A Snakemake workflow that automates cell-free DNA fragmentation feature extraction with FinaleToolkit — supporting hg38 and T2T-CHM13, parallel processing, SLURM, and BED/BAM/CRAM inputs.
| Installation | Reference Setup | Usage | Parameters |
|---|---|---|---|
| Install | Reference setup | Usage | YAML parameters |
| Genomes | Mappability | SLURM | Output naming |
Two genomes are supported out of the box, each with a runnable example config and a one-command reference setup:
| Genome | Example config | Setup command |
|---|---|---|
| hg38 | params.hg38.yaml |
scripts/setup_reference.sh hg38 supplement 500 |
| T2T-CHM13 (hs1) | params.t2t-chm13.yaml |
scripts/setup_reference.sh t2t-chm13 supplement 500 |
git clone https://github.com/epifluidlab/finaletoolkit_workflow
cd finaletoolkit_workflow
conda env create -f environment.yml
conda activate finaletoolkit_workflowCore tools (installed by the environment): finaletoolkit, snakemake, bedtools, htslib,
samtools, pybigwig.
The workflow needs per-genome supplement files (chrom sizes, .2bit, interval bins, blacklist, gap, and
a mappability bigWig). scripts/setup_reference.sh builds all of them, with the exact filenames the
example configs expect:
# T2T-CHM13: UCSC hs1 2bit/sizes + BEDbase excluderanges blacklist + Zenodo mappability track
scripts/setup_reference.sh t2t-chm13 supplement 500
# hg38: UCSC 2bit/sizes + Boyle-Lab Blacklist + finaletoolkit gap-bed + Zenodo mappability track
scripts/setup_reference.sh hg38 supplement 500This writes into supplement/:
<g>.chrom.sizes <g>.2bit <g>.<N>kb.bins <g>.delfi.chrom.sizes <g>.blacklist.bed <g>.45mer.mappability.bw
( + hg38.gap.bed for hg38 only — T2T-CHM13 is gap-free )
(g = hg38 or chm13). <g>.delfi.chrom.sizes is chr1–22, X, Y (DELFI requires
centromere-bearing contigs; the general bins keep chrM/chrY). Everything except the mappability bigWig is built live from public sources
(UCSC, Boyle-Lab Blacklist, BEDbase). The *.bins.filtered interval file used by DELFI is generated
automatically during the run by the mappability filter.
- Put input fragment/alignment files in
input/(or setinput_dir). - Pick a config and run:
snakemake --configfile params.t2t-chm13.yaml --cores <N> \
--rerun-incomplete --default-resources "tmpdir='./tmp'"--cores sets CPU cores; --jobs caps concurrent jobs. Use -n for a dry run to preview the DAG.
The SLURM executor plugin is already in environment.yml. Set your account/partition in
slurm_profile/config.yaml, then submit:
./ftk_exc.sh params.t2t-chm13.yaml ./tmp # runs in the background -> snakemake.logInterval bins are kept only if their mean mappability over the bin is ≥ mappability_threshold
(filtered by scripts/mappability_filter.py). Because the filter uses the per-bin mean, the kind of
track matters:
-
Continuous track (recommended) — each position =
1 / (k-mer occurrences), so the bin mean is the average mappability (a position that maps twice contributes 0.5). The shipped hg38 and T2T-CHM13 tracks are both continuous 45-mer GenMap tracks (-E 0), hosted on Zenodo (record 20724659) and fetched automatically bysetup_reference.sh. To build one for another genome / k-mer:conda create -n genmap -c bioconda -c conda-forge genmap ucsc-bedgraphtobigwig conda activate genmap scripts/genmap_mappability.sh genome.fa supplement/<g>.chrom.sizes \ supplement/<g>.45mer.mappability.bw 45 0 # memory/time-heavy: use a compute node
T2T-CHM13 specifics: the assembly is gap-free and has no finaletoolkit gap preset, so DELFI runs
without a gap file (centromeres are already removed by the mappability + blacklist filtering); the
blacklist is the excluderanges T2T.excluderanges set.
DELFI uses <g>.delfi.chrom.sizes (chr1–22, X, Y) because finaletoolkit crashes on contigs without a
centromere (e.g. chrM) when a gap file is supplied.
The continuous 45-mer mappability bigWigs (hg38, T2T-CHM13) are deposited in the FinaleToolkit
Dataset on Zenodo — DOI 10.5281/zenodo.20724659
(concept DOI 10.5281/zenodo.14284132 always resolves to the
latest version). All other supplement files are built live from public sources (UCSC, Boyle-Lab
Blacklist, BEDbase) by setup_reference.sh.
- Required:
input_dir,output_dir,file_format(bed.gz,frag.gz,bam, orcram). - Optional:
supplement_dir,interval_file,mappability_file,mappability_threshold, and any FinaleToolkit command. - Commands: enable a FinaleToolkit command with underscores instead of hyphens (e.g.
adjust-wps→adjust_wps: True); set flags by appending_<flag>(e.g.coverage_mapq: 30). The workflow respects command dependencies (e.g.mdsneedsend_motifs,regional_mdsneedsinterval_end_motifs). The per-region motif-diversity step isregional_mds. Seeparams.yamlfor the fully annotated list of every option.
- Filtered files get
.filteredbefore the format (e.g.file.filtered.bed.gz). - Command outputs insert the command name (e.g.
file.frag_length_intervals.bed). - Each input is processed for every enabled command.
Li JW, Bandaru R, Baliga K, Liu Y (2025). FinaleToolkit: accelerating cell-free DNA fragmentation
analysis with a high-speed computational toolkit. Bioinformatics Advances.
- Ravi Bandaru: [email protected]
- James Li: [email protected]
- Kundan Baliga: [email protected]
- Yaping Liu: [email protected]
See LICENSE.
