Skip to content

epifluidlab/finaletoolkit_workflow

Repository files navigation

dna with letters FT finaletoolkit_workflow

A Snakemake workflow that automates cell-free DNA fragmentation feature extraction with FinaleToolkit — supporting hg38 and T2T-CHM13, parallel processing, SLURM, and BED/BAM/CRAM inputs.

Docs

Contents

Installation Reference Setup Usage Parameters
Install Reference setup Usage YAML parameters
Genomes Mappability SLURM Output naming

Genome support

Two genomes are supported out of the box, each with a runnable example config and a one-command reference setup:

Genome Example config Setup command
hg38 params.hg38.yaml scripts/setup_reference.sh hg38 supplement 500
T2T-CHM13 (hs1) params.t2t-chm13.yaml scripts/setup_reference.sh t2t-chm13 supplement 500

Installation

git clone https://github.com/epifluidlab/finaletoolkit_workflow
cd finaletoolkit_workflow
conda env create -f environment.yml
conda activate finaletoolkit_workflow

Core tools (installed by the environment): finaletoolkit, snakemake, bedtools, htslib, samtools, pybigwig.

Reference & supplement setup

The workflow needs per-genome supplement files (chrom sizes, .2bit, interval bins, blacklist, gap, and a mappability bigWig). scripts/setup_reference.sh builds all of them, with the exact filenames the example configs expect:

# T2T-CHM13: UCSC hs1 2bit/sizes + BEDbase excluderanges blacklist + Zenodo mappability track
scripts/setup_reference.sh t2t-chm13 supplement 500

# hg38: UCSC 2bit/sizes + Boyle-Lab Blacklist + finaletoolkit gap-bed + Zenodo mappability track
scripts/setup_reference.sh hg38 supplement 500

This writes into supplement/:

<g>.chrom.sizes   <g>.2bit   <g>.<N>kb.bins   <g>.delfi.chrom.sizes   <g>.blacklist.bed   <g>.45mer.mappability.bw
                  ( + hg38.gap.bed for hg38 only — T2T-CHM13 is gap-free )

(g = hg38 or chm13). <g>.delfi.chrom.sizes is chr1–22, X, Y (DELFI requires centromere-bearing contigs; the general bins keep chrM/chrY). Everything except the mappability bigWig is built live from public sources (UCSC, Boyle-Lab Blacklist, BEDbase). The *.bins.filtered interval file used by DELFI is generated automatically during the run by the mappability filter.

Usage

  1. Put input fragment/alignment files in input/ (or set input_dir).
  2. Pick a config and run:
snakemake --configfile params.t2t-chm13.yaml --cores <N> \
  --rerun-incomplete --default-resources "tmpdir='./tmp'"

--cores sets CPU cores; --jobs caps concurrent jobs. Use -n for a dry run to preview the DAG.

SLURM execution

The SLURM executor plugin is already in environment.yml. Set your account/partition in slurm_profile/config.yaml, then submit:

./ftk_exc.sh params.t2t-chm13.yaml ./tmp     # runs in the background -> snakemake.log

Mappability

Interval bins are kept only if their mean mappability over the bin is ≥ mappability_threshold (filtered by scripts/mappability_filter.py). Because the filter uses the per-bin mean, the kind of track matters:

  • Continuous track (recommended) — each position = 1 / (k-mer occurrences), so the bin mean is the average mappability (a position that maps twice contributes 0.5). The shipped hg38 and T2T-CHM13 tracks are both continuous 45-mer GenMap tracks (-E 0), hosted on Zenodo (record 20724659) and fetched automatically by setup_reference.sh. To build one for another genome / k-mer:

    conda create -n genmap -c bioconda -c conda-forge genmap ucsc-bedgraphtobigwig
    conda activate genmap
    scripts/genmap_mappability.sh genome.fa supplement/<g>.chrom.sizes \
        supplement/<g>.45mer.mappability.bw 45 0   # memory/time-heavy: use a compute node

T2T-CHM13 specifics: the assembly is gap-free and has no finaletoolkit gap preset, so DELFI runs without a gap file (centromeres are already removed by the mappability + blacklist filtering); the blacklist is the excluderanges T2T.excluderanges set. DELFI uses <g>.delfi.chrom.sizes (chr1–22, X, Y) because finaletoolkit crashes on contigs without a centromere (e.g. chrM) when a gap file is supplied.

Data availability

The continuous 45-mer mappability bigWigs (hg38, T2T-CHM13) are deposited in the FinaleToolkit Dataset on Zenodo — DOI 10.5281/zenodo.20724659 (concept DOI 10.5281/zenodo.14284132 always resolves to the latest version). All other supplement files are built live from public sources (UCSC, Boyle-Lab Blacklist, BEDbase) by setup_reference.sh.

YAML parameters

  • Required: input_dir, output_dir, file_format (bed.gz, frag.gz, bam, or cram).
  • Optional: supplement_dir, interval_file, mappability_file, mappability_threshold, and any FinaleToolkit command.
  • Commands: enable a FinaleToolkit command with underscores instead of hyphens (e.g. adjust-wpsadjust_wps: True); set flags by appending _<flag> (e.g. coverage_mapq: 30). The workflow respects command dependencies (e.g. mds needs end_motifs, regional_mds needs interval_end_motifs). The per-region motif-diversity step is regional_mds. See params.yaml for the fully annotated list of every option.

Output file naming

  • Filtered files get .filtered before the format (e.g. file.filtered.bed.gz).
  • Command outputs insert the command name (e.g. file.frag_length_intervals.bed).
  • Each input is processed for every enabled command.

Citation

Li JW, Bandaru R, Baliga K, Liu Y (2025). FinaleToolkit: accelerating cell-free DNA fragmentation analysis with a high-speed computational toolkit. Bioinformatics Advances. DOI

Contact

License

See LICENSE.

About

Extract Finaletoolkit features from several files

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors