Skip to content

odcambc/STROMBOLI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

STROMBOLI: Sequencing Through Rapid Optimization of Mutational Barcode Orientation and Linkage Identification

A Snakemake pipeline for nanopore-based barcode variant mapping

This is a Snakemake pipeline for processing barcoded long-read sequencing data. The pipeline is designed to take noisy nanopore data, robustly identify and cluster barcode sequences, then call consensus variants for each barcode cluster.

Quick start

git clone https://github.com/odcambc/STROMBOLI
cd STROMBOLI
conda env create --file stromboli_env.yaml
conda activate STROMBOLI

Note: for ARM64 Macs, try the following (assuming Rosetta is installed):

git clone https://github.com/odcambc/STROMBOLI
cd STROMBOLI
CONDA_SUBDIR=osx-64 conda env create --file stromboli_env.yaml
conda activate STROMBOLI

If the environment installed and activated properly, edit the configuration files in the config directory as needed. Then run the pipeline with:

snakemake -s workflow/Snakefile --software-deployment-method conda --cores 16

Overview

The pipeline proceeds in the following steps:

  1. Barcode identification: Putative barcode sequences are identified by matching constant flanking regions using cutadapt.
  2. Barcode clustering: The barcodes are clustered and canonical sequences are identified using starcode.
  3. Variant sequence grouping: All reads corresponding to a single barcode are grouped together.
  4. Read mapping: Each barcode group is mapped to a reference genome using minimap2.
  5. Variant calling: Variants are called using bcftools.
  6. Variant aggregation For each sample, all variants are combined into a single barcode-> variant mapping file.

Roadmap

  • Complete pipeline (all rules and scripts) — runs end to end to a per-sample results/{sample}.variants.tsv
    • Scatter-gather rules for handling the barcode groups generated by starcode
  • Include fastq qualities when making cluster fastas
  • Add tests (unit + a synthetic ground-truth integration test; see Testing)
  • Add documentation (usage, installation, etc.) — in progress
  • Detect and flag barcode clashes — merged clusters (distinct barcodes pooled by starcode) and mixed barcodes (collisions, seen as intermediate-AF variants in single_qc) are flagged and excluded from the mapping, and recorded with the reason in results/{sample}.flagged.tsv. Thresholds: clash_merge_fraction, clash_mixed_af, exclude_clashes.
    • Robustly detect balanced (~50/50) collisions — the haploid caller resolves these to reference, so they produce no variant to flag; this folds into the per-position purity scan of the WT-calling item below.
  • QC and visualization output — three layers merged into one MultiQC report (raw-data quality + standard tool outputs + pipeline-specific metrics):
    • Emit a per-sample QC summary JSON (results/qc/{sample}.stromboli_qc.json, schema_version 4) aggregating the metrics the pipeline already produces — cluster_qc.tsv, flagged.tsv, the per-barcode af column, cutadapt read counts. Useful on its own to eyeball a run.
    • Raw-data QC with NanoPlot (the long-read standard; FastQC's per-cycle model doesn't transfer to ONT's kb-scale, heavy-tailed read-length and per-read quality distributions). Two runs per sample: the raw input reads (library/instrument health) and the post-cutadapt barcode FASTQ (extraction yield/quality) — the delta between them is the barcode-detection rate (cutadapt runs --discard-untrimmed). NanoPlot emits its own NanoStats.txt, parsed by MultiQC's built-in NanoStat module (no separate NanoStat call). Outputs under results/qc/nanoplot/{raw,trimmed}/{sample}/. Note: each report's sample name is injected into the NanoStats.txt General summary: line (NanoPlot leaves it blank) so the two stages stay distinct in MultiQC — its default filename cleaning strips a .trimmed tail but keeps .raw, which would otherwise desync them.
    • Top-level multiqc aggregation rule — one all-samples target that depends on every per-sample QC artifact and renders results/multiqc/multiqc_report.html. Inputs: the two NanoStat reports, *.cutadapt.json (built-in module), and the *.stromboli_qc.json summaries (via the plugin below). rule all / get_input (in workflow/rules/common.smk) now targets this report. MultiQC is passed the declared inputs explicitly (not a directory scan) so the report is scoped to the run and can't pick up stale artifacts from another config; settings in config/multiqc_config.yaml.
    • MultiQC plugin multiqc-stromboli (separate pip-installable repo) consuming the summary JSON (schema_version 4) — general-stats table, clash bar plot, and plots for variant-consequence (DMS class breakdown), variant types (SNV/MNV/indel), ORF coverage profile + evenness (Gini, %-covered), cluster size, allele fraction, variants-per-barcode, barcodes-per-variant (mapping redundancy), cluster purity, and barcode composition (length, GC), all verified end-to-end against real output. Every metric is derived from data already in the per-barcode tables — coverage from amino_acid_change codon positions (ORF length via the orf config param), variant type from INDEL/dna_change, redundancy from variants.tsv — with no extra pipeline rules.
    • Deliberately not emitting samtools stats / bcftools stats: the per-barcode scatter produces thousands-plus of tiny transient BAMs/BCFs, with no single alignment/variant artifact to summarize. Mapping and coverage QC stay in the per-cluster metrics (cluster_qc.tsv) instead.
    • Env: add nanoplot and multiqc to stromboli_env.yaml (both on bioconda).
  • Pre-run parameter tuning/QC: run the cheap front of the pipeline (cutadapt → starcode → clustering) on a subsample and print a few numbers — barcode detection rate, cluster count, cluster-size distribution vs candidate min_cluster_size, and the FDR estimate at candidate depth/AF thresholds — to sanity-check parameters before the full run. A small script, not a parameter-space-exploration engine.
  • (Separate project — likely its own module/web app) Library & sequencing-run design planner — plan barcode length vs variant/insert size, plan sequencing depth for variant mapping, and set FDR / cluster-size / AF thresholds from design parameters. The core statistical model is prototyped — read-capture + barcode-collision and length requirements across random vs error-correcting barcode sets, validated against a real run (a ~5% ONT error rate needs ~25–31 bp; HiFi ~13–15 bp), building on the tools/fdr_estimator.py math. Remaining: package the analysis as a standalone module and add the planner front-end.
  • Ship a recommended single_qc "high-recall" config profile — revised by analysis on a real shallow library: single_qc floods systematic homopolymer indels that depth does not remove (it must be paired with indel filtering), and its SNV-recall edge only appears at high depth — the two modes converge by depth ≈20. The substantive lever is a per-barcode depth floor, not the mode; the new cluster_depth column lets a run be filtered to a target FDR without discarding barcodes. A profile should bundle indel-dropping + a depth floor rather than just min_cluster_size ≈ 15.
  • Extend tools/fdr_estimator.py to model a distribution of systematic (homopolymer) error fractions rather than a single f_sys (real hotspots span ~0.3–0.8)
  • (Deferred) Vectorized multi-sample calling for very large libraries. Tag reads with RG=barcode, map once, and run a single multi-sample mpileup/call/csq (each barcode a sample column) instead of the per-barcode scatter. The prototype (experiments/prototype_multisample.py) measured ~7× faster than the fused per-barcode pipeline, single-threaded, ~190× fewer process spawns, and recovered the synthetic truth exactly. Not adopted because it collapses the per-barcode evidence trail that keeps errors inspectable (FDR, coverage/no-call, and WT calling all reason per cluster). Revisit only if library scale (≳500k barcodes) forces it — and it needs (a) FP/recall-equivalence validation on hard data and (b) batching into sample-groups to bound the multi-sample matrix.
  • Perform parameter optimization
    • cutadapt stringency
      • Add a minimum barcode-length filter — filter_awk bounds only the max (max_barcode_length); short mis-extractions (e.g. 5–12 bp, which collide trivially) currently pass through.
    • starcode clustering parameters
      • Revisit barcode_distance — analyzed on a real run: -d 5 sits just below the collision-percolation cliff (at -d ≥6 the 20-mer graph percolates and collapses distinct barcodes), so it is appropriate; only ~3 merges were actually flagged (starcode's abundance-ratio gating protects equal-abundance barcodes), and the low-depth tail is real low-abundance molecules, not unmerged error satellites. The real lever is barcode design (longer / error-correcting sets), not the clustering distance.
    • minimap2 mapping parameters
    • bcftools variant calling parameters
      • Minimum barcode group size to call variants? — min_cluster_size; choose it with tools/fdr_estimator.py
      • Q-score weighting? — use_qual (consensus) / single_qc allele-fraction calling
      • Annotate each call with its cluster read depth (cluster_depth column on results/{sample}.variants.tsv) so a run can be filtered to a target FDR post-hoc — per-barcode calls below ~10 reads are noise-dominated regardless of calling mode.
      • Affirmatively call WT as WT (depth-gated reference calls); never impute missing data as WT. A barcode with no variant is only wild-type where it has sufficient coverage — uncovered/low-depth positions are explicit no-calls, not reference. Requires moving past bcftools call -v (variants-only) to a per-barcode callability/coverage track so "confirmed reference" is distinguishable from "no data".

Variant-calling modes

The per-cluster variant caller is selectable via calling_mode in the config:

  • double (default): build a consensus for the cluster, re-map it, then call. A depth-1 call with very high precision (near-zero false positives) that is robust even for shallow clusters, at some cost in recall near indels.
  • single_qc: call directly on the cluster read pileup and filter by ALT allele fraction (qc_min_af, qc_min_alt_reads). This keeps the read-depth information the consensus discards and recovers more true variants, but needs adequate depth — pair it with a larger min_cluster_size.

tools/fdr_estimator.py estimates the expected false-positive variant rate per cluster (and barcode-collision rate) as a function of sequence length, barcode length, error rate, depth and AF threshold, to guide those threshold choices:

python tools/fdr_estimator.py                  # report with amplicon defaults
python tools/fdr_estimator.py --tau 0.85 --error-rate 0.04

Barcode clash detection

A barcode → variant assignment is untrustworthy when reads from two different variants end up under one barcode, either because distinct barcodes were merged by starcode (within barcode_distance) or because of a barcode collision (two molecules, one barcode). Both are flagged and excluded from results/{sample}.variants.tsv, and listed with the reason in results/{sample}.flagged.tsv:

  • merged — the cluster's 2nd-most-abundant member barcode holds ≥ clash_merge_fraction of the reads (a real second barcode, not an error cloud).
  • mixed — in single_qc, a variant called at intermediate allele fraction (clash_mixed_af ≤ AF < qc_min_af) indicates two variants sharing the barcode.

Set exclude_clashes: false to keep flagged barcodes in the mapping (still recorded in the flagged file). If clashes are a large fraction of a library, that itself signals a design problem (too-short barcodes, too-large barcode_distance, or low diversity) — see tools/fdr_estimator.py's collision estimate.

Testing

pytest -m "not integration"   # fast unit tests for the pure-Python scripts
pytest -m integration         # full pipeline on synthetic data (needs the conda env)

tests/generate_synthetic_data.py builds a small, seeded dataset with a known barcode→mutation truth table; the integration test asserts the pipeline recovers it in both calling modes. experiments/ holds the analyses used to compare the calling strategies and calibrate thresholds.

Installation

Dependencies

Via conda (recommended)

The simplest way to handle dependencies is with Conda and the provided environment file.

conda env create --file stromboli_env.yaml

If using an ARM64 Mac, try the following:

CONDA_SUBDIR=osx-64 conda env create --file stromboli_env.yaml

Manually

The following are the dependencies required to run the pipeline:

License

This is licensed under the MIT license. See the LICENSE file for details.

Contributing

Contributions and feedback are welcome. Please submit an issue or pull request.

Getting help

For any issues, please open an issue on the GitHub repository. For questions or feedback, email Chris.

About

STROMBOLI is a snakemake pipeline for associating barcodes with genetic variants using long-read sequencing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages