STROMBOLI: Sequencing Through Rapid Optimization of Mutational Barcode Orientation and Linkage Identification
A Snakemake pipeline for nanopore-based barcode variant mapping
This is a Snakemake pipeline for processing barcoded long-read sequencing data. The pipeline is designed to take noisy nanopore data, robustly identify and cluster barcode sequences, then call consensus variants for each barcode cluster.
git clone https://github.com/odcambc/STROMBOLI
cd STROMBOLI
conda env create --file stromboli_env.yaml
conda activate STROMBOLINote: for ARM64 Macs, try the following (assuming Rosetta is installed):
git clone https://github.com/odcambc/STROMBOLI
cd STROMBOLI
CONDA_SUBDIR=osx-64 conda env create --file stromboli_env.yaml
conda activate STROMBOLIIf the environment installed and activated properly,
edit the configuration files in the config directory as needed. Then run the pipeline with:
snakemake -s workflow/Snakefile --software-deployment-method conda --cores 16The pipeline proceeds in the following steps:
- Barcode identification: Putative barcode sequences are identified by matching constant flanking regions using cutadapt.
- Barcode clustering: The barcodes are clustered and canonical sequences are identified using starcode.
- Variant sequence grouping: All reads corresponding to a single barcode are grouped together.
- Read mapping: Each barcode group is mapped to a reference genome using minimap2.
- Variant calling: Variants are called using bcftools.
- Variant aggregation For each sample, all variants are combined into a single barcode-> variant mapping file.
- Complete pipeline (all rules and scripts) — runs end to end to a per-sample
results/{sample}.variants.tsv- Scatter-gather rules for handling the barcode groups generated by starcode
- Include fastq qualities when making cluster fastas
- Add tests (unit + a synthetic ground-truth integration test; see Testing)
- Add documentation (usage, installation, etc.) — in progress
- Detect and flag barcode clashes — merged clusters (distinct barcodes pooled by starcode) and mixed barcodes (collisions, seen as intermediate-AF variants in
single_qc) are flagged and excluded from the mapping, and recorded with the reason inresults/{sample}.flagged.tsv. Thresholds:clash_merge_fraction,clash_mixed_af,exclude_clashes.- Robustly detect balanced (~50/50) collisions — the haploid caller resolves these to reference, so they produce no variant to flag; this folds into the per-position purity scan of the WT-calling item below.
- QC and visualization output — three layers merged into one MultiQC report
(raw-data quality + standard tool outputs + pipeline-specific metrics):
- Emit a per-sample QC summary JSON (
results/qc/{sample}.stromboli_qc.json,schema_version4) aggregating the metrics the pipeline already produces —cluster_qc.tsv,flagged.tsv, the per-barcodeafcolumn, cutadapt read counts. Useful on its own to eyeball a run. - Raw-data QC with NanoPlot (the long-read standard; FastQC's per-cycle model
doesn't transfer to ONT's kb-scale, heavy-tailed read-length and per-read quality
distributions). Two runs per sample: the raw input reads (library/instrument health)
and the post-cutadapt barcode FASTQ (extraction yield/quality) — the delta between
them is the barcode-detection rate (cutadapt runs
--discard-untrimmed). NanoPlot emits its ownNanoStats.txt, parsed by MultiQC's built-in NanoStat module (no separate NanoStat call). Outputs underresults/qc/nanoplot/{raw,trimmed}/{sample}/. Note: each report's sample name is injected into theNanoStats.txtGeneral summary:line (NanoPlot leaves it blank) so the two stages stay distinct in MultiQC — its default filename cleaning strips a.trimmedtail but keeps.raw, which would otherwise desync them. - Top-level
multiqcaggregation rule — one all-samples target that depends on every per-sample QC artifact and rendersresults/multiqc/multiqc_report.html. Inputs: the two NanoStat reports,*.cutadapt.json(built-in module), and the*.stromboli_qc.jsonsummaries (via the plugin below).rule all/get_input(inworkflow/rules/common.smk) now targets this report. MultiQC is passed the declared inputs explicitly (not a directory scan) so the report is scoped to the run and can't pick up stale artifacts from another config; settings inconfig/multiqc_config.yaml. - MultiQC plugin
multiqc-stromboli(separate pip-installable repo) consuming the summary JSON (schema_version 4) — general-stats table, clash bar plot, and plots for variant-consequence (DMS class breakdown), variant types (SNV/MNV/indel), ORF coverage profile + evenness (Gini, %-covered), cluster size, allele fraction, variants-per-barcode, barcodes-per-variant (mapping redundancy), cluster purity, and barcode composition (length, GC), all verified end-to-end against real output. Every metric is derived from data already in the per-barcode tables — coverage fromamino_acid_changecodon positions (ORF length via theorfconfig param), variant type fromINDEL/dna_change, redundancy fromvariants.tsv— with no extra pipeline rules. - Deliberately not emitting
samtools stats/bcftools stats: the per-barcode scatter produces thousands-plus of tiny transient BAMs/BCFs, with no single alignment/variant artifact to summarize. Mapping and coverage QC stay in the per-cluster metrics (cluster_qc.tsv) instead. - Env: add
nanoplotandmultiqctostromboli_env.yaml(both on bioconda).
- Emit a per-sample QC summary JSON (
- Pre-run parameter tuning/QC: run the cheap front of the pipeline (cutadapt →
starcode → clustering) on a subsample and print a few numbers — barcode detection
rate, cluster count, cluster-size distribution vs candidate
min_cluster_size, and the FDR estimate at candidate depth/AF thresholds — to sanity-check parameters before the full run. A small script, not a parameter-space-exploration engine. - (Separate project — likely its own module/web app) Library & sequencing-run design planner — plan barcode length vs variant/insert size, plan sequencing depth for variant mapping, and set FDR / cluster-size / AF thresholds from design parameters. The core statistical model is prototyped — read-capture + barcode-collision and length requirements across random vs error-correcting barcode sets, validated against a real run (a ~5% ONT error rate needs ~25–31 bp; HiFi ~13–15 bp), building on the
tools/fdr_estimator.pymath. Remaining: package the analysis as a standalone module and add the planner front-end. - Ship a recommended
single_qc"high-recall" config profile — revised by analysis on a real shallow library:single_qcfloods systematic homopolymer indels that depth does not remove (it must be paired with indel filtering), and its SNV-recall edge only appears at high depth — the two modes converge by depth ≈20. The substantive lever is a per-barcode depth floor, not the mode; the newcluster_depthcolumn lets a run be filtered to a target FDR without discarding barcodes. A profile should bundle indel-dropping + a depth floor rather than justmin_cluster_size≈ 15. - Extend
tools/fdr_estimator.pyto model a distribution of systematic (homopolymer) error fractions rather than a singlef_sys(real hotspots span ~0.3–0.8) - (Deferred) Vectorized multi-sample calling for very large libraries. Tag reads with
RG=barcode, map once, and run a single multi-samplempileup/call/csq(each barcode a sample column) instead of the per-barcode scatter. The prototype (experiments/prototype_multisample.py) measured ~7× faster than the fused per-barcode pipeline, single-threaded, ~190× fewer process spawns, and recovered the synthetic truth exactly. Not adopted because it collapses the per-barcode evidence trail that keeps errors inspectable (FDR, coverage/no-call, and WT calling all reason per cluster). Revisit only if library scale (≳500k barcodes) forces it — and it needs (a) FP/recall-equivalence validation on hard data and (b) batching into sample-groups to bound the multi-sample matrix. - Perform parameter optimization
- cutadapt stringency
- Add a minimum barcode-length filter —
filter_awkbounds only the max (max_barcode_length); short mis-extractions (e.g. 5–12 bp, which collide trivially) currently pass through.
- Add a minimum barcode-length filter —
- starcode clustering parameters
- Revisit
barcode_distance— analyzed on a real run:-d 5sits just below the collision-percolation cliff (at-d ≥6the 20-mer graph percolates and collapses distinct barcodes), so it is appropriate; only ~3 merges were actually flagged (starcode's abundance-ratio gating protects equal-abundance barcodes), and the low-depth tail is real low-abundance molecules, not unmerged error satellites. The real lever is barcode design (longer / error-correcting sets), not the clustering distance.
- Revisit
- minimap2 mapping parameters
- bcftools variant calling parameters
- Minimum barcode group size to call variants? —
min_cluster_size; choose it withtools/fdr_estimator.py - Q-score weighting? —
use_qual(consensus) /single_qcallele-fraction calling - Annotate each call with its cluster read depth (
cluster_depthcolumn onresults/{sample}.variants.tsv) so a run can be filtered to a target FDR post-hoc — per-barcode calls below ~10 reads are noise-dominated regardless of calling mode. - Affirmatively call WT as WT (depth-gated reference calls); never impute missing data as WT. A barcode with no variant is only wild-type where it has sufficient coverage — uncovered/low-depth positions are explicit no-calls, not reference. Requires moving past
bcftools call -v(variants-only) to a per-barcode callability/coverage track so "confirmed reference" is distinguishable from "no data".
- Minimum barcode group size to call variants? —
- cutadapt stringency
The per-cluster variant caller is selectable via calling_mode in the config:
double(default): build a consensus for the cluster, re-map it, then call. A depth-1 call with very high precision (near-zero false positives) that is robust even for shallow clusters, at some cost in recall near indels.single_qc: call directly on the cluster read pileup and filter by ALT allele fraction (qc_min_af,qc_min_alt_reads). This keeps the read-depth information the consensus discards and recovers more true variants, but needs adequate depth — pair it with a largermin_cluster_size.
tools/fdr_estimator.py estimates the expected false-positive variant rate per
cluster (and barcode-collision rate) as a function of sequence length, barcode
length, error rate, depth and AF threshold, to guide those threshold choices:
python tools/fdr_estimator.py # report with amplicon defaults
python tools/fdr_estimator.py --tau 0.85 --error-rate 0.04A barcode → variant assignment is untrustworthy when reads from two different variants
end up under one barcode, either because distinct barcodes were merged by starcode
(within barcode_distance) or because of a barcode collision (two molecules, one
barcode). Both are flagged and excluded from results/{sample}.variants.tsv, and listed
with the reason in results/{sample}.flagged.tsv:
- merged — the cluster's 2nd-most-abundant member barcode holds ≥
clash_merge_fractionof the reads (a real second barcode, not an error cloud). - mixed — in
single_qc, a variant called at intermediate allele fraction (clash_mixed_af≤ AF <qc_min_af) indicates two variants sharing the barcode.
Set exclude_clashes: false to keep flagged barcodes in the mapping (still recorded in
the flagged file). If clashes are a large fraction of a library, that itself signals a
design problem (too-short barcodes, too-large barcode_distance, or low diversity) —
see tools/fdr_estimator.py's collision estimate.
pytest -m "not integration" # fast unit tests for the pure-Python scripts
pytest -m integration # full pipeline on synthetic data (needs the conda env)tests/generate_synthetic_data.py builds a small, seeded dataset with a known
barcode→mutation truth table; the integration test asserts the pipeline recovers it
in both calling modes. experiments/ holds the analyses used to compare the calling
strategies and calibrate thresholds.
The simplest way to handle dependencies is with Conda and the provided environment file.
conda env create --file stromboli_env.yamlIf using an ARM64 Mac, try the following:
CONDA_SUBDIR=osx-64 conda env create --file stromboli_env.yamlThe following are the dependencies required to run the pipeline:
This is licensed under the MIT license. See the LICENSE file for details.
Contributions and feedback are welcome. Please submit an issue or pull request.
For any issues, please open an issue on the GitHub repository. For questions or feedback, email Chris.