STROMBOLI: Sequencing Through Rapid Optimization of Mutational Barcode Orientation and Linkage Identification

A Snakemake pipeline for nanopore-based barcode variant mapping

This is a Snakemake pipeline for processing barcoded long-read sequencing data. The pipeline is designed to take noisy nanopore data, robustly identify and cluster barcode sequences, then call consensus variants for each barcode cluster.

Quick start

git clone https://github.com/odcambc/STROMBOLI
cd STROMBOLI
conda env create --file stromboli_env.yaml
conda activate STROMBOLI

Note: for ARM64 Macs, try the following (assuming Rosetta is installed):

git clone https://github.com/odcambc/STROMBOLI
cd STROMBOLI
CONDA_SUBDIR=osx-64 conda env create --file stromboli_env.yaml
conda activate STROMBOLI

If the environment installed and activated properly, edit the configuration files in the config directory as needed. Then run the pipeline with:

snakemake -s workflow/Snakefile --software-deployment-method conda --cores 16

Overview

The pipeline proceeds in the following steps:

Barcode identification: Putative barcode sequences are identified by matching constant flanking regions using cutadapt.
Barcode clustering: The barcodes are clustered and canonical sequences are identified using starcode.
Variant sequence grouping: All reads corresponding to a single barcode are grouped together.
Read mapping: Each barcode group is mapped to a reference genome using minimap2.
Variant calling: Variants are called using bcftools.
Variant aggregation For each sample, all variants are combined into a single barcode-> variant mapping file.

Roadmap

Variant-calling modes

The per-cluster variant caller is selectable via calling_mode in the config:

double (default): build a consensus for the cluster, re-map it, then call. A depth-1 call with very high precision (near-zero false positives) that is robust even for shallow clusters, at some cost in recall near indels.
single_qc: call directly on the cluster read pileup and filter by ALT allele fraction (qc_min_af, qc_min_alt_reads). This keeps the read-depth information the consensus discards and recovers more true variants, but needs adequate depth — pair it with a larger min_cluster_size.

tools/fdr_estimator.py estimates the expected false-positive variant rate per cluster (and barcode-collision rate) as a function of sequence length, barcode length, error rate, depth and AF threshold, to guide those threshold choices:

python tools/fdr_estimator.py                  # report with amplicon defaults
python tools/fdr_estimator.py --tau 0.85 --error-rate 0.04

Barcode clash detection

A barcode → variant assignment is untrustworthy when reads from two different variants end up under one barcode, either because distinct barcodes were merged by starcode (within barcode_distance) or because of a barcode collision (two molecules, one barcode). Both are flagged and excluded from results/{sample}.variants.tsv, and listed with the reason in results/{sample}.flagged.tsv:

merged — the cluster's 2nd-most-abundant member barcode holds ≥ clash_merge_fraction of the reads (a real second barcode, not an error cloud).
mixed — in single_qc, a variant called at intermediate allele fraction (clash_mixed_af ≤ AF < qc_min_af) indicates two variants sharing the barcode.

Set exclude_clashes: false to keep flagged barcodes in the mapping (still recorded in the flagged file). If clashes are a large fraction of a library, that itself signals a design problem (too-short barcodes, too-large barcode_distance, or low diversity) — see tools/fdr_estimator.py's collision estimate.

Testing

pytest -m "not integration"   # fast unit tests for the pure-Python scripts
pytest -m integration         # full pipeline on synthetic data (needs the conda env)

tests/generate_synthetic_data.py builds a small, seeded dataset with a known barcode→mutation truth table; the integration test asserts the pipeline recovers it in both calling modes. experiments/ holds the analyses used to compare the calling strategies and calibrate thresholds.

Installation

Dependencies

Via conda (recommended)

The simplest way to handle dependencies is with Conda and the provided environment file.

conda env create --file stromboli_env.yaml

If using an ARM64 Mac, try the following:

CONDA_SUBDIR=osx-64 conda env create --file stromboli_env.yaml

Manually

The following are the dependencies required to run the pipeline:

License

This is licensed under the MIT license. See the LICENSE file for details.

Contributing

Contributions and feedback are welcome. Please submit an issue or pull request.

Getting help

For any issues, please open an issue on the GitHub repository. For questions or feedback, email Chris.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STROMBOLI: Sequencing Through Rapid Optimization of Mutational Barcode Orientation and Linkage Identification

Quick start

Overview

Roadmap

Variant-calling modes

Barcode clash detection

Testing

Installation

Dependencies

Via conda (recommended)

Manually

License

Contributing

Getting help

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
config		config
experiments		experiments
tests		tests
tools		tools
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
stromboli_env.yaml		stromboli_env.yaml

Folders and files

Latest commit

History

Repository files navigation

STROMBOLI: Sequencing Through Rapid Optimization of Mutational Barcode Orientation and Linkage Identification

Quick start

Overview

Roadmap

Variant-calling modes

Barcode clash detection

Testing

Installation

Dependencies

Via conda (recommended)

Manually

License

Contributing

Getting help

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages