strain_mapper is a pipeline for mapping short read sequences of bacteria to a given reference.
strain_mapper maps short read sequences to a given reference genome using bowtie2. It generates a VCF containing genotype likelihoods for the alignment using bcftools mpileup and subsequently uses bcftools call to call the variants. This variant information is then used to create a consensus sequence based on the mapped reads.
The pipeline will build reference and bowtie2 indexes if it doesn't find them in the same directory as the supplied --reference.
All relevant intermediate files are currently published in process-specific directories within the supplied --outdir directory.
-
Load nextflow and singularity modules:
module load nextflow ISG/singularity
-
Clone the repo:
git clone --recurse-submodules [email protected]:sanger-pathogens/pipelines/strain_mapper.git cd strain_mapper
-
Start the pipeline
For example input, please see Generating a manifest.Example:
nextflow run . --manifest_of_reads ./test_data/inputs/new_manifest.csv --reference ./test_data/inputs/ref/GCF_000011265.1.fna --outdir my_outputIt is good practice to submit a dedicated job for the nextflow master process (use the
oversubscribedqueue):bsub -o output.o -e error.e -q oversubscribed -R "select[mem>4000] rusage[mem=4000]" -M4000 nextflow run . --manifest_of_reads ./test_data/inputs/new_manifest.csv --reference ./test_data/inputs/ref/GCF_000011265.1.fna --outdir my_output
See usage for all available pipeline options.
-
Once your run has finished, check output in the
outdirand clean up any intermediate files. To do this (assuming no other pipelines are running from the current working directory) run:rm -rf work .nextflow*
Manifests supplied as an argument to --manifest_of_reads should be of of the following format:
ID,R1,R2
test_id,./test_data/inputs/test_1.fastq.gz,./test_data/inputs/test_2.fastq.gzWhere column ID can be an arbitrary sample identifier, R1 is a .fastq.gz file of forward reads, R2 is the mate .fastq.gz file containing reverse reads.
Scripts have been developed to generate manifests appropriate for this pipeline:
-
To generate a manifest from a file of lane identifiers visible to
pf, use this script. -
To generate a manifest from a file of custom .fastq.gz paths, use this script.
Please run --help on these scripts for more information on script usage.
Usage:
strain-mapper [--manifest_of_reads <path to manifest>] [--manifest_of_lanes <path to manifest>] [--studyid <study_id>, [--runid <run_id>, [--laneid <lane_id>, [--plexid <plex_id>]]]] --reference <path to reference> --outdir <path to results folder>
Input parameters:
Sequencing reads:
There are two ways of providing input reads, which can be combined:
1) through direct input of compressed fastq sequence reads files. This kind of input is passed by specifying the paths to the
read files via a manifest listing the pair of read files pertaing to a sample, one per row.
--manifest_of_reads Manifest containing per-sample paths to .fastq.gz files (optional)
2) through specification of data to be downloaded from iRODS. Each sample is defined by a combination of study, run, lane and plex ids
(these ids correspond to the reference of the sequencing experiment). Run, lane and plex ids are not mandatory: when provided, these
parameters gradually restrict of files to be downloaded; when ommitted, samples for all possible values are retrieved.
This information can be provided via a combination of workflow parameters passed on through command line options: --study, --runid,
--laneid and --plexid; this defines a single sequencing dataset based on a combination of study, run, lane and plex ids.
--studyid ID of sequencing study including read data to use as pipeline input (mandatory)
--runid ID of sequencing run including read data to use as pipeline input (mandatory)
--laneid ID of sequencing lane (as in a lane within of a flow cell) including read data to use as pipeline input (mandatory)
--plexid ID of sequencing lane multiplex tag index including read data to use as pipeline input (mandatory)
Alternatively, the user can provide a manifest listing a batch of such combinations.
--manifest_of_lanes Manifest containing specification of data to be downloaded from iRODS (optional)
Each row defines a sequencing dataset based on a combination of study, run, lane and plex ids.
Run, lane and plex ids are not mandatory (field in csv file can be left blank);
when provided, these gradually restrict of files to be downloaded.
NB: the real lane id is different from the the so-called "lane" id, a term commonly used in Sanger referring to this sequencing run output unit, usually labelled with this syntax: 48106_1#83.
In this, the run id is 48106, the (real) lane id is 1 and the plex id is 83.
Other input parameters:
--reference Reference to map reads against (mandatory)
Output parameters:
--outdir Specify output directory [default: ./results] (optional)
General options:
--help Print summary of main parameters and options (optional)
--help_all Print extensive list of parameters and options (optional)
--help Print this help message (optional)The pipeline applies stringent quality filters to ensure high-confidence consensus sequences:
- Quality Thresholds:
-
Minimum quality score:
≥50 (QUAL) -
Read support:
-
≥3 forward reads (
INFO/ADF[0]) -
≥3 reverse reads (
INFO/ADR[0])
-
-
Coverage: ≥8 total reads at position (
INFO/DP)
- Genotype Requirements:
-
Only homozygous calls (
0/0or1/1) are included -
All multiallelic/heterozygous calls (
0/1) are marked as'Het'and excluded
- Filter Classification:
-
'PASS': Meets all quality thresholds -
'Het': Passes quality but has heterozygous genotype (excluded) -
'LowQual': Fails one or more quality thresholds (excluded)
To edit these filters, please follow the documentation as seen on the BCFTOOLS view page under expressions:
https://samtools.github.io/bcftools/bcftools.html#expressions
When using --skip_filtering:
-
All variants are included regardless of:
- Quality scores - Read depth - Genotype (including heterozygous calls) -
Results may contain:
- Lower-confidence variants - More ambiguous positions - Potential sequencing artifacts
NOTE Unfiltered mode is not recommended for standard consensus generation but may be useful for debugging or specialized analyses.
You will find the variant (.vcf.gz) file and the curated consensus sequence (.fasta) for each sample under the corresponding sample directory within the supplied --outdir directory. If you also wish to retain the sorted BAM files, please use --keep_sorted_bam=true.
Developer contributions to this pipeline will only be accepted if all pipeline tests pass. To check:
-
Make your changes.
-
Download the test data. A utility script is provided:
python3 scripts/download_test_data.py -
Install
nf-test(>=0.7.0) and run the tests:nf-test test tests/*.nf.testIf running on Sanger HPC cluster, add the option
--profile sanger_local. -
Submit a PR.
strain_mapper was originally produced by Marta Matuszewska and adapted for nextflow by PAM informatics.
For further information or help, don't hesitate to get in touch via [email protected].
If you use strain_mapper for your analysis, please cite the following doi: <PLACEHOLDER_FOR_CITATION>
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.