SegmentQTL is a segmentation-aware molecular quantitative trait loci (molQTL) analysis tool designed for copy-number–driven cancers. It incorporates genomic segmentation data to improve QTL mapping accuracy by filtering out associations disrupted by structural variations. This approach prevents spurious signals caused by breakpoints, ensuring biologically meaningful genotype-phenotype associations.
SegmentQTL uses an allele-specific genotype model: each variant is represented by two log-ratio dosages, ALTlr (alternate allele) and REFlr (reference allele), supplied as separate per-chromosome files. From these, the tool internally derives the contrast d = REFlr − ALTlr used in association testing, and supports four analysis modes:
nominal– per-variant association testing.perm– permutation-based gene-level scan with Freedman–Lane residualization.finemap– joint multi-variant fitting per cis-window using a missing-aware Elastic Net with stability selection.validate– assess generalization of finemapped models on an independent cohort.
The tool efficiently processes large datasets through multi-core parallelization.
Requiring preinstalled Python and pip (Python package installer).
git clone https://github.com/HautaniemiLab/SegmentQTL.git
cd SegmentQTL
# (Optional, but recommended) Create a virtual environment
python -m venv <my-venv>
source <my-venv>/bin/activate
pip install -r requirements.txtSegmentQTL is executed via the command line with various options to control input data, analysis modes, and computational resources. The key arguments are:
--mode- Specifies the analysis mode:
nominal: per-variant association testing.perm: permutation-based gene-level scan.finemap: joint Elastic Net finemapping per cis-window.validate: validation of finemap models on an independent cohort.
- Specifies the analysis mode:
--chromosome- Chromosome number (e.g.,
21orX). Supportschrprefix (e.g.,chr21).
- Chromosome number (e.g.,
--genotypes- Path to genotype data directory containing per-chromosome ALTlr / REFlr CSVs (see Genotype Files).
--quantifications- Path to CSV file containing phenotype quantifications (e.g., gene expression). Provide the file with quantifications for the whole genome.
--covariates- Path to CSV file with sample-level covariate data.
--copynumber- Path to CSV file with phenotype-level copy-number data (CNlr). In
permmode it is required for Freedman–Lane residualization; infinemap/validateit is included as an unpenalised predictor.
- Path to CSV file with phenotype-level copy-number data (CNlr). In
--segmentation- Path to segmentation file with breakpoint data.
--out_dir- Directory where results are saved.
--phenotype_covariate- Path to additional phenotype-level covariate CSV. Optional; treated as an unpenalised predictor in
finemap/validate.
- Path to additional phenotype-level covariate CSV. Optional; treated as an unpenalised predictor in
--window- Window size in base pairs for cis-mapping (default:
1,000,000bp).
- Window size in base pairs for cis-mapping (default:
--num_cores- Number of CPU cores to use for parallel processing (default:
1).
- Number of CPU cores to use for parallel processing (default:
--all_variants- (nominal mode) Test all variants for a given phenotype. Provide a phenotype ID or use without a value to process all phenotypes.
--perm_method- (perm mode) Method used for permutation (
betaordirect). Default:beta.
- (perm mode) Method used for permutation (
--num_permutations- (perm mode) Number of permutations per phenotype (default:
5000).
- (perm mode) Number of permutations per phenotype (default:
--record_aic- (nominal/perm mode) Record AIC scores for associations.
--neg_control- (nominal/perm mode) Run trans negative-control mode. For each gene on chromosome c, tests variants from chromosome c+1 (wrapping). Used for calibration diagnostics.
--alpha_en– Elastic Net mixing parameter (1 = Lasso, 0 = Ridge). Default:0.5.--coverage_tau– Minimum fraction of samples observed for a variant. Default:0.6.--n_bootstrap– Number of stability-selection bootstrap resamples. Default:200.--subsample_frac– Fraction of samples per bootstrap resample. Default:0.8.--n_lambda– Number of lambda grid points for CV-based selection. Default:30.--lambda_ratio– Lower-bound ratiolam_min / lam_max. Default:0.01.--cv_tau– Range-based CV tolerance for sparsity. Default:0.8.--min_obs_boot– Minimum observed entries per variant within each bootstrap subsample. Default:20.--phenotype_id– Restrict run to a single phenotype.--compute_r2– Compute R² for baseline vs full model and include in output.--r2_stability_threshold– Minimum stability score for variant selection in R² computation. Default:0.6.
Validation reuses the discovery (main) cohort inputs (--genotypes, --quantifications, --covariates, --segmentation, --copynumber, --phenotype_covariate) and adds an independent cohort:
--val_genotypes,--val_quantifications,--val_segmentation– required validation-cohort inputs.--val_covariates,--val_copynumber,--val_phenotype_covariate– optional validation-cohort inputs.--validation_mode–recalibrated(default) refits the unpenalised block on validation and freezes the genetic component;frozenreuses everything from the discovery fit.--finemap_results_dir– Reuse pre-computedfinemap_<chr>.csv(recommended). Skips refitting the Elastic Net.--validate_with_bootstrap/--validation_stability_threshold– Mask discovery betas below a stability threshold.--restrict_to_supported_phenotypes,--support_definition,--support_min_stability– Limit validation scoring to phenotypes supported in discovery.--bootstrap_ci/--n_boot_ci– Paired bootstrap CIs for R² and calibration slope.--n_permutations– Validation phenotype-label permutation null (set> 0to enable).--save_model_audit– Write a long-format per-phenotype model audit CSV.
SegmentQTL requires five main inputs: genotypes (ALTlr + REFlr per chromosome), quantifications, sample covariates, copy number, and segmentation. An optional sixth input adds phenotype-level covariates. Below are the required formats and examples for each.
The --genotypes argument should point to a directory containing two files per chromosome, one for each allele:
genotypes/
chr1_ALTlr.csv
chr1_REFlr.csv
chr2_ALTlr.csv
chr2_REFlr.csv
...
chrX_ALTlr.csv
chrX_REFlr.csv
Each pair encodes log-ratio allelic dosages for the same set of variants and samples.
ID: Variant identifier in the formatchr:pos:ref:alt(e.g.,chr8:123456:A:G). TheIDcolumn must be identical (and in the same order) between the ALTlr and REFlr files of a chromosome.<sample1>,<sample2>, ...: Per-sample log-ratio dosage values.
SegmentQTL does not perform this normalization itself; it assumes the genotype files already contain the floored, shifted log-ratios
where ALTcn / REFcn are the alternate / reference allele copy numbers and ploidy is the sample ploidy. Internally SegmentQTL works with the contrast d = REFlr − ALTlr.
See AlleleDoser for upstream computation.
| ID | sample1 | sample2 | sample3 |
|---|---|---|---|
| chr8:123456:A:G | 1.85 | 2.10 | 0.40 |
| chr8:123789:T:C | 2.00 | 1.95 | 1.20 |
chr8_REFlr.csv has the identical ID column with the corresponding REFlr values.
The --quantifications argument should point to a CSV file containing normalized phenotype levels (e.g., gene expression) for all samples across the genome.
chr: Chromosome where the phenotype is located (e.g.,chr1,chrX).start: Start position of the phenotype.end: End position of the phenotype.gene_id: Unique identifier for the phenotype (e.g., Ensembl gene ID).
<sample1>,<sample2>, ...: Normalized phenotype values per sample.
| chr | start | end | gene_id | sample1 | sample2 | sample3 |
|---|---|---|---|---|---|---|
| chr8 | 123000 | 124000 | ENSG00000123 | 1.21 | 0.98 | 1.34 |
| chr8 | 130000 | 132000 | ENSG00000456 | 0.87 | 1.05 | 0.92 |
The --covariates argument should point to a CSV file containing sample-level covariate values. First row has n entries (samples); subsequent rows have n + 1 entries (covariate name + values).
- Row 1: Sample IDs only (e.g.,
sample1,sample2,sample3) - Row 2+: First cell is the covariate name, followed by values for each sample.
The --copynumber argument should point to a CSV file containing phenotype-level copy number values (CNlr) for each sample.
gene_id: Ensembl gene ID or equivalent identifier.
<sample1>,<sample2>, ...: Copy number values per sample.
| gene_id | sample1 | sample2 | sample3 |
|---|---|---|---|
| ENSG00000123 | 2.10 | 1.85 | 1.92 |
| ENSG00000456 | 1.75 | 2.30 | 2.00 |
The --segmentation argument should point to a CSV file with structural segmentation data for each sample. This is used to determine if a variant and gene are on the same intact genomic segment.
sample: Sample ID.chr: Chromosome identifier.startpos: Start coordinate of the segment.endpos: End coordinate of the segment.
| sample | chr | startpos | endpos |
|---|---|---|---|
| sample1 | chr8 | 100000 | 200000 |
| sample1 | chr8 | 200001 | 300000 |
| sample2 | chr8 | 120000 | 250000 |
The optional --phenotype_covariate argument points to a CSV with one phenotype-level covariate per sample (same layout as the copy number file). When provided, it is included as an unpenalised predictor in finemap and validate.
The primary output of nominal / perm modes is a CSV with per-(phenotype, variant) statistics.
| Column Name | Description |
|---|---|
phenotype |
Phenotype identifier. |
variant |
Variant identifier (best variant per phenotype in perm mode). |
number_of_samples |
Effective number of samples after segment-consistency filtering. |
beta_s |
Slope on the sum s = REFlr + ALTlr (allelic-burden / dosage covariate). |
se_s |
Standard error of beta_s. |
beta_d |
Slope on the contrast d = REFlr − ALTlr (allele-specific effect of interest). |
se_d |
Standard error of beta_d. |
t_stat_d |
t-statistic for beta_d. |
nominal_p |
Nominal p-value for beta_d. |
r2_alt |
R² of the alternative model. |
p_adj |
Permutation-adjusted p-value (perm mode only; NaN in nominal). |
chr |
Chromosome. |
When --record_aic is set, additional columns aic_null, aic_alt, delta_aic_alt_minus_null are written.
finemap mode writes finemap_<chr>.csv with one row per (phenotype, selected variant). Key columns include phenotype, variant, n_samples, n_variants, mean_d, sd_d, stability_score, mean_beta, sign_consistency, lambda_selected, beta_full (refit Elastic Net coefficient on standardised d), beta_full_raw (back-transformed to raw d), beta_cnlr, and effect_interpretation. A companion finemap_bootstrap_nonzero_<chr>.csv records per-bootstrap selections, and finemap_r2_<chr>.csv is written when --compute_r2 is set.
validate mode writes three files: validate_<chr>.csv (per-phenotype generalization metrics), validate_residuals_<chr>.csv (per-(phenotype, validation sample) predictions and residuals), and validate_model_<chr>.csv (long-format model audit; only when --save_model_audit is set). Metrics include RMSE/MAE, descriptive R², calibration (intercept + slope with HC3 robust Wald test), genetic transfer slope rho, burden-stratified R², and optional bootstrap and permutation p-values.
These examples assume your inputs follow the layout described above and that you are at the root of the SegmentQTL folder.
Per-variant nominal association testing on chromosome 8 with 4 cores:
python -m segmentqtl --mode nominal --chromosome 8 --num_cores 4 \
--genotypes data/genotypes --quantifications data/quantifications.csv \
--covariates data/covariates.csv --copynumber data/copynumbers.csv \
--segmentation data/segments.csv --out_dir results/Gene-level scan with 1000 permutations using the beta approximation:
python -m segmentqtl --mode perm --chromosome 8 --num_permutations 1000 \
--perm_method beta --num_cores 4 \
--genotypes data/genotypes --quantifications data/quantifications.csv \
--covariates data/covariates.csv --copynumber data/copynumbers.csv \
--segmentation data/segments.csv --out_dir results/Joint Elastic Net finemapping with stability selection on chromosome 8:
python -m segmentqtl --mode finemap --chromosome 8 --num_cores 4 \
--genotypes data/genotypes --quantifications data/quantifications.csv \
--covariates data/covariates.csv --copynumber data/copynumbers.csv \
--segmentation data/segments.csv \
--n_bootstrap 200 --compute_r2 --out_dir results/Reuse a prior finemap run and validate on a held-out cohort, restricting scoring to discovery-supported phenotypes:
python -m segmentqtl --mode validate --chromosome 8 --num_cores 4 \
--genotypes data/genotypes --quantifications data/quantifications.csv \
--covariates data/covariates.csv --copynumber data/copynumbers.csv \
--segmentation data/segments.csv \
--val_genotypes valdata/genotypes --val_quantifications valdata/quantifications.csv \
--val_covariates valdata/covariates.csv --val_copynumber valdata/copynumbers.csv \
--val_segmentation valdata/segments.csv \
--finemap_results_dir results/ \
--restrict_to_supported_phenotypes --support_min_stability 0.6 \
--out_dir validation_results/Run all applicable variants for one phenotype:
python -m segmentqtl --mode nominal --all_variants ENSG00000003987 \
--chromosome 8 --num_cores 1 \
--genotypes data/genotypes --quantifications data/quantifications.csv \
--covariates data/covariates.csv --copynumber data/copynumbers.csv \
--segmentation data/segments.csv --out_dir results/If you use SegmentQTL in your work, please cite:
Samuel Leppiniemi, et al. SegmentQTL: Identifying genetic variants influencing molecular phenotypes in copy number-driven cancers. bioRxiv, 2025. https://doi.org/10.1101/2025.07.28.667150

