AFQuery enables fast allele frequency queries on user-defined subsets of local genomic cohorts, without rescanning VCFs.
AFQuery is a bitmap-indexed engine that efficiently recomputes AC/AN/AF for dynamically defined subcohorts (e.g., by phenotype, sex, or sequencing technology), a common requirement in ACMG/AMP variant classification. It stores per-variant genotype data as Roaring Bitmaps in Parquet files and resolves sample filters into bitmaps that can be intersected in microseconds, enabling sub-100 ms queries on large cohorts. The system accounts for ploidy in sex chromosomes, adjusts AN based on sequencing technology, supports incremental updates, and runs locally using a file-based setup (Parquet + SQLite) without requiring server or cloud infrastructure.
- You need allele frequencies for phenotype or user-defined subcohorts
- You work with mixed sequencing technologies or capture kits versions (WGS, WES, targeted panels)
- You require fast, repeated queries without rescanning VCFs
- You want a local, reproducible workflow without cloud or cluster dependencies
- Dynamic subcohort queries (<100 ms) — bitmap intersections at query time; no VCF re-scan required
- Technology-aware — avoids bias when mixing WGS, WES, and panels using different BED capture indexes
- Ploidy-aware — correct handling of sex chromosomes (PAR/non-PAR, chrX, chrY)
- ACMG-compatible allele counting — AC/AN/AF computed per standard definitions
- Flexible metadata filtering — arbitrary labels (ICD-10, HPO, custom fields) with inclusion/exclusion rules
- Incremental updates — add or remove samples and update metadata without rebuilding the database
- VCF annotation — annotate variants using subcohort-specific frequencies
- FILTER/call quality tracking — failed calls (FILTER!=PASS) tracked per variant and reported as N_FAIL
- Batch and region queries — query a single locus, a genomic region, or a list of variants from a file
- Bulk CSV export — export all variant frequencies with optional disaggregation by sex, technology, or phenotype
- Audit changelog — all database operations logged with timestamps and operator notes
- Database validation — integrity checks with scripted exit codes
- Portable and serverless — file-based system, no infrastructure required
- Query latency: <100 ms (tested up to 50,000 samples)
- Storage: ~2 bytes/sample/variant
- Scales to millions of variants per chromosome
| AFQuery | bcftools | GATK GenomicsDB | Hail | |
|---|---|---|---|---|
| Technology-aware AN | Yes | No | No | No |
| Metadata filtering | Arbitrary labels | No | No | Custom code |
| Ploidy-aware sex chromosomes | Yes | Manual | No | Manual |
| Dynamic subcohort queries | Yes | No | Limited | Requires code |
| FILTER/call quality tracking | Per variant | Manual | No | Manual |
| Incremental updates | Yes | No | Yes | No |
| Infrastructure required | None | None | Java/server | Spark cluster |
| Query latency (50K samples) | <100 ms | ~5 min | <1 min | 1–2 min |
AFQuery pre-indexes per-variant genotype data as Roaring Bitmaps stored in Parquet files. Each variant row holds three bitmaps: heterozygous carriers, homozygous alt carriers, and samples with FILTER!=PASS. Sample metadata (sex, phenotype, technology) is pre-serialized as bitmaps in SQLite.
At query time, the requested sample filter is resolved to a single candidate bitmap via bitmap intersections and differences — taking microseconds regardless of cohort size. For each variant, the candidate bitmap is intersected with the genotype bitmaps to compute AC/AN/AF. AN accounts for WES capture regions (via BED-indexed interval trees) and for ploidy on sex chromosomes (males are haploid on non-PAR chrX and chrY).
- VCF files: normalized and consistent with the selected genome build (GRCh37 or GRCh38)
- Sample metadata: must include sex, sequencing technology, and any fields used for filtering (e.g., phenotype)
- BED files (optional): define capture regions for each sequencing technology
Example workflow from raw VCFs to query, export, and annotation:
pip install afquery
# Docker: see Installation docs for docker pull / run usage
# Build the database
afquery create-db --manifest samples.tsv --output-dir ./db/ --genome-build GRCh38
# Inspect the database
afquery info --db ./db/
# Query a single position, filtered to a phenotype
afquery query --db ./db/ --locus chr1:925952 --phenotype E11.9 --sex female
# Query a genomic region
afquery query --db ./db/ --region chr1:900000-1000000
# Export BRCA1 variant frequencies to CSV
afquery dump --db ./db/ --output all_variants.csv --chrom chr17 --start 43044292 --end 43170327
# Annotate a VCF with cohort frequencies
afquery annotate --db ./db/ --input patient.vcf --output annotated.vcf --threads 12
# Add new samples to an existing database
afquery update-db --db ./db/ --add-samples new_samples.tsvIf you use AFQuery, please cite:
AFQuery: fast, metadata-aware allele frequency queries on local genomic cohorts.
(manuscript in preparation)