Release: cosmos, rosace-aa, sequence-based variants, MaveDB deposit, cluster profiles, uv dev env by odcambc · Pull Request #46 · odcambc/dumpling

odcambc · 2026-06-15T20:44:39Z

Summary

Promotes the accumulated dev work to main — 57 commits, 79 files, +11.6k/−0.3k. This bundles several independent feature lines that have each been validated on dev (full test suite green with cosmos + snakemake installed). History is linear and dev is 0 commits behind main, so this can fast-forward.

What's included

Scoring backends

rosace-aa as a third scoring_backend — wraps pimentellab/rosace-aa, adding per-position + amino-acid substitution effect decomposition alongside rosace (default) and lilace. New install script, rules, driver, and config/schema wiring.
Determinism phase 1 — Stan seeds are now logged on every run; lilace's seed is configurable (lilace_seed) for bit-identical reruns. Documents why there is intentionally no rosace_seed knob.

Variant identification & filtering

Sequence-based variant identification (closes don't use oligo names to identify variants #27) — designed variants are matched by oligo sequence rather than name, hardened against real oligo data (validated ~94% on real oligo data).
Codon-based filtering + multi-codon protein variants (closes Handle multiple variants with different codons #23) — restores the codon filter that regressed in 28dbec2, supports multiple codons per protein variant, and restores insdel / multi-variant rejection stats and the max_deletion_length gate. Also fixes Error when running rosace from dumpling workflow #9
FASTA contig-name normalization (closes improve reference fasta handling #19) — normalizes the reference contig name to the filename stem.

Deposit & downstream analysis

MaveDB deposit formatting — format_mavedb (per-condition, MAVE-HGVS canonical form validated via mavehgvs); deposit_to_mavedb is on by default. (SRA submission formatting is intentionally not included — see follow-ups.)
cosmos integration — format_cosmos builds cosmos's wide multi-condition input, and an opt-in run_cosmos stage actually invokes pimentellab/cosmos (enrich2-style: layered on top of the chosen scoring backend) to produce the per-position direct/indirect decomposition. Off by default; isolated in its own conda env. End-to-end validated (DAG wiring + real DMSData load against cosmos 1.0.2).

Performance & infra

Parallel pigz for BBTools compressed IO; persistent aligner indexes across runs (--notemp documented); minimap2 as an alternative aligner.
Enrich2 .h5 cleanup (closes clean up enrich h5s #16) — intermediate HDF5 stores declared as Snakemake temp() so Snakemake (not the pipeline) reclaims them; keep_enrich_h5 opts out.
Cluster profiles + per-rule resources (closes Cluster profiles #13) — SLURM profile, default profile, and per-rule mem_mb / threads / runtime; JVM heaps pinned a headroom below the cgroup allocation.
Prebuilt container + GHCR publish workflow; thin cluster_env.yaml driver.

Dev experience

uv-native test/lint layer — pyproject.toml + uv.lock virtual project with a dev group and cosmos extra; CONTRIBUTING.md; devcontainer uses pip install --group dev.
ruff + snakefmt + pre-commit tooling.

Closes

Closes #13, #16, #19, #23, #27.

Test plan

Full unit + integration suite passes on dev with snakemake and cosmos installed (299 passed in the last run).
cosmos end-to-end validated against the real cosmos-dms 1.0.2 package (DAG construction + DMSData load with NA tolerance; slow real-run test guarded).
--dry-run green on the example config.

Notes / follow-ups (not in this PR)

Multi-scorer refactor (any combination of scorers, enrich-only runs) is designed and partially branched but intentionally out of scope here.
cosmos run is serial (~tens of seconds/position); scatter-gather parallelization is a tracked follow-up.
Cluster profile is wired but not yet validated on a live SLURM cluster.
An untested, manual-only prepare_sra stub was removed from this PR; SRA submission formatting will return as a properly-scoped feature (anchored fastq matching, real instrument/biosample config, tests).
The container CI smoke test (snakemake --dry-run in-image) was dropped — it validated against the gitignored example/data, so it never had inputs in CI. Re-enabling it is tied to the committed end-to-end fixture follow-up.

run_rosace.R's build_counts_for_replicate calls read_tsv without checking that the per-sample file exists. When upstream GATK or process_counts drops a sample, users see a raw "cannot open file" R traceback with no hint of which sample failed. Mirror the lilace guard from 0da8edc on the rosace path: file.exists check before read_tsv, error names the missing file and the sample that was supposed to produce it. New test in test-build_counts.R locks the message in (helper-functions.R carries the sibling copy of the function used by the test harness, updated in lockstep). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

The prior `bbtools_use_bgzip` bool only toggled bgzip on or off; both paths left BBTools' compressed IO effectively single-threaded for the 8 GB+ FASTQs DMS runs ship. Replace it with `bbtools_compression` (`pigz | bgzip | none`, default `pigz`), and emit `usepigz=t unpigz=t bgzip=f unbgzip=f` so BBTools spawns pigz for (de)compression. Audit-time spec also called for `pigzthreads={threads} unpigzthreads= {threads}`, but empirical probe on BBMerge 39.13 showed both flags are rejected as unknown parameters: java.lang.RuntimeException: Unknown parameter pigzthreads=4 Dropped from the emitted flag string. pigz invoked without an explicit thread count uses its own default (online CPU count), which still gives the parallel-vs-serial win. Cluster oversubscription is bounded by rule scheduling rather than per-pigz thread caps. Back-compat shim in script_utils.translate_legacy_bbtools_compression translates the deprecated bool byte-compatibly: bbtools_use_bgzip: true -> bbtools_compression: bgzip bbtools_use_bgzip: false -> bbtools_compression: none so existing user configs keep producing the same BBTools invocations. Empirical concordance gate on tests/fixtures example: - 12/14 processed_counts/*.csv byte-identical between pigz mode and bgzip mode (both fresh runs, --cores 4). - 2/14 differ by exactly one row × ±1 count — sub-ppm noise from BBMerge multi-threaded scheduling, well within the inherent thread non-determinism the audit memory documents (~15-25 ppm). Pigz vs bgzip itself produces identical counts. New tests/unit/test_bbtools_compression_config.py mirrors the test_aligner_config.py pattern: schema shape, enum enforcement, snakemake.utils.validate default injection, and the shim's byte-compatibility invariants. 14 new tests, 43/43 affected suite passing. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Reference generation today silently assumes the FASTA's first `>` line matches `Path(config["reference"]).stem`. When it doesn't — e.g. a user-supplied `my_ref.fasta` whose header is `>some_other_contig` — BBMap index build and GATK SAM/ref contig matching both fail with opaque errors. Add a `prepare_reference` rule that copies the user's reference into a working path under `ref/` and rewrites the first `>` line to `>{reference_name}`. Downstream consumers (prepare_dict_GATK, prepare_index, prepare_bbmap_index, prepare_minimap2_index, gatk_ASM, generate_variants) now read from the normalized path instead of the raw user file. The user's original FASTA is left untouched. awk rewrites only the FIRST `>` line; multi-contig FASTAs preserve their secondary contig names so anyone relying on those isn't silently renamed (DMS data is single-contig in practice, but the safer default is to touch the minimum). `reference_file` still points at the raw user file because `file_digest(reference_file)` (common.smk:351) runs at parse time before any rule fires; the normalized file doesn't exist yet then. The `ref_digest` derived from the raw bytes is the right cache key anyway — a changed raw reference should invalidate the bbmap index regardless of normalization. Empirical probe: ran `snakemake --until prepare_bbmap_index` against a crafted FASTA with `>some_weird_contig_name_42` and filename `example_ref.fasta`. Without this fix, BBMap fails on the contig name mismatch (per Upstream #19's reporter). With it: Input header: >some_weird_contig_name_42 Output header: >example_ref BBMap index: built successfully Happy case (header already matching filename) re-verified end-to-end on the example fixture — full pipeline through process_sample completes with 13/14 csvs byte-identical to the prior pigz baseline (the 14th differs by 1 row × ±1 count, the same sub-ppm BBMerge thread non-determinism the previous commit characterized). New tests/unit/test_prepare_reference.py covers five awk-level invariants: mismatched header rewritten, matching header idempotent, multi-contig only rewrites line 1, sequence bytes preserved, special characters in reference_name pass through literally. 176/176 full unit+integration suite passing. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

DMS scoring runs use Stan MCMC and can drift visibly at the per-variant score level when chain init isn't pinned. Surface the seed in the rule log so users can reproduce a specific run, and make lilace's configurable from config. Two backends, two outcomes: - lilace exposes `seed=NULL` on lilace_fit_model (verified against pimentellab/lilace R/lilace.R). Add `lilace_seed: int | null` config knob (default null → auto-generate from Sys.time() salted with the condition name; always logged). User-set integer → bit-identical chain init across reruns. - rosace's RunRosace does NOT accept a seed argument (verified against pimentellab/rosace R/runROSACE.R). MCMCRunStan hardcodes seed=100 internally, so rosace runs are already bit-identical across reruns by construction. Adding a `rosace_seed` knob would silently no-op and mislead users — intentionally absent from the schema. Run log surfaces "Stan seed: 100 (hardcoded upstream)" for visibility/parity with the lilace log line. resolve_lilace_seed lives in run_lilace.R and a parallel testable copy in helper-lilace-functions.R, matching the existing parse_stripped_hgvs pattern. R-side tests (test-resolve_lilace_seed.R) lock in: configured seed passes through unchanged, zero is a valid seed, auto-gen is deterministic given fixed clock, distinct conditions get distinct seeds at the same clock (salt is load-bearing for parallel jobs), seed stays in Stan's valid uint32 range, log message includes a copy-pasteable config snippet. Caught and fixed a 32-bit integer overflow in the auto-gen path — as.integer(Sys.time()) ≈ 1.7e9 and *17 overflows R's int range to NA. Compute in double space, then modulo into int range before casting. Schema-level tests (test_lilace_seed_config.py) mirror the test_aligner_config / test_bbtools_compression_config pattern: 9 new tests covering schema shape, enum enforcement (integer and null both accepted, strings rejected, zero accepted), snakemake.utils.validate default injection, and the deliberate absence of `rosace_seed`. 185/185 full unit+integration suite passing; R suite green including 8 new resolve_lilace_seed tests. MultiQC surfacing of the seed (custom_data sidecar or plugin section) deferred to the `qc` branch — the rule log is sufficient for the Phase 1 "displayed reproducibility" goal. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Audit of the 8 temp() markers in the workflow (filter.smk, baseline_qc.smk, map.smk, ref.smk) splits cleanly into two classes: - sample-derived intermediates (ec.clean.trim.fastq.gz pair, mapped.bam, generated multiqc config + baseline file list) → keep temp(). Multi-GB-or-no-debug-value artifacts that regenerate every run. - reference-derived artifacts (bbmap index dir, minimap2 .mmi) → drop temp(). Deterministic from ref+digest, keyed on a content hash so collisions are structurally impossible across reference changes. Persisting lets repeat runs against the same reference skip the ~5s rebuild that temp() forced on every invocation. The deletion behavior was inherited from an earlier (pre-#37) layout where the bbmap index lived at a fixed shared path and rebuild risked artifact cross-pollution. Since #37 keyed on `ref_digest`, that risk evaporated — a different reference always lands somewhere else by construction. temp() on the index has been pure waste ever since. For users who want to retain the sample-derived intermediates (e.g. to inspect a bbduk-trimmed FASTQ or samtools view a mapped BAM), Snakemake's built-in `--notemp` flag (alias `--no-temp`, `--nt`) covers the use case with no config knob needed. README now documents the recipe. 185/185 unit+integration suite passing; snakemake dry-run on the example fixture schedules the bbmap_index rule normally (it re-fires once on this commit because the rule definition changed, then caches on subsequent runs as intended). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

The initial mavedb scaffolding (d235265) inherited rosace's column names for the lilace backend too — `variants`/`mean`/`sd`. Verified against pimentellab/lilace R/lilace.R `.get_score_df()`: lilace actually emits singular `variant`, `effect`, `effect_se`. So format_mavedb run on a real lilace score CSV would have surfaced the existing "Expected columns not found" error and required users to override every column via config. Update the lilace defaults to match the emitted schema. Rosace defaults were verified correct against results/example_experiment/rosace/ cond_A_scores.csv (`variants`/`mean`/`sd`) — left unchanged. A header comment now records both audits with file references so the next person to touch this doesn't have to re-derive the schema mapping. New tests/unit/test_format_mavedb.py covers: - normalize_hgvs converts rosace's same-AA missense (`p.(A1A)`) into MaveDB-canonical synonymous notation (`p.(A1=)`), which is what mavehgvs needs to recognize the variant as synonymous (it doesn't auto-normalize — Phase 0 audit finding). - Both backends' _BACKEND_DEFAULTS pin the expected column names. Catches schema drift in either direction. - End-to-end format_scores on tiny in-memory fixtures for both backends, including the synonymous-normalization case. - Missing required columns produce a clear error that names the file path and lists available columns. - The optional `sd` column is omitted from output when absent rather than crashing or fabricating SD values. Also adds the `if __name__ == "__main__"` guard that format_mavedb.py and prepare_sra.py were missing. Without it, importing either module from a test fires main() at import time and the argparse path crashes on pytest's sys.argv. The Snakemake `script:` directive still works because Snakemake invokes the script with `__name__ == "__main__"` set. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

The original format_mavedb.py shipped rosace/lilace's one-letter parenthesized HGVS (`p.(M1L)`, `p.(A1del)`, `p.(A1*)`) directly into the MaveDB score set. Empirical audit against the MAVE-HGVS spec (github.com/VariantEffect/mavehgvs/docs/spec.rst) found that's structurally invalid: - MAVE-HGVS rejects the parenthesized `p.(...)` form for everything except the population-level synonymous `p.(=)`. Every dumpling row today fails the regex. - MAVE-HGVS requires three-letter amino acid codes (`Met`, `Ala`, `Ter`). Dumpling produces one-letter codes (`M`, `A`, `*`) at every stage. - Stop codon must be `Ter`, not `*`. The spec explicitly calls this out as a rejection case. Users would have uploaded a score CSV and watched every row bounce. Add `to_mave_hgvs` that does the structural conversion: strip outer parens (preserving `p.(=)`), expand one-letter to three-letter via an inverted aa_3to1_dict (replacing dumpling's internal `Stp`/`X` with canonical `Ter`), and recurse into the inserted-residue sequence for insertions. Covers the six DMS variant classes dumpling actually emits: missense, synonymous-at-known-position, single-residue deletion, multi-residue deletion, insertion, nonsense. Unknown shapes raise ValueError listing the supported set, so a new HGVS class can't slip into scoring output and silently corrupt MaveDB deposits. `normalize_hgvs` (the existing rosace `p.(A1A) → p.(A1=)` synonymous canonicalizer) stays — it's still load-bearing. The pipeline now runs both: `normalize_hgvs → to_mave_hgvs`. Add `mavehgvs` to dumpling_env.yaml's pip block (verified not on conda-forge or bioconda 2026-05-26). Import is wrapped: if the lib isn't available (e.g. user ran without --use-conda), the converter still runs but validation no-ops. Row-context error messages name the offending row index when validation fires — so a 10k-row CSV remains debuggable. Test coverage: - to_mave_hgvs: each variant class pinned, unknown-shape error message asserted (so it stays actionable). - End-to-end rosace fixture exercises substitution + synonymous + deletions + multi-deletion + nonsense in one CSV. - The two pre-existing end-to-end rosace/lilace tests updated to expect the new MAVE-HGVS output shape (`p.Met1Leu` not `p.(M1L)`). - mavehgvs validation tests use pytest.importorskip — they skip on dev boxes without the lib, run in CI where the env builds with it. - No-op-when-library-missing path tested via monkeypatch, doesn't need the lib. 206 passed, 2 skipped (the two live-mavehgvs tests; will fire in CI). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

pimentellab/rosace-aa extends rosace with position + amino-acid substitution effect decomposition. Verified the upstream R sources (audit notes in tasks/tasks.md): - Entrypoint RunRosace.Rosace lives in R/runROSACE.R, signature additions over rosace: wt.col, mut.col, aa.code, stop.col, stop.name, pos.act, blosum.label. Score CSV columns (variants, mean, sd, lfsr.*, test.*, label) match rosace's — format_mavedb works unchanged. - Stan seed is hardcoded to 100 in MCMCRunStan (same as rosace, not configurable). No rosace_aa_seed knob needed. - Requires synonymous controls + wildtype/mutation metadata columns, so it's incompatible with noprocess=true (same wall as lilace, plus an additional one). Schema changes: - scoring_backend enum becomes [rosace, lilace, rosace_aa]. Default stays `rosace` so configs that don't mention the key keep their pre-rosace_aa behavior. - New rosace_aa_local: boolean (default false), mirroring rosace_local / lilace_local. - New mem_rosace_aa: integer (default 16000), mirroring mem_rosace / mem_lilace. common.smk setdefault for the two new keys. validate_scoring_backend_mode rejects scoring_backend='rosace_aa' + noprocess=true with the same shape as the existing lilace gate, but the error message names which backend tripped — so when a third or fourth backend grows the same constraint, the message stays useful. Tests: - test_scoring_backend_config.py: enum acceptance for rosace_aa, schema-shape + default-injection assertions for mem_rosace_aa and rosace_aa_local. - test_script_utils.py: noprocess+rosace_aa is rejected, rosace_aa+noprocess=false passes. - 192/192 unit+integration suite passing. Rule + driver + install script land in the next commit (Phase 2); dry-run with `--config scoring_backend=rosace_aa` correctly surfaces a MissingInputException for the score CSV that rule will produce. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Adds the rule + script scaffolding for the rosace_aa scoring_backend that the previous commit accepted at the schema level. Mirrors rosace.smk's shape: - workflow/rules/rosace_aa.smk: run_rosace_aa (per-condition fan-out) and install_rosace_aa, both gated on the rosace_aa_local config flag that decides between system R and the conda+renv path. - workflow/envs/rosace_aa.yaml: same R-base + cmake<3.25 + nloptr + zlib system layer as rosace.yaml. CmdStan compatibility constraint is identical. - workflow/rules/scripts/install_rosace_aa.R: bootstraps renv, restores the base R deps from the existing renv.lock (which already carries impute as a Bioconductor entry), then installs rosace-aa at a pinned SHA via renv::install("pimentellab/rosace-aa@<sha>"). The upstream repo has no tagged releases (verified 2026-05-26), so we pin c0fda386 — the current main-branch HEAD — and bump on validation. - workflow/rules/scripts/run_rosace_aa.R: copy of run_rosace.R with the rosace-aa-specific changes: * library("rosaceAA") instead of library("rosace") — rosace-aa is a fork that ships its own copies of CreateRosaceObject, FilterData, etc., so loading rosace alongside would clash on function names. * RunRosace call gains wt.col="wildtype", mut.col="mutation", aa.code="single", stop.col="type", stop.name="nonsense", pos.act=FALSE — verified against the upstream vignette's AssaySet Model 2 example. * Output directory results/{experiment}/rosace_aa/ instead of rosace/. * Seed log message references rosaceAA::MCMCRunStan; the hardcoded seed=100 is the same as rosace. * The noprocess branch is replaced with a hard error — the config validator rejects rosace_aa+noprocess upfront, so this script firing under noprocess would mean a validator bug. Snakefile picks up the new rule file. Code duplication between run_rosace.R and run_rosace_aa.R (~470 lines of nearly-identical helper functions: build_counts_for_replicate, build_rosace_object_for_condition, finalize_variants_in_rosace, main) is the same pattern run_lilace.R established when lilace landed. A shared-helpers extract is the obvious follow-up if the duplication ages poorly; not the right call now since both backends could diverge on the helper-function side over time. Verified end-to-end at dry-run level: - Default scoring_backend (rosace) → install_rosace + run_rosace × 2 conditions, unchanged from before. - scoring_backend=rosace_aa → install_rosace_aa + run_rosace_aa × 2 conditions, plus 80 other pipeline jobs. - scoring_backend=rosace_aa + noprocess=True → ValueError from validate_scoring_backend_mode, naming both the backend and the config knob to flip. 192/192 unit+integration suite still passing. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Picks up the rosace-aa wiring landed in the previous commits and documents how users discover and pick it. README "Installing Rosace and Lilace" section becomes "Installing Rosace, Lilace, and rosace-aa": - mentions all three backends with linked upstream repos - explains rosace-aa's value-add (position + AA substitution effect decomposition, score CSV layout matches rosace's) - documents the scoring_backend config knob with all three options and the noprocess incompatibility for lilace + rosace_aa - adds `snakemake --cores 8 install_rosace_aa` to the manual-install snippet alongside the existing install_rosace / install_lilace lines - notes the SHA-pinned upstream install (rosace-aa has no tagged releases yet) - OSX gfortran section now references "all three backends" instead of just rosace + lilace. config/example.yaml: - scoring_backend comment grows the third option with the noprocess constraint spelled out - rosace_aa_local: false comment + value line added next to rosace_local / lilace_local, mirroring the established pattern Full suite still 192/192 passing; R suite green including the 8 resolve_lilace_seed tests. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Two changes that together let a fresh devcontainer rebuild pick up scoring_backend=rosace_aa (or lilace) with no per-run install wait: 1. postCreate.sh extracts the bootstrap into a shared script. Both .devcontainer/devcontainer.json (canonical) and the local-only claude-sandbox variant now call it. After the existing conda env + system-R packages steps, the script runs snakemake --configfile config/example.yaml \ results/example_experiment/{rosace,lilace,rosace_aa}/*_installed.txt so the existing install rules' codepath (renv::restore + cmdstanr + install_github for rosace-aa) fires once at build time. Marker files persist; subsequent snakemake invocations see the install as a cache-hit. The warmup is non-fatal: if it fails (network, upstream SHA gone, disk pressure), the devcontainer is still usable; the user just hits the real ~5-10 min install on first scoring run. Script prints the manual-retry command if that happens. `--with-dev-deps` flag preserves the claude-sandbox variant's extra `pip install -r requirements-dev.txt` step (pytest + pytest-mock). Canonical doesn't pass it. 2. apt-packages list gains `libblas-dev` and `liblapack-dev`. Found while empirically testing install_rosace_aa on the existing devcontainer: several rosace/rosace-aa transitive deps (urca, robustbase, ...) need to link against BLAS/LAPACK. The runtime libs (libblas3 / liblapack3) were already there, but the unversioned `libblas.so` / `liblapack.so` symlinks only ship in the `-dev` packages. Without them, renv::restore() fails with /usr/bin/ld: cannot find -llapack: No such file or directory This was a pre-existing latent bug — install_rosace was silently hiding it because the marker file gets written unconditionally regardless of renv::restore success. install_rosace_aa surfaced it because the new script's final `require(rosaceAA)` check fails-loud. Tracked as a follow-up in tasks.md to harden install_rosace + install_lilace with the same fail-loud pattern. Validation gap: the BLAS/LAPACK fix is structurally correct (adds the missing `-dev` packages to the apt feature list) but couldn't be empirically tested on this devcontainer because we don't have apt sudo. A fresh devcontainer rebuild is the empirical gate. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

install_rosace.R's non-local branch was the last install script with the silent-failure bug: it called renv::restore() then unconditionally wrote the snakemake marker file. renv prints errors but doesn't raise when a transitive dep fails to compile (e.g. missing system libs like libblas-dev/liblapack-dev), so the marker got written and run_rosace fired downstream — failing with a cryptic "could not find function 'CreateRosaceObject'" instead of pointing at the real install failure. install_lilace.R already had `if (!require(lilace)) stop(...)` in both its local and non-local branches (the lilace branch shipped it that way originally). install_rosace_aa.R's equivalent check was what surfaced the BLAS/LAPACK devcontainer gap in the first place. Bringing install_rosace.R into line closes the silent-failure class across all three scoring backends. The corresponding `if (!require(rosace)) stop(...)` already existed in install_rosace.R's *local* branch (rosace_local: true path) — only the non-local renv branch was missing it. New tests/unit/test_install_scripts_fail_loud.py parametrizes across all three install scripts and regex-checks the `if (!require(<pkg>)) { stop(...) }` shape. It's a coarse contract test, but each script is small enough that a stronger AST-level check would be more fragile than the contract itself. If you refactor an install script, update this test in lockstep — the docstring spells out the why. 195/195 unit+integration suite passing (3 new). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

User ran the rosace-aa branch end-to-end on the (now-validated) rebuilt devcontainer and got: [1] "cond_A_ROSACE2" run_rosace_aa failed: Invalid 'scores' name. Error in ExtractScore(object, name) : Invalid 'scores' name. rosace-aa internally tags the fitted Score with the model name picked based on which knobs are set (verified by reading helperRunRosaceGrowth in upstream R/runROSACE.R 2026-05-27): ROSACE0 — no position ROSACE1 — position-only ROSACE2 — position + blosum (default whenever pos.col is set and pos.act = FALSE — our config) ROSACE3 — position + blosum + activation (pos.act = TRUE) So the Score is stored under "cond_A_ROSACE2" but our OutputScore call was asking for "cond_A_ROSACE_AA" (a made-up suffix I picked when mirroring run_rosace.R's "_ROSACE"). ExtractScore couldn't find it and raised. Empirical failure mode that the dry-run + structural verification gates couldn't have caught. Switch the OutputScore name to paste0(expt_condition, "_ROSACE2") — matching the upstream vignette's Model 2 pattern. Inline comment spells out the mapping so the suffix can move in lockstep if/when pos.act or other model-selection knobs become user-configurable. Also flip pos.info = TRUE + blosum.info = TRUE on the OutputScore call. These surface the rosace-aa-specific columns (phi_mean/phi_sd/sigma2_mean/sigma2_sd from pos.info; blosum_score/nu_mean/nu_sd from blosum.info) — which is the entire point of using rosace-aa over rosace. The core variants/mean/sd columns are unchanged, so format_mavedb still works against the output. Dry-run on the example fixture still schedules the rule normally; 195/195 unit+integration suite green. The empirical re-test is on the user's box: retry the targeted-output invocation and confirm cond_{A,B}_scores.csv get produced. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…m.info Second empirical run on the rebuilt devcontainer landed past the "Invalid scores name" error from the previous commit and hit: Error in (function (..., row.names = NULL, ...) : arguments imply differing number of rows: 1773, 116 Root cause is an upstream bug at the pinned SHA c0fda386. OutputScore. Score's blosum branch does if ((blosum.info == TRUE) && (model_code >= 2)) { df <- cbind(df, [email protected] %>% select(blosum_score, nu_mean, nu_sd)) } `df` is variant-level (1773 rows on the example fixture) but `optional.score` is position-level (116 rows on the same fixture). cbind raises. Same shape of bug applies to the pos.info branch further down — both expand the output via cbind with a different- sized frame. Workaround: drop both flags. OutputScore returns a single variant- level data.frame with the standard 9 columns (variants, mean, sd, lfsr.*, test.*, label), which is what the existing rosace pathway produces and what format_mavedb's _BACKEND_DEFAULTS["rosace"] is wired against — so downstream tooling keeps working unchanged. The per-variant `mean`/`sd` still come from the ROSACE2 model, so the AA-aware shrinkage is baked into the variant scores — users don't lose rosace-aa's modeling, just the optional per-position summary columns. Tracked as a follow-up in tasks/tasks.md: bump the SHA pin once upstream fixes the row-mismatch (or once we file an issue and they respond), then turn the flags back on and handle the list return (write df_variant to the existing CSV + df_position to a sibling). Inline comment in run_rosace_aa.R spells out the bug shape + the ROSACE2 method-suffix mapping so the next person to revisit isn't chasing the same chain of upstream reads. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

The rocker-org/r-apt feature was set to "release" which on a fresh build now resolves to R 4.6.0 (released 2026-04-24). Source compiles of `rlang`, `S7`, `yaml`, and via cascade ~half the renv.lock fail against the R 4.6 headers — empirically found during the rosace-aa branch's devcontainer rebuild iteration. Pin to 4.5.x to match `"R": { "Version": "4.5..." }` declared in renv.lock. The pin moves in lockstep with the lockfile: bump both together when the CRAN ecosystem catches up to 4.6 and the renv.lock upgrade is safe. The R 4.6 lockfile-update attempt (see commit log on the rosace-aa branch for the full chain) hit the BLAS/LAPACK gap first, then a processx/Rf_findVar fight, then the rlang/S7/yaml compilation breakage. Reverting to 4.5 + adding the missing system libs is the combination that unblocks fresh installs. Inline comment in the JSON spells out the bump-with-lockfile contract so the next person reading this doesn't try to "modernize" the pin without thinking. Co-Authored-By: Chris <[email protected]> Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…CK-renv Each scoring-backend install rule (install_rosace.R, install_lilace.R, install_rosace_aa.R) begins its non-local path with install.packages("renv", repos = c("https://cloud.r-project.org")) targeting the shared system library at /usr/local/lib/R/site-library. On a fresh build with --cores 4, Snakemake fires up to four rules concurrently; the first one creates /usr/local/lib/R/site-library/ .renv/00LOCK-renv as install.packages's lock directory, and the other three try to write into the same path. The loser ends up with a partial renv install (.rdb file missing or truncated) and aborts mid- restore. Empirical failure mode found during the rosace-aa branch's devcontainer rebuild iteration. Manifests as a confusing "internal error" deep in renv when the parallel rules collide. Force --cores 1 on the warmup snakemake invocation. The wall-clock cost is small: renv install is fast (seconds), and the CmdStan compile dominates the warmup time but only runs once anyway since cmdstanr's install_cmdstan() is idempotent across the three rules. Updates both the live invocation and the retry-hint message echoed on failure so users who hit a transient failure don't copy-paste the broken --cores 4 command. The 00LOCK-renv race is the kind of thing a "should be safe to parallelize, it's just three independent installs" intuition would miss. Inline comment in postCreate.sh spells out the root cause for the next person to question why a four-core warmup runs serially. Co-Authored-By: Chris <[email protected]> Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

The three scoring-backend rule envs (workflow/envs/rosace.yaml, lilace.yaml, rosace_aa.yaml) shipped bare `r-base` with no version spec. On a fresh `snakemake --software-deployment-method conda` invocation, conda-forge would resolve to whatever's current — likely 4.6.x today — and the install rules running inside that env would hit the same renv-lockfile-vs-R-4.6 wall the devcontainer hit (rlang/S7/yaml compile failures via the upstream R 4.6 header changes). Pin all three to `r-base=4.5` to match renv.lock's declared R version + the matching r-apt pin in .devcontainer/devcontainer.json landed in commit 302aa38. Three pinned spots now move together: - workflow/envs/{rosace,lilace,rosace_aa}.yaml: r-base=4.5 - .devcontainer/devcontainer.json: r-apt feature version "4.5" - renv.lock: "R": { "Version": "4.5..." } Bump in lockstep when CRAN/Bioc catches up to 4.6 and the lockfile-upgrade is safe. The pin matters for any user running with `--software-deployment-method conda` (i.e., not relying on system R from the devcontainer). That's the common pattern on shared compute / production clusters, so the pin is on the load-bearing path even if devcontainer users get away without it. Inline comment in each yaml spells out the same bump-in-lockstep contract as devcontainer.json — so the next person doesn't peel a pin off thinking it's stale. 195/195 unit+integration suite still passing (env-yaml changes don't touch the python/snakemake test surface). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

The conda env's python dep was bare `python`, letting the solver pick whatever conda-forge resolved at solve time. Empirically found during the rosace-aa devcontainer rebuild iteration that the floating choice landed on a version where multiqc / pip-installed deps would not solve cleanly with the rest of the env — solver would stall or pick an incompatible pandas / pysam combination. Pin to `python=3.13` to match the version the rest of the toolchain (snakemake 9.21, pytest 9.0.3, pandas 2.3.3, biopython 1.87) was validated against on v0.2.0. Subsequent solves are now deterministic. Bump in lockstep with the rest of the env if/when v0.3 wants to move to a newer Python. Co-Authored-By: Chris <[email protected]> Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

cmdstanr::install_cmdstan() without a version argument fetches whatever the cmdstanr release-fetch endpoint reports as latest at install time. That's a moving target: subsequent runs on a different day can install a different CmdStan version, breaking reproducibility of the Stan model compilation step. Pin to 2.39.0 — the version the rosace + lilace + rosace-aa upstream sources were validated against (rosace's MCMCRunStan / lilace's sampler / rosace-aa's MCMCRunStan all target this version) and the version cmdstanr 0.9.0 settled on in the renv.lock chain. Applied at all four install_cmdstan call sites across the three scripts (both the local-managed-R branch and the renv-managed branch in each script). Same pin everywhere keeps the three scoring backends on a single CmdStan install at /home/vscode/.cmdstan/cmdstan-2.39.0. Bump in lockstep with the rosace / lilace / rosace-aa SHA bumps if the upstream maintainers move to a newer CmdStan. Co-Authored-By: Chris <[email protected]> Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

13 commits landing the third scoring backend end-to-end. Schema + config + validator + rule + driver + install script + docs + the full install/env-management debugging chain surfaced during empirical validation: d89fd50 Accept rosace_aa as a third scoring_backend (schema + config) 1cba330 Wire rosace-aa rules, driver, and install script fc7e4c9 Document rosace-aa scoring backend in README and example config d39e98a add rosace-aa to readme d55cc11 Devcontainer pre-warms all three scoring backends + adds BLAS/LAPACK 5be4c5c Fail loud when install_rosace's renv::restore silently drops packages b7985c4 Fix OutputScore name + surface AA-decomposition columns in run_rosace_aa e734799 Work around OutputScore.Score row mismatch by skipping pos.info/blosum.info 302aa38 Pin devcontainer R to 4.5.x — R 4.6 breaks the renv lockfile 4b542b2 Force --cores 1 on postCreate warmup — parallel installs race on 00LOCK-renv ecd3354 Pin per-rule conda envs to r-base=4.5 (mirroring devcontainer pin) c021733 Pin dumpling_env to python=3.13 9159b96 Pin CmdStan to 2.39.0 in all three install scripts Empirically validated 2026-05-27 on the rebuilt devcontainer: concordance Pearson r = 0.999 (cond_A, n=1773) / 0.997 (cond_B, n=1907) vs scoring_backend=rosace on the example fixture. Known limitations carried into dev (tracked in tasks/tasks.md): - rosace-aa's pos.info/blosum.info OutputScore columns are currently suppressed pending an upstream fix at the pinned SHA c0fda386. The decomposition data is still persisted to disk in phi.tsv / sigma2.tsv / nu.tsv / beta.tsv / input_*.RData and can be joined manually via df_map until the upstream fix lands. - Full renv/snakemake-install-handling story is tracked as a follow-up: cold-start tests, cache-hit invariants, recovery invariants, concurrency revisit, env consolidation, CI automation, README first-run-user recipe. 195/195 unit+integration suite passing. Co-Authored-By: Chris <[email protected]> Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Follow-up to 302aa38 (Pin devcontainer R to 4.5.x). The rocker-org/r-apt feature's `version` option doesn't take a numeric R version string — values like "4.5" silently fall through to the default source (Debian testing), which currently ships R 4.6 and re-triggers the renv-lockfile breakage that 302aa38 tried to avoid. The correct knob is `useTesting: false`, which switches the apt source from Debian testing → CRAN's official debian-cran40 repo. That repo tracks the released R paired with each Debian codename — for Debian Trixie this is R 4.5.x, matching renv.lock's declared version. Same intent as 302aa38, working mechanism. Inline comment updated to explain WHY useTesting=false rather than just naming the flag, so the next person to revisit doesn't peel it off thinking it's a stylistic choice. Lands directly on dev (not a branch) — this is a one-line corrective follow-up to a commit already in dev's history, not new feature work. Co-Authored-By: Chris <[email protected]> Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Empirical e2e validation 2026-05-27: ran format_mavedb against results/example_experiment/rosace/cond_A_scores.csv (1773 real rows) and the conversion crashed with `KeyError: 'X'` inside to_mave_hgvs. Root cause: the converter's _AA_1to3 mapping had `*` → `Ter` for stop codons, matching standard one-letter HGVS notation. But dumpling uses `X` for stops internally, per process_variants.aa_3to1_dict's "Stp" → "X" entry. So real score CSVs contain `p.(S2X)`, `p.(E3X)`, etc. — not `p.(S2*)`. The empirically-validated converter was wrong about which notation it would encounter. Add `"X": "Ter"` to _AA_1to3. Keep `"*": "Ter"` as a fallback for robustness (if any future path emits standard HGVS stop notation). Empirical concordance gate now passes on both example-fixture conditions: cond_A (1773 rows, 78 nonsense via X) and cond_B (1907 rows, 81 nonsense via X) both convert cleanly and validate as MAVE-HGVS via mavehgvs.Variant on every single row (3680 / 3680 valid). Same gate also confirms the existing normalize → to_mave_hgvs chain handles the same-AA-substitution → synonymous path correctly (89 + 92 synonymous rows across the two conditions, all emit `p.<wt3>{pos}=`). The unit test for nonsense conversion now covers the X notation explicitly (dumpling's actual format) alongside the `*` fallback, so the X-path can't silently regress. Comment in the test docstring records the cond_A row-count finding for future readers. 208/208 unit+integration suite passing (was 195 on rosace-aa branch's tests, +13 mavedb-specific format_mavedb tests). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Four commits landing the mavedb deposit-format work: ce5d2cd add initial mavedb and sra deposit/formatting (cherry-picked from pre-v0.2.0 d235265 — scaffolds deposit.smk, format_mavedb.py, prepare_sra.py) 731e195 Fix lilace _BACKEND_DEFAULTS column names + cover with tests (rosace uses variants/mean/sd; lilace uses variant/effect/effect_se — the initial scaffolding had them identical, breaking lilace deposit) 061ef80 Emit MAVE-HGVS canonical form + validate via mavehgvs (MAVE-HGVS requires three-letter codes, no parenthesized predicted form, Ter for stop; new to_mave_hgvs converter does the structural conversion; mavehgvs.Variant validates each row post-conversion) c604112 Map dumpling's `X` stop code to `Ter` for MAVE-HGVS (empirical e2e fix — dumpling's internal stop notation is `X`, not `*`, found when running format_mavedb against real rosace output) Empirically validated 2026-05-27 against the example-fixture rosace score CSVs: 3680 / 3680 rows across cond_A + cond_B convert cleanly and validate as MAVE-HGVS via mavehgvs.Variant. Coverage includes 78 + 81 nonsense (X → Ter) and 89 + 92 synonymous (same-AA missense → position-specific `=`) variants from the real data. prepare_sra.py was deferred per user direction during the audit phase; its scaffolding ships but the field validation against NCBI's current template is tracked as a follow-up in tasks/tasks.md. 208/208 unit+integration suite passing on the mavedb branch tip; expecting the merged dev suite to grow by the 13 mavedb-specific tests (format_mavedb's TestNormalizeHgvs / TestBackendDefaults / TestFormatScores / TestToMaveHgvs / TestFormatScoresMaveHgvs / TestMaveHgvsValidation). Co-Authored-By: Chris <[email protected]> Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

The original guard at the top of main() in both scripts was if "snakemake" in dir(): ...snakemake path... ...argparse CLI path... `dir()` called inside a function returns LOCAL names, not the module's globals. Snakemake's `script:` directive injects `snakemake` as a module-level global, so the guard always returned False and main() silently fell through to the argparse path. When run from the CLI with the right args this worked; when run via snakemake it crashed: usage: tmp2i0otwl0.format_mavedb.py [-h] --scores SCORES --output OUTPUT ... format_mavedb.py: error: the following arguments are required: --scores, --output Empirically surfaced 2026-05-27 while wiring `deposit_to_mavedb` default-on into `get_input`. Previous testing exercised the script via direct CLI invocation (python format_mavedb.py --scores ... --output ...), which masked the bug. The original `deposit.smk` rule was also never included in workflow/Snakefile (separate fix in the follow-up commit), so the snakemake path had never actually fired. Switch both scripts to `"snakemake" in globals()`, which IS the attribute the script directive injects. New tests/unit/test_format_mavedb.py::TestMainSnakemakeDispatch covers both paths: - With a SimpleNamespace `snakemake` global monkeypatched onto the module, main() takes the snakemake path and produces output at the snakemake.output[0] location. - With no `snakemake` global, main() falls through to argparse and SystemExits on missing required args (verifies the CLI path isn't accidentally broken too). If someone "refactors" the dispatch back to dir() or removes the mode detection, the first test fires. 220/220 unit+integration suite passing (was 218 → +2 dispatch tests). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

The MaveDB-format deposit CSV emitter (format_mavedb rule, added on the mavedb branch) was scaffolded as manual-run-only — the rule existed but neither it nor prepare_sra was added to `get_input`, and deposit.smk wasn't even included in workflow/Snakefile. Users had to know the rule existed AND target the output path explicitly AND manually include the smk file. Effectively invisible. Empirical validation (3680/3680 example-fixture rows convert and validate as MAVE-HGVS via mavehgvs.Variant, end-to-end on both example-fixture conditions) shows the converter is robust. The post- scoring work is also cheap (~seconds). Default it on. Changes: - workflow/schemas/config.schema.yaml: add deposit_to_mavedb: boolean, default true. Schema description spells out what it produces, where, and the opt-out flag. - workflow/rules/common.smk: setdefault + extend get_input to add `results/{experiment}/deposit/mavedb/{cond}_mavedb.csv` per experimental condition when the flag is true. - workflow/Snakefile: actually include rules/deposit.smk (was never wired in originally; targeting the rule's output path via snakemake would have failed before this). - config/example.yaml: document deposit_to_mavedb in the config comments + add the explicit `deposit_to_mavedb: true` line so users see it in the canonical config. `prepare_sra` stays manual-only regardless of this flag: its output ships `FILL_IN: e.g. Illumina NovaSeq 6000` and `FILL_IN: e.g. SAMN00000000` placeholders that the user has to fill in. Auto-firing it on every run would produce a placeholder-laden SRA metadata TSV that's misleading-by-default. Empirically validated: `snakemake -s workflow/Snakefile --configfile config/example.yaml --cores 4 results/example_experiment/deposit/ mavedb/cond_A_mavedb.csv results/example_experiment/deposit/mavedb/ cond_B_mavedb.csv` produces 1773 + 1907 valid MAVE-HGVS rows. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Codon matching was dropped in 28dbec2, leaving a name-only match that accepted off-target codons (wrong_codon_counts stuck at 0). Match on (name, codon) again and split rejects into wrong_codon vs wrong_variant.

Preserve distinct-codon designed variants through dedup, and sum codon-level counts back per hgvs in the Enrich2 TSV so scoring stays protein-level.

28dbec2 also dropped the insdel_variant_counts and multi_variant_counts increments, leaving both buckets stuck at 0. Affected reads were still rejected (into wrong_variant_counts), but the QC breakdown was lost. Classify rejections into their dedicated buckets again.

The insdel length cap was dropped in 28dbec2, leaving max_deletion_length a no-op config knob. Re-thread it into process_insdel: a recoverable insdel is accepted as an in-frame deletion only when within the limit, else rejected as an insdel. A non-positive value disables the cap; noprocess bypasses it. Default stays 3 (schema, applied by validate()).

The new benchmark tests read tests/fixtures/benchmark_variants.csv, which was untracked — commit it so the suite passes on a fresh checkout. Also run the dedup conflict test in a temp cwd so the debug duped.csv it triggers no longer lands in the repo root.

The unrecoverable-duplicate debug dump went to "duped.csv" in the process working directory. Route it to results/<experiment>/duped_variants.csv via a dup_report_path arg (parent dirs created as needed), so it lands in a predictable place instead of polluting the cwd.

Emit a single wide CSV joining each phenotype-assigned condition's scores into cosmos's (pimentellab/cosmos) beta_hat_N/se_hat_N columns, for causal direct-vs-indirect modeling across sequential phenotypes. Sibling of the MaveDB deposit. - format_cosmos.py: outer-join the per-condition score CSVs on variant, NA-pad missing phenotypes, coalesce group/type; backend-aware columns. derive_cosmos_type fuses indel length into the type label (insertion3), keyed off the I_/D_/ID_ mutation code; substitutions pass through. - Conditions opt in via an optional `phenotype` slot column on the experiment CSV (presence = inclusion). Validation (get_cosmos_phenotype_conditions) lives in script_utils alongside the other experiment checks: per-condition agreement, no baseline slot, contiguous 1..N. - deposit_to_cosmos config flag (off by default); rule format_cosmos. - Tests for derive_cosmos_type, the multi-condition join, and slot validation.

Add sequence-based variant identification: anchor each oligo's gene fragment to the reference by exact k-mer seeding (per-oligo, adapter- agnostic), then classify the single localized edit as substitution, synonymous, in-frame deletion, or insertion. Runs in designed_variants sequence-first, falling back to the DIMPLE name regexes only when the sequence can't be resolved (terminal codons, circular refs, multi-edit). Anchored exact-match keeps it ~O(oligo length) per oligo (no alignment, no new dependency); ~0.35 ms/oligo on a 1.5 kb gene. Indel positions are snapped to codon boundaries and the reconstruction verified to handle left-alignment ambiguity.

Validated against real CFTR DMS libraries (CFTR_1_oligos.csv vs the construct reference): 94% of oligos resolved by sequence with zero wrong calls; the rest fall back to names (gene-terminus codons, a divergent chunk). Fixes found in the process: - Resolve substitutions by walking CODONS in the ORF frame rather than base-level gap scanning, so a per-oligo adapter that coincidentally matches a base or two at the gene boundary is not misread as a second edit (this was silently dropping codon-2 and other near-boundary variants to the fallback path). - Name stop codons 'X', matching the rest of the pipeline (not '*'). - Drop the now-unused _extend_match_left helper.

Enrich2 writes large .h5 store files into its output directory as a side effect; dumpling only consumes the exported tsv scores, so the stores are pure intermediates. Rather than have the pipeline shell out to rm/find, run_enrich now declares the stores as Snakemake temp() outputs so Snakemake itself deletes them (and only them) once scoring completes -- it is structurally unable to remove anything it did not create as that rule's output. The store filenames are enumerated at parse time by expected_enrich_h5_basenames, which applies the same T0/min-timepoint filtering as generate_config and reuses shared selection/library label helpers so the predicted set can't drift from what Enrich2 writes (pinned to the committed results/example_experiment/enrich layout in tests). The keep_enrich_h5 config flag (default false) opts out: when true the stores are left undeclared and simply persist for debugging.

single-env image bundling snakemake + bioinformatics stack + R/Stan with cmdstan pre-compiled. snakefile points users at the image via the containerized directive; CI publishes to ghcr.io/odcambc/dumpling on tag push and tracks dev/container branches for rolling builds. README quickstart documents both docker and apptainer/singularity invocations.

Make every heavy rule declare resources: mem_mb so a scheduler requests the right --mem, and add the SLURM profile that was missing (the cluster branch referenced workflow/profiles/slurm but never committed it). - Per-tool memory budgets unified to MB (matching mem_rosace/mem_lilace); wired into resources: mem_mb on trim_clean_correct (sum of its 3 concurrent JVMs), map_to_reference_bbmap/minimap2, gatk_ASM, prepare_bbmap_index, prepare_minimap2_index, process_sample. - Java -Xmx heaps now derive a fixed headroom below the cgroup allocation via script_utils.java_heap_gb, so a JVM can't OOM at the --mem limit; GATK gets --java-options '-Xmx'. Replaces the global config[mem] -Xmx for these rules. - workflow/profiles/slurm/config.yaml (executor, default-resources, restart/ latency/rerun settings, apptainer deployment; account/partition left as documented placeholders) and workflow/profiles/default/config.yaml for local. - Schema + example.yaml document the mem_* knobs; README gains a 'Running on a cluster' section. - Tests: java_heap_gb headroom unit tests; profile-config validation; a snakemake dry-run asserting bbmap/gatk resolve mem_mb (no scheduler needed).

The cosmos export had only unit coverage. Add the two missing pieces: - A 2-condition fixture (mock_experiment_cosmos.csv, phenotype slots 1/2) and a snakemake dry-run test that format_cosmos wires into the DAG with the phenotype-assigned conditions' scores as slot-ordered inputs. - A skip-guarded test (runs only where the real cosmos package is importable) that a format_cosmos-emitted CSV loads into cosmos.DMSData, including the outer-join NA case (a variant scored in one phenotype only). Validated against cosmos 1.0.2: DMSData loads the NA-padded file and keeps the NA rows, so the outer-join policy ships as-is. Resolves the design-doc load-time open question.

The repo had no project venv, so 'uv run' silently fell back to the conda base interpreter and dev/test deps (plus dimple's numpy pin) piled up in base. requirements-dev.txt also only listed tooling, omitting the libraries the tests import (pandas, biopython, regex, jsonschema, pyyaml, mavehgvs) and snakemake. Declare the pure-Python test/lint layer in pyproject.toml as a uv-managed virtual project (package = false, so uv never builds/installs dumpling and the file is inert to the conda pipeline env): - [dependency-groups] dev: the full test/lint layer. 'uv sync' + 'uv run pytest' runs the whole suite in an isolated .venv -- including the snakemake dry-run integration tests, with no PATH hack. - [project.optional-dependencies] cosmos: opt-in extra for the skip-guarded real-cosmos test ('uv sync --extra cosmos'). Kept out of dev only for weight (numpy 2.x + arviz/xarray/scikit-learn); no version conflict, the .venv is isolated. matplotlib<3.11 dodges an arviz import of matplotlib.style.core. - uv.lock committed for reproducibility; requires-python >=3.12 (cosmos floor). - CONTRIBUTING.md documents uv sync / --extra cosmos and the test-vs-runtime split; the devcontainer installs the dev group via pip --group dev. The pipeline runtime (dumpling_env.yaml conda env, per-rule envs, container) is untouched. Suite: 296 passed (dev + cosmos extra); 294 passed / 2 skipped (dev only, cosmos test skips).

cosmos is a score generator (per-position direct/indirect causal decomposition across two sequential phenotypes), not a deposit target and not a base scorer. It can't be a scoring_backend value (it consumes the base scorer's per-phenotype effects rather than scoring counts), so it's modeled like enrich2: an optional analysis layered on top of the chosen scoring backend. - run_cosmos (config boolean, default false): when true, cosmos runs as part of the build, producing the per-position decomposition results/{exp}/cosmos/{exp}_cosmos_results.csv. Requires exactly two conditions assigned phenotype slots 1 and 2 (validated at parse time). - Two rules: format_cosmos prepares the wide input table (results/{exp}/cosmos/{exp}_cosmos.csv) from the scoring backend's per-condition scores; run_cosmos (envs/cosmos.yaml: pip cosmos-dms + matplotlib<3.11, pure-Python no Stan) fits cosmos and writes the summary. cosmos lives at results/{exp}/cosmos/ (an analysis alongside rosace/lilace), NOT under deposit/ — it isn't a deposit. - scripts/run_cosmos.py: DMSData -> PriorFactory -> ModelBuilder -> run_cosmos per position -> ModelAnalyzer.summary. API pinned against cosmos 1.0.2's vignette. CLI + Snakemake entry. mem_cosmos resource; optional cosmos settings block (include/exclude_type, x_gmm_n_components, min_num_variants_per_group). - Tests: format + run DAG wiring, a slow real-run test (per-position summary), the two-phenotype guard. Suite 299 passed. PERFORMANCE: cosmos fits per position (~tens of seconds each, serial), so enabling run_cosmos can add hours on a large library. Parallelizing the fits (scatter-gather) is deferred and tracked in tasks/tasks.md.

The in-image smoke test validated the DAG against config/example.yaml, whose data_dir (example/data) is gitignored, so a fresh CI checkout has no input data and validate_config hard-fails. It has failed on every dev push since it was added. Drop the step (keep the build, which still verifies the image compiles); re-enabling needs a committed CI fixture.

Copilot

Pull request overview

This release PR fast-forwards dev to main, bundling multiple validated feature lines into the dumpling Snakemake pipeline: new scoring backends and determinism knobs, improved variant identification/filtering, deposit/export utilities (MaveDB/SRA/cosmos), and substantial infra/dev-experience upgrades (container publishing, cluster profiles, uv-based dev tooling).

Changes:

Add rosace-aa backend, cosmos export + optional run stage, and MaveDB/SRA deposit formatting rules/scripts.
Improve variant processing correctness/performance (sequence-based designed-variant identification, codon-aware filtering, restored insdel/multi-variant accounting, Enrich2 HDF5 temp cleanup).
Add operational tooling: per-rule resources + cluster profiles, persistent aligner indexes, and a prebuilt GHCR container + build workflow; introduce uv/ruff/snakefmt/pre-commit dev layer.

Reviewed changes

Copilot reviewed 78 out of 79 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
workflow/Snakefile	Adds container URI and includes new rulesets (rosace_aa, deposit).
workflow/schemas/experiments.schema.yaml	Adds optional `phenotype` column for cosmos slot assignment.
workflow/schemas/config.schema.yaml	Extends scoring backend enum; adds cosmos/deposit/enrich cleanup and per-rule memory knobs; replaces BBTools compression knob.
workflow/rules/scripts/script_utils.py	Adds legacy BBTools-compression translation, scoring-mode validation update, JVM heap helper, and cosmos phenotype validation helper.
workflow/rules/scripts/run_rosace.R	Improves failure messaging on missing per-sample count TSVs; logs Rosace seed determinism note.
workflow/rules/scripts/run_lilace.R	Adds deterministic/traceable Stan seeding via `lilace_seed` with logging.
workflow/rules/scripts/run_cosmos.py	New: runs cosmos model per position and writes summary CSV.
workflow/rules/scripts/remove_zeros.py	Minor formatting cleanup.
workflow/rules/scripts/process_variants.py	Codon-aware filtering, restored insdel length gate, performance improvements, and codon→protein collapse for scoring TSV seam.
workflow/rules/scripts/process_counts.py	Wires codon normalization for designed variants and updates formatting.
workflow/rules/scripts/prepare_sra.py	New: builds SRA metadata TSV + FASTQ file list from experiment CSV + data dir.
workflow/rules/scripts/install_rosace.R	Pins CmdStan version and adds a loud post-install load check for rosace.
workflow/rules/scripts/install_rosace_aa.R	New: installs rosace-aa at a pinned SHA with load verification.
workflow/rules/scripts/install_lilace.R	Pins CmdStan version.
workflow/rules/scripts/generate_enrich_configs.py	Centralizes Enrich2 object naming and predicts HDF5 store outputs for Snakemake-managed temp cleanup.
workflow/rules/scripts/generate_baseline_file_list.py	Formatting/import ordering cleanup.
workflow/rules/scripts/generate_baseline_configs.py	Formatting cleanup.
workflow/rules/scripts/format_cosmos.py	New: formats per-condition scores into cosmos wide CSV (beta_hat_N/se_hat_N) with type derivation.
workflow/rules/rosace.smk	Reorders directives and wires conda env consistently.
workflow/rules/rosace_aa.smk	New rules for rosace-aa install + scoring.
workflow/rules/ref.smk	Adds FASTA header normalization step and makes aligner indexes persistent; wires per-rule resources + JVM heap sizing.
workflow/rules/qc.smk	Formatting/reordering; adds per-rule mem_mb for fastqc.
workflow/rules/process.smk	Uses normalized reference for designed-variant generation; adds mem_mb for process_sample.
workflow/rules/map.smk	Adds per-rule mem_mb and JVM heap sizing for bbmap; adds mem_mb for minimap2 path.
workflow/rules/filter.smk	Adds per-tool heap sizing and aggregate mem_mb for concurrent BBTools JVMs.
workflow/rules/enrich.smk	Declares Enrich2 HDF5 stores as temp() outputs (unless opted out) for automatic cleanup.
workflow/rules/deposit.smk	New: MaveDB export, cosmos format+run rules, and manual-only SRA prep rule.
workflow/rules/common.smk	Adds cosmos/deposit outputs to `all`, config defaults, BBTools compression flags helper, and normalized reference wiring.
workflow/rules/baseline_qc.smk	Formatting cleanup.
workflow/rules/asm.smk	Switches to normalized reference and pins GATK JVM heap below cgroup allocation; adds mem_mb.
workflow/profiles/slurm/config.yaml	New SLURM execution profile with default resources and containerized execution.
workflow/profiles/default/config.yaml	New local profile with mem_mb resource budget and mtime rerun-triggers.
workflow/envs/rosace.yaml	Pins R 4.5 to match renv.lock and documents rationale.
workflow/envs/rosace_aa.yaml	New env for rosace-aa system deps; pins R 4.5 and cmake<3.25.
workflow/envs/lilace.yaml	Pins R 4.5 to match renv.lock and documents rationale.
workflow/envs/cosmos.yaml	New pip-based cosmos env (python>=3.12) with matplotlib pin for arviz compatibility.
tests/unit/test_script_utils.py	Adds unit tests for cosmos phenotype validation and JVM heap sizing; extends scoring-backend mode checks.
tests/unit/test_scoring_backend_config.py	Updates schema tests for rosace_aa and new memory knobs.
tests/unit/test_remove_zeros.py	Import reordering/formatting.
tests/unit/test_process_variants.py	Adds extensive coverage for codon-aware filtering + restored insdel/multi-variant stats; updates signatures.
tests/unit/test_process_counts.py	Import cleanup to match updated module shape.
tests/unit/test_prepare_reference.py	New: tests for FASTA header normalization awk behavior.
tests/unit/test_lilace_seed_config.py	New: schema-level tests for `lilace_seed` and explicit absence of `rosace_seed`.
tests/unit/test_install_scripts_fail_loud.py	New contract test ensuring install scripts verify package load before writing marker file.
tests/unit/test_generate_variants.py	Adds tests for codon-preserving dedup and sequence-based variant identification (issue #27).
tests/unit/test_format_cosmos.py	New: unit tests for cosmos type derivation and wide-join formatting.
tests/unit/test_enrich_configs.py	Adds tests for Enrich2 object naming and predicted HDF5 store list.
tests/unit/test_bbtools_compression_config.py	New: schema + back-compat shim tests for bbtools_compression.
tests/r/testthat/test-resolve_lilace_seed.R	New: R tests for lilace seed resolver behavior and logging contract.
tests/r/testthat/test-build_counts.R	Adds R regression test for missing per-sample TSV error message.
tests/r/testthat/helper-lilace-functions.R	Adds helper copy of resolve_lilace_seed to keep R tests in lockstep with pipeline script.
tests/r/testthat/helper-functions.R	Mirrors missing per-sample TSV guard in helper implementation.
tests/integration/test_processing_pipeline.py	Formatting cleanup.
tests/integration/test_file_resolution.py	Formatting cleanup.
tests/integration/test_dag_construction.py	Adds dry-run coverage for Enrich2 temp-output wiring (keep_enrich_h5 on/off).
tests/integration/test_cosmos_integration.py	New: dry-run DAG wiring tests and optional real-cosmos load/run validations.
tests/integration/test_config_validation.py	Formatting cleanup.
tests/integration/test_cluster_profiles.py	New: validates profiles parse and heavy rules resolve mem_mb in dry-run.
tests/fixtures/mock_experiment_cosmos.csv	New fixture including phenotype slots for cosmos tests.
tests/conftest.py	Import ordering cleanup.
ruff.toml	New ruff configuration, incl. snakemake builtin and excludes.
requirements-dev.txt	Removed in favor of uv/pyproject-managed dev deps.
README.md	Documents container usage, new backends, cluster profiles, aligner choice, and new config knobs.
pyproject.toml	New uv-managed virtual project for dev/test tooling and optional cosmos extra.
dumpling_env.yaml	Pins python=3.13 and adds pip-only mavehgvs dependency.
Dockerfile	New: builds single “mega-env” container image and precompiles CmdStan.
CONTRIBUTING.md	New: contributor workflow using uv + ruff/snakefmt and optional cosmos extra.
config/example.yaml	Updates example config for new knobs (backend, deposit, cosmos, memory, compression).
cluster_env.yaml	New: thin login-node environment for SLURM submission.
.pre-commit-config.yaml	New: snakefmt + ruff pre-commit hooks.
.github/workflows/build-container.yml	New: GH Actions workflow to build/push GHCR container on branches/tags.
.devcontainer/postCreate.sh	New devcontainer bootstrap to build conda env, optional dev deps, and pre-warm scoring backends.
.devcontainer/devcontainer.json	Updates devcontainer system deps and switches postCreate to the new script.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

odcambc · 2026-06-15T21:09:12Z

Adds #31 and should resolve #9

1. common.smk: normalized_reference_file now strips to Path(reference).name. If reference were absolute or directory-qualified, Path('ref') / <absolute> collapses to the original location and prepare_reference would overwrite the user's own FASTA. Basename pins the rewrite under the workflow-managed ref/ dir (mirrors reference_name's .stem). 2. Remove the manual-only prepare_sra rule + scripts/prepare_sra.py and its schema/get_input mentions. It was an untested, placeholder-shipping stub that rode in with the mavedb commit (ce5d2cd), never on the default target. Deferred to return as a properly-scoped feature (with anchored fastq matching that raises on ambiguous per-lane/prefix matches).

odcambc and others added 30 commits May 26, 2026 18:00

add initial mavedb and sra deposit/formatting

ce5d2cd

add rosace-aa to readme

d39e98a

Fix MaveDB formatting and add count outputs

963007b

typo

b663227

apply snakefmt

caa5f50

odcambc added 21 commits May 27, 2026 11:17

ruff format

72421ae

add mavedb format tests

2f14358

adds note about count/score merge behavior

1321917

Restore codon-based variant filtering

13006a3

Codon matching was dropped in 28dbec2, leaving a name-only match that accepted off-target codons (wrong_codon_counts stuck at 0). Match on (name, codon) again and split rejects into wrong_codon vs wrong_variant.

Support multiple codons per protein variant (#23)

a5a192b

Preserve distinct-codon designed variants through dedup, and sum codon-level counts back per hgvs in the Enrich2 TSV so scoring stays protein-level.

Add default resources

9d0af2c

Add cluster env

69f4c39

odcambc requested a review from Copilot June 15, 2026 21:01

Copilot started reviewing on behalf of odcambc June 15, 2026 21:02 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread workflow/rules/common.smk Outdated

Comment thread workflow/rules/scripts/prepare_sra.py Outdated

odcambc added 3 commits June 15, 2026 16:23

add catalog file for snakemake and change docker tagged version

460dfd1

Update readme, add config readme, update config defaults and schema

629ddcc

odcambc merged commit 0fc2b14 into main Jun 16, 2026
2 checks passed

odcambc deleted the dev branch June 16, 2026 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release: cosmos, rosace-aa, sequence-based variants, MaveDB deposit, cluster profiles, uv dev env#46

Release: cosmos, rosace-aa, sequence-based variants, MaveDB deposit, cluster profiles, uv dev env#46
odcambc merged 61 commits into
mainfrom
dev

odcambc commented Jun 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

odcambc commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

odcambc commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Scoring backends

Variant identification & filtering

Deposit & downstream analysis

Performance & infra

Dev experience

Closes

Test plan

Notes / follow-ups (not in this PR)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

odcambc commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

odcambc commented Jun 15, 2026 •

edited

Loading