Skip to content

Release: cosmos, rosace-aa, sequence-based variants, MaveDB deposit, cluster profiles, uv dev env#46

Merged
odcambc merged 61 commits into
mainfrom
dev
Jun 16, 2026
Merged

Release: cosmos, rosace-aa, sequence-based variants, MaveDB deposit, cluster profiles, uv dev env#46
odcambc merged 61 commits into
mainfrom
dev

Conversation

@odcambc

@odcambc odcambc commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Summary

Promotes the accumulated dev work to main57 commits, 79 files, +11.6k/−0.3k. This bundles several independent feature lines that have each been validated on dev (full test suite green with cosmos + snakemake installed). History is linear and dev is 0 commits behind main, so this can fast-forward.

What's included

Scoring backends

  • rosace-aa as a third scoring_backend — wraps pimentellab/rosace-aa, adding per-position + amino-acid substitution effect decomposition alongside rosace (default) and lilace. New install script, rules, driver, and config/schema wiring.
  • Determinism phase 1 — Stan seeds are now logged on every run; lilace's seed is configurable (lilace_seed) for bit-identical reruns. Documents why there is intentionally no rosace_seed knob.

Variant identification & filtering

Deposit & downstream analysis

  • MaveDB deposit formattingformat_mavedb (per-condition, MAVE-HGVS canonical form validated via mavehgvs); deposit_to_mavedb is on by default. (SRA submission formatting is intentionally not included — see follow-ups.)
  • cosmos integrationformat_cosmos builds cosmos's wide multi-condition input, and an opt-in run_cosmos stage actually invokes pimentellab/cosmos (enrich2-style: layered on top of the chosen scoring backend) to produce the per-position direct/indirect decomposition. Off by default; isolated in its own conda env. End-to-end validated (DAG wiring + real DMSData load against cosmos 1.0.2).

Performance & infra

  • Parallel pigz for BBTools compressed IO; persistent aligner indexes across runs (--notemp documented); minimap2 as an alternative aligner.
  • Enrich2 .h5 cleanup (closes clean up enrich h5s #16) — intermediate HDF5 stores declared as Snakemake temp() so Snakemake (not the pipeline) reclaims them; keep_enrich_h5 opts out.
  • Cluster profiles + per-rule resources (closes Cluster profiles #13) — SLURM profile, default profile, and per-rule mem_mb / threads / runtime; JVM heaps pinned a headroom below the cgroup allocation.
  • Prebuilt container + GHCR publish workflow; thin cluster_env.yaml driver.

Dev experience

  • uv-native test/lint layerpyproject.toml + uv.lock virtual project with a dev group and cosmos extra; CONTRIBUTING.md; devcontainer uses pip install --group dev.
  • ruff + snakefmt + pre-commit tooling.

Closes

Closes #13, #16, #19, #23, #27.

Test plan

  • Full unit + integration suite passes on dev with snakemake and cosmos installed (299 passed in the last run).
  • cosmos end-to-end validated against the real cosmos-dms 1.0.2 package (DAG construction + DMSData load with NA tolerance; slow real-run test guarded).
  • --dry-run green on the example config.

Notes / follow-ups (not in this PR)

  • Multi-scorer refactor (any combination of scorers, enrich-only runs) is designed and partially branched but intentionally out of scope here.
  • cosmos run is serial (~tens of seconds/position); scatter-gather parallelization is a tracked follow-up.
  • Cluster profile is wired but not yet validated on a live SLURM cluster.
  • An untested, manual-only prepare_sra stub was removed from this PR; SRA submission formatting will return as a properly-scoped feature (anchored fastq matching, real instrument/biosample config, tests).
  • The container CI smoke test (snakemake --dry-run in-image) was dropped — it validated against the gitignored example/data, so it never had inputs in CI. Re-enabling it is tied to the committed end-to-end fixture follow-up.

odcambc and others added 30 commits May 26, 2026 18:00
run_rosace.R's build_counts_for_replicate calls read_tsv without
checking that the per-sample file exists. When upstream GATK or
process_counts drops a sample, users see a raw "cannot open file"
R traceback with no hint of which sample failed.

Mirror the lilace guard from 0da8edc on the rosace path: file.exists
check before read_tsv, error names the missing file and the sample
that was supposed to produce it. New test in test-build_counts.R
locks the message in (helper-functions.R carries the sibling copy
of the function used by the test harness, updated in lockstep).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The prior `bbtools_use_bgzip` bool only toggled bgzip on or off; both
paths left BBTools' compressed IO effectively single-threaded for the
8 GB+ FASTQs DMS runs ship. Replace it with `bbtools_compression`
(`pigz | bgzip | none`, default `pigz`), and emit `usepigz=t unpigz=t
bgzip=f unbgzip=f` so BBTools spawns pigz for (de)compression.

Audit-time spec also called for `pigzthreads={threads} unpigzthreads=
{threads}`, but empirical probe on BBMerge 39.13 showed both flags are
rejected as unknown parameters:

  java.lang.RuntimeException: Unknown parameter pigzthreads=4

Dropped from the emitted flag string. pigz invoked without an explicit
thread count uses its own default (online CPU count), which still
gives the parallel-vs-serial win. Cluster oversubscription is bounded
by rule scheduling rather than per-pigz thread caps.

Back-compat shim in script_utils.translate_legacy_bbtools_compression
translates the deprecated bool byte-compatibly:
  bbtools_use_bgzip: true  -> bbtools_compression: bgzip
  bbtools_use_bgzip: false -> bbtools_compression: none
so existing user configs keep producing the same BBTools invocations.

Empirical concordance gate on tests/fixtures example:
- 12/14 processed_counts/*.csv byte-identical between pigz mode and
  bgzip mode (both fresh runs, --cores 4).
- 2/14 differ by exactly one row × ±1 count — sub-ppm noise from
  BBMerge multi-threaded scheduling, well within the inherent thread
  non-determinism the audit memory documents (~15-25 ppm).
Pigz vs bgzip itself produces identical counts.

New tests/unit/test_bbtools_compression_config.py mirrors the
test_aligner_config.py pattern: schema shape, enum enforcement,
snakemake.utils.validate default injection, and the shim's
byte-compatibility invariants. 14 new tests, 43/43 affected suite
passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Reference generation today silently assumes the FASTA's first `>` line
matches `Path(config["reference"]).stem`. When it doesn't — e.g. a
user-supplied `my_ref.fasta` whose header is `>some_other_contig` —
BBMap index build and GATK SAM/ref contig matching both fail with
opaque errors.

Add a `prepare_reference` rule that copies the user's reference into a
working path under `ref/` and rewrites the first `>` line to
`>{reference_name}`. Downstream consumers (prepare_dict_GATK,
prepare_index, prepare_bbmap_index, prepare_minimap2_index, gatk_ASM,
generate_variants) now read from the normalized path instead of the
raw user file. The user's original FASTA is left untouched.

awk rewrites only the FIRST `>` line; multi-contig FASTAs preserve
their secondary contig names so anyone relying on those isn't
silently renamed (DMS data is single-contig in practice, but the
safer default is to touch the minimum).

`reference_file` still points at the raw user file because
`file_digest(reference_file)` (common.smk:351) runs at parse time
before any rule fires; the normalized file doesn't exist yet then.
The `ref_digest` derived from the raw bytes is the right cache key
anyway — a changed raw reference should invalidate the bbmap index
regardless of normalization.

Empirical probe: ran `snakemake --until prepare_bbmap_index` against
a crafted FASTA with `>some_weird_contig_name_42` and filename
`example_ref.fasta`. Without this fix, BBMap fails on the contig name
mismatch (per Upstream #19's reporter). With it:

  Input header:  >some_weird_contig_name_42
  Output header: >example_ref
  BBMap index: built successfully

Happy case (header already matching filename) re-verified end-to-end
on the example fixture — full pipeline through process_sample
completes with 13/14 csvs byte-identical to the prior pigz baseline
(the 14th differs by 1 row × ±1 count, the same sub-ppm BBMerge
thread non-determinism the previous commit characterized).

New tests/unit/test_prepare_reference.py covers five awk-level
invariants: mismatched header rewritten, matching header idempotent,
multi-contig only rewrites line 1, sequence bytes preserved, special
characters in reference_name pass through literally. 176/176 full
unit+integration suite passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
DMS scoring runs use Stan MCMC and can drift visibly at the per-variant
score level when chain init isn't pinned. Surface the seed in the rule
log so users can reproduce a specific run, and make lilace's
configurable from config.

Two backends, two outcomes:

  - lilace exposes `seed=NULL` on lilace_fit_model (verified against
    pimentellab/lilace R/lilace.R). Add `lilace_seed: int | null` config
    knob (default null → auto-generate from Sys.time() salted with the
    condition name; always logged). User-set integer → bit-identical
    chain init across reruns.

  - rosace's RunRosace does NOT accept a seed argument (verified against
    pimentellab/rosace R/runROSACE.R). MCMCRunStan hardcodes seed=100
    internally, so rosace runs are already bit-identical across reruns
    by construction. Adding a `rosace_seed` knob would silently no-op
    and mislead users — intentionally absent from the schema. Run log
    surfaces "Stan seed: 100 (hardcoded upstream)" for visibility/parity
    with the lilace log line.

resolve_lilace_seed lives in run_lilace.R and a parallel testable copy
in helper-lilace-functions.R, matching the existing parse_stripped_hgvs
pattern. R-side tests (test-resolve_lilace_seed.R) lock in: configured
seed passes through unchanged, zero is a valid seed, auto-gen is
deterministic given fixed clock, distinct conditions get distinct seeds
at the same clock (salt is load-bearing for parallel jobs), seed stays
in Stan's valid uint32 range, log message includes a copy-pasteable
config snippet.

Caught and fixed a 32-bit integer overflow in the auto-gen path —
as.integer(Sys.time()) ≈ 1.7e9 and *17 overflows R's int range to NA.
Compute in double space, then modulo into int range before casting.

Schema-level tests (test_lilace_seed_config.py) mirror the
test_aligner_config / test_bbtools_compression_config pattern: 9 new
tests covering schema shape, enum enforcement (integer and null both
accepted, strings rejected, zero accepted), snakemake.utils.validate
default injection, and the deliberate absence of `rosace_seed`. 185/185
full unit+integration suite passing; R suite green including 8 new
resolve_lilace_seed tests.

MultiQC surfacing of the seed (custom_data sidecar or plugin section)
deferred to the `qc` branch — the rule log is sufficient for the Phase 1
"displayed reproducibility" goal.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Audit of the 8 temp() markers in the workflow (filter.smk, baseline_qc.smk,
map.smk, ref.smk) splits cleanly into two classes:

  - sample-derived intermediates (ec.clean.trim.fastq.gz pair, mapped.bam,
    generated multiqc config + baseline file list) → keep temp().
    Multi-GB-or-no-debug-value artifacts that regenerate every run.

  - reference-derived artifacts (bbmap index dir, minimap2 .mmi) → drop temp().
    Deterministic from ref+digest, keyed on a content hash so collisions
    are structurally impossible across reference changes. Persisting lets
    repeat runs against the same reference skip the ~5s rebuild that
    temp() forced on every invocation.

The deletion behavior was inherited from an earlier (pre-#37) layout where
the bbmap index lived at a fixed shared path and rebuild risked artifact
cross-pollution. Since #37 keyed on `ref_digest`, that risk evaporated —
a different reference always lands somewhere else by construction. temp()
on the index has been pure waste ever since.

For users who want to retain the sample-derived intermediates (e.g.
to inspect a bbduk-trimmed FASTQ or samtools view a mapped BAM),
Snakemake's built-in `--notemp` flag (alias `--no-temp`, `--nt`) covers
the use case with no config knob needed. README now documents the
recipe.

185/185 unit+integration suite passing; snakemake dry-run on the example
fixture schedules the bbmap_index rule normally (it re-fires once on this
commit because the rule definition changed, then caches on subsequent runs
as intended).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The initial mavedb scaffolding (d235265) inherited rosace's column names
for the lilace backend too — `variants`/`mean`/`sd`. Verified against
pimentellab/lilace R/lilace.R `.get_score_df()`: lilace actually emits
singular `variant`, `effect`, `effect_se`. So format_mavedb run on a
real lilace score CSV would have surfaced the existing "Expected columns
not found" error and required users to override every column via config.

Update the lilace defaults to match the emitted schema. Rosace defaults
were verified correct against results/example_experiment/rosace/
cond_A_scores.csv (`variants`/`mean`/`sd`) — left unchanged. A header
comment now records both audits with file references so the next
person to touch this doesn't have to re-derive the schema mapping.

New tests/unit/test_format_mavedb.py covers:
  - normalize_hgvs converts rosace's same-AA missense (`p.(A1A)`) into
    MaveDB-canonical synonymous notation (`p.(A1=)`), which is what
    mavehgvs needs to recognize the variant as synonymous (it doesn't
    auto-normalize — Phase 0 audit finding).
  - Both backends' _BACKEND_DEFAULTS pin the expected column names.
    Catches schema drift in either direction.
  - End-to-end format_scores on tiny in-memory fixtures for both
    backends, including the synonymous-normalization case.
  - Missing required columns produce a clear error that names the file
    path and lists available columns.
  - The optional `sd` column is omitted from output when absent rather
    than crashing or fabricating SD values.

Also adds the `if __name__ == "__main__"` guard that format_mavedb.py
and prepare_sra.py were missing. Without it, importing either module
from a test fires main() at import time and the argparse path crashes
on pytest's sys.argv. The Snakemake `script:` directive still works
because Snakemake invokes the script with `__name__ == "__main__"`
set.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The original format_mavedb.py shipped rosace/lilace's one-letter
parenthesized HGVS (`p.(M1L)`, `p.(A1del)`, `p.(A1*)`) directly into
the MaveDB score set. Empirical audit against the MAVE-HGVS spec
(github.com/VariantEffect/mavehgvs/docs/spec.rst) found that's
structurally invalid:

  - MAVE-HGVS rejects the parenthesized `p.(...)` form for everything
    except the population-level synonymous `p.(=)`. Every dumpling row
    today fails the regex.
  - MAVE-HGVS requires three-letter amino acid codes (`Met`, `Ala`,
    `Ter`). Dumpling produces one-letter codes (`M`, `A`, `*`) at every
    stage.
  - Stop codon must be `Ter`, not `*`. The spec explicitly calls this
    out as a rejection case.

Users would have uploaded a score CSV and watched every row bounce.

Add `to_mave_hgvs` that does the structural conversion: strip outer
parens (preserving `p.(=)`), expand one-letter to three-letter via an
inverted aa_3to1_dict (replacing dumpling's internal `Stp`/`X` with
canonical `Ter`), and recurse into the inserted-residue sequence for
insertions. Covers the six DMS variant classes dumpling actually emits:
missense, synonymous-at-known-position, single-residue deletion,
multi-residue deletion, insertion, nonsense. Unknown shapes raise
ValueError listing the supported set, so a new HGVS class can't slip
into scoring output and silently corrupt MaveDB deposits.

`normalize_hgvs` (the existing rosace `p.(A1A) → p.(A1=)` synonymous
canonicalizer) stays — it's still load-bearing. The pipeline now runs
both: `normalize_hgvs → to_mave_hgvs`.

Add `mavehgvs` to dumpling_env.yaml's pip block (verified not on
conda-forge or bioconda 2026-05-26). Import is wrapped: if the lib
isn't available (e.g. user ran without --use-conda), the converter
still runs but validation no-ops. Row-context error messages name
the offending row index when validation fires — so a 10k-row CSV
remains debuggable.

Test coverage:
  - to_mave_hgvs: each variant class pinned, unknown-shape error
    message asserted (so it stays actionable).
  - End-to-end rosace fixture exercises substitution + synonymous +
    deletions + multi-deletion + nonsense in one CSV.
  - The two pre-existing end-to-end rosace/lilace tests updated to
    expect the new MAVE-HGVS output shape (`p.Met1Leu` not `p.(M1L)`).
  - mavehgvs validation tests use pytest.importorskip — they skip on
    dev boxes without the lib, run in CI where the env builds with it.
  - No-op-when-library-missing path tested via monkeypatch, doesn't
    need the lib.

206 passed, 2 skipped (the two live-mavehgvs tests; will fire in CI).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
pimentellab/rosace-aa extends rosace with position + amino-acid
substitution effect decomposition. Verified the upstream R sources
(audit notes in tasks/tasks.md):

  - Entrypoint RunRosace.Rosace lives in R/runROSACE.R, signature
    additions over rosace: wt.col, mut.col, aa.code, stop.col,
    stop.name, pos.act, blosum.label. Score CSV columns (variants,
    mean, sd, lfsr.*, test.*, label) match rosace's — format_mavedb
    works unchanged.
  - Stan seed is hardcoded to 100 in MCMCRunStan (same as rosace,
    not configurable). No rosace_aa_seed knob needed.
  - Requires synonymous controls + wildtype/mutation metadata
    columns, so it's incompatible with noprocess=true (same wall as
    lilace, plus an additional one).

Schema changes:
  - scoring_backend enum becomes [rosace, lilace, rosace_aa].
    Default stays `rosace` so configs that don't mention the key
    keep their pre-rosace_aa behavior.
  - New rosace_aa_local: boolean (default false), mirroring
    rosace_local / lilace_local.
  - New mem_rosace_aa: integer (default 16000), mirroring
    mem_rosace / mem_lilace.

common.smk setdefault for the two new keys.

validate_scoring_backend_mode rejects scoring_backend='rosace_aa' +
noprocess=true with the same shape as the existing lilace gate, but
the error message names which backend tripped — so when a third or
fourth backend grows the same constraint, the message stays useful.

Tests:
  - test_scoring_backend_config.py: enum acceptance for rosace_aa,
    schema-shape + default-injection assertions for mem_rosace_aa
    and rosace_aa_local.
  - test_script_utils.py: noprocess+rosace_aa is rejected,
    rosace_aa+noprocess=false passes.
  - 192/192 unit+integration suite passing.

Rule + driver + install script land in the next commit (Phase 2);
dry-run with `--config scoring_backend=rosace_aa` correctly
surfaces a MissingInputException for the score CSV that rule will
produce.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Adds the rule + script scaffolding for the rosace_aa scoring_backend that
the previous commit accepted at the schema level. Mirrors rosace.smk's
shape:

  - workflow/rules/rosace_aa.smk: run_rosace_aa (per-condition fan-out)
    and install_rosace_aa, both gated on the rosace_aa_local config flag
    that decides between system R and the conda+renv path.
  - workflow/envs/rosace_aa.yaml: same R-base + cmake<3.25 + nloptr + zlib
    system layer as rosace.yaml. CmdStan compatibility constraint is
    identical.
  - workflow/rules/scripts/install_rosace_aa.R: bootstraps renv, restores
    the base R deps from the existing renv.lock (which already carries
    impute as a Bioconductor entry), then installs rosace-aa at a pinned
    SHA via renv::install("pimentellab/rosace-aa@<sha>"). The upstream
    repo has no tagged releases (verified 2026-05-26), so we pin
    c0fda386 — the current main-branch HEAD — and bump on validation.
  - workflow/rules/scripts/run_rosace_aa.R: copy of run_rosace.R with the
    rosace-aa-specific changes:
      * library("rosaceAA") instead of library("rosace") — rosace-aa is
        a fork that ships its own copies of CreateRosaceObject, FilterData,
        etc., so loading rosace alongside would clash on function names.
      * RunRosace call gains wt.col="wildtype", mut.col="mutation",
        aa.code="single", stop.col="type", stop.name="nonsense",
        pos.act=FALSE — verified against the upstream vignette's
        AssaySet Model 2 example.
      * Output directory results/{experiment}/rosace_aa/ instead of
        rosace/.
      * Seed log message references rosaceAA::MCMCRunStan; the hardcoded
        seed=100 is the same as rosace.
      * The noprocess branch is replaced with a hard error — the config
        validator rejects rosace_aa+noprocess upfront, so this script
        firing under noprocess would mean a validator bug.

Snakefile picks up the new rule file.

Code duplication between run_rosace.R and run_rosace_aa.R (~470 lines
of nearly-identical helper functions: build_counts_for_replicate,
build_rosace_object_for_condition, finalize_variants_in_rosace, main)
is the same pattern run_lilace.R established when lilace landed.
A shared-helpers extract is the obvious follow-up if the duplication
ages poorly; not the right call now since both backends could diverge
on the helper-function side over time.

Verified end-to-end at dry-run level:
  - Default scoring_backend (rosace) → install_rosace + run_rosace × 2
    conditions, unchanged from before.
  - scoring_backend=rosace_aa → install_rosace_aa + run_rosace_aa × 2
    conditions, plus 80 other pipeline jobs.
  - scoring_backend=rosace_aa + noprocess=True → ValueError from
    validate_scoring_backend_mode, naming both the backend and the
    config knob to flip.

192/192 unit+integration suite still passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Picks up the rosace-aa wiring landed in the previous commits and
documents how users discover and pick it.

README "Installing Rosace and Lilace" section becomes
"Installing Rosace, Lilace, and rosace-aa":
  - mentions all three backends with linked upstream repos
  - explains rosace-aa's value-add (position + AA substitution effect
    decomposition, score CSV layout matches rosace's)
  - documents the scoring_backend config knob with all three options
    and the noprocess incompatibility for lilace + rosace_aa
  - adds `snakemake --cores 8 install_rosace_aa` to the manual-install
    snippet alongside the existing install_rosace / install_lilace lines
  - notes the SHA-pinned upstream install (rosace-aa has no tagged
    releases yet)
  - OSX gfortran section now references "all three backends" instead
    of just rosace + lilace.

config/example.yaml:
  - scoring_backend comment grows the third option with the noprocess
    constraint spelled out
  - rosace_aa_local: false comment + value line added next to
    rosace_local / lilace_local, mirroring the established pattern

Full suite still 192/192 passing; R suite green including the
8 resolve_lilace_seed tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Two changes that together let a fresh devcontainer rebuild pick up
scoring_backend=rosace_aa (or lilace) with no per-run install wait:

1. postCreate.sh extracts the bootstrap into a shared script. Both
   .devcontainer/devcontainer.json (canonical) and the local-only
   claude-sandbox variant now call it. After the existing conda env
   + system-R packages steps, the script runs

     snakemake --configfile config/example.yaml \
       results/example_experiment/{rosace,lilace,rosace_aa}/*_installed.txt

   so the existing install rules' codepath (renv::restore + cmdstanr
   + install_github for rosace-aa) fires once at build time. Marker
   files persist; subsequent snakemake invocations see the install
   as a cache-hit.

   The warmup is non-fatal: if it fails (network, upstream SHA gone,
   disk pressure), the devcontainer is still usable; the user just
   hits the real ~5-10 min install on first scoring run. Script
   prints the manual-retry command if that happens.

   `--with-dev-deps` flag preserves the claude-sandbox variant's
   extra `pip install -r requirements-dev.txt` step (pytest +
   pytest-mock). Canonical doesn't pass it.

2. apt-packages list gains `libblas-dev` and `liblapack-dev`. Found
   while empirically testing install_rosace_aa on the existing
   devcontainer: several rosace/rosace-aa transitive deps (urca,
   robustbase, ...) need to link against BLAS/LAPACK. The runtime
   libs (libblas3 / liblapack3) were already there, but the
   unversioned `libblas.so` / `liblapack.so` symlinks only ship in
   the `-dev` packages. Without them, renv::restore() fails with

     /usr/bin/ld: cannot find -llapack: No such file or directory

   This was a pre-existing latent bug — install_rosace was silently
   hiding it because the marker file gets written unconditionally
   regardless of renv::restore success. install_rosace_aa surfaced
   it because the new script's final `require(rosaceAA)` check
   fails-loud. Tracked as a follow-up in tasks.md to harden
   install_rosace + install_lilace with the same fail-loud pattern.

Validation gap: the BLAS/LAPACK fix is structurally correct (adds
the missing `-dev` packages to the apt feature list) but couldn't be
empirically tested on this devcontainer because we don't have apt
sudo. A fresh devcontainer rebuild is the empirical gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
install_rosace.R's non-local branch was the last install script with
the silent-failure bug: it called renv::restore() then unconditionally
wrote the snakemake marker file. renv prints errors but doesn't raise
when a transitive dep fails to compile (e.g. missing system libs like
libblas-dev/liblapack-dev), so the marker got written and run_rosace
fired downstream — failing with a cryptic "could not find function
'CreateRosaceObject'" instead of pointing at the real install failure.

install_lilace.R already had `if (!require(lilace)) stop(...)` in both
its local and non-local branches (the lilace branch shipped it that
way originally). install_rosace_aa.R's equivalent check was what
surfaced the BLAS/LAPACK devcontainer gap in the first place. Bringing
install_rosace.R into line closes the silent-failure class across all
three scoring backends.

The corresponding `if (!require(rosace)) stop(...)` already existed
in install_rosace.R's *local* branch (rosace_local: true path) — only
the non-local renv branch was missing it.

New tests/unit/test_install_scripts_fail_loud.py parametrizes across
all three install scripts and regex-checks the
`if (!require(<pkg>)) { stop(...) }` shape. It's a coarse contract
test, but each script is small enough that a stronger AST-level check
would be more fragile than the contract itself. If you refactor an
install script, update this test in lockstep — the docstring spells
out the why.

195/195 unit+integration suite passing (3 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
User ran the rosace-aa branch end-to-end on the (now-validated) rebuilt
devcontainer and got:

  [1] "cond_A_ROSACE2"
  run_rosace_aa failed: Invalid 'scores' name.
  Error in ExtractScore(object, name) : Invalid 'scores' name.

rosace-aa internally tags the fitted Score with the model name picked
based on which knobs are set (verified by reading helperRunRosaceGrowth
in upstream R/runROSACE.R 2026-05-27):

  ROSACE0 — no position
  ROSACE1 — position-only
  ROSACE2 — position + blosum (default whenever pos.col is set and
            pos.act = FALSE — our config)
  ROSACE3 — position + blosum + activation (pos.act = TRUE)

So the Score is stored under "cond_A_ROSACE2" but our OutputScore call
was asking for "cond_A_ROSACE_AA" (a made-up suffix I picked when
mirroring run_rosace.R's "_ROSACE"). ExtractScore couldn't find it
and raised. Empirical failure mode that the dry-run + structural
verification gates couldn't have caught.

Switch the OutputScore name to paste0(expt_condition, "_ROSACE2") —
matching the upstream vignette's Model 2 pattern. Inline comment
spells out the mapping so the suffix can move in lockstep if/when
pos.act or other model-selection knobs become user-configurable.

Also flip pos.info = TRUE + blosum.info = TRUE on the OutputScore
call. These surface the rosace-aa-specific columns
(phi_mean/phi_sd/sigma2_mean/sigma2_sd from pos.info;
blosum_score/nu_mean/nu_sd from blosum.info) — which is the entire
point of using rosace-aa over rosace. The core variants/mean/sd
columns are unchanged, so format_mavedb still works against the
output.

Dry-run on the example fixture still schedules the rule normally;
195/195 unit+integration suite green. The empirical re-test is on
the user's box: retry the targeted-output invocation and confirm
cond_{A,B}_scores.csv get produced.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…m.info

Second empirical run on the rebuilt devcontainer landed past the
"Invalid scores name" error from the previous commit and hit:

  Error in (function (..., row.names = NULL, ...) :
    arguments imply differing number of rows: 1773, 116

Root cause is an upstream bug at the pinned SHA c0fda386. OutputScore.
Score's blosum branch does

  if ((blosum.info == TRUE) && (model_code >= 2)) {
    df <- cbind(df, [email protected] %>%
                  select(blosum_score, nu_mean, nu_sd))
  }

`df` is variant-level (1773 rows on the example fixture) but
`optional.score` is position-level (116 rows on the same fixture).
cbind raises. Same shape of bug applies to the pos.info branch
further down — both expand the output via cbind with a different-
sized frame.

Workaround: drop both flags. OutputScore returns a single variant-
level data.frame with the standard 9 columns (variants, mean, sd,
lfsr.*, test.*, label), which is what the existing rosace pathway
produces and what format_mavedb's _BACKEND_DEFAULTS["rosace"] is
wired against — so downstream tooling keeps working unchanged. The
per-variant `mean`/`sd` still come from the ROSACE2 model, so the
AA-aware shrinkage is baked into the variant scores — users don't
lose rosace-aa's modeling, just the optional per-position summary
columns.

Tracked as a follow-up in tasks/tasks.md: bump the SHA pin once
upstream fixes the row-mismatch (or once we file an issue and they
respond), then turn the flags back on and handle the list return
(write df_variant to the existing CSV + df_position to a sibling).

Inline comment in run_rosace_aa.R spells out the bug shape + the
ROSACE2 method-suffix mapping so the next person to revisit isn't
chasing the same chain of upstream reads.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The rocker-org/r-apt feature was set to "release" which on a fresh
build now resolves to R 4.6.0 (released 2026-04-24). Source compiles
of `rlang`, `S7`, `yaml`, and via cascade ~half the renv.lock fail
against the R 4.6 headers — empirically found during the rosace-aa
branch's devcontainer rebuild iteration.

Pin to 4.5.x to match `"R": { "Version": "4.5..." }` declared in
renv.lock. The pin moves in lockstep with the lockfile: bump both
together when the CRAN ecosystem catches up to 4.6 and the renv.lock
upgrade is safe.

The R 4.6 lockfile-update attempt (see commit log on the rosace-aa
branch for the full chain) hit the BLAS/LAPACK gap first, then a
processx/Rf_findVar fight, then the rlang/S7/yaml compilation
breakage. Reverting to 4.5 + adding the missing system libs is the
combination that unblocks fresh installs.

Inline comment in the JSON spells out the bump-with-lockfile contract
so the next person reading this doesn't try to "modernize" the pin
without thinking.

Co-Authored-By: Chris <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…CK-renv

Each scoring-backend install rule (install_rosace.R, install_lilace.R,
install_rosace_aa.R) begins its non-local path with

  install.packages("renv", repos = c("https://cloud.r-project.org"))

targeting the shared system library at /usr/local/lib/R/site-library.
On a fresh build with --cores 4, Snakemake fires up to four rules
concurrently; the first one creates /usr/local/lib/R/site-library/
.renv/00LOCK-renv as install.packages's lock directory, and the other
three try to write into the same path. The loser ends up with a
partial renv install (.rdb file missing or truncated) and aborts mid-
restore.

Empirical failure mode found during the rosace-aa branch's
devcontainer rebuild iteration. Manifests as a confusing "internal
error" deep in renv when the parallel rules collide.

Force --cores 1 on the warmup snakemake invocation. The wall-clock
cost is small: renv install is fast (seconds), and the CmdStan
compile dominates the warmup time but only runs once anyway since
cmdstanr's install_cmdstan() is idempotent across the three rules.

Updates both the live invocation and the retry-hint message echoed
on failure so users who hit a transient failure don't copy-paste
the broken --cores 4 command.

The 00LOCK-renv race is the kind of thing a "should be safe to
parallelize, it's just three independent installs" intuition would
miss. Inline comment in postCreate.sh spells out the root cause for
the next person to question why a four-core warmup runs serially.

Co-Authored-By: Chris <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The three scoring-backend rule envs (workflow/envs/rosace.yaml,
lilace.yaml, rosace_aa.yaml) shipped bare `r-base` with no version
spec. On a fresh `snakemake --software-deployment-method conda`
invocation, conda-forge would resolve to whatever's current — likely
4.6.x today — and the install rules running inside that env would
hit the same renv-lockfile-vs-R-4.6 wall the devcontainer hit
(rlang/S7/yaml compile failures via the upstream R 4.6 header
changes).

Pin all three to `r-base=4.5` to match renv.lock's declared R
version + the matching r-apt pin in .devcontainer/devcontainer.json
landed in commit 302aa38. Three pinned spots now move together:
  - workflow/envs/{rosace,lilace,rosace_aa}.yaml: r-base=4.5
  - .devcontainer/devcontainer.json: r-apt feature version "4.5"
  - renv.lock: "R": { "Version": "4.5..." }
Bump in lockstep when CRAN/Bioc catches up to 4.6 and the
lockfile-upgrade is safe.

The pin matters for any user running with `--software-deployment-method
conda` (i.e., not relying on system R from the devcontainer). That's
the common pattern on shared compute / production clusters, so the
pin is on the load-bearing path even if devcontainer users get away
without it.

Inline comment in each yaml spells out the same bump-in-lockstep
contract as devcontainer.json — so the next person doesn't peel a
pin off thinking it's stale.

195/195 unit+integration suite still passing (env-yaml changes don't
touch the python/snakemake test surface).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The conda env's python dep was bare `python`, letting the solver pick
whatever conda-forge resolved at solve time. Empirically found during
the rosace-aa devcontainer rebuild iteration that the floating choice
landed on a version where multiqc / pip-installed deps would not solve
cleanly with the rest of the env — solver would stall or pick an
incompatible pandas / pysam combination.

Pin to `python=3.13` to match the version the rest of the toolchain
(snakemake 9.21, pytest 9.0.3, pandas 2.3.3, biopython 1.87) was
validated against on v0.2.0. Subsequent solves are now deterministic.

Bump in lockstep with the rest of the env if/when v0.3 wants to move
to a newer Python.

Co-Authored-By: Chris <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
cmdstanr::install_cmdstan() without a version argument fetches whatever
the cmdstanr release-fetch endpoint reports as latest at install time.
That's a moving target: subsequent runs on a different day can install
a different CmdStan version, breaking reproducibility of the Stan model
compilation step.

Pin to 2.39.0 — the version the rosace + lilace + rosace-aa upstream
sources were validated against (rosace's MCMCRunStan / lilace's
sampler / rosace-aa's MCMCRunStan all target this version) and the
version cmdstanr 0.9.0 settled on in the renv.lock chain.

Applied at all four install_cmdstan call sites across the three
scripts (both the local-managed-R branch and the renv-managed branch
in each script). Same pin everywhere keeps the three scoring backends
on a single CmdStan install at /home/vscode/.cmdstan/cmdstan-2.39.0.

Bump in lockstep with the rosace / lilace / rosace-aa SHA bumps if the
upstream maintainers move to a newer CmdStan.

Co-Authored-By: Chris <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
13 commits landing the third scoring backend end-to-end. Schema +
config + validator + rule + driver + install script + docs + the full
install/env-management debugging chain surfaced during empirical
validation:

  d89fd50 Accept rosace_aa as a third scoring_backend (schema + config)
  1cba330 Wire rosace-aa rules, driver, and install script
  fc7e4c9 Document rosace-aa scoring backend in README and example config
  d39e98a add rosace-aa to readme
  d55cc11 Devcontainer pre-warms all three scoring backends + adds BLAS/LAPACK
  5be4c5c Fail loud when install_rosace's renv::restore silently drops packages
  b7985c4 Fix OutputScore name + surface AA-decomposition columns in run_rosace_aa
  e734799 Work around OutputScore.Score row mismatch by skipping pos.info/blosum.info
  302aa38 Pin devcontainer R to 4.5.x — R 4.6 breaks the renv lockfile
  4b542b2 Force --cores 1 on postCreate warmup — parallel installs race on 00LOCK-renv
  ecd3354 Pin per-rule conda envs to r-base=4.5 (mirroring devcontainer pin)
  c021733 Pin dumpling_env to python=3.13
  9159b96 Pin CmdStan to 2.39.0 in all three install scripts

Empirically validated 2026-05-27 on the rebuilt devcontainer:
concordance Pearson r = 0.999 (cond_A, n=1773) / 0.997 (cond_B,
n=1907) vs scoring_backend=rosace on the example fixture.

Known limitations carried into dev (tracked in tasks/tasks.md):
  - rosace-aa's pos.info/blosum.info OutputScore columns are
    currently suppressed pending an upstream fix at the pinned SHA
    c0fda386. The decomposition data is still persisted to disk in
    phi.tsv / sigma2.tsv / nu.tsv / beta.tsv / input_*.RData and can
    be joined manually via df_map until the upstream fix lands.
  - Full renv/snakemake-install-handling story is tracked as a
    follow-up: cold-start tests, cache-hit invariants, recovery
    invariants, concurrency revisit, env consolidation, CI
    automation, README first-run-user recipe.

195/195 unit+integration suite passing.

Co-Authored-By: Chris <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Follow-up to 302aa38 (Pin devcontainer R to 4.5.x). The
rocker-org/r-apt feature's `version` option doesn't take a numeric
R version string — values like "4.5" silently fall through to the
default source (Debian testing), which currently ships R 4.6 and
re-triggers the renv-lockfile breakage that 302aa38 tried to avoid.

The correct knob is `useTesting: false`, which switches the apt
source from Debian testing → CRAN's official debian-cran40 repo. That
repo tracks the released R paired with each Debian codename — for
Debian Trixie this is R 4.5.x, matching renv.lock's declared version.

Same intent as 302aa38, working mechanism. Inline comment updated to
explain WHY useTesting=false rather than just naming the flag, so the
next person to revisit doesn't peel it off thinking it's a stylistic
choice.

Lands directly on dev (not a branch) — this is a one-line corrective
follow-up to a commit already in dev's history, not new feature work.

Co-Authored-By: Chris <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Empirical e2e validation 2026-05-27: ran format_mavedb against
results/example_experiment/rosace/cond_A_scores.csv (1773 real rows)
and the conversion crashed with `KeyError: 'X'` inside to_mave_hgvs.

Root cause: the converter's _AA_1to3 mapping had `*` → `Ter` for
stop codons, matching standard one-letter HGVS notation. But dumpling
uses `X` for stops internally, per process_variants.aa_3to1_dict's
"Stp" → "X" entry. So real score CSVs contain `p.(S2X)`, `p.(E3X)`,
etc. — not `p.(S2*)`. The empirically-validated converter was wrong
about which notation it would encounter.

Add `"X": "Ter"` to _AA_1to3. Keep `"*": "Ter"` as a fallback for
robustness (if any future path emits standard HGVS stop notation).

Empirical concordance gate now passes on both example-fixture
conditions: cond_A (1773 rows, 78 nonsense via X) and cond_B (1907
rows, 81 nonsense via X) both convert cleanly and validate as
MAVE-HGVS via mavehgvs.Variant on every single row (3680 / 3680
valid). Same gate also confirms the existing normalize → to_mave_hgvs
chain handles the same-AA-substitution → synonymous path correctly
(89 + 92 synonymous rows across the two conditions, all emit
`p.<wt3>{pos}=`).

The unit test for nonsense conversion now covers the X notation
explicitly (dumpling's actual format) alongside the `*` fallback, so
the X-path can't silently regress. Comment in the test docstring
records the cond_A row-count finding for future readers.

208/208 unit+integration suite passing (was 195 on rosace-aa
branch's tests, +13 mavedb-specific format_mavedb tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Four commits landing the mavedb deposit-format work:

  ce5d2cd add initial mavedb and sra deposit/formatting
    (cherry-picked from pre-v0.2.0 d235265 — scaffolds deposit.smk,
    format_mavedb.py, prepare_sra.py)
  731e195 Fix lilace _BACKEND_DEFAULTS column names + cover with tests
    (rosace uses variants/mean/sd; lilace uses variant/effect/effect_se —
    the initial scaffolding had them identical, breaking lilace deposit)
  061ef80 Emit MAVE-HGVS canonical form + validate via mavehgvs
    (MAVE-HGVS requires three-letter codes, no parenthesized predicted
    form, Ter for stop; new to_mave_hgvs converter does the structural
    conversion; mavehgvs.Variant validates each row post-conversion)
  c604112 Map dumpling's `X` stop code to `Ter` for MAVE-HGVS
    (empirical e2e fix — dumpling's internal stop notation is `X`, not
    `*`, found when running format_mavedb against real rosace output)

Empirically validated 2026-05-27 against the example-fixture rosace
score CSVs: 3680 / 3680 rows across cond_A + cond_B convert cleanly
and validate as MAVE-HGVS via mavehgvs.Variant. Coverage includes 78 +
81 nonsense (X → Ter) and 89 + 92 synonymous (same-AA missense →
position-specific `=`) variants from the real data.

prepare_sra.py was deferred per user direction during the audit phase;
its scaffolding ships but the field validation against NCBI's current
template is tracked as a follow-up in tasks/tasks.md.

208/208 unit+integration suite passing on the mavedb branch tip;
expecting the merged dev suite to grow by the 13 mavedb-specific
tests (format_mavedb's TestNormalizeHgvs / TestBackendDefaults /
TestFormatScores / TestToMaveHgvs / TestFormatScoresMaveHgvs /
TestMaveHgvsValidation).

Co-Authored-By: Chris <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The original guard at the top of main() in both scripts was

  if "snakemake" in dir():
      ...snakemake path...
  ...argparse CLI path...

`dir()` called inside a function returns LOCAL names, not the
module's globals. Snakemake's `script:` directive injects `snakemake`
as a module-level global, so the guard always returned False and
main() silently fell through to the argparse path. When run from
the CLI with the right args this worked; when run via snakemake it
crashed:

  usage: tmp2i0otwl0.format_mavedb.py [-h] --scores SCORES --output OUTPUT ...
  format_mavedb.py: error: the following arguments are required:
    --scores, --output

Empirically surfaced 2026-05-27 while wiring `deposit_to_mavedb`
default-on into `get_input`. Previous testing exercised the script
via direct CLI invocation (python format_mavedb.py --scores ...
--output ...), which masked the bug. The original `deposit.smk`
rule was also never included in workflow/Snakefile (separate fix
in the follow-up commit), so the snakemake path had never actually
fired.

Switch both scripts to `"snakemake" in globals()`, which IS the
attribute the script directive injects.

New tests/unit/test_format_mavedb.py::TestMainSnakemakeDispatch
covers both paths:
  - With a SimpleNamespace `snakemake` global monkeypatched onto
    the module, main() takes the snakemake path and produces output
    at the snakemake.output[0] location.
  - With no `snakemake` global, main() falls through to argparse
    and SystemExits on missing required args (verifies the CLI
    path isn't accidentally broken too).

If someone "refactors" the dispatch back to dir() or removes the
mode detection, the first test fires.

220/220 unit+integration suite passing (was 218 → +2 dispatch tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The MaveDB-format deposit CSV emitter (format_mavedb rule, added on
the mavedb branch) was scaffolded as manual-run-only — the rule
existed but neither it nor prepare_sra was added to `get_input`, and
deposit.smk wasn't even included in workflow/Snakefile. Users had to
know the rule existed AND target the output path explicitly AND
manually include the smk file. Effectively invisible.

Empirical validation (3680/3680 example-fixture rows convert and
validate as MAVE-HGVS via mavehgvs.Variant, end-to-end on both
example-fixture conditions) shows the converter is robust. The post-
scoring work is also cheap (~seconds). Default it on.

Changes:
  - workflow/schemas/config.schema.yaml: add deposit_to_mavedb:
    boolean, default true. Schema description spells out what it
    produces, where, and the opt-out flag.
  - workflow/rules/common.smk: setdefault + extend get_input to add
    `results/{experiment}/deposit/mavedb/{cond}_mavedb.csv` per
    experimental condition when the flag is true.
  - workflow/Snakefile: actually include rules/deposit.smk (was
    never wired in originally; targeting the rule's output path
    via snakemake would have failed before this).
  - config/example.yaml: document deposit_to_mavedb in the config
    comments + add the explicit `deposit_to_mavedb: true` line so
    users see it in the canonical config.

`prepare_sra` stays manual-only regardless of this flag: its output
ships `FILL_IN: e.g. Illumina NovaSeq 6000` and `FILL_IN: e.g.
SAMN00000000` placeholders that the user has to fill in. Auto-firing
it on every run would produce a placeholder-laden SRA metadata TSV
that's misleading-by-default.

Empirically validated: `snakemake -s workflow/Snakefile --configfile
config/example.yaml --cores 4 results/example_experiment/deposit/
mavedb/cond_A_mavedb.csv results/example_experiment/deposit/mavedb/
cond_B_mavedb.csv` produces 1773 + 1907 valid MAVE-HGVS rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
odcambc added 21 commits May 27, 2026 11:17
Codon matching was dropped in 28dbec2, leaving a name-only match that
accepted off-target codons (wrong_codon_counts stuck at 0). Match on
(name, codon) again and split rejects into wrong_codon vs wrong_variant.
Preserve distinct-codon designed variants through dedup, and sum
codon-level counts back per hgvs in the Enrich2 TSV so scoring stays
protein-level.
28dbec2 also dropped the insdel_variant_counts and multi_variant_counts
increments, leaving both buckets stuck at 0. Affected reads were still
rejected (into wrong_variant_counts), but the QC breakdown was lost.
Classify rejections into their dedicated buckets again.
The insdel length cap was dropped in 28dbec2, leaving max_deletion_length
a no-op config knob. Re-thread it into process_insdel: a recoverable
insdel is accepted as an in-frame deletion only when within the limit,
else rejected as an insdel. A non-positive value disables the cap;
noprocess bypasses it. Default stays 3 (schema, applied by validate()).
The new benchmark tests read tests/fixtures/benchmark_variants.csv, which
was untracked — commit it so the suite passes on a fresh checkout. Also
run the dedup conflict test in a temp cwd so the debug duped.csv it
triggers no longer lands in the repo root.
The unrecoverable-duplicate debug dump went to "duped.csv" in the process
working directory. Route it to results/<experiment>/duped_variants.csv
via a dup_report_path arg (parent dirs created as needed), so it lands in
a predictable place instead of polluting the cwd.
Emit a single wide CSV joining each phenotype-assigned condition's scores
into cosmos's (pimentellab/cosmos) beta_hat_N/se_hat_N columns, for causal
direct-vs-indirect modeling across sequential phenotypes. Sibling of the
MaveDB deposit.

- format_cosmos.py: outer-join the per-condition score CSVs on variant,
  NA-pad missing phenotypes, coalesce group/type; backend-aware columns.
  derive_cosmos_type fuses indel length into the type label (insertion3),
  keyed off the I_/D_/ID_ mutation code; substitutions pass through.
- Conditions opt in via an optional `phenotype` slot column on the
  experiment CSV (presence = inclusion). Validation
  (get_cosmos_phenotype_conditions) lives in script_utils alongside the
  other experiment checks: per-condition agreement, no baseline slot,
  contiguous 1..N.
- deposit_to_cosmos config flag (off by default); rule format_cosmos.
- Tests for derive_cosmos_type, the multi-condition join, and slot
  validation.
Add sequence-based variant identification: anchor each oligo's gene
fragment to the reference by exact k-mer seeding (per-oligo, adapter-
agnostic), then classify the single localized edit as substitution,
synonymous, in-frame deletion, or insertion. Runs in designed_variants
sequence-first, falling back to the DIMPLE name regexes only when the
sequence can't be resolved (terminal codons, circular refs, multi-edit).

Anchored exact-match keeps it ~O(oligo length) per oligo (no alignment,
no new dependency); ~0.35 ms/oligo on a 1.5 kb gene. Indel positions are
snapped to codon boundaries and the reconstruction verified to handle
left-alignment ambiguity.
Validated against real CFTR DMS libraries (CFTR_1_oligos.csv vs the
construct reference): 94% of oligos resolved by sequence with zero wrong
calls; the rest fall back to names (gene-terminus codons, a divergent
chunk). Fixes found in the process:

- Resolve substitutions by walking CODONS in the ORF frame rather than
  base-level gap scanning, so a per-oligo adapter that coincidentally
  matches a base or two at the gene boundary is not misread as a second
  edit (this was silently dropping codon-2 and other near-boundary
  variants to the fallback path).
- Name stop codons 'X', matching the rest of the pipeline (not '*').
- Drop the now-unused _extend_match_left helper.
Enrich2 writes large .h5 store files into its output directory as a side
effect; dumpling only consumes the exported tsv scores, so the stores are
pure intermediates. Rather than have the pipeline shell out to rm/find,
run_enrich now declares the stores as Snakemake temp() outputs so Snakemake
itself deletes them (and only them) once scoring completes -- it is
structurally unable to remove anything it did not create as that rule's
output.

The store filenames are enumerated at parse time by
expected_enrich_h5_basenames, which applies the same T0/min-timepoint
filtering as generate_config and reuses shared selection/library label
helpers so the predicted set can't drift from what Enrich2 writes (pinned to
the committed results/example_experiment/enrich layout in tests). The
keep_enrich_h5 config flag (default false) opts out: when true the stores are
left undeclared and simply persist for debugging.
single-env image bundling snakemake + bioinformatics stack + R/Stan
with cmdstan pre-compiled. snakefile points users at the image via the
containerized directive; CI publishes to ghcr.io/odcambc/dumpling on
tag push and tracks dev/container branches for rolling builds. README
quickstart documents both docker and apptainer/singularity invocations.
Make every heavy rule declare resources: mem_mb so a scheduler requests the
right --mem, and add the SLURM profile that was missing (the cluster branch
referenced workflow/profiles/slurm but never committed it).

- Per-tool memory budgets unified to MB (matching mem_rosace/mem_lilace);
  wired into resources: mem_mb on trim_clean_correct (sum of its 3 concurrent
  JVMs), map_to_reference_bbmap/minimap2, gatk_ASM, prepare_bbmap_index,
  prepare_minimap2_index, process_sample.
- Java -Xmx heaps now derive a fixed headroom below the cgroup allocation via
  script_utils.java_heap_gb, so a JVM can't OOM at the --mem limit; GATK gets
  --java-options '-Xmx'. Replaces the global config[mem] -Xmx for these rules.
- workflow/profiles/slurm/config.yaml (executor, default-resources, restart/
  latency/rerun settings, apptainer deployment; account/partition left as
  documented placeholders) and workflow/profiles/default/config.yaml for local.
- Schema + example.yaml document the mem_* knobs; README gains a 'Running on a
  cluster' section.
- Tests: java_heap_gb headroom unit tests; profile-config validation; a
  snakemake dry-run asserting bbmap/gatk resolve mem_mb (no scheduler needed).
The cosmos export had only unit coverage. Add the two missing pieces:

- A 2-condition fixture (mock_experiment_cosmos.csv, phenotype slots 1/2) and a
  snakemake dry-run test that format_cosmos wires into the DAG with the
  phenotype-assigned conditions' scores as slot-ordered inputs.
- A skip-guarded test (runs only where the real cosmos package is importable)
  that a format_cosmos-emitted CSV loads into cosmos.DMSData, including the
  outer-join NA case (a variant scored in one phenotype only). Validated against
  cosmos 1.0.2: DMSData loads the NA-padded file and keeps the NA rows, so the
  outer-join policy ships as-is. Resolves the design-doc load-time open question.
The repo had no project venv, so 'uv run' silently fell back to the conda base
interpreter and dev/test deps (plus dimple's numpy pin) piled up in base.
requirements-dev.txt also only listed tooling, omitting the libraries the tests
import (pandas, biopython, regex, jsonschema, pyyaml, mavehgvs) and snakemake.

Declare the pure-Python test/lint layer in pyproject.toml as a uv-managed
virtual project (package = false, so uv never builds/installs dumpling and the
file is inert to the conda pipeline env):
- [dependency-groups] dev: the full test/lint layer. 'uv sync' + 'uv run pytest'
  runs the whole suite in an isolated .venv -- including the snakemake dry-run
  integration tests, with no PATH hack.
- [project.optional-dependencies] cosmos: opt-in extra for the skip-guarded
  real-cosmos test ('uv sync --extra cosmos'). Kept out of dev only for weight
  (numpy 2.x + arviz/xarray/scikit-learn); no version conflict, the .venv is
  isolated. matplotlib<3.11 dodges an arviz import of matplotlib.style.core.
- uv.lock committed for reproducibility; requires-python >=3.12 (cosmos floor).
- CONTRIBUTING.md documents uv sync / --extra cosmos and the test-vs-runtime
  split; the devcontainer installs the dev group via pip --group dev.

The pipeline runtime (dumpling_env.yaml conda env, per-rule envs, container) is
untouched. Suite: 296 passed (dev + cosmos extra); 294 passed / 2 skipped (dev
only, cosmos test skips).
cosmos is a score generator (per-position direct/indirect causal decomposition
across two sequential phenotypes), not a deposit target and not a base scorer.
It can't be a scoring_backend value (it consumes the base scorer's per-phenotype
effects rather than scoring counts), so it's modeled like enrich2: an optional
analysis layered on top of the chosen scoring backend.

- run_cosmos (config boolean, default false): when true, cosmos runs as part of
  the build, producing the per-position decomposition
  results/{exp}/cosmos/{exp}_cosmos_results.csv. Requires exactly two conditions
  assigned phenotype slots 1 and 2 (validated at parse time).
- Two rules: format_cosmos prepares the wide input table
  (results/{exp}/cosmos/{exp}_cosmos.csv) from the scoring backend's
  per-condition scores; run_cosmos (envs/cosmos.yaml: pip cosmos-dms +
  matplotlib<3.11, pure-Python no Stan) fits cosmos and writes the summary.
  cosmos lives at results/{exp}/cosmos/ (an analysis alongside rosace/lilace),
  NOT under deposit/ — it isn't a deposit.
- scripts/run_cosmos.py: DMSData -> PriorFactory -> ModelBuilder -> run_cosmos
  per position -> ModelAnalyzer.summary. API pinned against cosmos 1.0.2's
  vignette. CLI + Snakemake entry. mem_cosmos resource; optional cosmos settings
  block (include/exclude_type, x_gmm_n_components, min_num_variants_per_group).
- Tests: format + run DAG wiring, a slow real-run test (per-position summary),
  the two-phenotype guard. Suite 299 passed.

PERFORMANCE: cosmos fits per position (~tens of seconds each, serial), so
enabling run_cosmos can add hours on a large library. Parallelizing the fits
(scatter-gather) is deferred and tracked in tasks/tasks.md.
The in-image smoke test validated the DAG against config/example.yaml,
whose data_dir (example/data) is gitignored, so a fresh CI checkout has
no input data and validate_config hard-fails. It has failed on every dev
push since it was added. Drop the step (keep the build, which still
verifies the image compiles); re-enabling needs a committed CI fixture.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This release PR fast-forwards dev to main, bundling multiple validated feature lines into the dumpling Snakemake pipeline: new scoring backends and determinism knobs, improved variant identification/filtering, deposit/export utilities (MaveDB/SRA/cosmos), and substantial infra/dev-experience upgrades (container publishing, cluster profiles, uv-based dev tooling).

Changes:

  • Add rosace-aa backend, cosmos export + optional run stage, and MaveDB/SRA deposit formatting rules/scripts.
  • Improve variant processing correctness/performance (sequence-based designed-variant identification, codon-aware filtering, restored insdel/multi-variant accounting, Enrich2 HDF5 temp cleanup).
  • Add operational tooling: per-rule resources + cluster profiles, persistent aligner indexes, and a prebuilt GHCR container + build workflow; introduce uv/ruff/snakefmt/pre-commit dev layer.

Reviewed changes

Copilot reviewed 78 out of 79 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
workflow/Snakefile Adds container URI and includes new rulesets (rosace_aa, deposit).
workflow/schemas/experiments.schema.yaml Adds optional phenotype column for cosmos slot assignment.
workflow/schemas/config.schema.yaml Extends scoring backend enum; adds cosmos/deposit/enrich cleanup and per-rule memory knobs; replaces BBTools compression knob.
workflow/rules/scripts/script_utils.py Adds legacy BBTools-compression translation, scoring-mode validation update, JVM heap helper, and cosmos phenotype validation helper.
workflow/rules/scripts/run_rosace.R Improves failure messaging on missing per-sample count TSVs; logs Rosace seed determinism note.
workflow/rules/scripts/run_lilace.R Adds deterministic/traceable Stan seeding via lilace_seed with logging.
workflow/rules/scripts/run_cosmos.py New: runs cosmos model per position and writes summary CSV.
workflow/rules/scripts/remove_zeros.py Minor formatting cleanup.
workflow/rules/scripts/process_variants.py Codon-aware filtering, restored insdel length gate, performance improvements, and codon→protein collapse for scoring TSV seam.
workflow/rules/scripts/process_counts.py Wires codon normalization for designed variants and updates formatting.
workflow/rules/scripts/prepare_sra.py New: builds SRA metadata TSV + FASTQ file list from experiment CSV + data dir.
workflow/rules/scripts/install_rosace.R Pins CmdStan version and adds a loud post-install load check for rosace.
workflow/rules/scripts/install_rosace_aa.R New: installs rosace-aa at a pinned SHA with load verification.
workflow/rules/scripts/install_lilace.R Pins CmdStan version.
workflow/rules/scripts/generate_enrich_configs.py Centralizes Enrich2 object naming and predicts HDF5 store outputs for Snakemake-managed temp cleanup.
workflow/rules/scripts/generate_baseline_file_list.py Formatting/import ordering cleanup.
workflow/rules/scripts/generate_baseline_configs.py Formatting cleanup.
workflow/rules/scripts/format_cosmos.py New: formats per-condition scores into cosmos wide CSV (beta_hat_N/se_hat_N) with type derivation.
workflow/rules/rosace.smk Reorders directives and wires conda env consistently.
workflow/rules/rosace_aa.smk New rules for rosace-aa install + scoring.
workflow/rules/ref.smk Adds FASTA header normalization step and makes aligner indexes persistent; wires per-rule resources + JVM heap sizing.
workflow/rules/qc.smk Formatting/reordering; adds per-rule mem_mb for fastqc.
workflow/rules/process.smk Uses normalized reference for designed-variant generation; adds mem_mb for process_sample.
workflow/rules/map.smk Adds per-rule mem_mb and JVM heap sizing for bbmap; adds mem_mb for minimap2 path.
workflow/rules/filter.smk Adds per-tool heap sizing and aggregate mem_mb for concurrent BBTools JVMs.
workflow/rules/enrich.smk Declares Enrich2 HDF5 stores as temp() outputs (unless opted out) for automatic cleanup.
workflow/rules/deposit.smk New: MaveDB export, cosmos format+run rules, and manual-only SRA prep rule.
workflow/rules/common.smk Adds cosmos/deposit outputs to all, config defaults, BBTools compression flags helper, and normalized reference wiring.
workflow/rules/baseline_qc.smk Formatting cleanup.
workflow/rules/asm.smk Switches to normalized reference and pins GATK JVM heap below cgroup allocation; adds mem_mb.
workflow/profiles/slurm/config.yaml New SLURM execution profile with default resources and containerized execution.
workflow/profiles/default/config.yaml New local profile with mem_mb resource budget and mtime rerun-triggers.
workflow/envs/rosace.yaml Pins R 4.5 to match renv.lock and documents rationale.
workflow/envs/rosace_aa.yaml New env for rosace-aa system deps; pins R 4.5 and cmake<3.25.
workflow/envs/lilace.yaml Pins R 4.5 to match renv.lock and documents rationale.
workflow/envs/cosmos.yaml New pip-based cosmos env (python>=3.12) with matplotlib pin for arviz compatibility.
tests/unit/test_script_utils.py Adds unit tests for cosmos phenotype validation and JVM heap sizing; extends scoring-backend mode checks.
tests/unit/test_scoring_backend_config.py Updates schema tests for rosace_aa and new memory knobs.
tests/unit/test_remove_zeros.py Import reordering/formatting.
tests/unit/test_process_variants.py Adds extensive coverage for codon-aware filtering + restored insdel/multi-variant stats; updates signatures.
tests/unit/test_process_counts.py Import cleanup to match updated module shape.
tests/unit/test_prepare_reference.py New: tests for FASTA header normalization awk behavior.
tests/unit/test_lilace_seed_config.py New: schema-level tests for lilace_seed and explicit absence of rosace_seed.
tests/unit/test_install_scripts_fail_loud.py New contract test ensuring install scripts verify package load before writing marker file.
tests/unit/test_generate_variants.py Adds tests for codon-preserving dedup and sequence-based variant identification (issue #27).
tests/unit/test_format_cosmos.py New: unit tests for cosmos type derivation and wide-join formatting.
tests/unit/test_enrich_configs.py Adds tests for Enrich2 object naming and predicted HDF5 store list.
tests/unit/test_bbtools_compression_config.py New: schema + back-compat shim tests for bbtools_compression.
tests/r/testthat/test-resolve_lilace_seed.R New: R tests for lilace seed resolver behavior and logging contract.
tests/r/testthat/test-build_counts.R Adds R regression test for missing per-sample TSV error message.
tests/r/testthat/helper-lilace-functions.R Adds helper copy of resolve_lilace_seed to keep R tests in lockstep with pipeline script.
tests/r/testthat/helper-functions.R Mirrors missing per-sample TSV guard in helper implementation.
tests/integration/test_processing_pipeline.py Formatting cleanup.
tests/integration/test_file_resolution.py Formatting cleanup.
tests/integration/test_dag_construction.py Adds dry-run coverage for Enrich2 temp-output wiring (keep_enrich_h5 on/off).
tests/integration/test_cosmos_integration.py New: dry-run DAG wiring tests and optional real-cosmos load/run validations.
tests/integration/test_config_validation.py Formatting cleanup.
tests/integration/test_cluster_profiles.py New: validates profiles parse and heavy rules resolve mem_mb in dry-run.
tests/fixtures/mock_experiment_cosmos.csv New fixture including phenotype slots for cosmos tests.
tests/conftest.py Import ordering cleanup.
ruff.toml New ruff configuration, incl. snakemake builtin and excludes.
requirements-dev.txt Removed in favor of uv/pyproject-managed dev deps.
README.md Documents container usage, new backends, cluster profiles, aligner choice, and new config knobs.
pyproject.toml New uv-managed virtual project for dev/test tooling and optional cosmos extra.
dumpling_env.yaml Pins python=3.13 and adds pip-only mavehgvs dependency.
Dockerfile New: builds single “mega-env” container image and precompiles CmdStan.
CONTRIBUTING.md New: contributor workflow using uv + ruff/snakefmt and optional cosmos extra.
config/example.yaml Updates example config for new knobs (backend, deposit, cosmos, memory, compression).
cluster_env.yaml New: thin login-node environment for SLURM submission.
.pre-commit-config.yaml New: snakefmt + ruff pre-commit hooks.
.github/workflows/build-container.yml New: GH Actions workflow to build/push GHCR container on branches/tags.
.devcontainer/postCreate.sh New devcontainer bootstrap to build conda env, optional dev deps, and pre-warm scoring backends.
.devcontainer/devcontainer.json Updates devcontainer system deps and switches postCreate to the new script.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread workflow/rules/common.smk Outdated
Comment thread workflow/rules/scripts/prepare_sra.py Outdated
@odcambc

odcambc commented Jun 15, 2026

Copy link
Copy Markdown
Owner Author

Adds #31 and should resolve #9

odcambc added 3 commits June 15, 2026 16:23
1. common.smk: normalized_reference_file now strips to Path(reference).name.
   If reference were absolute or directory-qualified, Path('ref') / <absolute>
   collapses to the original location and prepare_reference would overwrite the
   user's own FASTA. Basename pins the rewrite under the workflow-managed ref/
   dir (mirrors reference_name's .stem).

2. Remove the manual-only prepare_sra rule + scripts/prepare_sra.py and its
   schema/get_input mentions. It was an untested, placeholder-shipping stub that
   rode in with the mavedb commit (ce5d2cd), never on the default target.
   Deferred to return as a properly-scoped feature (with anchored fastq matching
   that raises on ambiguous per-lane/prefix matches).
@odcambc odcambc merged commit 0fc2b14 into main Jun 16, 2026
2 checks passed
@odcambc odcambc deleted the dev branch June 16, 2026 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants