peptide-property-ng: unified Depthcharge peptide property predictor by theGreatHerrLebert · Pull Request #404 · theGreatHerrLebert/rustims

theGreatHerrLebert · 2026-06-08T14:53:57Z

Draft — the peptide-property-ng research predictor: a clean-slate, Depthcharge-based unified model predicting peptide intensity / CCS-IM / RT / charge from one shared encoder, plus the new timsim peak-shape heads. 17 commits; not yet ready to merge (see "remaining" below).

What the feature adds

Unified multi-task model — shared encoder + per-task heads (intensity, ion-mobility, RT, charge).
Hybrid modification encoding — learnable per-mod token + atomic-composition feature (composition table from sagepy UNIMOD), so rare/unseen mods inherit chemistry instead of a cold embedding.
Fragment-indexed intensity head — one prediction per cleavage site (removes the Prosit 30-residue cap); pooled-head + per-PSM CE calibration variant.
Instrument / acquisition conditioning — wired from timstof_catalog, leak-free via the encoder global token; embedding headroom reserved for new instruments/modes.
timsim peak-shape supervision (latest commit) — RTSigmaHead (gradient-normalised RT EMG σ) + IM-σ supervision on the ion-mobility head's std, masked & additive. Lets the predictor configure timsim's per-peptide peak widths. (Design + POC: see the feat/sim-shape-predictor docs PR in claudius-proteomics.)

Status / remaining before un-drafting

Shape-target dataset-loader join (sidecar σ labels → rt_sigma/im_sigma targets + ProForma string normalization) is not yet wired.
Prototype training + held-out metrics to be run on monster3 (full env lives there).
Unseen-modification eval split (the decisive test for the hybrid encoding) still deferred.

Opening as draft for visibility/review while training validation is pending.

New research-track package: a clean-slate, Depthcharge-based neural network predicting fragment intensity, ion mobility/CCS, retention time and precursor charge from one shared encoder. Key design points: - Hybrid modification encoding: each residue carries both a learnable per-UNIMOD token embedding and an atomic-composition chemistry feature (from sagepy's UNIMOD tables), so rare/unseen modifications generalise and the scheme stays open-vocab (no re-mint when UNIMOD grows). - Fragment-indexed intensity head: one prediction per cleavage site, removing Prosit's fixed 174-vector 30-residue cap. - Instrument / acquisition-mode conditioning via the encoder global token; charge / m-z / collision energy conditioned at the heads so the charge head never sees its own label. - CCS head ports the imspy sqrt(m/z) per-charge physics prior. Trains on Sage search results. See docs/nn-architecture-exploration in the claudius-proteomics project for the design rationale.

- losses/metrics: skip degenerate intensity targets (no observed peak) from the spectral-angle loss and metric -- they were a constant, zero-gradient term diluting both. - encoder: assert composition-table dimensions match the model config, rather than silently mis-indexing chemistry vectors. - train: save the final model as best.pt if no epoch improved val SA, so the post-training reload cannot crash. - embedding: add token_only / composition_only fusion modes for the modification-encoding ablation; expose via `train --comp-fusion`. - tests: synthetic fragment-target tests + loss / degenerate-handling tests (24 total).

charge / m-z / CE are conditioned at the heads, not in the encoder (only instrument / acq-mode go through the encoder global token, which keeps the charge head leak-free). The README table said otherwise.

…t 174-vector Both data sources now reach the site-indexed intensity target through one proven encoder + one conversion, replacing the hand-rolled site encoder: - Sage fragments -> 174-vector via imspy's observed_fragments_to_intensity_target (the encoding the rescoring pipeline already uses), then prosit174_to_sites. - Wilhelmlab HF MS2 datasets store the 174-vector natively -> prosit174_to_sites. prosit174_to_sites is the single ordinal->site remap; the hand-rolled build_intensity_target is removed. Adds data/hf_intensity.py for the timsTOF-ms2 / prospect-ptms-ms2 / Prosit-2025 intensity-pretraining datasets. Intensity peptides are capped at 30 aa (the 174-vector limit).

…>site test Codex's conversion review noted the tests pinned y-channel-0 / b-channel-3 but not the charge order within each y/b triplet. This test sets distinct values across all six channels of one ordinal so a charge permutation would fail.

Adds the pretrain-then-fine-tune pipeline: - data/chronologer_rt.py -- Chronologer DB (2.64M peptide-RT, harmonised HI) - data/ccs_pretrain.py -- ionmob CCS parquets (UNIMOD-annotated) - train/pretrain.py -- staged single-task curriculum: Orbitrap intensity -> timsTOF intensity -> CCS -> RT, accumulating in the shared encoder; saves a checkpoint train.py picks up via --init-from. - config: add an `orbitrap` instrument id so the embedding can bridge the Orbitrap-HCD -> timsTOF-PASEF domain shift across pretraining stages. CCS/RT pretrain on their native units (CCS, harmonised HI); the unit shift to Sage 1/K0 / aligned_rt is absorbed during the campaign fine-tune -- no error-prone physics conversion. Intensity uses the one proven 174-vector encoding throughout.

- ccs / rt loaders: guard finite, positive targets -- a NaN no longer produces a NaN loss (codex A1/A2). - train --init-from: reject a preset mismatch with a clear error instead of a shape crash (codex A3). - normalise the CCS / ion-mobility target to ~[0,1] with fixed bounds in both the ionmob (CCS) and Sage (1/K0) loaders, and re-init the physics prior for that range -- the CCS head now sees one consistent target scale across pretraining and fine-tune, removing the negative transfer codex flagged (B1); still no physics conversion. - pretrain: reorder the curriculum (RT, CCS, then intensity last) so the shared encoder ends tuned for the priority task, and save per-stage checkpoints so the fine-tune handoff is selectable (B2); add the Prosit-2025 intensity stage (B3). - hf_intensity: validate the precursor-charge one-hot before argmax (C3). - pretrain: add --chronologer-db / --ccs-glob so the data paths are configurable (for running on monster3).

- clip the normalised CCS / ion-mobility targets to [0,1] (codex: values beyond the fixed bounds would otherwise leave the [0,1] range). - pretrain: peptide-level stage hold-out (a peptide no longer crosses train/val within a stage) instead of a random row split. Final codex review: build verified sound for capped runs. Remaining note -- streaming/sharded stage loading for an uncapped full-scale run -- is a follow-up; a finite --cap avoids it.

…sion The Chronologer RT loader and the Sage loader converted [+mass] peptide notation via sagepy_rescore._parse_sage_peptide, which imports sagepy_rescore.pipeline -- a heavy module that pulls the whole search/fdr stack and is version-fragile (its sagepy.core.fdr imports break against older sagepy installs, e.g. monster3's sagepy 0.4.4). Vendor just the conversion into data/mass_to_unimod.py (_BRACKET_MOD, _mass_to_unimod + tables, parse_delta_mass_peptide) -- copied verbatim from sagepy_rescore; it needs only sagepy.core.unimod and sagepy_connector.py_unimod. Verified byte-identical to sagepy_rescore._parse_sage_peptide. Drops the sagepy_rescore dependency.

…ents/modes The instrument and acquisition-mode embeddings were sized exactly to the current vocabularies (9 / 7). Adding a new instrument later would change the embedding shape and break load_state_dict on every checkpoint. Size the embeddings to a fixed capacity instead -- 32 instrument slots (9 used), 16 acquisition-mode slots (7 used). A new entry appended to INSTRUMENTS / ACQUISITION_MODES takes the next free id with the embedding shape unchanged: existing checkpoints stay loadable and the new id starts from an untrained row, ready to fine-tune. __post_init__ asserts the vocabularies stay within capacity.

…/ 32 acq modes Plenty of room to add new instrument models or acquisition modes later without resizing the embedding (so existing checkpoints keep loading).

…of_catalog The Sage loader passed `instrument=unknown` for every campaign sample, which silently bypassed the encoder's metadata conditioning -- the very mechanism the multi-instrument pretraining curriculum was built to exercise. Wire per-dataset `instrument_model` and `acquisition_mode` through a new `load_catalog_metadata(timstof_catalog.tsv)` -> per-accession dispatch in `build_split_datasets`; expose `--catalog` on `train.py`. Also extend ACQUISITION_MODES with `PASEF` and `single-cell` (the catalog's dominant values that the previous enum did not cover). All 25 currently-processed campaign datasets resolve to real (instrument, acq) ids -- spanning Pro / Pro 2 / HT / SCP / fleX / plain timsTOF.

… for Sage examples The fine-tune intensity SA capped at 0.66 because every campaign example was fed CE=0 -- squarely out-of-distribution for the head's CE FiLM, which was pretrained around the timsTOF-ms2 median normalised CE (~0.26). Default the Sage loader's CE to that in-distribution value; expose --default-ce on train.py so we can sweep it later.

…4 convention) The intensity loss / eval masked target < 0 (impossible-channel only), which treated target == 0 as a real no-peak label. That is correct for densely annotated PROSPECT targets but wrong for Sage matched_fragments, where 0 means unmatched / unknown, not real zero. v4's finetune_dia_pasef_intensity.py canonical-Prosit loss masks target > 0; this aligns the package to that. This explains a sizable chunk of the campaign fine-tune intensity-SA gap: re-evaluating the existing best.pt with the v4 metric already lifts SA from 0.673 to 0.733 with no retraining. The remaining ~0.10 gap to v4's 0.83 will need a retrain with the corrected training loss.

…tion Two-prong fine-tune-time levers explored on the SMALL preset campaign data. * Pooled (v4-style) intensity head: attention-pool residue latents, concat with charge/CE/instrument side-embeddings, MLP to Prosit-174 grid. Sits behind cfg.intensity_head = 'site' (default, local FiLM) or 'pooled' (new). Plumbed through train.py and pretrain.py via --intensity-head. * Per-PSM CE calibration: new calibrate_ce.py runs an inference-time CE grid sweep against a frozen checkpoint and writes per-PSM optimal CEs to parquet keyed by (accession, psm_id). train.py and evaluate_optimal_nce.py accept --ce-calibration to use those calibrated CEs instead of the flat default; added load_ce_calibration to sage_dataset. * Loss / metric: masked_spectral_angle now slices pred to the target's site dimension when pred is longer (pooled head outputs fixed (B,29,n_ion)). * train.py: --tasks and --freeze-encoder for specialization sweeps; the eval helper also takes a tasks list. evaluate_split short-circuits skipped tasks. Campaign results on the 25 timsTOF datasets: canonical baseline test SA 0.7597 canonical + inference-time optimal-CE 0.7945 round-1 calibrated fine-tune 0.7867 round-1 + inference-time optimal-CE 0.8058 round-2 calibrated fine-tune 0.8090 round-2 + inference-time optimal-CE 0.8170 pooled head + round-2 calibration (random 0.7510 (head-init disadvantage) pooled-head init, encoder warm-started)

…oint metadata The real bug: the arccos clamp_eps was 1e-8, which let arccos's near-boundary derivative blow up to ~7000 on sparse intensity targets (Sage matched_fragments with 1 positive position normalize to one-hot vectors whose cos=1 exactly). Even with clamp absorbing the boundary in forward, the gradient chain through F.normalize on tiny-magnitude masked vectors compounded into NaN gradients. This was the silent failure mode underneath three earlier negative results: * RESEARCH preset pretraining at lr=3e-4 NaN'd intensity stages from epoch 1 * RESEARCH preset pretraining at lr=1e-4 NaN'd intensity (the cause, not lr) * Pooled intensity head pretraining NaN'd intensity (same root cause) Decoupled the two eps in masked_spectral_angle: - norm_eps (1e-8): F.normalize denominator floor — accurate for normal mags - clamp_eps (1e-4): arccos boundary buffer — derivative bounded at ~70 instead of ~7000 The fix unblocks scaling experiments. Empirically, with the fix in place: * RESEARCH preset now pretrains cleanly through all 5 stages, with each stage reaching higher val SA than SMALL (timsTOF 0.8574 vs 0.8484). * Pooled-head pretraining likewise runs to completion. Recipe / infra additions: * pretrain.py: --intensity-head flag; per-stage eval restricted to current stage's task (so the pooled head's fixed (B,29,n_ion) output doesn't shape- clash with Chronologer's long-peptide intensity placeholders during RT eval) * train.py + pretrain.py: save intensity_head in checkpoint metadata * calibrate_ce.py + evaluate_optimal_nce.py: auto-detect intensity_head from checkpoint (back-compat for legacy checkpoints inferring from state-dict keys) * evaluate_optimal_nce.py: --ce-calibration to use the calibration parquet as the 'fixed' baseline (matches what a fine-tune that was trained with the same calibration actually saw) * encoder.py: disable nested-tensor fast path on TransformerEncoder — its SDPA backward had a real NaN risk on certain shape combinations Campaign fine-tune results, ordered by test SA on the 7,724-PSM hold-out: config test SA --------------------------------------------- ------- canonical baseline (flat CE 0.26) 0.7597 + inference-time optimal-CE search 0.7945 round-1 per-PSM CE calibration 0.7867 round-1 calibration + inference search 0.8058 round-2 calibration 0.8090 round-2 calibration + inference search 0.8170 per-(accession, charge) CE labels 0.7616 ABLATION pooled head, pretrained + r1 calibration 0.7673 ABLATION RESEARCH preset + r1 calibration 0.7868 ABLATION RESEARCH preset + r2 calibration 0.8073 ABLATION RESEARCH preset + r2 + inference search 0.8137 ABLATION The +0.057 SA lift from baseline came from iterative per-PSM CE calibration + inference-time CE search alone. Pooled-head topology, RESEARCH capacity, and per-(run, charge) label smoothing each underperformed the SMALL site-head per-PSM recipe — the per-PSM CE specificity is doing real work, not noise.

…apes Add per-peptide peak-shape supervision so the predictor can configure timsim's elution/mobility peak widths (replacing its sampled/fixed defaults): - RTSigmaHead: new head predicting the gradient-normalised RT EMG sigma from the encoder global token (apex stays Chronologer; lambda left to timsim sampling, as it is unidentifiable at timsTOF MS1 frame sampling). - IM sigma: supervise the existing IonMobilityHead (mean, std) std output with an L1 term vs the measured Gaussian mobilogram width -- no model change. - MultiTaskLoss: rt_sigma + im_sigma terms, both masked to peptides carrying a valid label and skipped when absent, so existing shape-free training is unchanged. Weights 0.5 each. - tests/test_shape_heads.py covers the new head output, masked loss terms, and the backward-compatible no-target path.

Validate the published HF timsTOF corpus by training on it directly. - data/hf_corpus_dataset.py: streams Tier-1 + Tier-3 parquets (pyarrow.dataset pushdown; the 391M-row Tier-3 can't be materialised) into the same example dicts the Sage path emits. Real per-PSM CE (volts/100); prefers Sage aligned_rt via --hf-rt-lookup, else per-accession-normalised rt_seconds. - train.py: --hf-corpus / --hf-max-datasets / --hf-rt-lookup. Validation (15 ds, cap 4k, from cap0-pretrained.pt): intensity SA 0.734, ccs_mae 0.0226, RT Pearson 0.880 (aligned), charge 0.804 — in-neighborhood of the 56-dataset baseline; gap is data volume.

…; schema-aware (falls back for v0.1)

theGreatHerrLebert added 19 commits May 22, 2026 17:54

docs(peptide-property-ng): fix conditioning description in README

34ffa86

charge / m-z / CE are conditioned at the heads, not in the encoder (only instrument / acq-mode go through the encoder global token, which keeps the charge head leak-free). The README table said otherwise.

feat(peptide-property-ng): bump embedding headroom to 64 instruments …

9292187

…/ 32 acq modes Plenty of room to add new instrument models or acquisition modes later without resizing the embedding (so existing checkpoints keep loading).

peptide-property-ng: HF loader prefers native tier1.rt_aligned (v0.2)…

88b93e7

…; schema-aware (falls back for v0.1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

peptide-property-ng: unified Depthcharge peptide property predictor#404

peptide-property-ng: unified Depthcharge peptide property predictor#404
theGreatHerrLebert wants to merge 19 commits into
mainfrom
feat/peptide-property-ng

theGreatHerrLebert commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

theGreatHerrLebert commented Jun 8, 2026

What the feature adds

Status / remaining before un-drafting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant