peptide-property-ng: unified Depthcharge peptide property predictor#404
Draft
theGreatHerrLebert wants to merge 19 commits into
Draft
peptide-property-ng: unified Depthcharge peptide property predictor#404theGreatHerrLebert wants to merge 19 commits into
theGreatHerrLebert wants to merge 19 commits into
Conversation
New research-track package: a clean-slate, Depthcharge-based neural network predicting fragment intensity, ion mobility/CCS, retention time and precursor charge from one shared encoder. Key design points: - Hybrid modification encoding: each residue carries both a learnable per-UNIMOD token embedding and an atomic-composition chemistry feature (from sagepy's UNIMOD tables), so rare/unseen modifications generalise and the scheme stays open-vocab (no re-mint when UNIMOD grows). - Fragment-indexed intensity head: one prediction per cleavage site, removing Prosit's fixed 174-vector 30-residue cap. - Instrument / acquisition-mode conditioning via the encoder global token; charge / m-z / collision energy conditioned at the heads so the charge head never sees its own label. - CCS head ports the imspy sqrt(m/z) per-charge physics prior. Trains on Sage search results. See docs/nn-architecture-exploration in the claudius-proteomics project for the design rationale.
- losses/metrics: skip degenerate intensity targets (no observed peak) from the spectral-angle loss and metric -- they were a constant, zero-gradient term diluting both. - encoder: assert composition-table dimensions match the model config, rather than silently mis-indexing chemistry vectors. - train: save the final model as best.pt if no epoch improved val SA, so the post-training reload cannot crash. - embedding: add token_only / composition_only fusion modes for the modification-encoding ablation; expose via `train --comp-fusion`. - tests: synthetic fragment-target tests + loss / degenerate-handling tests (24 total).
charge / m-z / CE are conditioned at the heads, not in the encoder (only instrument / acq-mode go through the encoder global token, which keeps the charge head leak-free). The README table said otherwise.
…t 174-vector Both data sources now reach the site-indexed intensity target through one proven encoder + one conversion, replacing the hand-rolled site encoder: - Sage fragments -> 174-vector via imspy's observed_fragments_to_intensity_target (the encoding the rescoring pipeline already uses), then prosit174_to_sites. - Wilhelmlab HF MS2 datasets store the 174-vector natively -> prosit174_to_sites. prosit174_to_sites is the single ordinal->site remap; the hand-rolled build_intensity_target is removed. Adds data/hf_intensity.py for the timsTOF-ms2 / prospect-ptms-ms2 / Prosit-2025 intensity-pretraining datasets. Intensity peptides are capped at 30 aa (the 174-vector limit).
…>site test Codex's conversion review noted the tests pinned y-channel-0 / b-channel-3 but not the charge order within each y/b triplet. This test sets distinct values across all six channels of one ordinal so a charge permutation would fail.
Adds the pretrain-then-fine-tune pipeline: - data/chronologer_rt.py -- Chronologer DB (2.64M peptide-RT, harmonised HI) - data/ccs_pretrain.py -- ionmob CCS parquets (UNIMOD-annotated) - train/pretrain.py -- staged single-task curriculum: Orbitrap intensity -> timsTOF intensity -> CCS -> RT, accumulating in the shared encoder; saves a checkpoint train.py picks up via --init-from. - config: add an `orbitrap` instrument id so the embedding can bridge the Orbitrap-HCD -> timsTOF-PASEF domain shift across pretraining stages. CCS/RT pretrain on their native units (CCS, harmonised HI); the unit shift to Sage 1/K0 / aligned_rt is absorbed during the campaign fine-tune -- no error-prone physics conversion. Intensity uses the one proven 174-vector encoding throughout.
- ccs / rt loaders: guard finite, positive targets -- a NaN no longer produces a NaN loss (codex A1/A2). - train --init-from: reject a preset mismatch with a clear error instead of a shape crash (codex A3). - normalise the CCS / ion-mobility target to ~[0,1] with fixed bounds in both the ionmob (CCS) and Sage (1/K0) loaders, and re-init the physics prior for that range -- the CCS head now sees one consistent target scale across pretraining and fine-tune, removing the negative transfer codex flagged (B1); still no physics conversion. - pretrain: reorder the curriculum (RT, CCS, then intensity last) so the shared encoder ends tuned for the priority task, and save per-stage checkpoints so the fine-tune handoff is selectable (B2); add the Prosit-2025 intensity stage (B3). - hf_intensity: validate the precursor-charge one-hot before argmax (C3). - pretrain: add --chronologer-db / --ccs-glob so the data paths are configurable (for running on monster3).
- clip the normalised CCS / ion-mobility targets to [0,1] (codex: values beyond the fixed bounds would otherwise leave the [0,1] range). - pretrain: peptide-level stage hold-out (a peptide no longer crosses train/val within a stage) instead of a random row split. Final codex review: build verified sound for capped runs. Remaining note -- streaming/sharded stage loading for an uncapped full-scale run -- is a follow-up; a finite --cap avoids it.
…sion The Chronologer RT loader and the Sage loader converted [+mass] peptide notation via sagepy_rescore._parse_sage_peptide, which imports sagepy_rescore.pipeline -- a heavy module that pulls the whole search/fdr stack and is version-fragile (its sagepy.core.fdr imports break against older sagepy installs, e.g. monster3's sagepy 0.4.4). Vendor just the conversion into data/mass_to_unimod.py (_BRACKET_MOD, _mass_to_unimod + tables, parse_delta_mass_peptide) -- copied verbatim from sagepy_rescore; it needs only sagepy.core.unimod and sagepy_connector.py_unimod. Verified byte-identical to sagepy_rescore._parse_sage_peptide. Drops the sagepy_rescore dependency.
…ents/modes The instrument and acquisition-mode embeddings were sized exactly to the current vocabularies (9 / 7). Adding a new instrument later would change the embedding shape and break load_state_dict on every checkpoint. Size the embeddings to a fixed capacity instead -- 32 instrument slots (9 used), 16 acquisition-mode slots (7 used). A new entry appended to INSTRUMENTS / ACQUISITION_MODES takes the next free id with the embedding shape unchanged: existing checkpoints stay loadable and the new id starts from an untrained row, ready to fine-tune. __post_init__ asserts the vocabularies stay within capacity.
…/ 32 acq modes Plenty of room to add new instrument models or acquisition modes later without resizing the embedding (so existing checkpoints keep loading).
…of_catalog The Sage loader passed `instrument=unknown` for every campaign sample, which silently bypassed the encoder's metadata conditioning -- the very mechanism the multi-instrument pretraining curriculum was built to exercise. Wire per-dataset `instrument_model` and `acquisition_mode` through a new `load_catalog_metadata(timstof_catalog.tsv)` -> per-accession dispatch in `build_split_datasets`; expose `--catalog` on `train.py`. Also extend ACQUISITION_MODES with `PASEF` and `single-cell` (the catalog's dominant values that the previous enum did not cover). All 25 currently-processed campaign datasets resolve to real (instrument, acq) ids -- spanning Pro / Pro 2 / HT / SCP / fleX / plain timsTOF.
… for Sage examples The fine-tune intensity SA capped at 0.66 because every campaign example was fed CE=0 -- squarely out-of-distribution for the head's CE FiLM, which was pretrained around the timsTOF-ms2 median normalised CE (~0.26). Default the Sage loader's CE to that in-distribution value; expose --default-ce on train.py so we can sweep it later.
…4 convention) The intensity loss / eval masked target < 0 (impossible-channel only), which treated target == 0 as a real no-peak label. That is correct for densely annotated PROSPECT targets but wrong for Sage matched_fragments, where 0 means unmatched / unknown, not real zero. v4's finetune_dia_pasef_intensity.py canonical-Prosit loss masks target > 0; this aligns the package to that. This explains a sizable chunk of the campaign fine-tune intensity-SA gap: re-evaluating the existing best.pt with the v4 metric already lifts SA from 0.673 to 0.733 with no retraining. The remaining ~0.10 gap to v4's 0.83 will need a retrain with the corrected training loss.
…tion
Two-prong fine-tune-time levers explored on the SMALL preset campaign data.
* Pooled (v4-style) intensity head: attention-pool residue latents, concat with
charge/CE/instrument side-embeddings, MLP to Prosit-174 grid. Sits behind
cfg.intensity_head = 'site' (default, local FiLM) or 'pooled' (new). Plumbed
through train.py and pretrain.py via --intensity-head.
* Per-PSM CE calibration: new calibrate_ce.py runs an inference-time CE grid
sweep against a frozen checkpoint and writes per-PSM optimal CEs to parquet
keyed by (accession, psm_id). train.py and evaluate_optimal_nce.py accept
--ce-calibration to use those calibrated CEs instead of the flat default;
added load_ce_calibration to sage_dataset.
* Loss / metric: masked_spectral_angle now slices pred to the target's site
dimension when pred is longer (pooled head outputs fixed (B,29,n_ion)).
* train.py: --tasks and --freeze-encoder for specialization sweeps; the eval
helper also takes a tasks list. evaluate_split short-circuits skipped tasks.
Campaign results on the 25 timsTOF datasets:
canonical baseline test SA 0.7597
canonical + inference-time optimal-CE 0.7945
round-1 calibrated fine-tune 0.7867
round-1 + inference-time optimal-CE 0.8058
round-2 calibrated fine-tune 0.8090
round-2 + inference-time optimal-CE 0.8170
pooled head + round-2 calibration (random 0.7510 (head-init disadvantage)
pooled-head init, encoder warm-started)
…oint metadata
The real bug: the arccos clamp_eps was 1e-8, which let arccos's near-boundary
derivative blow up to ~7000 on sparse intensity targets (Sage matched_fragments
with 1 positive position normalize to one-hot vectors whose cos=1 exactly).
Even with clamp absorbing the boundary in forward, the gradient chain through
F.normalize on tiny-magnitude masked vectors compounded into NaN gradients.
This was the silent failure mode underneath three earlier negative results:
* RESEARCH preset pretraining at lr=3e-4 NaN'd intensity stages from epoch 1
* RESEARCH preset pretraining at lr=1e-4 NaN'd intensity (the cause, not lr)
* Pooled intensity head pretraining NaN'd intensity (same root cause)
Decoupled the two eps in masked_spectral_angle:
- norm_eps (1e-8): F.normalize denominator floor — accurate for normal mags
- clamp_eps (1e-4): arccos boundary buffer — derivative bounded at ~70 instead of ~7000
The fix unblocks scaling experiments. Empirically, with the fix in place:
* RESEARCH preset now pretrains cleanly through all 5 stages, with each
stage reaching higher val SA than SMALL (timsTOF 0.8574 vs 0.8484).
* Pooled-head pretraining likewise runs to completion.
Recipe / infra additions:
* pretrain.py: --intensity-head flag; per-stage eval restricted to current
stage's task (so the pooled head's fixed (B,29,n_ion) output doesn't shape-
clash with Chronologer's long-peptide intensity placeholders during RT eval)
* train.py + pretrain.py: save intensity_head in checkpoint metadata
* calibrate_ce.py + evaluate_optimal_nce.py: auto-detect intensity_head from
checkpoint (back-compat for legacy checkpoints inferring from state-dict keys)
* evaluate_optimal_nce.py: --ce-calibration to use the calibration parquet
as the 'fixed' baseline (matches what a fine-tune that was trained with
the same calibration actually saw)
* encoder.py: disable nested-tensor fast path on TransformerEncoder — its
SDPA backward had a real NaN risk on certain shape combinations
Campaign fine-tune results, ordered by test SA on the 7,724-PSM hold-out:
config test SA
--------------------------------------------- -------
canonical baseline (flat CE 0.26) 0.7597
+ inference-time optimal-CE search 0.7945
round-1 per-PSM CE calibration 0.7867
round-1 calibration + inference search 0.8058
round-2 calibration 0.8090
round-2 calibration + inference search 0.8170
per-(accession, charge) CE labels 0.7616 ABLATION
pooled head, pretrained + r1 calibration 0.7673 ABLATION
RESEARCH preset + r1 calibration 0.7868 ABLATION
RESEARCH preset + r2 calibration 0.8073 ABLATION
RESEARCH preset + r2 + inference search 0.8137 ABLATION
The +0.057 SA lift from baseline came from iterative per-PSM CE calibration +
inference-time CE search alone. Pooled-head topology, RESEARCH capacity, and
per-(run, charge) label smoothing each underperformed the SMALL site-head
per-PSM recipe — the per-PSM CE specificity is doing real work, not noise.
…apes Add per-peptide peak-shape supervision so the predictor can configure timsim's elution/mobility peak widths (replacing its sampled/fixed defaults): - RTSigmaHead: new head predicting the gradient-normalised RT EMG sigma from the encoder global token (apex stays Chronologer; lambda left to timsim sampling, as it is unidentifiable at timsTOF MS1 frame sampling). - IM sigma: supervise the existing IonMobilityHead (mean, std) std output with an L1 term vs the measured Gaussian mobilogram width -- no model change. - MultiTaskLoss: rt_sigma + im_sigma terms, both masked to peptides carrying a valid label and skipped when absent, so existing shape-free training is unchanged. Weights 0.5 each. - tests/test_shape_heads.py covers the new head output, masked loss terms, and the backward-compatible no-target path.
Validate the published HF timsTOF corpus by training on it directly. - data/hf_corpus_dataset.py: streams Tier-1 + Tier-3 parquets (pyarrow.dataset pushdown; the 391M-row Tier-3 can't be materialised) into the same example dicts the Sage path emits. Real per-PSM CE (volts/100); prefers Sage aligned_rt via --hf-rt-lookup, else per-accession-normalised rt_seconds. - train.py: --hf-corpus / --hf-max-datasets / --hf-rt-lookup. Validation (15 ds, cap 4k, from cap0-pretrained.pt): intensity SA 0.734, ccs_mae 0.0226, RT Pearson 0.880 (aligned), charge 0.804 — in-neighborhood of the 56-dataset baseline; gap is data volume.
…; schema-aware (falls back for v0.1)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft — the
peptide-property-ngresearch predictor: a clean-slate, Depthcharge-based unified model predicting peptide intensity / CCS-IM / RT / charge from one shared encoder, plus the new timsim peak-shape heads. 17 commits; not yet ready to merge (see "remaining" below).What the feature adds
timstof_catalog, leak-free via the encoder global token; embedding headroom reserved for new instruments/modes.RTSigmaHead(gradient-normalised RT EMG σ) + IM-σ supervision on the ion-mobility head'sstd, masked & additive. Lets the predictor configure timsim's per-peptide peak widths. (Design + POC: see thefeat/sim-shape-predictordocs PR in claudius-proteomics.)Status / remaining before un-drafting
rt_sigma/im_sigmatargets + ProForma string normalization) is not yet wired.Opening as draft for visibility/review while training validation is pending.