Skip to content

peptide-property-ng: unified Depthcharge peptide property predictor#404

Draft
theGreatHerrLebert wants to merge 19 commits into
mainfrom
feat/peptide-property-ng
Draft

peptide-property-ng: unified Depthcharge peptide property predictor#404
theGreatHerrLebert wants to merge 19 commits into
mainfrom
feat/peptide-property-ng

Conversation

@theGreatHerrLebert

Copy link
Copy Markdown
Owner

Draft — the peptide-property-ng research predictor: a clean-slate, Depthcharge-based unified model predicting peptide intensity / CCS-IM / RT / charge from one shared encoder, plus the new timsim peak-shape heads. 17 commits; not yet ready to merge (see "remaining" below).

What the feature adds

  • Unified multi-task model — shared encoder + per-task heads (intensity, ion-mobility, RT, charge).
  • Hybrid modification encoding — learnable per-mod token + atomic-composition feature (composition table from sagepy UNIMOD), so rare/unseen mods inherit chemistry instead of a cold embedding.
  • Fragment-indexed intensity head — one prediction per cleavage site (removes the Prosit 30-residue cap); pooled-head + per-PSM CE calibration variant.
  • Instrument / acquisition conditioning — wired from timstof_catalog, leak-free via the encoder global token; embedding headroom reserved for new instruments/modes.
  • timsim peak-shape supervision (latest commit)RTSigmaHead (gradient-normalised RT EMG σ) + IM-σ supervision on the ion-mobility head's std, masked & additive. Lets the predictor configure timsim's per-peptide peak widths. (Design + POC: see the feat/sim-shape-predictor docs PR in claudius-proteomics.)

Status / remaining before un-drafting

  • Shape-target dataset-loader join (sidecar σ labels → rt_sigma/im_sigma targets + ProForma string normalization) is not yet wired.
  • Prototype training + held-out metrics to be run on monster3 (full env lives there).
  • Unseen-modification eval split (the decisive test for the hybrid encoding) still deferred.

Opening as draft for visibility/review while training validation is pending.

New research-track package: a clean-slate, Depthcharge-based neural
network predicting fragment intensity, ion mobility/CCS, retention time
and precursor charge from one shared encoder.

Key design points:
- Hybrid modification encoding: each residue carries both a learnable
  per-UNIMOD token embedding and an atomic-composition chemistry feature
  (from sagepy's UNIMOD tables), so rare/unseen modifications generalise
  and the scheme stays open-vocab (no re-mint when UNIMOD grows).
- Fragment-indexed intensity head: one prediction per cleavage site,
  removing Prosit's fixed 174-vector 30-residue cap.
- Instrument / acquisition-mode conditioning via the encoder global
  token; charge / m-z / collision energy conditioned at the heads so the
  charge head never sees its own label.
- CCS head ports the imspy sqrt(m/z) per-charge physics prior.

Trains on Sage search results. See docs/nn-architecture-exploration in
the claudius-proteomics project for the design rationale.
- losses/metrics: skip degenerate intensity targets (no observed peak)
  from the spectral-angle loss and metric -- they were a constant,
  zero-gradient term diluting both.
- encoder: assert composition-table dimensions match the model config,
  rather than silently mis-indexing chemistry vectors.
- train: save the final model as best.pt if no epoch improved val SA,
  so the post-training reload cannot crash.
- embedding: add token_only / composition_only fusion modes for the
  modification-encoding ablation; expose via `train --comp-fusion`.
- tests: synthetic fragment-target tests + loss / degenerate-handling
  tests (24 total).
charge / m-z / CE are conditioned at the heads, not in the encoder
(only instrument / acq-mode go through the encoder global token, which
keeps the charge head leak-free). The README table said otherwise.
…t 174-vector

Both data sources now reach the site-indexed intensity target through one
proven encoder + one conversion, replacing the hand-rolled site encoder:

- Sage fragments -> 174-vector via imspy's observed_fragments_to_intensity_target
  (the encoding the rescoring pipeline already uses), then prosit174_to_sites.
- Wilhelmlab HF MS2 datasets store the 174-vector natively -> prosit174_to_sites.

prosit174_to_sites is the single ordinal->site remap; the hand-rolled
build_intensity_target is removed. Adds data/hf_intensity.py for the
timsTOF-ms2 / prospect-ptms-ms2 / Prosit-2025 intensity-pretraining datasets.
Intensity peptides are capped at 30 aa (the 174-vector limit).
…>site test

Codex's conversion review noted the tests pinned y-channel-0 / b-channel-3 but
not the charge order within each y/b triplet. This test sets distinct values
across all six channels of one ordinal so a charge permutation would fail.
Adds the pretrain-then-fine-tune pipeline:
- data/chronologer_rt.py  -- Chronologer DB (2.64M peptide-RT, harmonised HI)
- data/ccs_pretrain.py    -- ionmob CCS parquets (UNIMOD-annotated)
- train/pretrain.py       -- staged single-task curriculum: Orbitrap intensity
  -> timsTOF intensity -> CCS -> RT, accumulating in the shared encoder;
  saves a checkpoint train.py picks up via --init-from.
- config: add an `orbitrap` instrument id so the embedding can bridge the
  Orbitrap-HCD -> timsTOF-PASEF domain shift across pretraining stages.

CCS/RT pretrain on their native units (CCS, harmonised HI); the unit shift to
Sage 1/K0 / aligned_rt is absorbed during the campaign fine-tune -- no
error-prone physics conversion. Intensity uses the one proven 174-vector
encoding throughout.
- ccs / rt loaders: guard finite, positive targets -- a NaN no longer
  produces a NaN loss (codex A1/A2).
- train --init-from: reject a preset mismatch with a clear error instead
  of a shape crash (codex A3).
- normalise the CCS / ion-mobility target to ~[0,1] with fixed bounds in
  both the ionmob (CCS) and Sage (1/K0) loaders, and re-init the physics
  prior for that range -- the CCS head now sees one consistent target
  scale across pretraining and fine-tune, removing the negative transfer
  codex flagged (B1); still no physics conversion.
- pretrain: reorder the curriculum (RT, CCS, then intensity last) so the
  shared encoder ends tuned for the priority task, and save per-stage
  checkpoints so the fine-tune handoff is selectable (B2); add the
  Prosit-2025 intensity stage (B3).
- hf_intensity: validate the precursor-charge one-hot before argmax (C3).
- pretrain: add --chronologer-db / --ccs-glob so the data paths are
  configurable (for running on monster3).
- clip the normalised CCS / ion-mobility targets to [0,1] (codex: values
  beyond the fixed bounds would otherwise leave the [0,1] range).
- pretrain: peptide-level stage hold-out (a peptide no longer crosses
  train/val within a stage) instead of a random row split.

Final codex review: build verified sound for capped runs. Remaining note
-- streaming/sharded stage loading for an uncapped full-scale run -- is a
follow-up; a finite --cap avoids it.
…sion

The Chronologer RT loader and the Sage loader converted [+mass] peptide
notation via sagepy_rescore._parse_sage_peptide, which imports
sagepy_rescore.pipeline -- a heavy module that pulls the whole
search/fdr stack and is version-fragile (its sagepy.core.fdr imports
break against older sagepy installs, e.g. monster3's sagepy 0.4.4).

Vendor just the conversion into data/mass_to_unimod.py (_BRACKET_MOD,
_mass_to_unimod + tables, parse_delta_mass_peptide) -- copied verbatim
from sagepy_rescore; it needs only sagepy.core.unimod and
sagepy_connector.py_unimod. Verified byte-identical to
sagepy_rescore._parse_sage_peptide. Drops the sagepy_rescore dependency.
…ents/modes

The instrument and acquisition-mode embeddings were sized exactly to the
current vocabularies (9 / 7). Adding a new instrument later would change
the embedding shape and break load_state_dict on every checkpoint.

Size the embeddings to a fixed capacity instead -- 32 instrument slots
(9 used), 16 acquisition-mode slots (7 used). A new entry appended to
INSTRUMENTS / ACQUISITION_MODES takes the next free id with the embedding
shape unchanged: existing checkpoints stay loadable and the new id starts
from an untrained row, ready to fine-tune. __post_init__ asserts the
vocabularies stay within capacity.
…/ 32 acq modes

Plenty of room to add new instrument models or acquisition modes later
without resizing the embedding (so existing checkpoints keep loading).
…of_catalog

The Sage loader passed `instrument=unknown` for every campaign sample,
which silently bypassed the encoder's metadata conditioning -- the very
mechanism the multi-instrument pretraining curriculum was built to
exercise. Wire per-dataset `instrument_model` and `acquisition_mode`
through a new `load_catalog_metadata(timstof_catalog.tsv)` -> per-accession
dispatch in `build_split_datasets`; expose `--catalog` on `train.py`.

Also extend ACQUISITION_MODES with `PASEF` and `single-cell` (the
catalog's dominant values that the previous enum did not cover).

All 25 currently-processed campaign datasets resolve to real (instrument,
acq) ids -- spanning Pro / Pro 2 / HT / SCP / fleX / plain timsTOF.
… for Sage examples

The fine-tune intensity SA capped at 0.66 because every campaign example was
fed CE=0 -- squarely out-of-distribution for the head's CE FiLM, which was
pretrained around the timsTOF-ms2 median normalised CE (~0.26). Default the
Sage loader's CE to that in-distribution value; expose --default-ce on
train.py so we can sweep it later.
…4 convention)

The intensity loss / eval masked target < 0 (impossible-channel only), which
treated target == 0 as a real no-peak label. That is correct for densely
annotated PROSPECT targets but wrong for Sage matched_fragments, where 0 means
unmatched / unknown, not real zero. v4's finetune_dia_pasef_intensity.py
canonical-Prosit loss masks target > 0; this aligns the package to that.

This explains a sizable chunk of the campaign fine-tune intensity-SA gap:
re-evaluating the existing best.pt with the v4 metric already lifts SA from
0.673 to 0.733 with no retraining. The remaining ~0.10 gap to v4's 0.83 will
need a retrain with the corrected training loss.
…tion

Two-prong fine-tune-time levers explored on the SMALL preset campaign data.

* Pooled (v4-style) intensity head: attention-pool residue latents, concat with
  charge/CE/instrument side-embeddings, MLP to Prosit-174 grid. Sits behind
  cfg.intensity_head = 'site' (default, local FiLM) or 'pooled' (new). Plumbed
  through train.py and pretrain.py via --intensity-head.

* Per-PSM CE calibration: new calibrate_ce.py runs an inference-time CE grid
  sweep against a frozen checkpoint and writes per-PSM optimal CEs to parquet
  keyed by (accession, psm_id). train.py and evaluate_optimal_nce.py accept
  --ce-calibration to use those calibrated CEs instead of the flat default;
  added load_ce_calibration to sage_dataset.

* Loss / metric: masked_spectral_angle now slices pred to the target's site
  dimension when pred is longer (pooled head outputs fixed (B,29,n_ion)).

* train.py: --tasks and --freeze-encoder for specialization sweeps; the eval
  helper also takes a tasks list. evaluate_split short-circuits skipped tasks.

Campaign results on the 25 timsTOF datasets:
  canonical baseline                          test SA 0.7597
  canonical + inference-time optimal-CE       0.7945
  round-1 calibrated fine-tune                0.7867
  round-1 + inference-time optimal-CE         0.8058
  round-2 calibrated fine-tune                0.8090
  round-2 + inference-time optimal-CE         0.8170
  pooled head + round-2 calibration (random   0.7510  (head-init disadvantage)
    pooled-head init, encoder warm-started)
…oint metadata

The real bug: the arccos clamp_eps was 1e-8, which let arccos's near-boundary
derivative blow up to ~7000 on sparse intensity targets (Sage matched_fragments
with 1 positive position normalize to one-hot vectors whose cos=1 exactly).
Even with clamp absorbing the boundary in forward, the gradient chain through
F.normalize on tiny-magnitude masked vectors compounded into NaN gradients.

This was the silent failure mode underneath three earlier negative results:
  * RESEARCH preset pretraining at lr=3e-4 NaN'd intensity stages from epoch 1
  * RESEARCH preset pretraining at lr=1e-4 NaN'd intensity (the cause, not lr)
  * Pooled intensity head pretraining NaN'd intensity (same root cause)

Decoupled the two eps in masked_spectral_angle:
  - norm_eps (1e-8): F.normalize denominator floor — accurate for normal mags
  - clamp_eps (1e-4): arccos boundary buffer — derivative bounded at ~70 instead of ~7000

The fix unblocks scaling experiments. Empirically, with the fix in place:
  * RESEARCH preset now pretrains cleanly through all 5 stages, with each
    stage reaching higher val SA than SMALL (timsTOF 0.8574 vs 0.8484).
  * Pooled-head pretraining likewise runs to completion.

Recipe / infra additions:
  * pretrain.py: --intensity-head flag; per-stage eval restricted to current
    stage's task (so the pooled head's fixed (B,29,n_ion) output doesn't shape-
    clash with Chronologer's long-peptide intensity placeholders during RT eval)
  * train.py + pretrain.py: save intensity_head in checkpoint metadata
  * calibrate_ce.py + evaluate_optimal_nce.py: auto-detect intensity_head from
    checkpoint (back-compat for legacy checkpoints inferring from state-dict keys)
  * evaluate_optimal_nce.py: --ce-calibration to use the calibration parquet
    as the 'fixed' baseline (matches what a fine-tune that was trained with
    the same calibration actually saw)
  * encoder.py: disable nested-tensor fast path on TransformerEncoder — its
    SDPA backward had a real NaN risk on certain shape combinations

Campaign fine-tune results, ordered by test SA on the 7,724-PSM hold-out:

  config                                         test SA
  ---------------------------------------------  -------
  canonical baseline (flat CE 0.26)              0.7597
  + inference-time optimal-CE search             0.7945
  round-1 per-PSM CE calibration                 0.7867
  round-1 calibration + inference search         0.8058
  round-2 calibration                            0.8090
  round-2 calibration + inference search         0.8170
  per-(accession, charge) CE labels              0.7616  ABLATION
  pooled head, pretrained + r1 calibration       0.7673  ABLATION
  RESEARCH preset + r1 calibration               0.7868  ABLATION
  RESEARCH preset + r2 calibration               0.8073  ABLATION
  RESEARCH preset + r2 + inference search        0.8137  ABLATION

The +0.057 SA lift from baseline came from iterative per-PSM CE calibration +
inference-time CE search alone. Pooled-head topology, RESEARCH capacity, and
per-(run, charge) label smoothing each underperformed the SMALL site-head
per-PSM recipe — the per-PSM CE specificity is doing real work, not noise.
…apes

Add per-peptide peak-shape supervision so the predictor can configure timsim's
elution/mobility peak widths (replacing its sampled/fixed defaults):

- RTSigmaHead: new head predicting the gradient-normalised RT EMG sigma from the
  encoder global token (apex stays Chronologer; lambda left to timsim sampling,
  as it is unidentifiable at timsTOF MS1 frame sampling).
- IM sigma: supervise the existing IonMobilityHead (mean, std) std output with an
  L1 term vs the measured Gaussian mobilogram width -- no model change.
- MultiTaskLoss: rt_sigma + im_sigma terms, both masked to peptides carrying a
  valid label and skipped when absent, so existing shape-free training is
  unchanged. Weights 0.5 each.
- tests/test_shape_heads.py covers the new head output, masked loss terms, and
  the backward-compatible no-target path.
Validate the published HF timsTOF corpus by training on it directly.

- data/hf_corpus_dataset.py: streams Tier-1 + Tier-3 parquets (pyarrow.dataset
  pushdown; the 391M-row Tier-3 can't be materialised) into the same example
  dicts the Sage path emits. Real per-PSM CE (volts/100); prefers Sage
  aligned_rt via --hf-rt-lookup, else per-accession-normalised rt_seconds.
- train.py: --hf-corpus / --hf-max-datasets / --hf-rt-lookup.

Validation (15 ds, cap 4k, from cap0-pretrained.pt): intensity SA 0.734,
ccs_mae 0.0226, RT Pearson 0.880 (aligned), charge 0.804 — in-neighborhood of
the 56-dataset baseline; gap is data volume.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant