perf(mscore): cut per-peak allocations in annotated frame building (A + D)#428
Merged
Conversation
Two low-risk wins for the annotated frame builder (slow + RAM-heavy on large
sims), with no change to the spectrum or the structured labels consumers read.
A) Stop eagerly building a per-peak `description` String. The annotated
generators formatted "{kind}_{ordinal}_{isotope}" + allocated a String for
every fragment peak in the whole simulation; this dominated construction
cost and RAM. No consumer reads `description` (charge/peptide_id/isotope_peak
carry the labels), and the field/PyO3 surface is unchanged — it just returns
None now. (b/y kind + ordinal can be promoted to structured fields later if
the identity is ever needed.)
D) MzSpectrumAnnotated::new now sorts by moving the annotations instead of
cloning every PeakAnnotation twice. sort_by stays stable, so equal-mz peak
order (and thus downstream merge/first-contribution semantics) is unchanged.
Tests: add mscore unit/parity tests asserting (A) the spectrum + integer labels
are preserved while description is None, and (D) new() sorts by mz keeping each
annotation attached to its peak, stably for equal mz.
…peak string
Follow-up to the previous commit, addressing the codex-review P2: dropping the
per-peak description String also dropped the b/y fragment identity, making the
peak-level ground truth incomplete. Fix by storing the identity structurally
instead of as a formatted string — strictly better on every axis.
- SignalAttributes: replace `description: Option<String>` with
`fragment_kind: Option<FragmentType>` + `fragment_ordinal: Option<i32>`
(small Copy fields, no heap alloc), and add a `description()` method that
derives "{kind}_{ordinal}_{isotope}" on demand — identical to the old string.
- Annotated generators set the structured fields (the data was already in hand);
precursor-ion peaks carry None (no fragment identity), as before.
- PyO3/Python: expose fragment_kind + fragment_ordinal getters and keep a derived
`description` getter; constructor now takes (kind, ordinal) instead of a string.
Net vs the original: no per-peak String/format!, ~6B inline instead of 24B +
heap, and the fragment identity is preserved in a queryable structured form.
Tests updated to assert the structured fields + that description() reproduces
the old format exactly. Validated end-to-end on the 125K HeLa blueprint.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First, low-risk step of the annotated-frame-builder optimization (the builder is slow + RAM-heavy on large sims). Two changes in
mscore, no change to the spectrum or to the labels consumers read.D — drop the double-clone in
MzSpectrumAnnotated::newSort by moving the annotations instead of cloning every
PeakAnnotationtwice.sort_bystays stable, so equal-mz ordering (and downstream first-contribution / merge semantics) is unchanged.A → A2-lean — stop paying for the per-peak
descriptionString, without losing informationThe annotated generators used to
format!("{kind}_{ordinal}_{isotope}")+ allocate aStringfor every fragment peak in the whole simulation — this dominated construction time and RAM. Initially I dropped it (description = None), but codex-review flagged (P2) that this also dropped the b/y fragment identity, making the peak-level ground truth incomplete. Fixed properly by storing the identity structurally:SignalAttributes:description: Option<String>→fragment_kind: Option<FragmentType>+fragment_ordinal: Option<i32>(smallCopyfields, no heap alloc).description()is now derived on demand and reproduces the old"{kind}_{ordinal}_{isotope}"string exactly.fragment_kind+fragment_ordinalgetters, keep a deriveddescriptiongetter; thePySignalAttributesconstructor now takes(kind, ordinal)instead of a string.Net effect vs the original
format!Option<String>24 B + heapStrictly better on every axis, and the fragment ground truth is now typed/queryable instead of a string to parse.
Tests / validation
mscore: new unit tests assert (A2) the spectrum + integer labels are preserved,fragment_kind/fragment_ordinalpopulate for fragment peaks (None for precursor peaks), anddescription()reproduces the old format; (D)new()sorts by mz keeping each annotation attached to its peak, stably for equal mz. Full suite: 53 + new green.rustdf+imspy_connectorbuild/type-check clean.Behavior change
SignalAttributes.descriptionis now derived (same value) rather than a stored field; thePySignalAttributes/PythonSignalAttributesconstructor takesfragment_kind+fragment_ordinalinstead ofdescription.Not in this PR
SmallVecfor the contribution vector) and C (lazy/per-batch annotation to lift the RAM ceiling).