Skip to content

perf(mscore): cut per-peak allocations in annotated frame building (A + D)#428

Merged
theGreatHerrLebert merged 2 commits into
mainfrom
perf/annotated-builder-a-d
Jun 26, 2026
Merged

perf(mscore): cut per-peak allocations in annotated frame building (A + D)#428
theGreatHerrLebert merged 2 commits into
mainfrom
perf/annotated-builder-a-d

Conversation

@theGreatHerrLebert

Copy link
Copy Markdown
Owner

Summary

First, low-risk step of the annotated-frame-builder optimization (the builder is slow + RAM-heavy on large sims). Two changes in mscore, no change to the spectrum or to the labels consumers read.

D — drop the double-clone in MzSpectrumAnnotated::new
Sort by moving the annotations instead of cloning every PeakAnnotation twice. sort_by stays stable, so equal-mz ordering (and downstream first-contribution / merge semantics) is unchanged.

A → A2-lean — stop paying for the per-peak description String, without losing information
The annotated generators used to format!("{kind}_{ordinal}_{isotope}") + allocate a String for every fragment peak in the whole simulation — this dominated construction time and RAM. Initially I dropped it (description = None), but codex-review flagged (P2) that this also dropped the b/y fragment identity, making the peak-level ground truth incomplete. Fixed properly by storing the identity structurally:

  • SignalAttributes: description: Option<String>fragment_kind: Option<FragmentType> + fragment_ordinal: Option<i32> (small Copy fields, no heap alloc).
  • description() is now derived on demand and reproduces the old "{kind}_{ordinal}_{isotope}" string exactly.
  • PyO3 / Python: expose fragment_kind + fragment_ordinal getters, keep a derived description getter; the PySignalAttributes constructor now takes (kind, ordinal) instead of a string.

Net effect vs the original

per-peak heap alloc per-peak format! annotation size b/y identity
original 1 (String) yes Option<String> 24 B + heap ✓ (string)
this PR 0 no a few bytes inline structured + queryable

Strictly better on every axis, and the fragment ground truth is now typed/queryable instead of a string to parse.

Tests / validation

  • mscore: new unit tests assert (A2) the spectrum + integer labels are preserved, fragment_kind/fragment_ordinal populate for fragment peaks (None for precursor peaks), and description() reproduces the old format; (D) new() sorts by mz keeping each annotation attached to its peak, stably for equal mz. Full suite: 53 + new green.
  • rustdf + imspy_connector build/type-check clean.
  • Python end-to-end on the 125K HeLa blueprint: annotated frame builds, dense label windows work, new getters return correctly.

Behavior change

SignalAttributes.description is now derived (same value) rather than a stored field; the PySignalAttributes/Python SignalAttributes constructor takes fragment_kind + fragment_ordinal instead of description.

Not in this PR

  • Formal before/after perf numbers (load time + peak RSS) — to be measured next.
  • The heavier follow-ups from the plan: B (SmallVec for the contribution vector) and C (lazy/per-batch annotation to lift the RAM ceiling).

Two low-risk wins for the annotated frame builder (slow + RAM-heavy on large
sims), with no change to the spectrum or the structured labels consumers read.

A) Stop eagerly building a per-peak `description` String. The annotated
   generators formatted "{kind}_{ordinal}_{isotope}" + allocated a String for
   every fragment peak in the whole simulation; this dominated construction
   cost and RAM. No consumer reads `description` (charge/peptide_id/isotope_peak
   carry the labels), and the field/PyO3 surface is unchanged — it just returns
   None now. (b/y kind + ordinal can be promoted to structured fields later if
   the identity is ever needed.)

D) MzSpectrumAnnotated::new now sorts by moving the annotations instead of
   cloning every PeakAnnotation twice. sort_by stays stable, so equal-mz peak
   order (and thus downstream merge/first-contribution semantics) is unchanged.

Tests: add mscore unit/parity tests asserting (A) the spectrum + integer labels
are preserved while description is None, and (D) new() sorts by mz keeping each
annotation attached to its peak, stably for equal mz.
…peak string

Follow-up to the previous commit, addressing the codex-review P2: dropping the
per-peak description String also dropped the b/y fragment identity, making the
peak-level ground truth incomplete. Fix by storing the identity structurally
instead of as a formatted string — strictly better on every axis.

- SignalAttributes: replace `description: Option<String>` with
  `fragment_kind: Option<FragmentType>` + `fragment_ordinal: Option<i32>`
  (small Copy fields, no heap alloc), and add a `description()` method that
  derives "{kind}_{ordinal}_{isotope}" on demand — identical to the old string.
- Annotated generators set the structured fields (the data was already in hand);
  precursor-ion peaks carry None (no fragment identity), as before.
- PyO3/Python: expose fragment_kind + fragment_ordinal getters and keep a derived
  `description` getter; constructor now takes (kind, ordinal) instead of a string.

Net vs the original: no per-peak String/format!, ~6B inline instead of 24B +
heap, and the fragment identity is preserved in a queryable structured form.
Tests updated to assert the structured fields + that description() reproduces
the old format exactly. Validated end-to-end on the 125K HeLa blueprint.
@theGreatHerrLebert theGreatHerrLebert merged commit b1d5c76 into main Jun 26, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant