Skip to content

Add SIRIUS annotations and mz_rt fallback to MZMine converter (three-tier ProteinName)#138

Merged
tonywu1999 merged 4 commits into
develfrom
MSstatsConvert/work/20260603_mzmine_sirius_annotations
Jun 10, 2026
Merged

Add SIRIUS annotations and mz_rt fallback to MZMine converter (three-tier ProteinName)#138
tonywu1999 merged 4 commits into
develfrom
MSstatsConvert/work/20260603_mzmine_sirius_annotations

Conversation

@swaraj-neu

@swaraj-neu swaraj-neu commented Jun 4, 2026

Copy link
Copy Markdown
Contributor
  • Add SIRIUS annotation support and mz_rt fallback to MZMine converter
  • Retain features lacking MZMine compound names, assign ProteinName via MZMine/SIRIUS/mz_rt tiers, restore required mz/RT metadata, and add validation, tests, docs, and tier-level logging.

Motivation and Context

Please include relevant motivation and context of the problem along with a short summary of the solution.

Changes

Please provide a detailed bullet point list of your changes.

Testing

Please describe any unit tests you added or modified to verify your changes.

Checklist Before Requesting a Review

  • I have read the MSstats contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

Motivation and Context

The PR adds support for SIRIUS structural annotations and implements a fallback mechanism for the MZMine converter in MSstatsConvert. Previously, the converter would drop features lacking MZMine spectral-library annotations (MSI Level 2). This enhancement introduces a three-tier ProteinName assignment strategy: (1) MZMine compound names (highest priority), (2) SIRIUS structure identifications when available (MSI Level 3, in-silico structure prediction), and (3) a synthesized m/z and retention time identifier as a final fallback. This approach retains all features while maintaining annotation precedence and improving metabolomics data coverage.

Detailed Changes

Implementation Changes

  • R/clean_MZMine.R (.cleanRawMZMine function)

    • Added optional sirius_annotations parameter (default NULL)
    • Changed ProteinName assignment from a single inner join (which discarded unmatched features) to a three-tier left-join strategy:
      • Tier 1: Left join MZMine compound_name by rowIDid (features with MZMine annotations receive highest-scoring compound name)
      • Tier 2: If sirius_annotations provided, fill remaining NA values using mappingFeatureIdrowID join with SIRIUS name (non-empty names only)
      • Tier 3: For any remaining NA features, synthesize ProteinName from m/z and retention time (format: mz_rt, e.g., 489.334_7.89 constructed as round(mz, 4)_round(rt, 2))
    • Added validation for required metadata columns: rowID, rowmz, rowretentiontime
    • Added per-tier logging to report counts of features assigned at each tier (MZMine, SIRIUS, m/z-RT fallback) with messages like "** MZMine ProteinName assignment: MZMine compound: N feature(s); SIRIUS name: N feature(s); m/z-RT fallback: N feature(s)."
    • Updated roxygen documentation to describe new parameter and three-tier assignment strategy
    • Lines changed: +54/-16
  • R/converters_MZMinetoMSstatsFormat.R (public wrapper function)

    • Added optional sirius_annotations parameter (default NULL) to function signature
    • Added validation to check that sirius_annotations (when non-NULL) contains required columns mappingFeatureId and name; stops with error message if missing required columns
    • Added early validation to ensure mzmine_annotations is provided (not NULL or missing); raises error with instruction message
    • Modified preprocessing call to pass sirius_annotations into MSstatsConvert::MSstatsClean (was previously called without it)
    • Changed remove_single_feature_proteins behavior: now always set to FALSE rather than driven by removeProtein_with1Feature flag
    • Extended roxygen examples to demonstrate reading SIRIUS structure_identifications.tsv and calling the converter with sirius_annotations
    • Updated \details section to describe the three-tier strategy, MSI level assignments, and explicit clarification that features are retained via m/z-RT fallback rather than filtered out
    • Included discussion of trade-offs: retaining all features improves normalization stability but increases hypothesis burden; recommends confirmatory users filter to MZMine-only, discovery users benefit from additional sources
    • Lines changed: +60/-15
  • Documentation Updates

    • man/MSstatsClean.Rd: Updated S4 method signature to accept sirius_annotations = NULL; clarified that SIRIUS name populates ProteinName for features without MZMine compound names; notes schema validated against SIRIUS 6 output (lines: +13/-6)
    • man/MZMinetoMSstatsFormat.Rd: Documented new sirius_annotations parameter; rewrote \details section to describe three-tier assignment with MSI level context (Level 2 for MZMine, Level 3 for SIRIUS); updated examples with SIRIUS demonstration; clarified features are retained rather than filtered; explained m/z-RT fallback format (lines: +64/-15)
    • man/dot-cleanRawMZMine.Rd: Documented sirius_annotations parameter; specified only mappingFeatureId and name columns are read; noted score/confidence columns are ignored; clarified schema validation against SIRIUS 6 with guidance for other versions; removed prior statement that features without matching annotations are dropped (lines: +13/-6)
    • vignettes/msstats_data_format.Rmd: Updated vignette section to explain three-tier strategy with precedence, SIRIUS schema expectations (mappingFeatureId matches to MZMine rowID), and replaced prior behavior (drop features; no fallback) with new behavior (retain via m/z-RT fallback); extended worked examples to show both baseline conversion and SIRIUS-enriched conversion followed by inspection of unique PeptideSequence/ProteinName pairs (lines: +52/-12)

Unit Tests Added/Modified

  • inst/tinytest/test_converters_MZMinetoMSstatsFormat.R (lines: +50/-22)
    • Baseline MZMine annotation test:

      • Verified all 6 features retained (24 rows = 6 features × 4 runs) instead of dropping unannotated features
      • Confirmed features 4 and 5 receive m/z-RT fallback ProteinName values (489.334_7.89, 555.447_9.1) when absent from mzmine_annotations
      • Verified correct handling of feature 2 (highest-scoring MZMine annotation wins among duplicates; feature 2 with two annotation rows correctly assigned "GlucoseHigh")
      • Tested standard column presence and structure (11 columns, correct column names)
      • Tested data type handling (IsotopeLabelType set to "Light" for metabolomics, charge/fragment columns NA, Fraction set to 1)
      • Tested zero-intensity input cells converted to NA in output
      • Verified annotation merging with run metadata (Condition, BioReplicate correctly joined)
      • Tested intensity values traced back correctly to input data
    • SIRIUS annotation test:

      • Tested precedence: MZMine compound_name (Caffeine) beats SIRIUS for feature 1 (confirmed both sources available but MZMine wins)
      • Verified SIRIUS name fill: feature 4 (no MZMine match) assigned SIRIUS name "Caffeic acid"
      • Tested m/z-RT fallback when SIRIUS provides empty/invalid name: feature 5 with only empty-name SIRIUS row correctly falls back to "555.447_9.1"
      • Verified irrelevant SIRIUS rows (mappingFeatureId=99) do not introduce spurious new features or protein names ("Ghost")
    • Error handling tests:

      • Confirmed mzmine_annotations = NULL raises "mzmine_annotations is required" error with guidance message
      • Confirmed omitting mzmine_annotations argument (missing) raises same error
      • Confirmed malformed sirius_annotations (missing required name column) triggers stop() with "missing required column" error message

Coding Guidelines Compliance

No explicit violations of coding guidelines were identified in the provided changes. The implementation demonstrates:

  • Consistent roxygen documentation practices with proper parameter documentation, details sections, and examples
  • Proper use of data.table idioms (left joins, setorder, unique, set for column deletion) consistent with codebase patterns
  • Appropriate input validation with informative error messages for missing or malformed inputs
  • Informative logging at multiple tiers to document feature assignment decisions via getOption("MSstatsLog") and getOption("MSstatsMsg")
  • Comprehensive updates to all related documentation (roxygen, man pages, vignettes) in coordination with code changes
  • Well-structured test coverage for normal operation, edge cases, and error conditions

@swaraj-neu swaraj-neu requested a review from tonywu1999 June 4, 2026 15:51
@swaraj-neu swaraj-neu self-assigned this Jun 4, 2026
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 00180c97-bf44-4f50-91ba-2f668a436274

📥 Commits

Reviewing files that changed from the base of the PR and between 129ff6e and 7ddee25.

⛔ Files ignored due to path filters (1)
  • inst/tinytest/raw_data/MZMine/structure_identifications.tsv is excluded by !**/*.tsv
📒 Files selected for processing (7)
  • R/clean_MZMine.R
  • R/converters_MZMinetoMSstatsFormat.R
  • inst/tinytest/test_converters_MZMinetoMSstatsFormat.R
  • man/MSstatsClean.Rd
  • man/MZMinetoMSstatsFormat.Rd
  • man/dot-cleanRawMZMine.Rd
  • vignettes/msstats_data_format.Rmd
✅ Files skipped from review due to trivial changes (2)
  • vignettes/msstats_data_format.Rmd
  • man/dot-cleanRawMZMine.Rd
🚧 Files skipped from review as they are similar to previous changes (4)
  • man/MSstatsClean.Rd
  • R/clean_MZMine.R
  • man/MZMinetoMSstatsFormat.Rd
  • inst/tinytest/test_converters_MZMinetoMSstatsFormat.R

📝 Walkthrough

Walkthrough

Implements three-tier ProteinName assignment: MZMine compound_name (tier 1), optional SIRIUS name via sirius_annotations (tier 2), and synthesized mz_rt fallback for remaining features (tier 3); updates .cleanRawMZMine, MZMinetoMSstatsFormat, tests, and documentation.

Changes

Three-tier ProteinName assignment with SIRIUS enrichment

Layer / File(s) Summary
Three-tier ProteinName assignment core implementation
R/clean_MZMine.R
.cleanRawMZMine adds optional sirius_annotations, validates row m/z and row retention time, left-joins MZMine compound_name, conditionally fills from SIRIUS mappingFeatureIdname, and synthesizes mz_rt fallback with per-tier logging.
Public API wrapper and SIRIUS parameter integration
R/converters_MZMinetoMSstatsFormat.R
MZMinetoMSstatsFormat adds sirius_annotations = NULL, documents it, validates required SIRIUS columns (mappingFeatureId, name) when provided, forwards it to MSstatsClean, updates examples, and sets remove_single_feature_proteins = FALSE.
Test coverage for behavior changes and new feature
inst/tinytest/test_converters_MZMinetoMSstatsFormat.R
Tinytests updated to expect retention of all base features, verify mz_rt fallback naming, and add tests for MZMine-over-SIRIUS precedence, SIRIUS fill, invalid-name fallback, irrelevant mapping exclusion, and missing-column error handling.
Roxygen, Rd, and vignette documentation updates
man/MSstatsClean.Rd, man/MZMinetoMSstatsFormat.Rd, man/dot-cleanRawMZMine.Rd, vignettes/msstats_data_format.Rmd
Method/function signatures, argument docs, details, and examples updated to expose sirius_annotations, describe the three-tier assignment and retention policy, and show worked examples with SIRIUS enrichment.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • tonywu1999
  • mstaniak

Poem

🐰 I found a name that once was lost,
Tiered hops through data at no cost.
MZMine first, then SIRIUS peeks,
If both are shy, mz_rt speaks.
The rabbit cheers: no feature lost!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is largely incomplete: it includes a summary statement but lacks detailed motivation, context, specific bullet-point changes, and testing descriptions required by the template. Complete the description by adding detailed sections explaining the motivation, context, a comprehensive bullet-point list of changes, testing approach, and ensuring the checklist items are properly addressed.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title directly and specifically describes the main change: adding SIRIUS annotations support and mz_rt fallback functionality to the MZMine converter with the three-tier ProteinName assignment strategy.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch MSstatsConvert/work/20260603_mzmine_sirius_annotations

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
R/clean_MZMine.R (1)

79-85: 💤 Low value

SIRIUS deduplication is order-dependent when duplicates exist.

Unlike tier 1 which sorts by id, -score to pick the highest-scoring compound, the SIRIUS deduplication sorts only by mappingFeatureId. If sirius_annotations contains multiple rows for the same feature (e.g., multiple structure candidates), the chosen name depends on the input row order, which may vary between runs or SIRIUS versions.

While the documentation notes that scores are "ignored in this release", consider adding a deterministic tiebreaker (e.g., alphabetical by name) to ensure reproducible results:

🔧 Suggested stabilization
     sirius_dt[, mappingFeatureId := as.character(mappingFeatureId)]
-    data.table::setorder(sirius_dt, mappingFeatureId)
+    data.table::setorder(sirius_dt, mappingFeatureId, name)
     sirius_dt <- unique(sirius_dt, by = "mappingFeatureId")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@R/clean_MZMine.R` around lines 79 - 85, sirius deduplication is
nondeterministic because sirius_dt is only ordered by mappingFeatureId before
calling unique(), so when multiple rows share a mappingFeatureId the kept row
depends on input order; to fix, make the ordering deterministic by additionally
ordering by name (with NA pushed last) before deduplication: in the sirius_dt
pipeline (working with sirius_annotations, sirius_dt, mappingFeatureId, name)
create a stable sort key or replace NA names with a sentinel that sorts after
real names, then call setorder(sirius_dt, mappingFeatureId, name_sort_key) and
finally unique(sirius_dt, by = "mappingFeatureId") so the chosen name is
reproducible (alphabetical tiebreaker).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@vignettes/msstats_data_format.Rmd`:
- Around line 381-385: Update the note about removeProtein_with1Feature and
tier-3 mz_rt fallback IDs to avoid asserting they are always singletons: clarify
that mz_rt fallback ProteinName values are derived from rounded m/z and RT so
collisions can occur and some fallback IDs may group multiple features, and
advise users that using removeProtein_with1Feature = TRUE can therefore remove
grouped features unexpectedly; reference the terms mz_rt,
removeProtein_with1Feature, ProteinName, and "tier-3" in the revised sentence.

---

Nitpick comments:
In `@R/clean_MZMine.R`:
- Around line 79-85: sirius deduplication is nondeterministic because sirius_dt
is only ordered by mappingFeatureId before calling unique(), so when multiple
rows share a mappingFeatureId the kept row depends on input order; to fix, make
the ordering deterministic by additionally ordering by name (with NA pushed
last) before deduplication: in the sirius_dt pipeline (working with
sirius_annotations, sirius_dt, mappingFeatureId, name) create a stable sort key
or replace NA names with a sentinel that sorts after real names, then call
setorder(sirius_dt, mappingFeatureId, name_sort_key) and finally
unique(sirius_dt, by = "mappingFeatureId") so the chosen name is reproducible
(alphabetical tiebreaker).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3e37b941-7631-498d-ad89-2f525f3cea15

📥 Commits

Reviewing files that changed from the base of the PR and between 5ce2a62 and 129ff6e.

⛔ Files ignored due to path filters (1)
  • inst/tinytest/raw_data/MZMine/structure_identifications.tsv is excluded by !**/*.tsv
📒 Files selected for processing (7)
  • R/clean_MZMine.R
  • R/converters_MZMinetoMSstatsFormat.R
  • inst/tinytest/test_converters_MZMinetoMSstatsFormat.R
  • man/MSstatsClean.Rd
  • man/MZMinetoMSstatsFormat.Rd
  • man/dot-cleanRawMZMine.Rd
  • vignettes/msstats_data_format.Rmd

Comment thread vignettes/msstats_data_format.Rmd Outdated
Comment thread R/converters_MZMinetoMSstatsFormat.R Outdated
Comment thread R/converters_MZMinetoMSstatsFormat.R Outdated
Comment thread R/converters_MZMinetoMSstatsFormat.R Outdated
Comment thread vignettes/msstats_data_format.Rmd Outdated
Comment thread R/clean_MZMine.R Outdated
Comment thread R/clean_MZMine.R Outdated
Comment thread R/clean_MZMine.R
- Retain features lacking MZMine compound names, assign ProteinName via MZMine/SIRIUS/mz_rt tiers, restore required mz/RT metadata, and add validation, tests, docs, and tier-level logging.
Inherit sirius_annotations docs via @inheritParams, replace tier terminology with MSI levels and plain-language source descriptions, and remove the removeProtein_with1Feature parameter (hard-coded FALSE internally). Switch the SIRIUS fill to in-place data.table updates with a deterministic dedup tiebreaker.
@swaraj-neu swaraj-neu force-pushed the MSstatsConvert/work/20260603_mzmine_sirius_annotations branch from f40d381 to 66b378c Compare June 10, 2026 04:07
@swaraj-neu swaraj-neu requested a review from tonywu1999 June 10, 2026 04:56
Comment thread R/converters_MZMinetoMSstatsFormat.R Outdated
Comment on lines +15 to +19
#' @param mzmine_annotations `data.frame` of MZMine spectral-library
#' annotations with columns `id`, `compound_name`, `score`. Required:
#' the highest-scoring `compound_name` per feature is used as
#' `ProteinName`, and features in the quant table with no matching
#' annotation row are dropped from the output.
#' the highest-scoring `compound_name` per feature (MSI Level 2
#' putative identification via MS/MS spectral matching) is used as
#' `ProteinName`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove mzmine_annotations in the docs here if you're using inheritParams now from .cleanRawMZMine

…from .cleanRawMZMine alongside sirius_annotations
@swaraj-neu swaraj-neu requested a review from tonywu1999 June 10, 2026 16:46
@tonywu1999 tonywu1999 merged commit 700a25d into devel Jun 10, 2026
2 checks passed
@tonywu1999 tonywu1999 deleted the MSstatsConvert/work/20260603_mzmine_sirius_annotations branch June 10, 2026 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants