Skip to content

Add Compound proteinIdType and entity-agnostic grounding for metabolite networks#105

Open
swaraj-neu wants to merge 1 commit into
develfrom
MSstatsBioNet/work/20260528_compound-id-type
Open

Add Compound proteinIdType and entity-agnostic grounding for metabolite networks#105
swaraj-neu wants to merge 1 commit into
develfrom
MSstatsBioNet/work/20260528_compound-id-type

Conversation

@swaraj-neu

Copy link
Copy Markdown

Add Compound proteinIdType and entity-agnostic grounding columns:
Generalize the protein-only HgncId/HgncName contract into EntityNamespace/EntityId/EntityName grounded through Gilda, keeping multi-grounding as semicolon-joined aligned lists that fan out into the INDRA query. Gene-only annotations are skipped for compounds, and the new contract flows through annotateProteinInfoFromIndra, getSubnetworkFromIndra, and cytoscapeNetwork.

Motivation and Context

Please include relevant motivation and context of the problem along with a short summary of the solution.

Changes

Please provide a detailed bullet point list of your changes.

Testing

Please describe any unit tests you added or modified to verify your changes.

Checklist Before Requesting a Review

  • I have read the MSstats contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules
  • Ran styler::style_pkg(transformers = styler::tidyverse_style(indent_by = 4))
  • Ran devtools::document()

Generalize the protein-only HgncId/HgncName contract into EntityNamespace/EntityId/EntityName grounded through Gilda, keeping multi-grounding as semicolon-joined aligned lists that fan out into the INDRA query. Gene-only annotations are skipped for compounds, and the new contract flows through annotateProteinInfoFromIndra, getSubnetworkFromIndra, and cytoscapeNetwork.
@swaraj-neu swaraj-neu requested a review from tonywu1999 June 10, 2026 04:56
@swaraj-neu swaraj-neu self-assigned this Jun 10, 2026
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@swaraj-neu, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 39 minutes and 58 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fbdf590d-d363-4556-8c30-1a03c1ac0a8c

📥 Commits

Reviewing files that changed from the base of the PR and between d1e6219 and 24fff4a.

⛔ Files ignored due to path filters (2)
  • inst/extdata/groupComparisonModel.csv is excluded by !**/*.csv
  • inst/extdata/groupComparisonModel_compound.csv is excluded by !**/*.csv
📒 Files selected for processing (30)
  • R/annotateProteinInfoFromIndra.R
  • R/cytoscapeNetwork.R
  • R/getSubnetworkFromIndra.R
  • R/utils_annotateProteinInfoFromIndra.R
  • R/utils_cytoscapeNetwork.R
  • R/utils_getSubnetworkFromIndra.R
  • man/annotateProteinInfoFromIndra.Rd
  • man/cytoscapeNetwork.Rd
  • man/dot-populateEntityIdsInDataFrame.Rd
  • man/dot-populateEntityNamesInDataFrame.Rd
  • man/dot-populateHgncIdsInDataFrame.Rd
  • man/dot-populateHgncNamesInDataFrame.Rd
  • man/dot-populateKinaseInfoInDataFrame.Rd
  • man/dot-populatePhophataseInfoInDataFrame.Rd
  • man/dot-populateTranscriptionFactorInfoInDataFrame.Rd
  • man/dot-populateUniprotIdsInDataFrame.Rd
  • man/dot-validateAnnotateProteinInfoFromIndraInput.Rd
  • man/exportNetworkToHTML.Rd
  • man/getSubnetworkFromIndra.Rd
  • man/previewNetworkInBrowser.Rd
  • tests/testthat/test-annotateProteinInfoFromIndra.R
  • tests/testthat/test-exportNetworkToHTML.R
  • tests/testthat/test-getSubnetworkFromIndra.R
  • tests/testthat/test-multi-grounding.R
  • tests/testthat/test-utils_annotateProteinInfoFromIndra.R.R
  • tests/testthat/test-utils_cytoscapeNetwork.R
  • tests/testthat/test-utils_getSubnetworkFromIndra.R
  • vignettes/Cytoscape-Visualization.Rmd
  • vignettes/MSstatsBioNet.Rmd
  • vignettes/PTM-Analysis.Rmd
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch MSstatsBioNet/work/20260528_compound-id-type

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.40506% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.13%. Comparing base (d1e6219) to head (24fff4a).

Files with missing lines Patch % Lines
R/utils_getSubnetworkFromIndra.R 85.71% 9 Missing ⚠️
R/utils_annotateProteinInfoFromIndra.R 90.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            devel     #105      +/-   ##
==========================================
+ Coverage   75.35%   77.13%   +1.77%     
==========================================
  Files           9        9              
  Lines        1047     1124      +77     
==========================================
+ Hits          789      867      +78     
+ Misses        258      257       -1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@swaraj-neu

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Comment on lines +3 to +5
#' This function annotates a data frame with entity (protein or compound)
#' grounding information from INDRA / Gilda, plus gene-only flags
#' (transcription factor / kinase / phosphatase) for the protein paths.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: rewrite the first part to:

"This function standardizes entity identifiers from protein, compound, or gene inputs to a unified namespace using ID conversion from INDRA cogex or Gilda grounding."

Comment on lines +19 to +20
#' \item{GlobalProtein}{Character. The input identifier with the
#' MSstats mnemonic suffix stripped, used as the grounding key.}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say this is the post translational modification suffix stripped, where this suffix is typically <amino acid><site number> e.g. _S148

if (!is.null(nameMapping[[hgncId]])) {
df$HgncName[df$HgncId == hgncId] <- nameMapping[[hgncId]]
#' @return A data frame with populated entity names.
.populateEntityNamesInDataFrame <- function(df) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function name is interesting because it only populates HGNC Names, which makes it misleading.

I'd create a new function that combines both .populateEntityIdsInDataFrame and .populateEntityNamesInDataFrame into single function. The structure would look like:

  • .populateEntityInformationInDataFrame
    • if(uniprot || uniprot_mnemonic) --> .populateEntityInformationWithIndraCogex
    • else --> .populateEntityInformationWithGilda

if (proteinIdType == "Compound") {
return(df)
}
validNameMask <- !is.na(df$EntityName)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to handle it right now, but as a short term hack, could you also ensure there aren't any rows with semicolons as well with this mask? Similar comment for populating kinase and phosphatase information.

@@ -72,7 +75,7 @@ getSubnetworkFromIndra <- function(input,
direction = match.arg(direction)
input <- .filterGetSubnetworkFromIndraInput(input, pvalueCutoff, logfc_cutoff, force_include_other, include_infinite_fc, direction)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a differential abundance analysis results table here. It's labeled as data-2026-06-10.csv

I noticed that getSubnetworkFromIndra fails with this dataset, but after I filter out all of the rows that have NA in the EntityName/EntityId/EntityNamespace columns, the function works fine. Could you look into the root cause? One solution I thought of was to filter out NA EntityId rows in .filterGetSubnetworkFromIndraInput, but that'd be if the NAs are truly causing the problems

list(
text = hgnc_name,
text = text_input,
organisms = list("9606")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check if the results change (i.e. counting number of rows with NA entityName should be sufficient) if we remove this parameter for organisms (i.e. with the dataset linked in the google drive)? My thinking is that we might accidentally be losing out on chemicals from other organisms (e.g. bacteria).

Comment on lines +252 to +255
# `emitted_cpds` and `node_type = "compound"` below refer to Cytoscape
# grouping containers used to parent PTM satellite nodes around a protein.
# This Cytoscape "compound" concept is UNRELATED to the chemical
# `proteinIdType = "Compound"` analyte type in annotateProteinInfoFromIndra.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, let's use metabolite instead of compound as an enum for proteinIdType. Could you make this change? And then this comment could get removed.

#' Splits each row's semicolon-joined \code{EntityNamespace} / \code{EntityId}
#' positionally, fans out each pair into its own grounding node, then appends
#' any \code{force_include_other} entries (parsed as \code{"namespace:id"}),
#' returning the unique set. Extracted from \code{.callIndraCogexApi} to keep

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: you can remove the text Extracted from \code{.callIndraCogexApi} to keep #' the network-free portion unit-testable.

Comment on lines +51 to +68
ns_split <- strsplit(as.character(namespaces), ";")
id_split <- strsplit(as.character(ids), ";")
if (length(ns_split) != length(id_split)) {
stop("EntityNamespace and EntityId must have the same length")
}

pairs <- list()
for (i in seq_along(ns_split)) {
ns_i <- ns_split[[i]]
id_i <- id_split[[i]]
if (length(ns_i) != length(id_i)) {
stop("EntityNamespace and EntityId entries must be positionally aligned ",
"after splitting on ';' (mismatch at row ", i, ")")
}
for (k in seq_along(ns_i)) {
pairs <- c(pairs, list(list(ns_i[k], id_i[k])))
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w.r.t. readability, I'm a little confused about this loop.

It seems like you're splitting the whole character vector by ";" and then process each chunk. For more intuitive readability, would it be better to process each value in namespaces, and then split by ";" after?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants