Add Compound proteinIdType and entity-agnostic grounding for metabolite networks#105
Add Compound proteinIdType and entity-agnostic grounding for metabolite networks#105swaraj-neu wants to merge 1 commit into
Conversation
Generalize the protein-only HgncId/HgncName contract into EntityNamespace/EntityId/EntityName grounded through Gilda, keeping multi-grounding as semicolon-joined aligned lists that fan out into the INDRA query. Gene-only annotations are skipped for compounds, and the new contract flows through annotateProteinInfoFromIndra, getSubnetworkFromIndra, and cytoscapeNetwork.
|
Warning Review limit reached
More reviews will be available in 39 minutes and 58 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (30)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## devel #105 +/- ##
==========================================
+ Coverage 75.35% 77.13% +1.77%
==========================================
Files 9 9
Lines 1047 1124 +77
==========================================
+ Hits 789 867 +78
+ Misses 258 257 -1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
@coderabbitai review |
✅ Action performedReview finished.
|
| #' This function annotates a data frame with entity (protein or compound) | ||
| #' grounding information from INDRA / Gilda, plus gene-only flags | ||
| #' (transcription factor / kinase / phosphatase) for the protein paths. |
There was a problem hiding this comment.
nitpick: rewrite the first part to:
"This function standardizes entity identifiers from protein, compound, or gene inputs to a unified namespace using ID conversion from INDRA cogex or Gilda grounding."
| #' \item{GlobalProtein}{Character. The input identifier with the | ||
| #' MSstats mnemonic suffix stripped, used as the grounding key.} |
There was a problem hiding this comment.
I'd say this is the post translational modification suffix stripped, where this suffix is typically <amino acid><site number> e.g. _S148
| if (!is.null(nameMapping[[hgncId]])) { | ||
| df$HgncName[df$HgncId == hgncId] <- nameMapping[[hgncId]] | ||
| #' @return A data frame with populated entity names. | ||
| .populateEntityNamesInDataFrame <- function(df) { |
There was a problem hiding this comment.
This function name is interesting because it only populates HGNC Names, which makes it misleading.
I'd create a new function that combines both .populateEntityIdsInDataFrame and .populateEntityNamesInDataFrame into single function. The structure would look like:
- .populateEntityInformationInDataFrame
- if(uniprot || uniprot_mnemonic) --> .populateEntityInformationWithIndraCogex
- else --> .populateEntityInformationWithGilda
| if (proteinIdType == "Compound") { | ||
| return(df) | ||
| } | ||
| validNameMask <- !is.na(df$EntityName) |
There was a problem hiding this comment.
I'm not sure how to handle it right now, but as a short term hack, could you also ensure there aren't any rows with semicolons as well with this mask? Similar comment for populating kinase and phosphatase information.
| @@ -72,7 +75,7 @@ getSubnetworkFromIndra <- function(input, | |||
| direction = match.arg(direction) | |||
| input <- .filterGetSubnetworkFromIndraInput(input, pvalueCutoff, logfc_cutoff, force_include_other, include_infinite_fc, direction) | |||
There was a problem hiding this comment.
I added a differential abundance analysis results table here. It's labeled as data-2026-06-10.csv
I noticed that getSubnetworkFromIndra fails with this dataset, but after I filter out all of the rows that have NA in the EntityName/EntityId/EntityNamespace columns, the function works fine. Could you look into the root cause? One solution I thought of was to filter out NA EntityId rows in .filterGetSubnetworkFromIndraInput, but that'd be if the NAs are truly causing the problems
| list( | ||
| text = hgnc_name, | ||
| text = text_input, | ||
| organisms = list("9606") |
There was a problem hiding this comment.
Could you check if the results change (i.e. counting number of rows with NA entityName should be sufficient) if we remove this parameter for organisms (i.e. with the dataset linked in the google drive)? My thinking is that we might accidentally be losing out on chemicals from other organisms (e.g. bacteria).
| # `emitted_cpds` and `node_type = "compound"` below refer to Cytoscape | ||
| # grouping containers used to parent PTM satellite nodes around a protein. | ||
| # This Cytoscape "compound" concept is UNRELATED to the chemical | ||
| # `proteinIdType = "Compound"` analyte type in annotateProteinInfoFromIndra. |
There was a problem hiding this comment.
For now, let's use metabolite instead of compound as an enum for proteinIdType. Could you make this change? And then this comment could get removed.
| #' Splits each row's semicolon-joined \code{EntityNamespace} / \code{EntityId} | ||
| #' positionally, fans out each pair into its own grounding node, then appends | ||
| #' any \code{force_include_other} entries (parsed as \code{"namespace:id"}), | ||
| #' returning the unique set. Extracted from \code{.callIndraCogexApi} to keep |
There was a problem hiding this comment.
nitpick: you can remove the text Extracted from \code{.callIndraCogexApi} to keep #' the network-free portion unit-testable.
| ns_split <- strsplit(as.character(namespaces), ";") | ||
| id_split <- strsplit(as.character(ids), ";") | ||
| if (length(ns_split) != length(id_split)) { | ||
| stop("EntityNamespace and EntityId must have the same length") | ||
| } | ||
|
|
||
| pairs <- list() | ||
| for (i in seq_along(ns_split)) { | ||
| ns_i <- ns_split[[i]] | ||
| id_i <- id_split[[i]] | ||
| if (length(ns_i) != length(id_i)) { | ||
| stop("EntityNamespace and EntityId entries must be positionally aligned ", | ||
| "after splitting on ';' (mismatch at row ", i, ")") | ||
| } | ||
| for (k in seq_along(ns_i)) { | ||
| pairs <- c(pairs, list(list(ns_i[k], id_i[k]))) | ||
| } | ||
| } |
There was a problem hiding this comment.
w.r.t. readability, I'm a little confused about this loop.
It seems like you're splitting the whole character vector by ";" and then process each chunk. For more intuitive readability, would it be better to process each value in namespaces, and then split by ";" after?
Add Compound proteinIdType and entity-agnostic grounding columns:
Generalize the protein-only HgncId/HgncName contract into EntityNamespace/EntityId/EntityName grounded through Gilda, keeping multi-grounding as semicolon-joined aligned lists that fan out into the INDRA query. Gene-only annotations are skipped for compounds, and the new contract flows through annotateProteinInfoFromIndra, getSubnetworkFromIndra, and cytoscapeNetwork.
Motivation and Context
Please include relevant motivation and context of the problem along with a short summary of the solution.
Changes
Please provide a detailed bullet point list of your changes.
Testing
Please describe any unit tests you added or modified to verify your changes.
Checklist Before Requesting a Review