feat(g2p): make g2p useful for variant-interpretation workflows + bug fixes#258
Open
lauraluebbert wants to merge 7 commits into
Open
feat(g2p): make g2p useful for variant-interpretation workflows + bug fixes#258lauraluebbert wants to merge 7 commits into
lauraluebbert wants to merge 7 commits into
Conversation
- gene is now optional and resolved from uniprot_id via the UniProt REST
entry endpoint (cached with lru_cache). A UniProt accession alone is
sufficient identification; the old API required both.
- Fix silent failure where G2P returns HTTP 200 with a JSON
{"status":"failure",...} body on unknown gene/UniProt pairs. The
response was being parsed as a single TSV column header and a 0-row
DataFrame returned with no error. Now detected, logged, and returns
None.
- All failure modes return None (previously a mix of None and empty
DataFrame).
- Retry transient failures (5xx, connection errors, timeouts) with
exponential backoff.
- URL-encode gene/uniprot_id/isoform path segments.
- New `out=` Python argument to write to an explicit CSV path (takes
precedence over `save`).
- Docs: list g2p in SUMMARY.md (was hidden from the published site);
advertise PFES, MaveDB, pocket and interaction columns; note that
variant overlays (gnomAD/ClinVar/HGMD) are portal-only.
Backward-compatible: existing CLI (`gget g2p BRCA1 -u P38398`) and
Python (`gget.g2p("BRCA1", uniprot_id="P38398")`) call sites continue
to work unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
for more information, see https://pre-commit.ci
- Symmetric resolution: `--uniprot_id` is now optional too, resolved from `gene` via UniProt (canonical reviewed human Swiss-Prot entry) when omitted. Limitations are spelled out in a prominent log message (synonyms, paralogues, non-human, unreviewed, isoforms — pass uniprot_id to override). The resolved pair travels with the data both as df.attrs["gene"]/["uniprot_id"] and as leading `Resolved Gene` / `Resolved UniProt` columns whenever resolution happened — so CSV/JSON saved files also record what was queried. - `residues=` argument filters `features`/`alignment` results to specific positions (int / list / tuple / range / set in Python; comma-separated list and/or inclusive ranges on the CLI: `--residues 100-200,300,400`). - `map` results gain a parsed `PDB Ids List` column (list[str]) alongside the comma-joined `PDB Ids` string, ready to feed into `gget pdb`. - `alignment` now requires `uniprot_id` explicitly (gene→UniProt returns the base accession and cannot disambiguate isoforms). Backward-compatible. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
for more information, see https://pre-commit.ci
The previous behavior added "Resolved Gene" / "Resolved UniProt" columns only when one of the identifiers was looked up. That meant the output schema differed depending on input mode, which is awkward for downstream code that should not care whether the caller supplied gene, uniprot_id, or both. Now the canonical pair is *always* prepended as `gene_name` and `uniprot_id` columns, populated with whichever values were used for the query. The same keys are also set on `df.attrs`. Output schema is now identical across all three input modes. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi @Elarwei001 — thanks for the original
gget g2pmodule. While reviewing it for biological relevance we found that the wrapper faithfully exposed the three public G2P REST endpoints but didn't quite match how scientists actually use the portal day-to-day, and had a couple of silent-failure bugs against the live API. This PR closes that gap. Tagging you so you can sanity-check the scientific framing and naming.Why this PR exists (scientific motivation)
The G2P portal is fundamentally a variant interpretation resource — scientists land on it with a list of variants of interest (clinical screens, MAVE deep mutational scans, gnomAD outliers) and ask: "at residue Y of protein Z, is this position in a folded region? at a pocket? at a PTM? does PFES flag it as enriched for pathogenic variants?"
The previous
gget g2pexposed the data needed to answer those questions but made the user do a lot of plumbing — look up the UniProt accession by hand, get all ~140 columns × N residues even when they only care about three positions, parse the comma-joined PDB string before feeding it intogget pdb, and figure out which of the 142 columns are actually the scientifically useful ones from the column headers alone. This PR is about closing that last mile so a working scientist can go from "I have a list of variants" to "I have an annotated table" in one call.What this PR does
Scientific UX
geneor--uniprot_idis now sufficient — the other is resolved via UniProt (cached). Gene → UniProt picks the canonical reviewed human Swiss-Prot entry; the resolution and its limitations (synonyms, paralogues, non-human, isoforms) are logged so users see what was chosen and how to override.gene_name/uniprot_idcolumns (and stored ondf.attrs), so downstream code doesn't have to branch on which input mode the user used and the chosen identifiers survive in saved CSV/JSON files.residues=filter restrictsfeatures/alignmentresults to specific positions — Python acceptsint/list/range/set; CLI accepts--residues 185,1775,1812or100-200or1-50,185,300-310. The whole point of fetching the features table is usually to score a small variant list; filtering client-side after fetch is one short line of pandas the user no longer has to write.mapresults get a parsedPDB Ids Listcolumn (list[str]) alongside the comma-joinedPDB Idsstring, so the output is directly chainable intogget pdbwithout.split(",")boilerplate.g2p-bisfeature-description repo so readers can decode the cryptic MaveDB column names.gget g2p. Setting honest expectations rather than implying parity.Bug fixes
{\"status\":\"failure\",\"message\":\"No data for this gene.\"}over the TSV channel. The previous code parsed that JSON string as a single TSV column header and returned a 0-row DataFrame with no error. Now detected, the upstreammessageis logged, andNoneis returned.None. Previously it was a mix ofNoneandpd.DataFrame(), which forced callers to check both.gget bgee/gget opentargetsalready use.gene/uniprot_id/isoformpath segments.Acceptheader (the server ignores it and always returnstext/plainregardless).API surface additions
residues=Python argument (and--residuesCLI flag).out=Python argument writes the result to an explicit CSV path;save=Truecontinues to work and writes to the auto-named CSV in CWD;out=takes precedence when both are set.alignmentnow requiresuniprot_idto be passed explicitly — gene→UniProt resolution returns the base accession and can't disambiguate isoforms.Backward compatibility
All existing call sites continue to work unchanged:
gget g2p BRCA1 -u P38398 -r features✓gget.g2p(\"BRCA1\", uniprot_id=\"P38398\", resource=\"features\")✓save=Truebehavior unchanged.The one schema change is the two new leading
gene_name/uniprot_idcolumns on every result, which we think is the right trade-off for invariant output across input modes.