Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
* [gget search](en/search.md)
* [gget setup](en/setup.md)
* [gget seq](en/seq.md)
* [gget ucsc](en/ucsc.md)
* [gget virus](en/virus.md)

---
Expand Down
58 changes: 58 additions & 0 deletions docs/src/en/ucsc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
[<kbd> View page source on GitHub </kbd>](https://github.com/scverse/gget/blob/main/docs/src/en/ucsc.md)

> Python arguments are equivalent to long-option arguments (`--arg`), unless otherwise specified. Flags are True/False arguments in Python. The manual for any gget tool can be called from the command-line using the `-h` `--help` flag.
# gget ucsc 🔎
Fetch [UCSC Genome Browser](https://genome.ucsc.edu/) IDs for a gene or term, similar to `gget search` for Ensembl.
`gget ucsc` searches the UCSC Genome Browser for a gene symbol, accession, or free-text term and returns the matching identifiers (e.g. UCSC known gene / transcript IDs) together with their genomic positions, grouped by the track they come from.
Return format: JSON (command-line) or data frame/CSV (Python).

**Positional argument**
`search_term`
Gene symbol, accession, or free-text term to search for, e.g. `BRCA2`.

**Optional arguments**
`-g` `--genome`
UCSC genome assembly to search, e.g. `hg38`, `hg19`, `mm39`. Default: `hg38`.

`-t` `--track`
Only return matches from tracks whose name contains this (case-insensitive) substring, e.g. `knownGene`. Default: None.

`-l` `--limit`
Maximum number of matches to return. Default: None (all matches).

`-o` `--out`
Path to the file the results will be saved in, e.g. path/to/directory/results.csv (or .json). Default: Standard out.
Python: `save=True` will save the output in the current working directory.

**Flags**
`-csv` `--csv`
Command-line only. Returns results in CSV format.
Python: Use `json=True` to return output in JSON format.

`-q` `--quiet`
Command-line only. Prevents progress information from being displayed.
Python: Use `verbose=False` to prevent progress information from being displayed.

### Example
```bash
gget ucsc BRCA2 --genome hg38 --track knownGene
```
```python
# Python
gget.ucsc("BRCA2", genome="hg38", track="knownGene")
```
&rarr; Returns the UCSC IDs matching the search term, with their genomic positions.

| track | ucsc_id | chrom | start | end | name | description |
| --- | --- | --- | --- | --- | --- | --- |
| knownGene | ENST00000380152.8 | chr13 | 32315508 | 32400268 | BRCA2 (ENST00000380152.8) | breast cancer type 2 susceptibility protein |
| . . . | . . . | . . . | . . . | . . . | . . . | . . . |

A UCSC ID (e.g. a known gene `ucsc_id`) can be inspected on the UCSC gene page, e.g. `https://genome.ucsc.edu/cgi-bin/hgGene?hgg_gene={ucsc_id}&db=hg38`.

# References
If you use `gget ucsc` in a publication, please cite the following articles:

- Luebbert, L., & Pachter, L. (2023). Efficient querying of genomic reference databases with gget. Bioinformatics. [https://doi.org/10.1093/bioinformatics/btac836](https://doi.org/10.1093/bioinformatics/btac836)

- Kent WJ, Sugnet CW, Furey TS, et al. (2002). The human genome browser at UCSC. Genome Research. [https://doi.org/10.1101/gr.229102](https://doi.org/10.1101/gr.229102)
1 change: 1 addition & 0 deletions docs/src/en/updates.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
#### *gget* officially became part of [*scverse*](https://scverse.org/) on June 9, 2026. 🥳🥳🥳

**Version ≥ 0.30.8** (XXX XX, 2026):
- [`gget ucsc`](ucsc.md): **New module** to fetch [UCSC Genome Browser](https://genome.ucsc.edu/) IDs for a gene or term, analogous to `gget search` for Ensembl. Searches the UCSC Genome Browser for a symbol/accession/term and returns the matching identifiers (e.g. UCSC known gene / transcript IDs) with their genomic positions, grouped by track; supports filtering by `genome`, `track`, and `limit`. Available in the Python API and on the command line. Resolves [issue 18](https://github.com/scverse/gget/issues/18).
- [`gget pdb`](pdb.md): Added support for the PDBx/mmCIF structure format (fixes [issue 178](https://github.com/scverse/gget/issues/178) and [issue 177](https://github.com/scverse/gget/issues/177)).
- New `resource="mmcif"` option downloads the structure in PDBx/mmCIF format (`.cif`).
- The default `resource="pdb"` now automatically falls back to PDBx/mmCIF when the legacy PDB file is unavailable (e.g. for large structures), since the legacy PDB format is being phased out by RCSB. A warning is logged and saved files use the correct extension (`.cif`).
Expand Down
1 change: 1 addition & 0 deletions gget/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
from .gget_search import search
from .gget_seq import seq
from .gget_setup import setup
from .gget_ucsc import ucsc
from .gget_virus import virus

# Mute numexpr threads info
Expand Down
3 changes: 3 additions & 0 deletions gget/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@
# strategy avoid hanging indefinitely on slow upstreams.
DEFAULT_REQUESTS_TIMEOUT = (10, 60)

# UCSC Genome Browser REST API for gget ucsc
UCSC_API_URL = "https://api.genome.ucsc.edu"

# Ensembl REST API server for gget seq and info
ENSEMBL_REST_API = "http://rest.ensembl.org/"
ENSEMBL_FTP_URL = "http://ftp.ensembl.org/pub/"
Expand Down
180 changes: 180 additions & 0 deletions gget/gget_ucsc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
from __future__ import annotations

import html
import json as json_package
from typing import Any, Literal, overload
from urllib.parse import unquote

import pandas as pd
import requests

from .constants import DEFAULT_REQUESTS_TIMEOUT, UCSC_API_URL
from .utils import set_up_logger

logger = set_up_logger()

_COLUMNS = [
"track",
"ucsc_id",
"chrom",
"start",
"end",
"name",
"description",
]


def _parse_position(position: str | None) -> tuple[str | None, int | None, int | None]:
"""Parse a UCSC position string 'chr13:32315508-32400268' into (chrom, start, end)."""
if not position or ":" not in position:
return position, None, None
chrom, _, span = position.partition(":")
if "-" not in span:
return chrom, None, None
start_str, _, end_str = span.partition("-")
start_str = start_str.replace(",", "").strip()
end_str = end_str.replace(",", "").strip()
start = int(start_str) if start_str.isdigit() else None
end = int(end_str) if end_str.isdigit() else None
return chrom, start, end


def _match_rows(group: dict[str, Any]) -> list[dict[str, Any]]:
"""Flatten one UCSC positionMatches track group into rows."""
track = group.get("trackName") or group.get("name")
group_desc = group.get("description")
rows = []
for m in group.get("matches", []):
chrom, start, end = _parse_position(m.get("position"))
ucsc_id = m.get("hgFindMatches")
if ucsc_id is not None:
ucsc_id = unquote(str(ucsc_id))
pos_name = m.get("posName")
match_desc = m.get("description") or group_desc
rows.append(
{
"track": track,
"ucsc_id": ucsc_id,
"chrom": chrom,
"start": start,
"end": end,
"name": html.unescape(pos_name) if isinstance(pos_name, str) else pos_name,
"description": html.unescape(match_desc) if isinstance(match_desc, str) else match_desc,
}
)
return rows


@overload
def ucsc(
search_term: str,
genome: str = "hg38",
track: str | None = None,
limit: int | None = None,
save: bool = False,
verbose: bool = True,
*,
json: Literal[True],
) -> list[dict[str, Any]] | None: ...


@overload
def ucsc(
search_term: str,
genome: str = "hg38",
track: str | None = None,
limit: int | None = None,
save: bool = False,
verbose: bool = True,
json: Literal[False] = False,
) -> pd.DataFrame | None: ...


def ucsc(
search_term: str,
genome: str = "hg38",
track: str | None = None,
limit: int | None = None,
save: bool = False,
verbose: bool = True,
json: bool = False,
) -> pd.DataFrame | list[dict[str, Any]] | None:
"""Fetch UCSC Genome Browser IDs for a gene/term, similar to gget search.

Searches the UCSC Genome Browser for a gene symbol, accession, or other term
and returns the matching identifiers (e.g. UCSC known gene / transcript IDs)
together with their genomic positions, grouped by the track they come from.

Args:
- search_term Gene symbol, accession, or free-text term to search for, e.g. "BRCA2".
- genome UCSC genome assembly to search, e.g. "hg38", "hg19", "mm39". Default: "hg38".
- track If provided, only return matches from tracks whose name contains
this (case-insensitive) substring, e.g. "knownGene". Default: None.
- limit Maximum number of matches to return. Default: None (all matches).
- save If True, save the results table as csv/json in the working directory. Default: False.
- verbose True/False whether to print progress information. Default: True.
- json If True, returns results in json format instead of data frame. Default: False.

Returns a data frame (or list of dicts if json=True) with one row per match,
including the track, UCSC ID, chromosome, start, end, name, and description.
Returns None if no matches are found.
"""
if search_term is None or str(search_term).strip() == "":
raise ValueError("Please provide a gene symbol or search term in 'search_term'.")

term = str(search_term).strip()
url = f"{UCSC_API_URL}/search"
params = {"search": term, "genome": genome}

if verbose:
logger.info(f"Searching UCSC ({genome}) for '{term}'...")

try:
response = requests.get(
url,
params=params,
headers={"Accept": "application/json"},
timeout=DEFAULT_REQUESTS_TIMEOUT,
)
except requests.exceptions.RequestException as exc:
raise RuntimeError(f"The UCSC server request failed: {exc}") from exc

if not response.ok:
raise RuntimeError(
f"The UCSC server returned error status code {response.status_code}. Please try again later."
)

data = response.json()
if isinstance(data, dict) and data.get("error"):
raise ValueError(f"UCSC returned an error: {data['error']}")

rows = []
for group in data.get("positionMatches", []):
rows.extend(_match_rows(group))

# Optional track filter
if track is not None:
track_lower = str(track).lower()
rows = [r for r in rows if r["track"] and track_lower in str(r["track"]).lower()]

# Optional limit
if limit is not None:
rows = rows[: int(limit)]

results_df = pd.DataFrame(rows, columns=_COLUMNS)

if len(results_df) == 0:
logger.warning(f"No UCSC matches found for '{term}' in genome '{genome}'.")
return None

if json:
results_dict = json_package.loads(results_df.to_json(orient="records"))
if save:
with open("gget_ucsc_results.json", "w", encoding="utf-8") as f:
json_package.dump(results_dict, f, ensure_ascii=False, indent=4)
return results_dict

if save:
results_df.to_csv("gget_ucsc_results.csv", index=False)

return results_df
Loading
Loading