Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 131 additions & 0 deletions docs/extended-universe.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
==================================
Extended universe (legacy mode)
==================================

The ``extended_universe`` URL parameter reproduces the cross-organism
term-enrichment behavior used in
:doc:`Functional Module Detection <modules>` and
:doc:`Tissue-specific Networks <functional-networks>` on GIANT
networks between February 2024 and April 2026. New analyses default
to human-only; MAGE networks have always used a human-only annotation
universe and the flag does not apply to them.

When to use it
==============

Use ``extended_universe=true`` only when you need to reproduce a result
from a publication, figure, or saved link generated between February
2024 and April 2026.

What it changes
===============

Term enrichment in :doc:`Functional Module Detection <modules>` uses
a one-sided Fisher's exact test followed by Benjamini–Hochberg
correction. Two inputs to that test depend on which annotation
universe is in effect:

* **Term size (K).** The number of genes annotated to the term.
* **Background universe (N).** The total number of genes considered
available for annotation.

In the current default mode, both K and N are computed from
human annotations only. In extended-universe mode, K and N include
annotations from non-human organisms that were carried over from
the source databases: mouse (*Mus musculus*), zebrafish (*Danio
rerio*), fruit fly (*Drosophila melanogaster*), nematode worm
(*Caenorhabditis elegans*), and budding yeast (*Saccharomyces
cerevisiae*).

A note on Q values
------------------

Functional Module Detection now computes term enrichment Q values using a one-sided
Fisher's exact test (the upper-tail probability ``hypergeom.sf(k - 1)``),
as described in Krishnan et al. (2016) Genome-wide prediction and
functional characterization of the genetic basis of autism spectrum
disorder. *Nature Neuroscience*. From November 2017 through
April 2026, the calculation instead used the point probability of
observing exactly the seen overlap (``hypergeom.pmf(k)``).

``extended_universe=true`` restores the point-probability calculation
along with the cross-organism universe, so any result produced between
February 2024 and April 2026 can be reproduced exactly. Results from
before February 2024 will have matching test statistics, but term-size
and universe definitions will differ from what this flag restores. If
you need to reproduce a result from before February 2024, please get
in touch and we will help you recover matching values.

A note on data versioning
-------------------------

The ``extended_universe`` flag restores the legacy *code path*, but
it cannot, on its own, restore the legacy *data state*. HumanBase
imports gene records and term annotations from external sources
(NCBI, Gene Ontology, MSigDB, MeSH, and others) on its own
schedule. Two quantities used by the hypergeometric test drift
between releases independently of any term-release version label:

* the **gene universe (M)** — the set of distinct genes with at
least one annotation, summed across the loaded organisms; and
* each **term size (K)** — the number of distinct genes annotated
to a given term.

In practice this means a community page rerun today with
``extended_universe=true`` will reproduce the exact statistical
calculation used in the legacy code path, but the inputs M and K
reflect today's annotation tables, not the tables present at the
time of the original run. Q values shift in proportion to how much
the underlying data has drifted since.

This is a known limitation. Long-term reproducibility is on the
roadmap, achieved by pinning gene and term snapshots per HumanBase
release so the data state itself is versioned.

Where it applies
================

The flag is supported only for **GIANT** networks (the original
human tissue and biological-process networks). MAGE network analyses
have used a human-only annotation pipeline from the start, so there
is no legacy cross-organism behavior to reproduce. Requests
that combine ``extended_universe=true`` with a MAGE network are
rejected by the API.

The flag applies to the GIANT version of:

* :doc:`Functional module detection <modules>` — term enrichment for
each detected community.
* :doc:`Tissue-specific networks <functional-networks>` — annotated
term tables on gene pages (Process and Tissue tabs).

It does not affect network edge weights, gene-prediction scores, or
any non-enrichment output.

How to use it
=============

Append ``extended_universe=true`` to the URL of a community page or
gene page. For example:

* Functional module detection result::

https://humanbase.io/module/overview/?body_tag=<job_id>&extended_universe=true

* Gene page::

https://humanbase.io/gene/3553/blood?extended_universe=true

The interface displays a banner whenever the flag is active, so it
is always visible whether a page is in extended-universe or
default mode. On a non-GIANT page the flag is ignored, removed from
the address bar, and a warning banner explains that the request
fell back to human-only data.

Programmatic access
===================

The same parameter is forwarded to the underlying API endpoints
(``/community/`` and ``/terms/annotated/``). Scripts replaying
historical analyses through the API should append
``&extended_universe=true`` to the query string.
33 changes: 25 additions & 8 deletions docs/functional-networks.rst
Original file line number Diff line number Diff line change
@@ -1,24 +1,31 @@
Tissue-specific Networks
===========================
In order to leverage the vast collections of raw, noisy genomic data, they must be integrated, summarized, and presented in a biologically informative manner. We provide a means of mining tens of thousands of whole-genome experiments by way of functional interaction networks. Each interaction network represents a body of data, probabilistically weighted and integrated, focused on a particular tissue or process context.
In order to leverage the vast collections of raw, noisy genomic data, they must be integrated, summarized, and presented in a biologically informative manner. We provide a means of mining tens of thousands of whole-genome experiments by way of functional interaction networks. Each interaction network represents a body of data, weighted and integrated, focused on a particular tissue, cell, or process context.

It is important to consider gene relationships within a tissue context as the precise actions of genes are frequently dependent on their tissue context, and human diseases result from the disordered interplay of tissue- and cell lineage–specific processes. These factors combine to make the understanding of tissue-specific gene functions, disease pathophysiology and gene-disease associations particularly challenging.
It is important to consider gene relationships within a tissue or cell type as the precise actions of genes are frequently dependent on their context, and human diseases result from the disordered interplay of tissue- and cell lineage–specific processes. These factors combine to make the understanding of tissue-specific gene functions, disease pathophysiology and gene-disease associations particularly challenging.

Tissue-specific network construction is described in the following publication: Greene, C. S., Krishnan, A., Wong, A. K., Ricciotti, E., Zelaya, R. A., Himmelstein, D. S., ... & Troyanskaya, O. G. (2015). `Understanding multicellular function and disease with human tissue-specific networks <https://www.nature.com/articles/ng.3259>`_. Nature Genetics.

Method
---------------------------
Briefly, functional integration relies on the construction of process-specific functional relationship networks. These are interaction networks in which each node represents a gene, each edge a functional relationship, and an edge between two genes is probabilistically weighted based on experimental evidence relating to those genes. We integrate evidence from many data sets, with each data set weighted in a process-specific manner.
Briefly, functional integration relies on the construction of process-specific functional relationship networks. These are interaction networks in which each node represents a gene, each edge a functional relationship, where an edge between two genes is a probability based on experimental evidence relating to those genes. We integrate evidence from many data sets, with each data set weighted in a process-specific manner.

One naïve Bayesian classifier is trained per biological area of interest (e.g. a tissue, or a specific biological process), using the appropriate gold standard for the biological context in addition to one global process-unaware classifier trained using the complete gold standard. Each classifier consisted of a class node predicting the binary presence or absence of a functional relationship (FR) between two genes and n nodes conditioned on FR, each representing the value of a data set.
For GIANT, one naïve Bayesian classifier is trained per biological area of interest (e.g. a tissue, or a specific biological process), using the appropriate gold standard for the biological context in addition to one global process-unaware classifier trained using the complete gold standard. Each classifier consisted of a class node predicting the binary presence or absence of a functional relationship (FR) between two genes and n nodes conditioned on FR, each representing the value of a data set.

Parameter regularization is performed as described in `Steck and Jaakkola (2002) <https://proceedings.neurips.cc/paper_files/paper/2002/file/1819932ff5cf474f4f19e7c7024640c2-Paper.pdf>`_ using mutual information between data sets to estimate a strength of prior belief for each data set. While a large amount of shared information does not guarantee a redundant data set, since the same subset of information could be shared many times, it provides a valuable quantitative estimate of data set uniqueness.
Parameter regularization is performed as described in Steck and Jaakkola (2002) using mutual information between data sets to estimate a strength of prior belief for each data set. While a large amount of shared information does not guarantee a redundant data set, since the same subset of information could be shared many times, it provides a valuable quantitative estimate of data set uniqueness.

MAGE constructs networks in two stages.
In stage 1 (representation learning), each dataset is converted into a gene graph with edges derived from coexpression or protein/gene interactions. MAGE trains a masked graph autoencoder that hides a fraction of edges and learns to reconstruct them using information from neighboring genes in the graph. The decoder outputs a reconstruction probability for each gene pair, which serves as dataset-level evidence for functional relatedness.

In stage 2 (context-specific integration), MAGE learns a tissue- or cell-type-specific mapping from dataset-level evidence to a functional relationship probability. This supervised model is trained using a tissue- or cell-type-specific functional gold standard derived from Gene Ontology biological process relationships together with tissue expression patterns. The output is a tissue- or cell-type-specific functional network where each edge weight is the predicted probability that two genes participate in shared biological processes in that context.

Data integration
---------------------------
We collected and integrated 987 genome-scale data sets encompassing approximately 38,000 conditions from an estimated 14,000 publications including both expression and interaction measurements. To integrate these data, we automatically assess each data set for its relevance to each of 144 tissue- and cell lineagespecific functional contexts. The resulting functional maps provide a detailed portrait of protein function and interactions in specific human tissues and cell lineages ranging from B lymphocytes to the renal glomerulus and the whole brain. This approach allows us to profile the specialized function of genes in a high-throughput manner, even in tissues and cell lineages for which no or few tissue-specific data exist.
GIANT integrates 987 genome-scale data sets encompassing approximately 38,000 conditions from an estimated 14,000 publications including both expression and interaction measurements. To integrate these data, we automatically assess each data set for its relevance to each of 144 tissue- and cell lineage-specific functional contexts. The resulting functional maps provide a detailed portrait of protein function and interactions in specific human tissues and cell lineages ranging from B lymphocytes to the renal glomerulus and the whole brain. This approach allows us to profile the specialized function of genes in a high-throughput manner, even in tissues and cell lineages for which no or few tissue-specific data exist.

* Gene co-expression: All gene expression data sets are from NCBI's Gene Expression Omnibus (GEO). Genes with more than 30% of values missing were removed, and remaining missing values were imputed using ten nearest neighbors. Non-log-transformed data sets were log transformed. Expression measurements were summarized to Entrez identifiers, and duplicate identifiers were merged. The Pearson correlation was calculated for each gene pair, normalized with Fisher's z transform, mean subtracted and divided by the standard deviation.
MAGE integrates 7,463 genome-scale datasets representing more than 250,000 experiments across multiple data types. These include protein–protein interaction resources, transcription factor binding motif information, perturbation and microRNA target profiles, and large collections of gene expression studies. Each dataset is processed into a graph representation, and the full collection of dataset-level edge evidence is then integrated into 289 tissue and cell-type networks.

* Gene co-expression: All gene expression data sets are from NCBI's Gene Expression Omnibus (GEO) for GIANT and refine.bio for MAGE. Genes with more than 30% of values missing were removed, and remaining missing values were imputed using ten nearest neighbors. Non-log-transformed data sets were log transformed. Expression measurements were summarized to Entrez identifiers, and duplicate identifiers were merged. The Pearson correlation was calculated for each gene pair, normalized with Fisher's z transform, mean subtracted and divided by the standard deviation.

* Protein-interaction: Interaction data are collected from BioGRID, IntAct, MINT, and MIPS.

Expand All @@ -29,6 +36,7 @@ We collected and integrated 987 genome-scale data sets encompassing approximatel

Evidence
---------------------------
For GIANT:
The "evidence" for an edge is measured as the contribution or "influence" of each dataset on the posterior classification probability. Each dataset contribution is calculated as the posterior probability of a functional relationship given only that dataset, minus the prior probablility.

Contribution of dataset D to an edge functional relationship prediction (FR)::
Expand All @@ -37,11 +45,20 @@ Contribution of dataset D to an edge functional relationship prediction (FR)::

Note that the contributions will not sum to 1.0, as each contribution is measured separately. Generally, individual gene expression datasets will not contribute much to the posterior probability but cumulatively can make a significant contribution.

For MAGE:
In each tissue- or cell-type-specific MAGE network, an edge between genes *u* and *v* is assigned a single score produced by the stage 2 (context-specific integration) gradient-boosting integration model (XGBoost). Each gene pair is represented by a 7,463-dimensional feature vector (one feature per dataset) derived from the stage 1 (representation learning) masked-edge reconstruction probabilities, and the boosting model maps these features to a predicted score between 0 and 1, where the score represents the probability of a functional relationship in that context.

The final network edge weight is the predicted score:
edge_weight(u, v) ∈ [0, 1]

Higher values indicate a higher predicted probability that the two genes participate in a functional relationship in the selected tissue or cell type.


Example
---------------------------

IL1B in blood vessel
~~~~~~~~~~~~~~~~~~~~~~~~~
We examined and experimentally verified the tissue-specific molecular response of blood vessel cells to stimulation by IL-1β (IL1B), a pro-inflammatory cytokine. We anticipated that the genes most tightly connected to IL1B in the blood vessel network would be among those responding to IL-1β stimulation in blood vessel cells. We tested this hypothesis by profiling the gene expression of human aortic smooth muscle cells (HASMCs; the predominant cell type in blood vessels) stimulated with IL-1β.

Examination of the genes whose expression was significantly upregulated at 2 h after stimulation showed that 18 of the 20 IL1B network neighbors were among the top 500 most upregulated genes in the experiment (P = 2.07 × 10−23). The blood vessel network was the most accurate tissue network in predicting this experimental outcome; none of the other 143 tissue-specific networks or the tissue-naive network performed as well when evaluated by each network's ability to predict the result of IL-1β stimulation on the cells.
Examination of the genes whose expression was significantly upregulated at 2 h after stimulation showed that 18 of the 20 IL1B network neighbors were among the top 500 most upregulated genes in the experiment (P = 2.07 × 10−23). The blood vessel network was the most accurate GIANT tissue network in predicting this experimental outcome; none of the other 143 GIANT tissue-specific networks or the tissue-naive network performed as well when evaluated by each network's ability to predict the result of IL-1β stimulation on the cells.
Binary file added docs/img/use-cases/functional-module-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/use-cases/functional-module-4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ Help topics
use-cases
functional-networks
modules
extended-universe
netwas
deepsea
sei
Expand Down
2 changes: 1 addition & 1 deletion docs/modules.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,5 @@ This approach has two key desirable characteristics:

We use a dynamic :code:`k = min(50, 0.2 * |V|)` to obtain the shared-nearest-neighbor tissue-specific network and apply the Louvain algorithm to cluster this network into distinct modules, where V is the number of query genes. Krishnan et al. (2016) showed that module node membership and cluster sizes are robust by testing a range of values for k from 10 to 100. To stabilize clustering across different runs of the Louvain algorithm, we run the algorithm 100 times and calculate cluster comembership scores for each pair of genes that was equal to the fraction of times (out of 100) the pair was assigned to the same cluster. Genes are assigned to clusters where their comembership score ≥ 0.9.

Resulting modules are then tested for functional enrichment using genes annotated to Gene Ontology biological process terms. Representative processes and pathways enriched within each cluster are presented alongside of the cluster with their resulting Q value. The Q value of each term associated to the modules is calculated using one-sided Fisher's exact tests and BenjaminiHochberg corrections to correct for multiple tests.
Resulting modules are then tested for functional enrichment using genes annotated to Gene Ontology biological process terms. GIANT networks use annotations from UniProt-GOA (experimental evidence codes), while MAGE networks use annotations from NCBI gene2go (all evidence codes including computationally inferred). Enrichment is also performed against Disease Ontology and MSigDB gene sets. Representative processes and pathways enriched within each cluster are presented alongside of the cluster with their resulting Q value. The Q value of each term associated to the modules is calculated using one-sided Fisher's exact tests and Benjamini-Hochberg corrections to correct for multiple tests.

Loading