Skip to content

TransBioInfoLab/multiRF-cluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MRF-cluster

Code repository for the multiRF clustering manuscript.

This repo is organized as a code-only companion for lab GitHub. Raw data, large intermediate files, and manuscript outputs are intentionally excluded. The scripts are arranged so the top-level study folders contain the entry points most people should run, while a small internal/ folder is used only for HNSC data preparation helpers.

The analysis depends on the multiRF R package. Installation is included in load_package.R.

Abstract

Multi-omics studies are widely used across many areas of biomedical research. In many diseases, some signals are shared across data types, while others are strongest in a single omics layer. Current multi-omics clustering methods often either merge all data types into a single representation, which can blur biology that is strong in one layer, or rely on linear structure that may miss more complex relationships across data types. We introduce multiRF, a random-forest-based method that handles complex data types and separates shared and modality-specific structure for multi-omics data. multiRF learns sample similarities across omics layers from multivariate random forests, combines them across data types, and uses the resulting weights to estimate the part of each omics layer that is predictable from the others. The remaining residual is treated as modality-specific signal, allowing shared and modality-specific similarities to be clustered separately. In simulations, multiRF recovered shared clusters as well as or better than established integrative methods while more reliably separating modality-specific signal under nonlinear data structures. In TCGA head and neck squamous cell carcinoma, the shared component aligned with the main subtype structure across established reference classifications, while gene- and miRNA-specific components revealed additional immune and developmental biology. In the ADNI cohort with matched blood DNA methylation and structural MRI, the shared cross-modal aging signal was associated with future conversion to mild cognitive impairment or Alzheimer's disease, and a DNAm-specific residual signal showed exploratory additional information. These results show that multiRF can recover a common disease axis while retaining biologically meaningful signals specific to one data type. multiRF is available as an open-source R package at https://github.com/novawz/multiRF.

Repository structure

MRF-cluster/
├── code/
│   ├── common/
│   │   └── config.R
│   ├── hnsc/
│   │   ├── 01_prepare_data.R
│   │   ├── 02_fit_models.R
│   │   ├── 03_subtype_crosswalk.R
│   │   ├── 04_immune_validation.R
│   │   ├── 05_hpv_negative_analysis.R
│   │   ├── 06_survival.R
│   │   ├── 07_nested_cox_models.R
│   │   ├── 08_stability.R
│   │   ├── 09_mirna_annotation.R
│   │   ├── 10_biological_annotation.R
│   │   ├── 11_methylation_annotation.R
│   │   ├── 12_marker_tests.R
│   │   ├── 13_make_main_figures.R
│   │   ├── 14_make_supplementary_figures.R
│   │   └── internal/
│   ├── adni/
│   │   ├── 01_fit_models.R
│   │   ├── 02_biological_age.R
│   │   ├── 03_feature_analysis.R
│   │   ├── 04_ablation.R
│   │   ├── 05_subtypes.R
│   │   ├── 06_make_figures.R
│   │   ├── 07_make_tables.R
│   │   ├── 08_component_sweep.R
│   │   ├── 09_component_ablation.R
│   │   └── 10_all_dnam_sensitivity.R
│   └── simulation/
│       ├── helpers.R
│       ├── run_intersim_benchmark.R
│       ├── run_nl_jive_benchmark.R
│       ├── simulate_nl_jive.R
│       ├── plot_benchmark_figures.R
│       └── ytry_sweep.R
├── load_package.R
└── README.md

Installation

Run:

source("load_package.R")

If you want to install the main dependency manually:

install.packages("remotes")
remotes::install_github("novawz/multiRF")

Local paths

The scripts now use a shared path helper in code/common/config.R.

By default they look for local working directories under:

~/mrf-cluster-local/
├── data/
├── results/
├── cache/
└── external/

You can override these with environment variables:

  • MRF_CLUSTER_LOCAL_ROOT
  • MRF_CLUSTER_DATA_ROOT
  • MRF_CLUSTER_OUTPUT_ROOT
  • MRF_CLUSTER_CACHE_ROOT
  • MRF_CLUSTER_EXTERNAL_DATA_ROOT

ADNI scripts also support:

  • MRF_CLUSTER_ADNI_DATA_DIR
  • MRF_CLUSTER_ADNI_CPG_REFERENCE

Supplementary figure export also supports:

  • MRF_CLUSTER_DRAFT_FIG_DIR

Suggested workflow

HNSC

  1. Run code/hnsc/01_prepare_data.R to download and preprocess data.
  2. Run code/hnsc/02_fit_models.R.
  3. Run downstream analysis scripts in order as needed.
  4. Generate figures at the end with 13_make_main_figures.R and 14_make_supplementary_figures.R.

ADNI

  1. Point the ADNI environment variables to your local files.
  2. Run code/adni/01_fit_models.R.
  3. Run biological age, subtype, and ablation analyses as needed.
  4. Run code/adni/10_all_dnam_sensitivity.R for the DNAm-only all-subject sensitivity analysis.
  5. Generate figures and tables with 06_make_figures.R and 07_make_tables.R.

Simulation

  1. Run code/simulation/run_intersim_benchmark.R.
  2. Run code/simulation/run_nl_jive_benchmark.R.
  3. Summarize with code/simulation/plot_benchmark_figures.R.

Data access

  • TCGA data: downloaded through TCGAbiolinks, cBioPortalData, and related public resources.
  • ADNI data: available through the ADNI data portal under the project data-use agreement.

Notes

  • This repo does not include raw data or generated results.
  • Output files are expected to be written to your local analysis directories, not into the repository itself.
  • The HNSC internal/ folder contains helper scripts used by 01_prepare_data.R; most users do not need to run those files directly.

About

Code repository for the multiRF clustering manuscript

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages