Code repository for the multiRF clustering manuscript.
This repo is organized as a code-only companion for lab GitHub. Raw data, large intermediate files, and manuscript outputs are intentionally excluded. The scripts are arranged so the top-level study folders contain the entry points most people should run, while a small internal/ folder is used only for HNSC data preparation helpers.
The analysis depends on the multiRF R package. Installation is included in load_package.R.
Multi-omics studies are widely used across many areas of biomedical research. In many diseases, some signals are shared across data types, while others are strongest in a single omics layer. Current multi-omics clustering methods often either merge all data types into a single representation, which can blur biology that is strong in one layer, or rely on linear structure that may miss more complex relationships across data types. We introduce multiRF, a random-forest-based method that handles complex data types and separates shared and modality-specific structure for multi-omics data. multiRF learns sample similarities across omics layers from multivariate random forests, combines them across data types, and uses the resulting weights to estimate the part of each omics layer that is predictable from the others. The remaining residual is treated as modality-specific signal, allowing shared and modality-specific similarities to be clustered separately. In simulations, multiRF recovered shared clusters as well as or better than established integrative methods while more reliably separating modality-specific signal under nonlinear data structures. In TCGA head and neck squamous cell carcinoma, the shared component aligned with the main subtype structure across established reference classifications, while gene- and miRNA-specific components revealed additional immune and developmental biology. In the ADNI cohort with matched blood DNA methylation and structural MRI, the shared cross-modal aging signal was associated with future conversion to mild cognitive impairment or Alzheimer's disease, and a DNAm-specific residual signal showed exploratory additional information. These results show that multiRF can recover a common disease axis while retaining biologically meaningful signals specific to one data type. multiRF is available as an open-source R package at https://github.com/novawz/multiRF.
MRF-cluster/
├── code/
│ ├── common/
│ │ └── config.R
│ ├── hnsc/
│ │ ├── 01_prepare_data.R
│ │ ├── 02_fit_models.R
│ │ ├── 03_subtype_crosswalk.R
│ │ ├── 04_immune_validation.R
│ │ ├── 05_hpv_negative_analysis.R
│ │ ├── 06_survival.R
│ │ ├── 07_nested_cox_models.R
│ │ ├── 08_stability.R
│ │ ├── 09_mirna_annotation.R
│ │ ├── 10_biological_annotation.R
│ │ ├── 11_methylation_annotation.R
│ │ ├── 12_marker_tests.R
│ │ ├── 13_make_main_figures.R
│ │ ├── 14_make_supplementary_figures.R
│ │ └── internal/
│ ├── adni/
│ │ ├── 01_fit_models.R
│ │ ├── 02_biological_age.R
│ │ ├── 03_feature_analysis.R
│ │ ├── 04_ablation.R
│ │ ├── 05_subtypes.R
│ │ ├── 06_make_figures.R
│ │ ├── 07_make_tables.R
│ │ ├── 08_component_sweep.R
│ │ ├── 09_component_ablation.R
│ │ └── 10_all_dnam_sensitivity.R
│ └── simulation/
│ ├── helpers.R
│ ├── run_intersim_benchmark.R
│ ├── run_nl_jive_benchmark.R
│ ├── simulate_nl_jive.R
│ ├── plot_benchmark_figures.R
│ └── ytry_sweep.R
├── load_package.R
└── README.md
Run:
source("load_package.R")If you want to install the main dependency manually:
install.packages("remotes")
remotes::install_github("novawz/multiRF")The scripts now use a shared path helper in code/common/config.R.
By default they look for local working directories under:
~/mrf-cluster-local/
├── data/
├── results/
├── cache/
└── external/
You can override these with environment variables:
MRF_CLUSTER_LOCAL_ROOTMRF_CLUSTER_DATA_ROOTMRF_CLUSTER_OUTPUT_ROOTMRF_CLUSTER_CACHE_ROOTMRF_CLUSTER_EXTERNAL_DATA_ROOT
ADNI scripts also support:
MRF_CLUSTER_ADNI_DATA_DIRMRF_CLUSTER_ADNI_CPG_REFERENCE
Supplementary figure export also supports:
MRF_CLUSTER_DRAFT_FIG_DIR
- Run
code/hnsc/01_prepare_data.Rto download and preprocess data. - Run
code/hnsc/02_fit_models.R. - Run downstream analysis scripts in order as needed.
- Generate figures at the end with
13_make_main_figures.Rand14_make_supplementary_figures.R.
- Point the ADNI environment variables to your local files.
- Run
code/adni/01_fit_models.R. - Run biological age, subtype, and ablation analyses as needed.
- Run
code/adni/10_all_dnam_sensitivity.Rfor the DNAm-only all-subject sensitivity analysis. - Generate figures and tables with
06_make_figures.Rand07_make_tables.R.
- Run
code/simulation/run_intersim_benchmark.R. - Run
code/simulation/run_nl_jive_benchmark.R. - Summarize with
code/simulation/plot_benchmark_figures.R.
- TCGA data: downloaded through
TCGAbiolinks,cBioPortalData, and related public resources. - ADNI data: available through the ADNI data portal under the project data-use agreement.
- This repo does not include raw data or generated results.
- Output files are expected to be written to your local analysis directories, not into the repository itself.
- The HNSC
internal/folder contains helper scripts used by01_prepare_data.R; most users do not need to run those files directly.