Skip to content

TDA-Medical/FindVar

Repository files navigation

FindVar

A TDA-Based Cancer Biomarker Discovery Pipeline

FindVar is a project that applies Topological Data Analysis (TDA) to TCGA-BRCA RNA-seq data to identify cancer-associated biomarker gene sets that cannot be discovered through conventional Euclidean statistical methods. The project ultimately identifies the H2C Gene Panel, a topology-driven biomarker signature with strong predictive power.


Key Findings

Finding Description
H1 Loop Structures Tumor samples exhibit 2.5× more H1 loops than normal samples (p < 0.001).
0% Gene Overlap The top 200 genes identified by TDA and Euclidean statistics are completely disjoint.
H2C Gene Panel A set of 37 genes, all statistically non-significant under Euclidean analysis (p > 0.05), achieves AUC = 0.993.
Pathway Orthogonality TDA highlights cell invasion and cytoskeletal pathways, whereas Euclidean methods identify metabolic and ion-channel pathways, with zero pathway overlap.

Project Structure

FindVar/
├── README.md                                ← This document
├── plan.md                                  ← Overall analysis plan
├── result.md                                ← Consolidated results (for manuscript preparation)
│
├── phase1_tda_setup/                        ← Phase 1: Exploratory TDA analysis
│   ├── verify_install.py                    │  Library installation verification
│   ├── explore_ph.py                        │  Persistent Homology exploration
│   ├── PHASE1_REPORT.md                     │  Analysis report
│   └── results/
│       ├── ph_comparison_summary.csv        │  PH comparison summary table
│       ├── ph_diagram_*.png                 │  Persistence diagrams (5 settings)
│       └── distance_comparison.png          │  Wasserstein/Bottleneck comparison
│
├── phase2_persistent_homology/              ← Phase 2: Statistical validation
│   ├── analyze_ph.py                        │  Permutation tests + bootstrap analysis
│   ├── PHASE2_REPORT.md                     │  Analysis report
│   └── results/
│       ├── permutation_test_results.csv     │  Permutation p-values
│       ├── h1_count_test_results.csv        │  H1 count test (key result)
│       ├── bootstrap_stability_results.csv  │  Bootstrap stability analysis
│       ├── permutation_null_distributions.png
│       ├── h1_count_comparison.png          │  ★ H1 count: Tumor vs Normal
│       ├── observed_vs_null_comparison.png
│       └── bootstrap_stability.png
│
├── phase3_gene_traceback/                   ← Phase 3: Gene attribution
│   ├── traceback_genes.py                   │  Decoder Jacobian-based gene tracing
│   ├── PHASE3_REPORT.md                     │  Analysis report
│   └── results/
│       ├── gene_importance_full.csv         │  Full ranking of 20,876 genes
│       ├── gene_importance_top100.csv       │  Detailed Top 100 genes
│       ├── tda_only_genes.csv               │  200 TDA-exclusive genes
│       ├── both_methods_genes.csv           │  Genes identified by both methods (0 genes)
│       ├── latent_dimension_analysis.csv    │  Analysis of 32 latent dimensions
│       ├── top30_genes.png                  │  Top 30 gene importance chart
│       ├── tda_vs_euclidean_rank.png        │  ★ TDA vs Euclidean scatter plot
│       ├── discovery_comparison.png         │  Gene discovery Venn diagram
│       ├── latent_dimension_importance.png
│       └── latent_pca.png
│
├── phase4_biological_interpretation/        ← Phase 4: Pathway analysis and validation
│   ├── pathway_and_validation.py            │  GO/KEGG enrichment + classification
│   ├── PHASE4_REPORT.md                     │  Analysis report
│   └── results/
│       ├── enrichment_tda_top200.csv        │  TDA pathway enrichment
│       ├── enrichment_euclidean_top200.csv  │  Euclidean pathway enrichment
│       ├── classification_results.csv       │  Classification performance
│       ├── pathway_overlap_summary.csv      │  Pathway overlap summary
│       ├── classification_comparison.png    │  ★ Classification comparison
│       └── pathway_comparison.png           │  ★ Pathway comparison
│
└── phase5_visualization_paper/              ← Phase 5: Publication-ready figures
    ├── generate_figures.py                  │  Figure generation script
    └── figures/
        ├── fig2_persistence_diagrams.pdf    │  Persistence diagrams
        ├── fig3_statistical_validation.pdf  │  Statistical validation
        ├── fig4_gene_discovery.pdf          │  Gene discovery
        ├── fig5_pathway_comparison.pdf      │  Pathway comparison
        ├── fig6_classification.pdf          │  Classification performance
        ├── fig7_latent_space.pdf            │  Latent space visualization
        ├── summary_figure.pdf               │  Overall summary figure
        └── *.png                            │  PNG versions

Analysis Pipeline

TCGA-BRCA RNA-seq (1,215 samples × 20,862 genes)
  │
  ├─ [Preprocessing] log1p → GPU ComBat → Gene Filtering
  │
  ├─ [TAE] Topological Autoencoder (32-dimensional cosine latent space)
  │
  ├─ [Phase 1] Persistent Homology Exploration
  │       → Detect topological differences between tumor and normal samples
  │
  ├─ [Phase 2] Size-Matched Permutation Testing
  │       → H1 loop enrichment (p < 0.001)
  │
  ├─ [Phase 3] Decoder Jacobian Analysis
  │       → Gene attribution and traceback
  │       → 0% overlap between TDA and Euclidean discoveries
  │
  ├─ [Phase 4] Pathway Enrichment + Classification Validation
  │       → H2C Panel achieves AUC = 0.993
  │
  └─ [Phase 5] Publication-Ready Figure Generation
          → Vectorized PDF figures

H2C Gene Panel

The H2C Gene Panel consists of 37 genes that are completely non-significant under conventional Euclidean statistical testing (p > 0.05) but are identified as highly influential through topological analysis.

Representative genes:

Gene TDA Rank Euclidean p-value Biological Function
EFCAB3 8 0.791 Calcium-binding domain protein
PGC 11 0.908 Pepsinogen C
RPRM 13 0.206 p53 target involved in G2 checkpoint regulation
RPRML 14 0.333 Reprimo-like protein
HSPB9 18 0.924 Small heat shock protein

Complete gene list: phase3_gene_traceback/results/tda_only_genes.csv


Software Environment

Component Version
Python 3.12.13 (conda: tda)
PyTorch 2.11.0+cu126
ripser 0.6.14
persim 0.3.8
gudhi 3.12.0
scikit-learn 1.8.0
gseapy 1.1.13

Related Repositories

Repository Description
Data-preprocessing Data preprocessing and Topological Autoencoder training
FindVar TDA analysis and H2C biomarker discovery

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages