FindVar

A TDA-Based Cancer Biomarker Discovery Pipeline

FindVar is a project that applies Topological Data Analysis (TDA) to TCGA-BRCA RNA-seq data to identify cancer-associated biomarker gene sets that cannot be discovered through conventional Euclidean statistical methods. The project ultimately identifies the H2C Gene Panel, a topology-driven biomarker signature with strong predictive power.

Key Findings

Finding	Description
H1 Loop Structures	Tumor samples exhibit 2.5× more H1 loops than normal samples (p < 0.001).
0% Gene Overlap	The top 200 genes identified by TDA and Euclidean statistics are completely disjoint.
H2C Gene Panel	A set of 37 genes, all statistically non-significant under Euclidean analysis (p > 0.05), achieves AUC = 0.993.
Pathway Orthogonality	TDA highlights cell invasion and cytoskeletal pathways, whereas Euclidean methods identify metabolic and ion-channel pathways, with zero pathway overlap.

Project Structure

FindVar/
├── README.md                                ← This document
├── plan.md                                  ← Overall analysis plan
├── result.md                                ← Consolidated results (for manuscript preparation)
│
├── phase1_tda_setup/                        ← Phase 1: Exploratory TDA analysis
│   ├── verify_install.py                    │  Library installation verification
│   ├── explore_ph.py                        │  Persistent Homology exploration
│   ├── PHASE1_REPORT.md                     │  Analysis report
│   └── results/
│       ├── ph_comparison_summary.csv        │  PH comparison summary table
│       ├── ph_diagram_*.png                 │  Persistence diagrams (5 settings)
│       └── distance_comparison.png          │  Wasserstein/Bottleneck comparison
│
├── phase2_persistent_homology/              ← Phase 2: Statistical validation
│   ├── analyze_ph.py                        │  Permutation tests + bootstrap analysis
│   ├── PHASE2_REPORT.md                     │  Analysis report
│   └── results/
│       ├── permutation_test_results.csv     │  Permutation p-values
│       ├── h1_count_test_results.csv        │  H1 count test (key result)
│       ├── bootstrap_stability_results.csv  │  Bootstrap stability analysis
│       ├── permutation_null_distributions.png
│       ├── h1_count_comparison.png          │  ★ H1 count: Tumor vs Normal
│       ├── observed_vs_null_comparison.png
│       └── bootstrap_stability.png
│
├── phase3_gene_traceback/                   ← Phase 3: Gene attribution
│   ├── traceback_genes.py                   │  Decoder Jacobian-based gene tracing
│   ├── PHASE3_REPORT.md                     │  Analysis report
│   └── results/
│       ├── gene_importance_full.csv         │  Full ranking of 20,876 genes
│       ├── gene_importance_top100.csv       │  Detailed Top 100 genes
│       ├── tda_only_genes.csv               │  200 TDA-exclusive genes
│       ├── both_methods_genes.csv           │  Genes identified by both methods (0 genes)
│       ├── latent_dimension_analysis.csv    │  Analysis of 32 latent dimensions
│       ├── top30_genes.png                  │  Top 30 gene importance chart
│       ├── tda_vs_euclidean_rank.png        │  ★ TDA vs Euclidean scatter plot
│       ├── discovery_comparison.png         │  Gene discovery Venn diagram
│       ├── latent_dimension_importance.png
│       └── latent_pca.png
│
├── phase4_biological_interpretation/        ← Phase 4: Pathway analysis and validation
│   ├── pathway_and_validation.py            │  GO/KEGG enrichment + classification
│   ├── PHASE4_REPORT.md                     │  Analysis report
│   └── results/
│       ├── enrichment_tda_top200.csv        │  TDA pathway enrichment
│       ├── enrichment_euclidean_top200.csv  │  Euclidean pathway enrichment
│       ├── classification_results.csv       │  Classification performance
│       ├── pathway_overlap_summary.csv      │  Pathway overlap summary
│       ├── classification_comparison.png    │  ★ Classification comparison
│       └── pathway_comparison.png           │  ★ Pathway comparison
│
└── phase5_visualization_paper/              ← Phase 5: Publication-ready figures
    ├── generate_figures.py                  │  Figure generation script
    └── figures/
        ├── fig2_persistence_diagrams.pdf    │  Persistence diagrams
        ├── fig3_statistical_validation.pdf  │  Statistical validation
        ├── fig4_gene_discovery.pdf          │  Gene discovery
        ├── fig5_pathway_comparison.pdf      │  Pathway comparison
        ├── fig6_classification.pdf          │  Classification performance
        ├── fig7_latent_space.pdf            │  Latent space visualization
        ├── summary_figure.pdf               │  Overall summary figure
        └── *.png                            │  PNG versions

Analysis Pipeline

TCGA-BRCA RNA-seq (1,215 samples × 20,862 genes)
  │
  ├─ [Preprocessing] log1p → GPU ComBat → Gene Filtering
  │
  ├─ [TAE] Topological Autoencoder (32-dimensional cosine latent space)
  │
  ├─ [Phase 1] Persistent Homology Exploration
  │       → Detect topological differences between tumor and normal samples
  │
  ├─ [Phase 2] Size-Matched Permutation Testing
  │       → H1 loop enrichment (p < 0.001)
  │
  ├─ [Phase 3] Decoder Jacobian Analysis
  │       → Gene attribution and traceback
  │       → 0% overlap between TDA and Euclidean discoveries
  │
  ├─ [Phase 4] Pathway Enrichment + Classification Validation
  │       → H2C Panel achieves AUC = 0.993
  │
  └─ [Phase 5] Publication-Ready Figure Generation
          → Vectorized PDF figures

H2C Gene Panel

The H2C Gene Panel consists of 37 genes that are completely non-significant under conventional Euclidean statistical testing (p > 0.05) but are identified as highly influential through topological analysis.

Representative genes:

Gene	TDA Rank	Euclidean p-value	Biological Function
EFCAB3	8	0.791	Calcium-binding domain protein
PGC	11	0.908	Pepsinogen C
RPRM	13	0.206	p53 target involved in G2 checkpoint regulation
RPRML	14	0.333	Reprimo-like protein
HSPB9	18	0.924	Small heat shock protein

Complete gene list: phase3_gene_traceback/results/tda_only_genes.csv

Software Environment

Component	Version
Python	3.12.13 (conda: tda)
PyTorch	2.11.0+cu126
ripser	0.6.14
persim	0.3.8
gudhi	3.12.0
scikit-learn	1.8.0
gseapy	1.1.13

Related Repositories

Repository	Description
Data-preprocessing	Data preprocessing and Topological Autoencoder training
FindVar	TDA analysis and H2C biomarker discovery

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FindVar

Key Findings

Project Structure

Analysis Pipeline

H2C Gene Panel

Software Environment

Related Repositories

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
phase1_tda_setup		phase1_tda_setup
phase2_persistent_homology		phase2_persistent_homology
phase3_gene_traceback		phase3_gene_traceback
phase4_biological_interpretation		phase4_biological_interpretation
phase5_visualization_paper		phase5_visualization_paper
README.md		README.md
plan.md		plan.md
result.md		result.md

Folders and files

Latest commit

History

Repository files navigation

FindVar

Key Findings

Project Structure

Analysis Pipeline

H2C Gene Panel

Software Environment

Related Repositories

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages