A TDA-Based Cancer Biomarker Discovery Pipeline
FindVar is a project that applies Topological Data Analysis (TDA) to TCGA-BRCA RNA-seq data to identify cancer-associated biomarker gene sets that cannot be discovered through conventional Euclidean statistical methods. The project ultimately identifies the H2C Gene Panel, a topology-driven biomarker signature with strong predictive power.
| Finding | Description |
|---|---|
| H1 Loop Structures | Tumor samples exhibit 2.5× more H1 loops than normal samples (p < 0.001). |
| 0% Gene Overlap | The top 200 genes identified by TDA and Euclidean statistics are completely disjoint. |
| H2C Gene Panel | A set of 37 genes, all statistically non-significant under Euclidean analysis (p > 0.05), achieves AUC = 0.993. |
| Pathway Orthogonality | TDA highlights cell invasion and cytoskeletal pathways, whereas Euclidean methods identify metabolic and ion-channel pathways, with zero pathway overlap. |
FindVar/
├── README.md ← This document
├── plan.md ← Overall analysis plan
├── result.md ← Consolidated results (for manuscript preparation)
│
├── phase1_tda_setup/ ← Phase 1: Exploratory TDA analysis
│ ├── verify_install.py │ Library installation verification
│ ├── explore_ph.py │ Persistent Homology exploration
│ ├── PHASE1_REPORT.md │ Analysis report
│ └── results/
│ ├── ph_comparison_summary.csv │ PH comparison summary table
│ ├── ph_diagram_*.png │ Persistence diagrams (5 settings)
│ └── distance_comparison.png │ Wasserstein/Bottleneck comparison
│
├── phase2_persistent_homology/ ← Phase 2: Statistical validation
│ ├── analyze_ph.py │ Permutation tests + bootstrap analysis
│ ├── PHASE2_REPORT.md │ Analysis report
│ └── results/
│ ├── permutation_test_results.csv │ Permutation p-values
│ ├── h1_count_test_results.csv │ H1 count test (key result)
│ ├── bootstrap_stability_results.csv │ Bootstrap stability analysis
│ ├── permutation_null_distributions.png
│ ├── h1_count_comparison.png │ ★ H1 count: Tumor vs Normal
│ ├── observed_vs_null_comparison.png
│ └── bootstrap_stability.png
│
├── phase3_gene_traceback/ ← Phase 3: Gene attribution
│ ├── traceback_genes.py │ Decoder Jacobian-based gene tracing
│ ├── PHASE3_REPORT.md │ Analysis report
│ └── results/
│ ├── gene_importance_full.csv │ Full ranking of 20,876 genes
│ ├── gene_importance_top100.csv │ Detailed Top 100 genes
│ ├── tda_only_genes.csv │ 200 TDA-exclusive genes
│ ├── both_methods_genes.csv │ Genes identified by both methods (0 genes)
│ ├── latent_dimension_analysis.csv │ Analysis of 32 latent dimensions
│ ├── top30_genes.png │ Top 30 gene importance chart
│ ├── tda_vs_euclidean_rank.png │ ★ TDA vs Euclidean scatter plot
│ ├── discovery_comparison.png │ Gene discovery Venn diagram
│ ├── latent_dimension_importance.png
│ └── latent_pca.png
│
├── phase4_biological_interpretation/ ← Phase 4: Pathway analysis and validation
│ ├── pathway_and_validation.py │ GO/KEGG enrichment + classification
│ ├── PHASE4_REPORT.md │ Analysis report
│ └── results/
│ ├── enrichment_tda_top200.csv │ TDA pathway enrichment
│ ├── enrichment_euclidean_top200.csv │ Euclidean pathway enrichment
│ ├── classification_results.csv │ Classification performance
│ ├── pathway_overlap_summary.csv │ Pathway overlap summary
│ ├── classification_comparison.png │ ★ Classification comparison
│ └── pathway_comparison.png │ ★ Pathway comparison
│
└── phase5_visualization_paper/ ← Phase 5: Publication-ready figures
├── generate_figures.py │ Figure generation script
└── figures/
├── fig2_persistence_diagrams.pdf │ Persistence diagrams
├── fig3_statistical_validation.pdf │ Statistical validation
├── fig4_gene_discovery.pdf │ Gene discovery
├── fig5_pathway_comparison.pdf │ Pathway comparison
├── fig6_classification.pdf │ Classification performance
├── fig7_latent_space.pdf │ Latent space visualization
├── summary_figure.pdf │ Overall summary figure
└── *.png │ PNG versions
TCGA-BRCA RNA-seq (1,215 samples × 20,862 genes)
│
├─ [Preprocessing] log1p → GPU ComBat → Gene Filtering
│
├─ [TAE] Topological Autoencoder (32-dimensional cosine latent space)
│
├─ [Phase 1] Persistent Homology Exploration
│ → Detect topological differences between tumor and normal samples
│
├─ [Phase 2] Size-Matched Permutation Testing
│ → H1 loop enrichment (p < 0.001)
│
├─ [Phase 3] Decoder Jacobian Analysis
│ → Gene attribution and traceback
│ → 0% overlap between TDA and Euclidean discoveries
│
├─ [Phase 4] Pathway Enrichment + Classification Validation
│ → H2C Panel achieves AUC = 0.993
│
└─ [Phase 5] Publication-Ready Figure Generation
→ Vectorized PDF figures
The H2C Gene Panel consists of 37 genes that are completely non-significant under conventional Euclidean statistical testing (p > 0.05) but are identified as highly influential through topological analysis.
Representative genes:
| Gene | TDA Rank | Euclidean p-value | Biological Function |
|---|---|---|---|
| EFCAB3 | 8 | 0.791 | Calcium-binding domain protein |
| PGC | 11 | 0.908 | Pepsinogen C |
| RPRM | 13 | 0.206 | p53 target involved in G2 checkpoint regulation |
| RPRML | 14 | 0.333 | Reprimo-like protein |
| HSPB9 | 18 | 0.924 | Small heat shock protein |
Complete gene list: phase3_gene_traceback/results/tda_only_genes.csv
| Component | Version |
|---|---|
| Python | 3.12.13 (conda: tda) |
| PyTorch | 2.11.0+cu126 |
| ripser | 0.6.14 |
| persim | 0.3.8 |
| gudhi | 3.12.0 |
| scikit-learn | 1.8.0 |
| gseapy | 1.1.13 |
| Repository | Description |
|---|---|
| Data-preprocessing | Data preprocessing and Topological Autoencoder training |
| FindVar | TDA analysis and H2C biomarker discovery |