Skip to content

markusdrag/MethylSense

Repository files navigation

MethylSense Logo

High-accuracy epigenetic diagnostics via Nanopore methylation sequencing

License: AFL-3.0 R Version


Overview

MethylSense is a sort-of "semi-automated" R-based bioinformatics pipeline for training, testing, and cross-validation of machine learning models to predict the presence of clinical infections or inflammatory processes using (drum roll) epigenetics! After feeding Nanopore or Illumina methylation data and running the main analysis script, it will compute associated DMRs with your condition and find the best predictive ML models for diagnosing new cases and samples. You will also gain insights into the epigenome and generate publication-ready figures in short time - and lots more!

So what does MethylSense actually do?

  • Detects differentially methylated regions (DMRs): Identifies custom sized genomic regions with statistically significant methylation differences between multiple groups (e.g., Control, Infected, Suspected) using the methylKit R package (Akalin et al., 2012) with false discovery rate (FDR) correction. Also supports covariates.

  • Trains machine learning classifiers: Builds diagnostic models using nine machine learning algorithms via the caret framework (Kuhn, 2008), including Random Forest, SVM, XGBoost, and Elastic Net. Key features include Monte Carlo cross-validation and nested cross-validation for robust, unbiased model evaluation.

  • Diagnoses new samples: Applies trained models to classify patient samples with probability scores and confidence estimates to support clinical decision-making.

  • Generates publication-ready reports: Creates comprehensive evaluation reports with ROC curves, confusion matrices, Random Forest decision-tree graphics (and other model-specific plots), and performance metrics suitable for publication. You will have a lot of figures ready!

Originally developed for avian Aspergillus fumigatus infection diagnosis using host cell-free DNA methylation, achieving ~92% accuracy with Monte Carlo cross-validation.


Table of Contents

  1. Prerequisites
  2. Installation
  3. Sample Sheet Format
  4. Quick Start
  5. Example Output
  6. Pipeline Components
  7. Output Structure
  8. Command-Line Reference
  9. Cross-Validation Methodology
  10. Clinical Applications
  11. Troubleshooting
  12. Citation
  13. Acknowledgements
  14. License and Authors

Prerequisites

Upstream Pipelines

MethylSense requires processed Nanopore methylation data. You must run these pipelines first to prepare your sequencing data:

NanoporeToBED-Pipeline (GitHub)

This pipeline converts raw Nanopore sequencing output (modBAM files from Guppy or Dorado basecallers) into standardised CpG BED files. It extracts 5-methylcytosine (5mC) modification calls at CpG dinucleotide sites, preserving methylation frequencies and read coverage information. Run this pipeline first before MethylSense.

GenomeToWindows (GitHub)

This utility generates genomic window BED files (typically 1kb, 5kb, 10kb, or 25kb) from a reference genome. These windows define the regions for DMR detection. Aggregating CpG-level data into larger genomic windows reduces noise from individual CpG site variation and provides sufficient statistical power for differential methylation testing.

System Requirements

  • R version: 4.0 or higher (4.2+ recommended)
  • Operating system: Linux, macOS, or Windows

Running MethylSense from the terminal:

Platform How to run
Linux Use the native Terminal application
macOS Use the native Terminal app (Applications → Utilities → Terminal)
Windows Use MobaXterm (recommended) or WSL2 with Ubuntu

Note for Windows users: MobaXterm provides a Unix-like terminal environment with built-in X11 forwarding for R plots. Install R within MobaXterm's environment or use WSL2 with a native Linux R installation for best compatibility.


Installation

Quick Install

git clone https://github.com/markusdrag/MethylSense.git
cd MethylSense

Rscript MethylSense_installer.R

chmod +x MethylSense_*.R

The installer script automatically installs all required packages from both CRAN and Bioconductor. Installation typically takes 15-45 minutes depending on your system and existing packages. The script will report success or failure for each package and provide troubleshooting guidance if needed.

Manual Installation

If you prefer manual installation, MethylSense depends on packages from both CRAN and Bioconductor:

# CRAN packages
install.packages(c(
  "optparse", "qs", "caret", "pROC", "ggplot2", "hrbrthemes", "dplyr", "tidyr",
  "randomForest", "e1071", "glmnet", "nnet", "class", "MASS",
  "klaR", "ranger", "xgboost", "pheatmap", "RColorBrewer",
  "data.table", "readxl", "reshape2", "viridis", "corrplot",
  "factoextra", "ggridges", "gridExtra", "ggrepel", "naivebayes",
  "tree", "rpart", "rpart.plot", "cluster", "jsonlite", "MLmetrics",
  "doParallel", "foreach", "pls", "PRROC", "igraph", "umap",
  "dendextend", "lme4", "nlme", "MuMIn", "ez", "fossil"
))

# Bioconductor packages
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("methylKit", "GenomicRanges", "genomation", "regioneR", "IRanges"))

Verify Installation

Rscript -e "library(methylKit); library(caret); library(xgboost)"
Rscript MethylSense_load_data.R --help

Sample Sheet Format

MethylSense requires a sample metadata file (Excel .xlsx or comma-separated .csv) that describes your samples, their group assignments, and file locations. This information is essential for the pipeline to correctly load, process, and analyse your data.

Required columns

These columns must be present in your sample sheet:

Column Description Example
IDUnique sample identifier (must match BED filenames)sample_001
SpeciesSpecies name used for filtering samplesGallus_gallus
bedFileOrgOriginal BED filename from NanoporeToBEDsample_001.CpG.bed
bedFileOutput filename for converted BED (user-specified)sample_001_8col.bed
treatMethylkitNumeric group code encoding each sample's treatment or infection status for methylKit. Each unique integer represents a distinct group (e.g., 0 = Control, 1 = Suspected, 2 = Infected). These codes are mapped to human-readable names via --group_names in ascending numeric order, or explicitly with --treatment_mapping (e.g., 0=Control,1=Suspected,2=Infected). Plot colours are then auto-assigned from the MethylSense palette based on group name keywords.0, 1, 2

Optional columns

These columns provide additional metadata for reporting and stratification:

Column Description Example
InfectionHuman-readable group labelControl, Infected, Suspected
StudyStudy or cohort identifier for stratified analysisStudy_A
BatchBatch identifier for covariate adjustment (use with --covariate_cols)Batch_1, Batch_2
Pre-analytic QC (numeric)Optional technical covariates: reviewer auto-detects columns whose names match keywords such as storage, dna, hemolysis, time, concentration, extraction, quality, qc (see script). Use several variables for meaningful histograms.storage_time_days, cfDNA_concentration_ng_uL, hemolysis_index, …
Animal_IDIndividual animal identifierBird_42
Lab_IDLaboratory sample identifierLAB001

Example sample sheet

ID            Species        bedFileOrg            bedFile                treatMethylkit  Infection
control_1     Gallus_gallus  control_1.CpG.bed     control_1_8col.bed     0               Control
control_2     Gallus_gallus  control_2.CpG.bed     control_2_8col.bed     0               Control
suspected_1   Gallus_gallus  suspected_1.CpG.bed   suspected_1_8col.bed   1               Suspected
suspected_2   Gallus_gallus  suspected_2.CpG.bed   suspected_2_8col.bed   1               Suspected
infected_1    Gallus_gallus  infected_1.CpG.bed    infected_1_8col.bed    2               Infected
infected_2    Gallus_gallus  infected_2.CpG.bed    infected_2_8col.bed    2               Infected

See the example_data folder for a complete working example: sample_metadata_30.csv (and matching BED files) provides 30 samples (10 per class). You can also use sample_metadata.xlsx with the same column layout.


Quick Start

MethylSense Workflow

Step 1: Load and preprocess data

Runtime: 5-10 minutes

This step converts raw CpG BED files into a unified methylKit object (qs format). The pipeline normalises coverage across samples, filters low-quality CpG sites, and prepares the data for downstream analysis.

Rscript MethylSense_load_data.R \
  --species "Gallus_gallus" \
  --sample_sheet ./example_data/sample_metadata_30.csv \
  --bed_dir ./example_data \
  --output_dir ./preprocessed

What happens: Each sample's CpG methylation data is loaded, filtered by minimum coverage thresholds, and combined into a single methylRawList object for efficient analysis.

Output: YYYYMMDD_Species_nN_methylRaw.qs

Step 2: Train diagnostic models

Runtime: 1-2 hours (with 10-fold cross-validation)

This is the core analysis step. MethylSense identifies differentially methylated regions (DMRs) using the methylKit package, then trains machine learning classifiers to distinguish between your experimental groups.

Rscript MethylSense_analysis.R \
  --qs_file ./preprocessed/*_methylRaw.qs \
  --sample_sheet ./example_data/sample_metadata_30.csv \
  --output_dir ./training \
  --window_file ./example_data/windows/windows_5kb.bed \
  --models rf,svm,xgboost \
  --group_names Control,Suspected,Infected \
  --group_colors '#1B5E20,#006064,#4A148C' \
  --positive_class Infected \
  --cv_repeats 10 \
  --nested_cv \
  --nested_cv_repeats 5

Tip — Group colours: The --group_colors parameter accepts comma-separated hex codes or colour names, one per group in the same order as --group_names. If omitted, colours are auto-assigned from the MethylSense palette based on group name keywords (e.g., "Control" → green, "Infected" → purple). Override with any valid R colour, for example --group_colors 'darkgreen,orange,darkred'.

What happens:

  1. Methylation levels are calculated within genomic windows
  2. Statistical testing identifies DMRs with FDR correction (Benjamini-Hochberg)
  3. DMR methylation values become features for machine learning
  4. Multiple models are trained with cross-validation for robust performance estimates
  5. Nested cross-validation provides unbiased accuracy estimates suitable for publication

Output: Trained models (rds files), DMR coordinates, cross-validation results, performance metrics

Optional: Data quality assessment and visualisation

Runtime: 5-15 minutes

This step is optional but recommended for exploring your DMR results and generating publication-ready visualisations. Run this after Step 2 to understand methylation patterns, assess data quality, and create comprehensive figures for presentations or publications.

Rscript MethylSense_general_data_overview.R \
  --analysis_dir ./training/training_* \
  --sample_sheet ./example_data/sample_metadata_30.csv \
  --region_sizes "5000" \
  --output_dir ./data_overview \
  --plot_format png,pdf \
  --infection_col Infection \
  --sample_id_col ID \
  --study_col Study

What happens: The script analyses your DMR results and generates publication-ready figures including:

  • DMR counts across different window sizes
  • Volcano plots showing effect sizes and significance
  • Coverage distribution analysis (quality control)
  • Chromosomal distribution of DMRs
  • Effect size characterisation (hypermethylated vs hypomethylated)
  • Sample-level methylation heatmaps for top DMRs

Output: Comprehensive report with figures and statistics in data_overview/

When to use: Run this after Step 2 to visualise DMR results, compare different window sizes, and generate exploratory figures. This is particularly useful for understanding your data before selecting the best model for clinical diagnosis.

Step 3: Model evaluation and selection

Runtime: 5-10 minutes

Before using a model for clinical diagnosis, evaluate its performance to select the best model. This step generates comprehensive reports with publication-ready figures to help you choose the optimal diagnostic model.

# Find a trained model directory (pick one based on marker count)
Rscript MethylSense_reviewer.R \
  --model_dir ./training/training_*/windows_5kb/rf/markers_5 \
  --qs_file ./preprocessed/*_methylRaw.qs \
  --sample_sheet ./example_data/sample_metadata_30.csv \
  --sample_id_column ID \
  --treatment_column Infection \
  --output_subdir ./report

What happens: The reviewer analyses model performance on the training data, generates ROC curves, confusion matrices, and detailed reports. Compare performance across different models (rf, svm, xgboost) to select the best one for your diagnostic application.

Output: MODEL_EVALUATION_REPORT.md, MODEL_EVALUATION_REPORT.html, figures, tables

Step 4: Diagnose new samples

Runtime: 5-10 minutes (including preprocessing)

Apply your trained model to classify new unknown samples. This is the real-world clinical workflow where you have new patient samples that need diagnosis.

Step 4a: Generate unknown example samples (Tutorial only)

For this tutorial, we provide a script to generate 3 "unknown" samples with hidden labels:

Rscript generate_unknown_samples.R

This creates example_data/unknown_samples/ with 3 BED files and a metadata sheet.

Step 4b: Preprocess new samples

Preprocess the unknown samples just like in Step 1. For unknown samples, the treatMethylkit and Infection columns should be NA:

Rscript MethylSense_load_data.R \
  --species "Gallus_gallus" \
  --sample_sheet ./example_data/unknown_samples/unknown_samples_metadata.xlsx \
  --bed_dir ./example_data/unknown_samples \
  --output_dir ./unknown_preprocessed

Step 4c: Run predictions

Rscript MethylSense_predict.R \
  --model_dir ./training/training_*/windows_5kb/rf/markers_5 \
  --qs_file ./unknown_preprocessed/*_methylRaw.qs \
  --output_dir ./unknown_predictions \
  --plots

What happens:

  1. The prediction script loads your trained model and its DMR coordinates
  2. It extracts methylation values from the new samples at those specific DMR regions
  3. The model predicts each sample's infection status with a probability score
  4. Samples with confidence ≥0.7 are classified; lower confidence samples are flagged for review

Output:

  • *_predictions_detailed.csv - Full predictions with probabilities for each class
  • *_predictions_clinical_report.csv - Simplified report for clinical use
  • *_prediction_summary.txt - Overview of prediction results
  • Diagnostic plots (if --plots flag used)

Tutorial Checkpoint: Check your predictions! The 3 unknown samples have hidden true labels (Control, Infected, Suspected). Compare with your model's predictions to see accuracy.

Example prediction output

MethylSense produces detailed predictions with class probabilities for each sample:

Sample_ID True_Status Predicted_Class Confidence Prob_Control Prob_Suspected Prob_Infected
unknown_1ControlControl0.948 (High)0.9480.0520.000
unknown_2InfectedInfected0.850 (High)0.0060.1440.850
unknown_3SuspectedSuspected0.820 (High)0.0020.8200.178

All 3 samples correctly classified with high confidence (≥0.7). The model correctly identified one sample from each class.


Example Output

MethylSense generates publication-ready visualisations across its pipeline scripts. The examples below were generated from a 3-class analysis (Control, Suspected, Infected) using 30 simulated samples with 10-fold nested cross-validation. Many additional plots are generated by each script—see the output directories for the full suite.

From MethylSense_analysis.R

Plots generated during model training (use --cpg_report for CpG-level plots).

Confusion matrix

Confusion Matrix

DMR barplot

DMR Barplot

Group methylation summary

Group Methylation

Methylation heatmap

Heatmap

Prediction probabilities

Predictions

Random Forest — example decision tree

RF example decision tree

Illustrative classification tree for the selected DMR features

Genome-wide DMR Manhattan plot

Manhattan Plot

CpG-level profile (DMR #4)

CpG Profile

From MethylSense_reviewer.R

Model evaluation plots generated when reviewing a trained model.

UMAP dimensionality reduction

UMAP

Sample clustering based on methylation biomarkers

Multiclass ROC curves

ROC Curves

One-vs-rest AUC for each class

DMR marker stability

Bootstrap Stability

Bootstrap selection frequency for each DMR marker

Marker effect size vs stability

Effect Size vs Stability

DMR selection consistency against methylation difference

Preanalytic QC - technical covariate associations

LMM Associations Summary

Linear model analysis of technical covariates vs DMR methylation with FDR correction

CpG-level Manhattan plot

CpG Manhattan

Genome-wide significance of individual CpG sites within DMRs

From MethylSense_general_data_overview.R

Exploratory plots providing a landscape view of DMR results.

Volcano plot

Volcano Plot

Effect size vs significance with top DMRs labeled

Methylation by infection status

Ridge Methylation

Ridge plot of methylation distribution by group


Pipeline Components

MethylSense Architecture

Core scripts

Script Purpose Key Output
MethylSense_load_data.RPreprocess BED files into methylKit formatqs methylRawList object
MethylSense_analysis.RDMR detection and ML model trainingTrained models, DMR coordinates
MethylSense_predict.RClinical diagnosis of new samplesPredictions with confidence scores
MethylSense_reviewer.RClinical interpretation and reportingEvaluation reports and figures

Optional scripts

Script Purpose
MethylSense_general_data_overview.RData quality assessment and exploratory visualisation
MethylSense_installer.RComplete package installer with verification

Available machine learning models

MethylSense supports nine machine learning algorithms, implemented through the caret framework (Kuhn, 2008):

Model Code Speed Best Use Case
Random ForestrfMediumGeneral purpose, feature importance analysis
Support Vector MachinesvmMediumHigh accuracy classification
XGBoostxgboostMediumBest overall accuracy
Elastic NetglmnetFastBiomarker discovery, interpretable models
Neural NetworknnetMediumComplex non-linear patterns
k-Nearest NeighboursknnFastSimple datasets, baseline comparisons
Linear Discriminant AnalysisldaFastHighly interpretable results
Naive BayesnbFastBaseline comparisons
Fast Random ForestrangerFastLarge datasets

Model shortcuts: --models all, --models fast, --models biomarker


Output Structure

After MethylSense_analysis.R

output_directory/
├── analysis_summary.csv          # Complete results for all models
├── analysis_settings.txt         # Reproducibility settings
├── ANALYSIS_REPORT.md            # Summary report (Markdown)
├── ANALYSIS_REPORT.html          # Summary report (HTML)
├── run_log.txt                   # Complete execution log
├── ml_training_logs/             # Detailed ML training diagnostics
├── train_test_splits/            # Data partition information
├── precomputed_data/             # Cached intermediate results
├── complete_dmr_datasets/        # DMR statistics per window size
├── cv_analysis_plots/            # Cross-validation visualisations
└── models/                       # Trained model files
    ├── rf_model/
    │   ├── model.rds             # Serialised model object
    │   ├── dmr_coords.csv        # DMR coordinates used
    │   └── model_summary.txt     # Model performance summary
    ├── svm_model/
    └── xgboost_model/

After MethylSense_reviewer.R

report_directory/
├── MODEL_EVALUATION_REPORT.md    # Comprehensive clinical report
├── MODEL_EVALUATION_REPORT.html  # Interactive HTML version
├── MODEL_EVALUATION_REPORT.pdf   # Publication-ready PDF
├── figures/                      # All visualisations
│   ├── roc_curves/
│   ├── confusion_matrices/
│   ├── feature_importance/
│   └── methylation_heatmaps/
└── tables/                       # Summary statistics

Command-Line Reference

All MethylSense scripts support --help to display full parameter lists. This section documents parameters for publication reporting and advanced usage.

MethylSense_load_data.R

Preprocesses raw BED files into methylKit objects for analysis.

Parameter Type Default Description
--speciesstringrequiredSpecies name (must match sample sheet)
--sample_sheetfilerequiredPath to sample metadata (Excel or CSV)
--bed_dirdirrequiredDirectory containing CpG BED files
--output_dirdirrequiredOutput directory for processed .qs files
--assemblystringautoGenome assembly name
--coresint20Number of CPU cores for parallel processing
--min_coverageint1Minimum read coverage per CpG site
--force_convertflagfalseForce re-conversion of existing files
--no_parallelflagfalseDisable parallel processing

Example:

Rscript MethylSense_load_data.R \
  --species "Gallus_gallus" \
  --sample_sheet ./samples.xlsx \
  --bed_dir ./bed_files \
  --output_dir ./preprocessed \
  --cores 8

MethylSense_analysis.R

Main analysis script for DMR detection and ML classifier training.

Core parameters

Parameter Type Default Description
--qs_filefilerequiredPath to preprocessed methylRawList .qs file
--sample_sheetfileSample metadata (Excel/CSV) for covariate adjustment
--output_dirdirrequiredOutput directory for results
--window_filefilePath to a single genomic window BED file (e.g., windows_5kb.bed)
--bed_filesstringComma-separated BED files for genomic windows (alternative to --window_file)
--region_dirdirDirectory containing multiple window BED files (alternative to --window_file)
--modelsstringrf,svm,glmnet,nnet,knn,lda,nbML models to train
--coresint4CPU cores for parallel model training

Treatment group options

Parameter Type Default Description
--group_namesstringautoCustom group names, comma-separated
--group_colorsstringautoCustom colours matching group_names
--positive_classstringName of positive class for binary metrics
--treatment_mappingstringMap numeric codes to names (e.g., 0=Control,1=Infected)
--min_group_sizeint4Minimum samples per group
--allow_small_groupsflagfalseAllow groups with fewer samples (risky)

Cross-validation & performance

Parameter Type Default Description
--cv_foldsint10Cross-validation folds
--cv_repeatsint0Number of CV repetitions (5-10 recommended for publication)
--nested_cvflagfalseEnable nested cross-validation (gold standard for publication)
--nested_cv_repeatsint1Number of times to repeat the entire nested k-fold CV with different seeds. For publication: use 5. Total evaluations = cv_repeats × nested_cv_repeats
--test_ratiofloat0.3Proportion of data for test set (70:30 train:test split)
--train_test_filefileExcel file with predefined train/test splits
--min_accuracyfloat0.7Minimum accuracy to save trained models
--reproducibleflagfalseFixed seeds for reproducibility

DMR detection & filtering

Parameter Type Default Description
--qvalue_thresholdfloat0.05FDR threshold for DMR significance
--min_coverageint20Minimum read coverage per CpG
--hyper_thresholdfloat5.0Minimum % for hypermethylated DMRs
--hypo_thresholdfloat-2.0Maximum % for hypomethylated DMRs
--disable_meth_filterflagfalseDisable methylation filtering (q-value only)
--covariate_colsstringComma-separated sample sheet columns for DMR covariate adjustment (e.g., 'Batch,Study')
--exclude_sex_chromsflagfalseExclude X, Y, Z, W chromosomes from markers
--filter_standard_chromsflagfalseFilter to standard chromosomes only (1-99 + sex)
--min_samples_per_regionfloat1.0Minimum sample fraction per region

Marker set size

Parameter Type Default Description
--max_markersint100Maximum number of DMRs to use as features
--min_markersint8Minimum number of DMRs to start training
--marker_stepint1Step size for marker iteration
--marker_rangestringAlternative: specify range as min:max (e.g., 2:50)

Plot output

Parameter Type Default Description
--create_plotsflagtrueGenerate visualisations
--no_plotsflagfalseDisable plot generation
--plot_formatstringpngOutput format: png, jpg, svg, pdf, all
--plot_dpiint300Resolution (300 standard, 600 high-res, 150 draft)

Other options

Parameter Type Default Description
--force_mlflagfalseForce re-run even with cached results
--verboseflagfalseEnable verbose output
--debugflagfalseEnable debug mode

Model shortcuts: --models all (all 9), --models fast (glmnet,ranger,lda,nb), --models biomarker (glmnet,rf)

Example — Full research analysis (Drag et al. 2025 setup):

Rscript MethylSense_analysis.R \
  --qs_file ./preprocessed/Chicken_n124.qs \
  --output_dir ./results/Chicken_Aspergillosis \
  --region_dir ./windowed_genomes \
  --group_names 'Control,Aspergillus_Infected' \
  --group_colors 'darkgreen,darkred' \
  --positive_class 'Aspergillus_Infected' \
  --treatment_mapping '0=Control,1=Aspergillus_Infected' \
  --min_accuracy 0.8 \
  --marker_range '2:50' \
  --hyper_threshold 5.0 --hypo_threshold -2.0 \
  --min_coverage 10 \
  --cv_repeats 10 --nested_cv --nested_cv_repeats 5 \
  --test_ratio 0.4 \
  --exclude_sex_chroms --filter_standard_chroms \
  --models 'nnet,glmnet,lda,knn,rf,nb,svm' \
  --cores 8 \
  --plot_format 'all' --plot_dpi 300 \
  --force_ml

Example — Quick test run:

Rscript MethylSense_analysis.R \
  --qs_file ./data.qs \
  --output_dir ./quick_test \
  --group_names Control,Disease \
  --models fast \
  --max_markers 20

MethylSense_predict.R

Applies trained models to classify new samples.

Parameter Type Default Description
--qs_filefilerequiredPath to preprocessed samples (.qs)
--model_dirdirrequiredDirectory containing trained model
--output_dirdirrequiredOutput directory for predictions
--sample_idsstringallComma-separated sample IDs to predict
--min_coverageint10Minimum coverage per CpG
--confidence_thresholdfloat0.7Confidence threshold for classification
--plotsflagfalseGenerate diagnostic plots
--batch_modeflagfalseBatch mode (suppress individual plots)
--verboseflagfalseEnable verbose output

Example:

Rscript MethylSense_predict.R \
  --qs_file ./new_samples/samples.qs \
  --model_dir ./results/rf/markers_10 \
  --output_dir ./predictions \
  --plots --confidence_threshold 0.8

MethylSense_reviewer.R

Generates comprehensive model evaluation reports for publication.

Core parameters

Parameter Type Default Description
--model_dirdirrequiredPath to trained model directory
--qs_filefilerequiredPath to methylRawList .qs file
--sample_sheetfilerequiredPath to sample metadata
--output_subdirstringmodel_reviewOutput subdirectory name
--model_namestringautoCustom model name for report title
--n_coresint4Number of CPU cores
--verboseflagfalseEnable verbose output

Statistical analysis

Parameter Type Default Description
--bootstrap_iterationsint500Bootstrap iterations for CI estimation
--stability_thresholdfloat0.70Feature stability threshold
--effect_size_thresholdfloat2.0Effect size threshold for significance
--min_effect_sizefloat5.0Minimum effect size (methylation %)
--multiple_testingstringfdrCorrection: fdr, bonferroni, holm, BY, none
--significance_thresholdfloat0.05P-value significance threshold
--control_groupstringautoReference group for logistic regression

Batch & time effects

Parameter Type Default Description
--batch_columnstringColumn for batch/platform variable
--storage_columnstringColumn for storage duration variable
--date_columnstringColumn for collection/sampling date
--time_analysisstringColumn for timepoint/longitudinal data
--time_categoricalflagfalseTreat time as categorical (vs continuous)
--covariate_colsstringComma-separated column names from sample sheet to adjust for confounders during DMR calling (e.g., Batch,Storage_Days)

Network & annotation

Parameter Type Default Description
--enable_wgcnaflagfalseEnable WGCNA co-methylation network analysis
--enable_gene_annotationflagfalseEnable biomaRt gene annotation
--speciesstringSpecies for annotation (e.g., 'Gallus gallus')
--enable_feature_annotationflagfalseEnable genomic feature annotation
--gtf_filefileGTF/GFF file for gene features
--regulatory_gfffileRegulatory features GFF (enhancers, CTCF)
--emar_gfffileEMAR regions GFF file
--open_chromatin_bedfileOpen chromatin BED file
--functional_enrichmentflagfalseEnable functional enrichment analysis

PCR validation

Parameter Type Default Description
--prepare_pcrflagfalseGenerate PCR primer design links
--pcr_n_dmrsint20Number of top DMRs for PCR validation

Report output

Parameter Type Default Description
--generate_markdownflagtrueGenerate markdown report
--generate_pdfflagfalseGenerate PDF report
--markdown_titlestringautoCustom report title
--markdown_authorstringAuthor name(s) for report
--figure_formatstringpngOutput format: png, jpg, svg, all
--figure_resolutionint300Figure resolution (DPI)
--group_colorsstringautoCustom colors for PCA/UMAP plots
--blind_pathsflagfalseHide full paths in reports for privacy

Example — Full research evaluation (Drag et al. 2025 setup):

Rscript MethylSense_reviewer.R \
  --model_dir ./results/svm/markers_35 \
  --qs_file ./preprocessed/Chicken_n124.qs \
  --sample_sheet ./AsperMerged_v3.xlsx \
  --model_name "Fast Model Diagnostic Test" \
  --covariate_cols "SEQUENCER,cfDNA_conc_ng_uL,number_of_reads,mapping_percentage,mean_coverage" \
  --time_analysis Sample_Week \
  --enable_wgcna \
  --bootstrap_iterations 1000 \
  --stability_threshold 0.75 \
  --effect_size_threshold 2.0 \
  --min_effect_size 5.0 \
  --multiple_testing fdr \
  --n_cores 8 \
  --prepare_pcr \
  --functional_enrichment \
  --enable_feature_annotation \
  --gtf_file ./genomes/Gallus_gallus.GRCg7b.108.gtf \
  --regulatory_gff ./genomes/Gallus_gallus.GRCg7b.regulatory_features.gff3 \
  --emar_gff ./genomes/Gallus_gallus.GRCg7b.EMARs.gff \
  --open_chromatin_bed ./genomes/openchrom-union.bed \
  --figure_format "jpg,svg" \
  --blind_paths \
  --generate_pdf \
  --verbose

Example — Quick batch effect check:

Rscript MethylSense_reviewer.R \
  --model_dir ./results/rf/markers_20 \
  --qs_file ./data.qs \
  --sample_sheet ./metadata.xlsx \
  --batch_column Platform \
  --storage_column StorageDays

MethylSense_general_data_overview.R

Creates visualisations of the methylation data landscape.

Parameter Type Default Description
--analysis_dirdirrequiredPath to analysis directory
--sample_sheetfilerequiredPath to sample metadata
--region_sizesstring1000,5000,10000,15000,20000,25000Region sizes (bp)
--plot_formatstringpngOutput format: png, jpg, svg, pdf
--plot_widthfloat12Plot width in inches
--plot_heightfloat8Plot height in inches
--plot_dpiint300Resolution for raster formats
--volcano_meth_thresholdfloat5Methylation difference threshold (%)

Example — High-resolution SVG for publication:

Rscript MethylSense_general_data_overview.R \
  --analysis_dir ./results \
  --sample_sheet ./metadata.xlsx \
  --region_sizes "5000,10000,25000" \
  --plot_format svg --plot_dpi 600 \
  --plot_width 14 --plot_height 10

Plot Output Specifications

For publication-quality figures:

Script Default format Default DPI Recommended for print
MethylSense_analysis.R PNG 300 --plot_format svg --plot_dpi 600
MethylSense_reviewer.R PNG 300 --figure_format 'png,svg' --figure_resolution 600 --generate_pdf
MethylSense_general_data_overview.R PNG 300 --plot_format svg --plot_dpi 600

Common journal requirements:

  • Nature/Science: 300-600 DPI, prefer vector formats (SVG/PDF)
  • PLoS/BMC: 300 DPI minimum, TIFF or PNG acceptable
  • Elsevier: 300 DPI for halftones, 1000 DPI for line art

All MethylSense scripts generate figures at 300 DPI by default, suitable for most journal requirements.


Cross-Validation Methodology

MethylSense implements two complementary cross-validation strategies for model evaluation. Both are controlled through command-line parameters and generate separate output files.

Standard Monte Carlo Cross-Validation (MCCV)

Enabled with --cv_repeats N (where N > 0). This is the default and faster method.

Each repeat draws a new random 60/40 stratified split. The model is trained on 60% of the data with internal k-fold CV for hyperparameter tuning, and evaluated on the held-out 40%. Metrics are averaged across all N repeats with 95% confidence intervals.

# Standard MCCV with 10 repeats
Rscript MethylSense_analysis.R \
  --qs_file ./data.qs \
  --output_dir ./results \
  --cv_repeats 10

Output: cv_detailed_*.csv, cv_summary_*.csv

Nested Cross-Validation (Gold Standard)

Enabled by adding --nested_cv. This is the gold standard for unbiased performance estimation suitable for publication (Varma & Simon, 2006).

Nested CV uses two layers of cross-validation:

  • Outer loop: K stratified, non-overlapping folds (set by --cv_repeats). Each fold is held out as a test set exactly once.
  • Inner loop: Separate k-fold CV within the outer training set for hyperparameter tuning. The outer test fold is never seen during tuning.

The entire k-fold procedure can be repeated N times with different random seeds (set by --nested_cv_repeats), producing K × N total evaluations for more robust estimates.

# Publication-quality nested CV: 10 folds × 5 repetitions = 50 evaluations
Rscript MethylSense_analysis.R \
  --qs_file ./data.qs \
  --output_dir ./results \
  --cv_repeats 10 \
  --nested_cv \
  --nested_cv_repeats 5

Output: nested_cv_detailed_*.csv, nested_cv_summary_*.csv, nested_cv_hyperparameters_*.csv, plus nested CV visualisation plots

Comparison

Aspect Standard MCCV Nested CV
Outer splitRandom 60/40 (overlapping across repeats)K-fold, non-overlapping
Hyperparameter tuningInternal CV on train splitSeparate inner CV loop on outer train only
Test data isolationGood (not used for tuning)Complete (fully isolated from all tuning)
BiasPotentially mildly optimisticUnbiased (gold standard)
SpeedFast5–10× slower
Use caseDevelopment and explorationPublication-quality results

When --nested_cv is enabled, MethylSense runs both methods automatically and calculates the optimism bias (the difference between standard and nested CV accuracy). If the bias exceeds 5%, a warning is raised suggesting the nested CV results should be used for publication.


Clinical Applications

This section provides comprehensive guidance for veterinarians, diagnosticians, and clinicians using MethylSense for disease diagnosis. The pipeline translates complex epigenetic data into actionable clinical information.

Understanding cfDNA-Based Diagnosis

Cell-free DNA (cfDNA) is released into the bloodstream when cells die, carrying the methylation patterns of their tissue of origin. During infection, pathogen-induced host cell damage releases cfDNA with characteristic methylation signatures that differ from healthy individuals. MethylSense detects these disease-associated methylation patterns to classify samples as infected or healthy.

The diagnostic approach is fundamentally different from pathogen detection methods (PCR, culture): rather than detecting the pathogen directly, MethylSense detects the host's epigenetic response to infection. This provides complementary diagnostic information and may detect infection earlier or in cases where pathogen load is below detection limits of conventional methods.

Sample Types and Collection

MethylSense can analyse cfDNA from both serum and plasma. Both sample types contain host cfDNA suitable for methylation analysis:

Serum: Blood allowed to clot before centrifugation. May contain slightly higher cfDNA concentrations due to cell lysis during clotting. Used in the original validation study (Drag et al., 2025).

Plasma: Blood collected in anticoagulant tubes (EDTA, citrate, or heparin) and centrifuged before clotting. May have lower background from lysed blood cells. Either sample type is suitable, but consistency within a study is recommended.

Sample Processing Specifications

For avian aspergillosis diagnosis, the following specifications are based on Drag et al. (2025):

Parameter Specification
Sample typeSerum or plasma (cell-free DNA)
Volume200 µL minimum (serum or plasma)
Pre-extractionCentrifugation 2 min at 400 × g to remove residual cells
DNA extractionNorgen Biotek cfDNA Micro Kit (cat. 55500) or equivalent cfDNA kit
Quality controlBioanalyzer 2100, High Sensitivity DNA kit (confirm cfDNA fragment profile)
Storage-80°C (extracted cfDNA stable for months)
Library preparationNative Barcoding Ligation Kit (SQK-NBD114.24), 11 µL cfDNA input
SequencerPromethION P2 Solo or MinION (any ONT platform)
Flow cellR10.4.1 (version 14 chemistry)
BasecallingGuppy v6.5.7 or Dorado, High Accuracy mode
Methylation modeldna_r10.4.1_e8.2_400bps_modbases_5mc_cg_hac.cfg
Methylation callerONT modkit v0.1.2 or later, pileup mode with cpg combine mods
Reference genomeGallus gallus (bGalGal1.mat.broiler.GRCg7b) or appropriate species

Note: During library preparation, bead clean-up steps should be multiplied by 2.5× to improve recovery of short cfDNA fragments, as recommended by Martignano et al. (2023).

For complete methodology details, see Drag et al. (2025).

Clinical Workflow

The complete diagnostic workflow from sample collection to clinical decision:

  1. Sample collection — Collect 200 µL serum or plasma from patient. For serum, allow blood to clot at room temperature for 30-60 minutes before centrifugation. For plasma, collect in anticoagulant tube and centrifuge promptly.

  2. Sample processing — Centrifuge to separate serum/plasma from cells. Perform a second centrifugation at 400 × g for 2 minutes to remove residual cells that could contaminate cfDNA.

  3. DNA extraction — Extract cell-free DNA using a kit designed for cfDNA recovery (low input, short fragments). Standard genomic DNA kits are not suitable as they lose short cfDNA fragments.

  4. Quality control — Verify cfDNA profile on Bioanalyzer. Successful extraction shows characteristic cfDNA peak at approximately 160-180 bp (mononucleosomal fragment).

  5. Sequencing — Prepare Oxford Nanopore library with methylation-aware protocol. Sequence on MinION, GridION, or PromethION with R10.4.1 flow cells.

  6. Data processing — Run basecalling with methylation model enabled. Process with NanoporeToBED-Pipeline to generate CpG BED files suitable for MethylSense.

  7. Diagnosis — Run MethylSense_predict.R with your trained model. The script outputs a diagnostic classification with confidence score for each sample.

  8. Interpretation — Review prediction report. High confidence predictions (>80%) can inform clinical decisions. Medium or low confidence results warrant additional confirmatory testing.

Interpreting Diagnostic Results

The prediction output includes a probability score for each sample indicating how strongly the methylation profile matches each class. The confidence score represents the maximum predicted probability across all classes:

Confidence Level Probability Range Clinical Interpretation
High>80%Strong methylation signal consistent with predicted class. Prediction is reliable and can inform clinical decisions.
Medium60-80%Moderate methylation signal. Consider confirmatory testing (culture, PCR, endoscopy, imaging) before treatment decisions.
Low<60%Weak or ambiguous methylation signal. Do not use for clinical decisions. Investigate sample quality or consider resequencing.

Causes of Low Confidence Predictions

Low confidence predictions warrant investigation before interpretation:

  • Sample quality issues — Degraded cfDNA, insufficient DNA input, or contamination with genomic DNA from lysed blood cells
  • Technical factors — Low sequencing coverage, library preparation failure, or basecalling errors
  • Biological factors — Early-stage infection before host response develops, atypical disease presentation, or immunocompromised patients with altered epigenetic responses
  • Species mismatch — Sample from a species different to the training data population

Pre-trained Models

Pre-trained diagnostic models are available for download:

Avian Aspergillosis (Chicken, n=124)

Additional species and disease models: Contact for availability at [email protected]

Regulatory Notice

MethylSense is a research tool and has not been approved by FDA, EMA, or other regulatory bodies for clinical diagnostic use. Diagnostic decisions should be made in conjunction with other clinical findings and established diagnostic methods.

Models trained on one species may not generalise to other species without proper validation. Cross-species application requires separate validation studies before clinical use.


Troubleshooting

Common Issues

Package not found

Rscript MethylSense_installer.R

Not enough samples per group

Rscript MethylSense_analysis.R ... --allow_small_groups

Small groups (<10 samples) may produce unreliable or overfitted results. We recommend a minimum of 20 samples per group for robust classifier training.

Memory error

# Use larger windows (fewer features):
--window_file windows_10kb.bed

# Use fewer models:
--models rf,svm

Training too slow

--models fast
--cv_repeats 3

Performance Benchmarks

Step Runtime Memory
load_data5-10 min2-4 GB
general_overview5-15 min2-6 GB
train (no CV)20-30 min4-8 GB
train (10× CV)60-90 min4-8 GB
train (nested CV)2-3 hours4-8 GB
predict1-2 min1-2 GB
reviewer5-10 min2-4 GB

Benchmarks for 30-50 samples, 2 groups, 8 CPU cores.


Citation

If you use MethylSense in your research, please cite:

Drag MH, Hvilsom C, Poulsen LL, Jensen HE, Tahas SA, Leineweber C, Cray C, Bertelsen MF, Bojesen AM. MethylSense: high accuracy machine learning-based diagnostics for Aspergillus fumigatus infection in chickens using host cell-free DNA methylation and Nanopore sequencing. J Clin Microbiol 0:e01054-25. https://doi.org/10.1128/jcm.01054-25


Acknowledgements

MethylSense builds upon excellent work from the R and Bioconductor communities.

Core Dependencies

methylKit (Akalin et al., 2012) — The foundation for DMR detection and methylation analysis in MethylSense. This Bioconductor package provides comprehensive tools for genome-wide DNA methylation analysis from bisulfite sequencing data.

Akalin A, Kormaksson M, Li S, Garrett-Bakelman FE, Figueroa ME, Melnick A, Mason CE (2012). methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles. Genome Biology, 13, R87. https://doi.org/10.1186/gb-2012-13-10-r87

caret (Kuhn, 2008) — The machine learning framework powering model training, cross-validation, and hyperparameter tuning in MethylSense.

Kuhn M (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5), 1-26. https://doi.org/10.18637/jss.v028.i05

pROC (Robin et al., 2011) — ROC curve analysis and AUC calculations.

Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, Müller M (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. https://doi.org/10.1186/1471-2105-12-77

Additional R Packages

MethylSense also uses: GenomicRanges (Lawrence et al., 2013), ggplot2 (Wickham, 2016), randomForest (Liaw & Wiener, 2002), xgboost (Chen & Guestrin, 2016), glmnet (Friedman et al., 2010), e1071, ranger, pheatmap, and RColorBrewer.

References

Expand full reference list

DNA Methylation

  • Bird A (2002). DNA methylation patterns and epigenetic memory. Genes & Development, 16(1), 6-21.
  • Jones PA (2012). Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nature Reviews Genetics, 13(7), 484-492.

Nanopore Methylation

  • Simpson JT, Workman RE, Zuzarte PC, David M, Dursi LJ, Timp W (2017). Detecting DNA cytosine methylation using nanopore sequencing. Nature Methods, 14(4), 407-410.

Statistical Methods

  • Benjamini Y, Hochberg Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289-300.
  • Varma S, Simon R (2006). Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics, 7, 91.

Related Pipelines


License and Authors

License: Academic Free License 3.0 (AFL-3.0). See LICENSE.

Lead Developer: Markus Hodal Drag

Co-authors: Christina Hvilsom, Louise Ladefoged Poulsen, Henrik Elvang Jensen, Stamatios Alan Tahas, Christoph Leineweber, Carolyn Cray, Mads Frost Bertelsen, Anders Miki Bojesen

Contact: [email protected] | GitHub Issues


Developed for Oxford Nanopore methylation analysis

About

MethylSense is a powerful automated workflow to develop new diagnostic test based on ML-analysis of methylation data

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages