GitHub - markusdrag/MethylSense: MethylSense is a powerful automated workflow to develop new diagnostic test based on ML-analysis of methylation data

High-accuracy epigenetic diagnostics via Nanopore methylation sequencing

Overview

MethylSense is a sort-of "semi-automated" R-based bioinformatics pipeline for training, testing, and cross-validation of machine learning models to predict the presence of clinical infections or inflammatory processes using (drum roll) epigenetics! After feeding Nanopore or Illumina methylation data and running the main analysis script, it will compute associated DMRs with your condition and find the best predictive ML models for diagnosing new cases and samples. You will also gain insights into the epigenome and generate publication-ready figures in short time - and lots more!

So what does MethylSense actually do?

Detects differentially methylated regions (DMRs): Identifies custom sized genomic regions with statistically significant methylation differences between multiple groups (e.g., Control, Infected, Suspected) using the methylKit R package (Akalin et al., 2012) with false discovery rate (FDR) correction. Also supports covariates.
Trains machine learning classifiers: Builds diagnostic models using nine machine learning algorithms via the caret framework (Kuhn, 2008), including Random Forest, SVM, XGBoost, and Elastic Net. Key features include Monte Carlo cross-validation and nested cross-validation for robust, unbiased model evaluation.
Diagnoses new samples: Applies trained models to classify patient samples with probability scores and confidence estimates to support clinical decision-making.
Generates publication-ready reports: Creates comprehensive evaluation reports with ROC curves, confusion matrices, Random Forest decision-tree graphics (and other model-specific plots), and performance metrics suitable for publication. You will have a lot of figures ready!

Originally developed for avian Aspergillus fumigatus infection diagnosis using host cell-free DNA methylation, achieving ~92% accuracy with Monte Carlo cross-validation.

Prerequisites

Upstream Pipelines

MethylSense requires processed Nanopore methylation data. You must run these pipelines first to prepare your sequencing data:

NanoporeToBED-Pipeline (GitHub)

This pipeline converts raw Nanopore sequencing output (modBAM files from Guppy or Dorado basecallers) into standardised CpG BED files. It extracts 5-methylcytosine (5mC) modification calls at CpG dinucleotide sites, preserving methylation frequencies and read coverage information. Run this pipeline first before MethylSense.

GenomeToWindows (GitHub)

This utility generates genomic window BED files (typically 1kb, 5kb, 10kb, or 25kb) from a reference genome. These windows define the regions for DMR detection. Aggregating CpG-level data into larger genomic windows reduces noise from individual CpG site variation and provides sufficient statistical power for differential methylation testing.

System Requirements

R version: 4.0 or higher (4.2+ recommended)
Operating system: Linux, macOS, or Windows

Running MethylSense from the terminal:

Platform	How to run
Linux	Use the native Terminal application
macOS	Use the native Terminal app (Applications → Utilities → Terminal)
Windows	Use MobaXterm (recommended) or WSL2 with Ubuntu

Note for Windows users: MobaXterm provides a Unix-like terminal environment with built-in X11 forwarding for R plots. Install R within MobaXterm's environment or use WSL2 with a native Linux R installation for best compatibility.

Installation

Quick Install

git clone https://github.com/markusdrag/MethylSense.git
cd MethylSense

Rscript MethylSense_installer.R

chmod +x MethylSense_*.R

The installer script automatically installs all required packages from both CRAN and Bioconductor. Installation typically takes 15-45 minutes depending on your system and existing packages. The script will report success or failure for each package and provide troubleshooting guidance if needed.

Manual Installation

If you prefer manual installation, MethylSense depends on packages from both CRAN and Bioconductor:

# CRAN packages
install.packages(c(
  "optparse", "qs", "caret", "pROC", "ggplot2", "hrbrthemes", "dplyr", "tidyr",
  "randomForest", "e1071", "glmnet", "nnet", "class", "MASS",
  "klaR", "ranger", "xgboost", "pheatmap", "RColorBrewer",
  "data.table", "readxl", "reshape2", "viridis", "corrplot",
  "factoextra", "ggridges", "gridExtra", "ggrepel", "naivebayes",
  "tree", "rpart", "rpart.plot", "cluster", "jsonlite", "MLmetrics",
  "doParallel", "foreach", "pls", "PRROC", "igraph", "umap",
  "dendextend", "lme4", "nlme", "MuMIn", "ez", "fossil"
))

# Bioconductor packages
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("methylKit", "GenomicRanges", "genomation", "regioneR", "IRanges"))

Verify Installation

Rscript -e "library(methylKit); library(caret); library(xgboost)"
Rscript MethylSense_load_data.R --help

Sample Sheet Format

MethylSense requires a sample metadata file (Excel .xlsx or comma-separated .csv) that describes your samples, their group assignments, and file locations. This information is essential for the pipeline to correctly load, process, and analyse your data.

Required columns

These columns must be present in your sample sheet:

Column	Description	Example
ID	Unique sample identifier (must match BED filenames)	sample_001
Species	Species name used for filtering samples	Gallus_gallus
bedFileOrg	Original BED filename from NanoporeToBED	sample_001.CpG.bed
bedFile	Output filename for converted BED (user-specified)	sample_001_8col.bed
treatMethylkit	Numeric group code encoding each sample's treatment or infection status for methylKit. Each unique integer represents a distinct group (e.g., 0 = Control, 1 = Suspected, 2 = Infected). These codes are mapped to human-readable names via `--group_names` in ascending numeric order, or explicitly with `--treatment_mapping` (e.g., `0=Control,1=Suspected,2=Infected`). Plot colours are then auto-assigned from the MethylSense palette based on group name keywords.	0, 1, 2

Optional columns

These columns provide additional metadata for reporting and stratification:

Column	Description	Example
Infection	Human-readable group label	Control, Infected, Suspected
Study	Study or cohort identifier for stratified analysis	Study_A
Batch	Batch identifier for covariate adjustment (use with --covariate_cols)	Batch_1, Batch_2
Pre-analytic QC (numeric)	Optional technical covariates: reviewer auto-detects columns whose names match keywords such as `storage`, `dna`, `hemolysis`, `time`, `concentration`, `extraction`, `quality`, `qc` (see script). Use several variables for meaningful histograms.	`storage_time_days`, `cfDNA_concentration_ng_uL`, `hemolysis_index`, …
Animal_ID	Individual animal identifier	Bird_42
Lab_ID	Laboratory sample identifier	LAB001

Example sample sheet

ID            Species        bedFileOrg            bedFile                treatMethylkit  Infection
control_1     Gallus_gallus  control_1.CpG.bed     control_1_8col.bed     0               Control
control_2     Gallus_gallus  control_2.CpG.bed     control_2_8col.bed     0               Control
suspected_1   Gallus_gallus  suspected_1.CpG.bed   suspected_1_8col.bed   1               Suspected
suspected_2   Gallus_gallus  suspected_2.CpG.bed   suspected_2_8col.bed   1               Suspected
infected_1    Gallus_gallus  infected_1.CpG.bed    infected_1_8col.bed    2               Infected
infected_2    Gallus_gallus  infected_2.CpG.bed    infected_2_8col.bed    2               Infected

See the example_data folder for a complete working example: sample_metadata_30.csv (and matching BED files) provides 30 samples (10 per class). You can also use sample_metadata.xlsx with the same column layout.

Quick Start

Step 1: Load and preprocess data

Runtime: 5-10 minutes

This step converts raw CpG BED files into a unified methylKit object (qs format). The pipeline normalises coverage across samples, filters low-quality CpG sites, and prepares the data for downstream analysis.

Rscript MethylSense_load_data.R \
  --species "Gallus_gallus" \
  --sample_sheet ./example_data/sample_metadata_30.csv \
  --bed_dir ./example_data \
  --output_dir ./preprocessed

What happens: Each sample's CpG methylation data is loaded, filtered by minimum coverage thresholds, and combined into a single methylRawList object for efficient analysis.

Output: YYYYMMDD_Species_nN_methylRaw.qs

Step 2: Train diagnostic models

Runtime: 1-2 hours (with 10-fold cross-validation)

This is the core analysis step. MethylSense identifies differentially methylated regions (DMRs) using the methylKit package, then trains machine learning classifiers to distinguish between your experimental groups.

Rscript MethylSense_analysis.R \
  --qs_file ./preprocessed/*_methylRaw.qs \
  --sample_sheet ./example_data/sample_metadata_30.csv \
  --output_dir ./training \
  --window_file ./example_data/windows/windows_5kb.bed \
  --models rf,svm,xgboost \
  --group_names Control,Suspected,Infected \
  --group_colors '#1B5E20,#006064,#4A148C' \
  --positive_class Infected \
  --cv_repeats 10 \
  --nested_cv \
  --nested_cv_repeats 5

Tip — Group colours: The --group_colors parameter accepts comma-separated hex codes or colour names, one per group in the same order as --group_names. If omitted, colours are auto-assigned from the MethylSense palette based on group name keywords (e.g., "Control" → green, "Infected" → purple). Override with any valid R colour, for example --group_colors 'darkgreen,orange,darkred'.

What happens:

Methylation levels are calculated within genomic windows
Statistical testing identifies DMRs with FDR correction (Benjamini-Hochberg)
DMR methylation values become features for machine learning
Multiple models are trained with cross-validation for robust performance estimates
Nested cross-validation provides unbiased accuracy estimates suitable for publication

Output: Trained models (rds files), DMR coordinates, cross-validation results, performance metrics

Optional: Data quality assessment and visualisation

Runtime: 5-15 minutes

This step is optional but recommended for exploring your DMR results and generating publication-ready visualisations. Run this after Step 2 to understand methylation patterns, assess data quality, and create comprehensive figures for presentations or publications.

Rscript MethylSense_general_data_overview.R \
  --analysis_dir ./training/training_* \
  --sample_sheet ./example_data/sample_metadata_30.csv \
  --region_sizes "5000" \
  --output_dir ./data_overview \
  --plot_format png,pdf \
  --infection_col Infection \
  --sample_id_col ID \
  --study_col Study

What happens: The script analyses your DMR results and generates publication-ready figures including:

DMR counts across different window sizes
Volcano plots showing effect sizes and significance
Coverage distribution analysis (quality control)
Chromosomal distribution of DMRs
Effect size characterisation (hypermethylated vs hypomethylated)
Sample-level methylation heatmaps for top DMRs

Output: Comprehensive report with figures and statistics in data_overview/

When to use: Run this after Step 2 to visualise DMR results, compare different window sizes, and generate exploratory figures. This is particularly useful for understanding your data before selecting the best model for clinical diagnosis.

Step 3: Model evaluation and selection

Runtime: 5-10 minutes

Before using a model for clinical diagnosis, evaluate its performance to select the best model. This step generates comprehensive reports with publication-ready figures to help you choose the optimal diagnostic model.

# Find a trained model directory (pick one based on marker count)
Rscript MethylSense_reviewer.R \
  --model_dir ./training/training_*/windows_5kb/rf/markers_5 \
  --qs_file ./preprocessed/*_methylRaw.qs \
  --sample_sheet ./example_data/sample_metadata_30.csv \
  --sample_id_column ID \
  --treatment_column Infection \
  --output_subdir ./report

What happens: The reviewer analyses model performance on the training data, generates ROC curves, confusion matrices, and detailed reports. Compare performance across different models (rf, svm, xgboost) to select the best one for your diagnostic application.

Output: MODEL_EVALUATION_REPORT.md, MODEL_EVALUATION_REPORT.html, figures, tables

Step 4: Diagnose new samples

Runtime: 5-10 minutes (including preprocessing)

Apply your trained model to classify new unknown samples. This is the real-world clinical workflow where you have new patient samples that need diagnosis.

Step 4a: Generate unknown example samples (Tutorial only)

For this tutorial, we provide a script to generate 3 "unknown" samples with hidden labels:

Rscript generate_unknown_samples.R

This creates example_data/unknown_samples/ with 3 BED files and a metadata sheet.

Step 4b: Preprocess new samples

Preprocess the unknown samples just like in Step 1. For unknown samples, the treatMethylkit and Infection columns should be NA:

Rscript MethylSense_load_data.R \
  --species "Gallus_gallus" \
  --sample_sheet ./example_data/unknown_samples/unknown_samples_metadata.xlsx \
  --bed_dir ./example_data/unknown_samples \
  --output_dir ./unknown_preprocessed

Step 4c: Run predictions

Rscript MethylSense_predict.R \
  --model_dir ./training/training_*/windows_5kb/rf/markers_5 \
  --qs_file ./unknown_preprocessed/*_methylRaw.qs \
  --output_dir ./unknown_predictions \
  --plots

What happens:

The prediction script loads your trained model and its DMR coordinates
It extracts methylation values from the new samples at those specific DMR regions
The model predicts each sample's infection status with a probability score
Samples with confidence ≥0.7 are classified; lower confidence samples are flagged for review

Output:

*_predictions_detailed.csv - Full predictions with probabilities for each class
*_predictions_clinical_report.csv - Simplified report for clinical use
*_prediction_summary.txt - Overview of prediction results
Diagnostic plots (if --plots flag used)

Tutorial Checkpoint: Check your predictions! The 3 unknown samples have hidden true labels (Control, Infected, Suspected). Compare with your model's predictions to see accuracy.

Example prediction output

MethylSense produces detailed predictions with class probabilities for each sample:

Sample_ID	True_Status	Predicted_Class	Confidence	Prob_Control	Prob_Suspected	Prob_Infected
unknown_1	Control	Control	0.948 (High)	0.948	0.052	0.000
unknown_2	Infected	Infected	0.850 (High)	0.006	0.144	0.850
unknown_3	Suspected	Suspected	0.820 (High)	0.002	0.820	0.178

All 3 samples correctly classified with high confidence (≥0.7). The model correctly identified one sample from each class.

Example Output

MethylSense generates publication-ready visualisations across its pipeline scripts. The examples below were generated from a 3-class analysis (Control, Suspected, Infected) using 30 simulated samples with 10-fold nested cross-validation. Many additional plots are generated by each script—see the output directories for the full suite.

From MethylSense_analysis.R

Plots generated during model training (use --cpg_report for CpG-level plots).

Confusion matrix	DMR barplot	Group methylation summary
Methylation heatmap	Prediction probabilities	Random Forest — example decision tree Illustrative classification tree for the selected DMR features
Genome-wide DMR Manhattan plot
CpG-level profile (DMR #4)

From MethylSense_reviewer.R

Model evaluation plots generated when reviewing a trained model.

UMAP dimensionality reduction

Sample clustering based on methylation biomarkers

Multiclass ROC curves

One-vs-rest AUC for each class

DMR marker stability

Bootstrap selection frequency for each DMR marker

Marker effect size vs stability

DMR selection consistency against methylation difference

Preanalytic QC - technical covariate associations

Linear model analysis of technical covariates vs DMR methylation with FDR correction

CpG-level Manhattan plot

Genome-wide significance of individual CpG sites within DMRs

From MethylSense_general_data_overview.R

Exploratory plots providing a landscape view of DMR results.

Volcano plot

Effect size vs significance with top DMRs labeled

Methylation by infection status

Ridge plot of methylation distribution by group

Pipeline Components

Core scripts

Script	Purpose	Key Output
MethylSense_load_data.R	Preprocess BED files into methylKit format	qs methylRawList object
MethylSense_analysis.R	DMR detection and ML model training	Trained models, DMR coordinates
MethylSense_predict.R	Clinical diagnosis of new samples	Predictions with confidence scores
MethylSense_reviewer.R	Clinical interpretation and reporting	Evaluation reports and figures

Optional scripts

Script	Purpose
MethylSense_general_data_overview.R	Data quality assessment and exploratory visualisation
MethylSense_installer.R	Complete package installer with verification

Available machine learning models

MethylSense supports nine machine learning algorithms, implemented through the caret framework (Kuhn, 2008):

Model	Code	Speed	Best Use Case
Random Forest	rf	Medium	General purpose, feature importance analysis
Support Vector Machine	svm	Medium	High accuracy classification
XGBoost	xgboost	Medium	Best overall accuracy
Elastic Net	glmnet	Fast	Biomarker discovery, interpretable models
Neural Network	nnet	Medium	Complex non-linear patterns
k-Nearest Neighbours	knn	Fast	Simple datasets, baseline comparisons
Linear Discriminant Analysis	lda	Fast	Highly interpretable results
Naive Bayes	nb	Fast	Baseline comparisons
Fast Random Forest	ranger	Fast	Large datasets

Model shortcuts: --models all, --models fast, --models biomarker

Output Structure

After MethylSense_analysis.R

output_directory/
├── analysis_summary.csv          # Complete results for all models
├── analysis_settings.txt         # Reproducibility settings
├── ANALYSIS_REPORT.md            # Summary report (Markdown)
├── ANALYSIS_REPORT.html          # Summary report (HTML)
├── run_log.txt                   # Complete execution log
├── ml_training_logs/             # Detailed ML training diagnostics
├── train_test_splits/            # Data partition information
├── precomputed_data/             # Cached intermediate results
├── complete_dmr_datasets/        # DMR statistics per window size
├── cv_analysis_plots/            # Cross-validation visualisations
└── models/                       # Trained model files
    ├── rf_model/
    │   ├── model.rds             # Serialised model object
    │   ├── dmr_coords.csv        # DMR coordinates used
    │   └── model_summary.txt     # Model performance summary
    ├── svm_model/
    └── xgboost_model/

After MethylSense_reviewer.R

report_directory/
├── MODEL_EVALUATION_REPORT.md    # Comprehensive clinical report
├── MODEL_EVALUATION_REPORT.html  # Interactive HTML version
├── MODEL_EVALUATION_REPORT.pdf   # Publication-ready PDF
├── figures/                      # All visualisations
│   ├── roc_curves/
│   ├── confusion_matrices/
│   ├── feature_importance/
│   └── methylation_heatmaps/
└── tables/                       # Summary statistics

Command-Line Reference

All MethylSense scripts support --help to display full parameter lists. This section documents parameters for publication reporting and advanced usage.

MethylSense_load_data.R

Preprocesses raw BED files into methylKit objects for analysis.

Parameter	Type	Default	Description
--species	string	required	Species name (must match sample sheet)
--sample_sheet	file	required	Path to sample metadata (Excel or CSV)
--bed_dir	dir	required	Directory containing CpG BED files
--output_dir	dir	required	Output directory for processed .qs files
--assembly	string	auto	Genome assembly name
--cores	int	20	Number of CPU cores for parallel processing
--min_coverage	int	1	Minimum read coverage per CpG site
--force_convert	flag	false	Force re-conversion of existing files
--no_parallel	flag	false	Disable parallel processing

Example:

Rscript MethylSense_load_data.R \
  --species "Gallus_gallus" \
  --sample_sheet ./samples.xlsx \
  --bed_dir ./bed_files \
  --output_dir ./preprocessed \
  --cores 8

MethylSense_analysis.R

Main analysis script for DMR detection and ML classifier training.

Core parameters

Parameter	Type	Default	Description
--qs_file	file	required	Path to preprocessed methylRawList .qs file
--sample_sheet	file	—	Sample metadata (Excel/CSV) for covariate adjustment
--output_dir	dir	required	Output directory for results
--window_file	file	—	Path to a single genomic window BED file (e.g., windows_5kb.bed)
--bed_files	string	—	Comma-separated BED files for genomic windows (alternative to --window_file)
--region_dir	dir	—	Directory containing multiple window BED files (alternative to --window_file)
--models	string	rf,svm,glmnet,nnet,knn,lda,nb	ML models to train
--cores	int	4	CPU cores for parallel model training

Treatment group options

Parameter	Type	Default	Description
--group_names	string	auto	Custom group names, comma-separated
--group_colors	string	auto	Custom colours matching group_names
--positive_class	string	—	Name of positive class for binary metrics
--treatment_mapping	string	—	Map numeric codes to names (e.g., 0=Control,1=Infected)
--min_group_size	int	4	Minimum samples per group
--allow_small_groups	flag	false	Allow groups with fewer samples (risky)

Cross-validation & performance

Parameter	Type	Default	Description
--cv_folds	int	10	Cross-validation folds
--cv_repeats	int	0	Number of CV repetitions (5-10 recommended for publication)
--nested_cv	flag	false	Enable nested cross-validation (gold standard for publication)
--nested_cv_repeats	int	1	Number of times to repeat the entire nested k-fold CV with different seeds. For publication: use 5. Total evaluations = cv_repeats × nested_cv_repeats
--test_ratio	float	0.3	Proportion of data for test set (70:30 train:test split)
--train_test_file	file	—	Excel file with predefined train/test splits
--min_accuracy	float	0.7	Minimum accuracy to save trained models
--reproducible	flag	false	Fixed seeds for reproducibility

DMR detection & filtering

Parameter	Type	Default	Description
--qvalue_threshold	float	0.05	FDR threshold for DMR significance
--min_coverage	int	20	Minimum read coverage per CpG
--hyper_threshold	float	5.0	Minimum % for hypermethylated DMRs
--hypo_threshold	float	-2.0	Maximum % for hypomethylated DMRs
--disable_meth_filter	flag	false	Disable methylation filtering (q-value only)
--covariate_cols	string	—	Comma-separated sample sheet columns for DMR covariate adjustment (e.g., 'Batch,Study')
--exclude_sex_chroms	flag	false	Exclude X, Y, Z, W chromosomes from markers
--filter_standard_chroms	flag	false	Filter to standard chromosomes only (1-99 + sex)
--min_samples_per_region	float	1.0	Minimum sample fraction per region

Marker set size

Parameter	Type	Default	Description
--max_markers	int	100	Maximum number of DMRs to use as features
--min_markers	int	8	Minimum number of DMRs to start training
--marker_step	int	1	Step size for marker iteration
--marker_range	string	—	Alternative: specify range as min:max (e.g., 2:50)

Plot output

Parameter	Type	Default	Description
--create_plots	flag	true	Generate visualisations
--no_plots	flag	false	Disable plot generation
--plot_format	string	png	Output format: png, jpg, svg, pdf, all
--plot_dpi	int	300	Resolution (300 standard, 600 high-res, 150 draft)

Other options

Parameter	Type	Default	Description
--force_ml	flag	false	Force re-run even with cached results
--verbose	flag	false	Enable verbose output
--debug	flag	false	Enable debug mode

Model shortcuts: --models all (all 9), --models fast (glmnet,ranger,lda,nb), --models biomarker (glmnet,rf)

Example — Full research analysis (Drag et al. 2025 setup):

Rscript MethylSense_analysis.R \
  --qs_file ./preprocessed/Chicken_n124.qs \
  --output_dir ./results/Chicken_Aspergillosis \
  --region_dir ./windowed_genomes \
  --group_names 'Control,Aspergillus_Infected' \
  --group_colors 'darkgreen,darkred' \
  --positive_class 'Aspergillus_Infected' \
  --treatment_mapping '0=Control,1=Aspergillus_Infected' \
  --min_accuracy 0.8 \
  --marker_range '2:50' \
  --hyper_threshold 5.0 --hypo_threshold -2.0 \
  --min_coverage 10 \
  --cv_repeats 10 --nested_cv --nested_cv_repeats 5 \
  --test_ratio 0.4 \
  --exclude_sex_chroms --filter_standard_chroms \
  --models 'nnet,glmnet,lda,knn,rf,nb,svm' \
  --cores 8 \
  --plot_format 'all' --plot_dpi 300 \
  --force_ml

Example — Quick test run:

Rscript MethylSense_analysis.R \
  --qs_file ./data.qs \
  --output_dir ./quick_test \
  --group_names Control,Disease \
  --models fast \
  --max_markers 20

MethylSense_predict.R

Applies trained models to classify new samples.

Parameter	Type	Default	Description
--qs_file	file	required	Path to preprocessed samples (.qs)
--model_dir	dir	required	Directory containing trained model
--output_dir	dir	required	Output directory for predictions
--sample_ids	string	all	Comma-separated sample IDs to predict
--min_coverage	int	10	Minimum coverage per CpG
--confidence_threshold	float	0.7	Confidence threshold for classification
--plots	flag	false	Generate diagnostic plots
--batch_mode	flag	false	Batch mode (suppress individual plots)
--verbose	flag	false	Enable verbose output

Example:

Rscript MethylSense_predict.R \
  --qs_file ./new_samples/samples.qs \
  --model_dir ./results/rf/markers_10 \
  --output_dir ./predictions \
  --plots --confidence_threshold 0.8

MethylSense_reviewer.R

Generates comprehensive model evaluation reports for publication.

Core parameters

Parameter	Type	Default	Description
--model_dir	dir	required	Path to trained model directory
--qs_file	file	required	Path to methylRawList .qs file
--sample_sheet	file	required	Path to sample metadata
--output_subdir	string	model_review	Output subdirectory name
--model_name	string	auto	Custom model name for report title
--n_cores	int	4	Number of CPU cores
--verbose	flag	false	Enable verbose output

Statistical analysis

Parameter	Type	Default	Description
--bootstrap_iterations	int	500	Bootstrap iterations for CI estimation
--stability_threshold	float	0.70	Feature stability threshold
--effect_size_threshold	float	2.0	Effect size threshold for significance
--min_effect_size	float	5.0	Minimum effect size (methylation %)
--multiple_testing	string	fdr	Correction: fdr, bonferroni, holm, BY, none
--significance_threshold	float	0.05	P-value significance threshold
--control_group	string	auto	Reference group for logistic regression

Batch & time effects

Parameter	Type	Default	Description
--batch_column	string	—	Column for batch/platform variable
--storage_column	string	—	Column for storage duration variable
--date_column	string	—	Column for collection/sampling date
--time_analysis	string	—	Column for timepoint/longitudinal data
--time_categorical	flag	false	Treat time as categorical (vs continuous)
--covariate_cols	string	—	Comma-separated column names from sample sheet to adjust for confounders during DMR calling (e.g., Batch,Storage_Days)

Network & annotation

Parameter	Type	Default	Description
--enable_wgcna	flag	false	Enable WGCNA co-methylation network analysis
--enable_gene_annotation	flag	false	Enable biomaRt gene annotation
--species	string	—	Species for annotation (e.g., 'Gallus gallus')
--enable_feature_annotation	flag	false	Enable genomic feature annotation
--gtf_file	file	—	GTF/GFF file for gene features
--regulatory_gff	file	—	Regulatory features GFF (enhancers, CTCF)
--emar_gff	file	—	EMAR regions GFF file
--open_chromatin_bed	file	—	Open chromatin BED file
--functional_enrichment	flag	false	Enable functional enrichment analysis

PCR validation

Parameter	Type	Default	Description
--prepare_pcr	flag	false	Generate PCR primer design links
--pcr_n_dmrs	int	20	Number of top DMRs for PCR validation

Report output

Parameter	Type	Default	Description
--generate_markdown	flag	true	Generate markdown report
--generate_pdf	flag	false	Generate PDF report
--markdown_title	string	auto	Custom report title
--markdown_author	string	—	Author name(s) for report
--figure_format	string	png	Output format: png, jpg, svg, all
--figure_resolution	int	300	Figure resolution (DPI)
--group_colors	string	auto	Custom colors for PCA/UMAP plots
--blind_paths	flag	false	Hide full paths in reports for privacy

Example — Full research evaluation (Drag et al. 2025 setup):

Rscript MethylSense_reviewer.R \
  --model_dir ./results/svm/markers_35 \
  --qs_file ./preprocessed/Chicken_n124.qs \
  --sample_sheet ./AsperMerged_v3.xlsx \
  --model_name "Fast Model Diagnostic Test" \
  --covariate_cols "SEQUENCER,cfDNA_conc_ng_uL,number_of_reads,mapping_percentage,mean_coverage" \
  --time_analysis Sample_Week \
  --enable_wgcna \
  --bootstrap_iterations 1000 \
  --stability_threshold 0.75 \
  --effect_size_threshold 2.0 \
  --min_effect_size 5.0 \
  --multiple_testing fdr \
  --n_cores 8 \
  --prepare_pcr \
  --functional_enrichment \
  --enable_feature_annotation \
  --gtf_file ./genomes/Gallus_gallus.GRCg7b.108.gtf \
  --regulatory_gff ./genomes/Gallus_gallus.GRCg7b.regulatory_features.gff3 \
  --emar_gff ./genomes/Gallus_gallus.GRCg7b.EMARs.gff \
  --open_chromatin_bed ./genomes/openchrom-union.bed \
  --figure_format "jpg,svg" \
  --blind_paths \
  --generate_pdf \
  --verbose

Example — Quick batch effect check:

Rscript MethylSense_reviewer.R \
  --model_dir ./results/rf/markers_20 \
  --qs_file ./data.qs \
  --sample_sheet ./metadata.xlsx \
  --batch_column Platform \
  --storage_column StorageDays

MethylSense_general_data_overview.R

Creates visualisations of the methylation data landscape.

Parameter	Type	Default	Description
--analysis_dir	dir	required	Path to analysis directory
--sample_sheet	file	required	Path to sample metadata
--region_sizes	string	1000,5000,10000,15000,20000,25000	Region sizes (bp)
--plot_format	string	png	Output format: png, jpg, svg, pdf
--plot_width	float	12	Plot width in inches
--plot_height	float	8	Plot height in inches
--plot_dpi	int	300	Resolution for raster formats
--volcano_meth_threshold	float	5	Methylation difference threshold (%)

Example — High-resolution SVG for publication:

Rscript MethylSense_general_data_overview.R \
  --analysis_dir ./results \
  --sample_sheet ./metadata.xlsx \
  --region_sizes "5000,10000,25000" \
  --plot_format svg --plot_dpi 600 \
  --plot_width 14 --plot_height 10

Plot Output Specifications

For publication-quality figures:

Script	Default format	Default DPI	Recommended for print
MethylSense_analysis.R	PNG	300	--plot_format svg --plot_dpi 600
MethylSense_reviewer.R	PNG	300	--figure_format 'png,svg' --figure_resolution 600 --generate_pdf
MethylSense_general_data_overview.R	PNG	300	--plot_format svg --plot_dpi 600

Common journal requirements:

Nature/Science: 300-600 DPI, prefer vector formats (SVG/PDF)
PLoS/BMC: 300 DPI minimum, TIFF or PNG acceptable
Elsevier: 300 DPI for halftones, 1000 DPI for line art

All MethylSense scripts generate figures at 300 DPI by default, suitable for most journal requirements.

Cross-Validation Methodology

MethylSense implements two complementary cross-validation strategies for model evaluation. Both are controlled through command-line parameters and generate separate output files.

Standard Monte Carlo Cross-Validation (MCCV)

Enabled with --cv_repeats N (where N > 0). This is the default and faster method.

Each repeat draws a new random 60/40 stratified split. The model is trained on 60% of the data with internal k-fold CV for hyperparameter tuning, and evaluated on the held-out 40%. Metrics are averaged across all N repeats with 95% confidence intervals.

# Standard MCCV with 10 repeats
Rscript MethylSense_analysis.R \
  --qs_file ./data.qs \
  --output_dir ./results \
  --cv_repeats 10

Output: cv_detailed_*.csv, cv_summary_*.csv

Nested Cross-Validation (Gold Standard)

Enabled by adding --nested_cv. This is the gold standard for unbiased performance estimation suitable for publication (Varma & Simon, 2006).

Nested CV uses two layers of cross-validation:

Outer loop: K stratified, non-overlapping folds (set by --cv_repeats). Each fold is held out as a test set exactly once.
Inner loop: Separate k-fold CV within the outer training set for hyperparameter tuning. The outer test fold is never seen during tuning.

The entire k-fold procedure can be repeated N times with different random seeds (set by --nested_cv_repeats), producing K × N total evaluations for more robust estimates.

# Publication-quality nested CV: 10 folds × 5 repetitions = 50 evaluations
Rscript MethylSense_analysis.R \
  --qs_file ./data.qs \
  --output_dir ./results \
  --cv_repeats 10 \
  --nested_cv \
  --nested_cv_repeats 5

Output: nested_cv_detailed_*.csv, nested_cv_summary_*.csv, nested_cv_hyperparameters_*.csv, plus nested CV visualisation plots

Comparison

Aspect	Standard MCCV	Nested CV
Outer split	Random 60/40 (overlapping across repeats)	K-fold, non-overlapping
Hyperparameter tuning	Internal CV on train split	Separate inner CV loop on outer train only
Test data isolation	Good (not used for tuning)	Complete (fully isolated from all tuning)
Bias	Potentially mildly optimistic	Unbiased (gold standard)
Speed	Fast	5–10× slower
Use case	Development and exploration	Publication-quality results

When --nested_cv is enabled, MethylSense runs both methods automatically and calculates the optimism bias (the difference between standard and nested CV accuracy). If the bias exceeds 5%, a warning is raised suggesting the nested CV results should be used for publication.

Clinical Applications

This section provides comprehensive guidance for veterinarians, diagnosticians, and clinicians using MethylSense for disease diagnosis. The pipeline translates complex epigenetic data into actionable clinical information.

Understanding cfDNA-Based Diagnosis

Cell-free DNA (cfDNA) is released into the bloodstream when cells die, carrying the methylation patterns of their tissue of origin. During infection, pathogen-induced host cell damage releases cfDNA with characteristic methylation signatures that differ from healthy individuals. MethylSense detects these disease-associated methylation patterns to classify samples as infected or healthy.

The diagnostic approach is fundamentally different from pathogen detection methods (PCR, culture): rather than detecting the pathogen directly, MethylSense detects the host's epigenetic response to infection. This provides complementary diagnostic information and may detect infection earlier or in cases where pathogen load is below detection limits of conventional methods.

Sample Types and Collection

MethylSense can analyse cfDNA from both serum and plasma. Both sample types contain host cfDNA suitable for methylation analysis:

Serum: Blood allowed to clot before centrifugation. May contain slightly higher cfDNA concentrations due to cell lysis during clotting. Used in the original validation study (Drag et al., 2025).

Plasma: Blood collected in anticoagulant tubes (EDTA, citrate, or heparin) and centrifuged before clotting. May have lower background from lysed blood cells. Either sample type is suitable, but consistency within a study is recommended.

Sample Processing Specifications

For avian aspergillosis diagnosis, the following specifications are based on Drag et al. (2025):

Parameter	Specification
Sample type	Serum or plasma (cell-free DNA)
Volume	200 µL minimum (serum or plasma)
Pre-extraction	Centrifugation 2 min at 400 × g to remove residual cells
DNA extraction	Norgen Biotek cfDNA Micro Kit (cat. 55500) or equivalent cfDNA kit
Quality control	Bioanalyzer 2100, High Sensitivity DNA kit (confirm cfDNA fragment profile)
Storage	-80°C (extracted cfDNA stable for months)
Library preparation	Native Barcoding Ligation Kit (SQK-NBD114.24), 11 µL cfDNA input
Sequencer	PromethION P2 Solo or MinION (any ONT platform)
Flow cell	R10.4.1 (version 14 chemistry)
Basecalling	Guppy v6.5.7 or Dorado, High Accuracy mode
Methylation model	dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_hac.cfg
Methylation caller	ONT modkit v0.1.2 or later, pileup mode with cpg combine mods
Reference genome	Gallus gallus (bGalGal1.mat.broiler.GRCg7b) or appropriate species

Note: During library preparation, bead clean-up steps should be multiplied by 2.5× to improve recovery of short cfDNA fragments, as recommended by Martignano et al. (2023).

For complete methodology details, see Drag et al. (2025).

Clinical Workflow

The complete diagnostic workflow from sample collection to clinical decision:

Sample collection — Collect 200 µL serum or plasma from patient. For serum, allow blood to clot at room temperature for 30-60 minutes before centrifugation. For plasma, collect in anticoagulant tube and centrifuge promptly.
Sample processing — Centrifuge to separate serum/plasma from cells. Perform a second centrifugation at 400 × g for 2 minutes to remove residual cells that could contaminate cfDNA.
DNA extraction — Extract cell-free DNA using a kit designed for cfDNA recovery (low input, short fragments). Standard genomic DNA kits are not suitable as they lose short cfDNA fragments.
Quality control — Verify cfDNA profile on Bioanalyzer. Successful extraction shows characteristic cfDNA peak at approximately 160-180 bp (mononucleosomal fragment).
Sequencing — Prepare Oxford Nanopore library with methylation-aware protocol. Sequence on MinION, GridION, or PromethION with R10.4.1 flow cells.
Data processing — Run basecalling with methylation model enabled. Process with NanoporeToBED-Pipeline to generate CpG BED files suitable for MethylSense.
Diagnosis — Run MethylSense_predict.R with your trained model. The script outputs a diagnostic classification with confidence score for each sample.
Interpretation — Review prediction report. High confidence predictions (>80%) can inform clinical decisions. Medium or low confidence results warrant additional confirmatory testing.

Interpreting Diagnostic Results

The prediction output includes a probability score for each sample indicating how strongly the methylation profile matches each class. The confidence score represents the maximum predicted probability across all classes:

Confidence Level	Probability Range	Clinical Interpretation
High	>80%	Strong methylation signal consistent with predicted class. Prediction is reliable and can inform clinical decisions.
Medium	60-80%	Moderate methylation signal. Consider confirmatory testing (culture, PCR, endoscopy, imaging) before treatment decisions.
Low	<60%	Weak or ambiguous methylation signal. Do not use for clinical decisions. Investigate sample quality or consider resequencing.

Causes of Low Confidence Predictions

Low confidence predictions warrant investigation before interpretation:

Sample quality issues — Degraded cfDNA, insufficient DNA input, or contamination with genomic DNA from lysed blood cells
Technical factors — Low sequencing coverage, library preparation failure, or basecalling errors
Biological factors — Early-stage infection before host response develops, atypical disease presentation, or immunocompromised patients with altered epigenetic responses
Species mismatch — Sample from a species different to the training data population

Pre-trained Models

Pre-trained diagnostic models are available for download:

Avian Aspergillosis (Chicken, n=124)

Accuracy: >95% with nested cross-validation
Download: Zenodo Repository
DOI: 10.5281/zenodo.15194046
Includes: Trained models, DMR coordinates, and validation results

Additional species and disease models: Contact for availability at [email protected]

Regulatory Notice

MethylSense is a research tool and has not been approved by FDA, EMA, or other regulatory bodies for clinical diagnostic use. Diagnostic decisions should be made in conjunction with other clinical findings and established diagnostic methods.

Models trained on one species may not generalise to other species without proper validation. Cross-species application requires separate validation studies before clinical use.

Troubleshooting

Common Issues

Package not found

Rscript MethylSense_installer.R

Not enough samples per group

Rscript MethylSense_analysis.R ... --allow_small_groups

Small groups (<10 samples) may produce unreliable or overfitted results. We recommend a minimum of 20 samples per group for robust classifier training.

Memory error

# Use larger windows (fewer features):
--window_file windows_10kb.bed

# Use fewer models:
--models rf,svm

Training too slow

--models fast
--cv_repeats 3

Performance Benchmarks

Step	Runtime	Memory
load_data	5-10 min	2-4 GB
general_overview	5-15 min	2-6 GB
train (no CV)	20-30 min	4-8 GB
train (10× CV)	60-90 min	4-8 GB
train (nested CV)	2-3 hours	4-8 GB
predict	1-2 min	1-2 GB
reviewer	5-10 min	2-4 GB

Benchmarks for 30-50 samples, 2 groups, 8 CPU cores.

Citation

If you use MethylSense in your research, please cite:

Drag MH, Hvilsom C, Poulsen LL, Jensen HE, Tahas SA, Leineweber C, Cray C, Bertelsen MF, Bojesen AM. MethylSense: high accuracy machine learning-based diagnostics for Aspergillus fumigatus infection in chickens using host cell-free DNA methylation and Nanopore sequencing. J Clin Microbiol 0:e01054-25. https://doi.org/10.1128/jcm.01054-25

Acknowledgements

MethylSense builds upon excellent work from the R and Bioconductor communities.

Core Dependencies

methylKit (Akalin et al., 2012) — The foundation for DMR detection and methylation analysis in MethylSense. This Bioconductor package provides comprehensive tools for genome-wide DNA methylation analysis from bisulfite sequencing data.

Akalin A, Kormaksson M, Li S, Garrett-Bakelman FE, Figueroa ME, Melnick A, Mason CE (2012). methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles. Genome Biology, 13, R87. https://doi.org/10.1186/gb-2012-13-10-r87

caret (Kuhn, 2008) — The machine learning framework powering model training, cross-validation, and hyperparameter tuning in MethylSense.

Kuhn M (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5), 1-26. https://doi.org/10.18637/jss.v028.i05

pROC (Robin et al., 2011) — ROC curve analysis and AUC calculations.

Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, Müller M (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. https://doi.org/10.1186/1471-2105-12-77

Additional R Packages

MethylSense also uses: GenomicRanges (Lawrence et al., 2013), ggplot2 (Wickham, 2016), randomForest (Liaw & Wiener, 2002), xgboost (Chen & Guestrin, 2016), glmnet (Friedman et al., 2010), e1071, ranger, pheatmap, and RColorBrewer.

References

Expand full reference list

DNA Methylation

Bird A (2002). DNA methylation patterns and epigenetic memory. Genes & Development, 16(1), 6-21.
Jones PA (2012). Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nature Reviews Genetics, 13(7), 484-492.

Nanopore Methylation

Simpson JT, Workman RE, Zuzarte PC, David M, Dursi LJ, Timp W (2017). Detecting DNA cytosine methylation using nanopore sequencing. Nature Methods, 14(4), 407-410.

Statistical Methods

Benjamini Y, Hochberg Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289-300.
Varma S, Simon R (2006). Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics, 7, 91.

Related Pipelines

License and Authors

License: Academic Free License 3.0 (AFL-3.0). See LICENSE.

Lead Developer: Markus Hodal Drag

Co-authors: Christina Hvilsom, Louise Ladefoged Poulsen, Henrik Elvang Jensen, Stamatios Alan Tahas, Christoph Leineweber, Carolyn Cray, Mads Frost Bertelsen, Anders Miki Bojesen

Contact: [email protected] | GitHub Issues

Developed for Oxford Nanopore methylation analysis

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
example_data		example_data
logos		logos
screenshots		screenshots
.gitignore		.gitignore
LICENSE		LICENSE
MethylSense_analysis.R		MethylSense_analysis.R
MethylSense_general_data_overview.R		MethylSense_general_data_overview.R
MethylSense_installer.R		MethylSense_installer.R
MethylSense_load_data.R		MethylSense_load_data.R
MethylSense_predict.R		MethylSense_predict.R
MethylSense_reviewer.R		MethylSense_reviewer.R
MethylSense_theme.R		MethylSense_theme.R
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Overview

So what does MethylSense actually do?

Table of Contents

Prerequisites

Upstream Pipelines

System Requirements

Installation

Quick Install

Manual Installation

Verify Installation

Sample Sheet Format

Required columns

Optional columns

Example sample sheet

Quick Start

Step 1: Load and preprocess data

Step 2: Train diagnostic models

Optional: Data quality assessment and visualisation

Step 3: Model evaluation and selection

Step 4: Diagnose new samples

Example Output

From MethylSense_analysis.R

From MethylSense_reviewer.R

From MethylSense_general_data_overview.R

Pipeline Components

Core scripts

Optional scripts

Available machine learning models

Output Structure

After MethylSense_analysis.R

After MethylSense_reviewer.R

Command-Line Reference

MethylSense_load_data.R

MethylSense_analysis.R

MethylSense_predict.R

MethylSense_reviewer.R

MethylSense_general_data_overview.R

Plot Output Specifications

Cross-Validation Methodology

Standard Monte Carlo Cross-Validation (MCCV)

Nested Cross-Validation (Gold Standard)

Comparison

Clinical Applications

Understanding cfDNA-Based Diagnosis

Sample Types and Collection

Sample Processing Specifications

Clinical Workflow

Interpreting Diagnostic Results

Causes of Low Confidence Predictions

Pre-trained Models

Regulatory Notice

Troubleshooting

Common Issues

Performance Benchmarks

Citation

Acknowledgements

Core Dependencies

Additional R Packages

References

License and Authors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages