MethylSense is a sort-of "semi-automated" R-based bioinformatics pipeline for training, testing, and cross-validation of machine learning models to predict the presence of clinical infections or inflammatory processes using (drum roll) epigenetics! After feeding Nanopore or Illumina methylation data and running the main analysis script, it will compute associated DMRs with your condition and find the best predictive ML models for diagnosing new cases and samples. You will also gain insights into the epigenome and generate publication-ready figures in short time - and lots more!
-
Detects differentially methylated regions (DMRs): Identifies custom sized genomic regions with statistically significant methylation differences between multiple groups (e.g., Control, Infected, Suspected) using the methylKit R package (Akalin et al., 2012) with false discovery rate (FDR) correction. Also supports covariates.
-
Trains machine learning classifiers: Builds diagnostic models using nine machine learning algorithms via the caret framework (Kuhn, 2008), including Random Forest, SVM, XGBoost, and Elastic Net. Key features include Monte Carlo cross-validation and nested cross-validation for robust, unbiased model evaluation.
-
Diagnoses new samples: Applies trained models to classify patient samples with probability scores and confidence estimates to support clinical decision-making.
-
Generates publication-ready reports: Creates comprehensive evaluation reports with ROC curves, confusion matrices, Random Forest decision-tree graphics (and other model-specific plots), and performance metrics suitable for publication. You will have a lot of figures ready!
Originally developed for avian Aspergillus fumigatus infection diagnosis using host cell-free DNA methylation, achieving ~92% accuracy with Monte Carlo cross-validation.
- Prerequisites
- Installation
- Sample Sheet Format
- Quick Start
- Example Output
- Pipeline Components
- Output Structure
- Command-Line Reference
- Cross-Validation Methodology
- Clinical Applications
- Troubleshooting
- Citation
- Acknowledgements
- License and Authors
MethylSense requires processed Nanopore methylation data. You must run these pipelines first to prepare your sequencing data:
NanoporeToBED-Pipeline (GitHub)
This pipeline converts raw Nanopore sequencing output (modBAM files from Guppy or Dorado basecallers) into standardised CpG BED files. It extracts 5-methylcytosine (5mC) modification calls at CpG dinucleotide sites, preserving methylation frequencies and read coverage information. Run this pipeline first before MethylSense.
GenomeToWindows (GitHub)
This utility generates genomic window BED files (typically 1kb, 5kb, 10kb, or 25kb) from a reference genome. These windows define the regions for DMR detection. Aggregating CpG-level data into larger genomic windows reduces noise from individual CpG site variation and provides sufficient statistical power for differential methylation testing.
- R version: 4.0 or higher (4.2+ recommended)
- Operating system: Linux, macOS, or Windows
Running MethylSense from the terminal:
| Platform | How to run |
|---|---|
| Linux | Use the native Terminal application |
| macOS | Use the native Terminal app (Applications → Utilities → Terminal) |
| Windows | Use MobaXterm (recommended) or WSL2 with Ubuntu |
Note for Windows users: MobaXterm provides a Unix-like terminal environment with built-in X11 forwarding for R plots. Install R within MobaXterm's environment or use WSL2 with a native Linux R installation for best compatibility.
git clone https://github.com/markusdrag/MethylSense.git
cd MethylSense
Rscript MethylSense_installer.R
chmod +x MethylSense_*.RThe installer script automatically installs all required packages from both CRAN and Bioconductor. Installation typically takes 15-45 minutes depending on your system and existing packages. The script will report success or failure for each package and provide troubleshooting guidance if needed.
If you prefer manual installation, MethylSense depends on packages from both CRAN and Bioconductor:
# CRAN packages
install.packages(c(
"optparse", "qs", "caret", "pROC", "ggplot2", "hrbrthemes", "dplyr", "tidyr",
"randomForest", "e1071", "glmnet", "nnet", "class", "MASS",
"klaR", "ranger", "xgboost", "pheatmap", "RColorBrewer",
"data.table", "readxl", "reshape2", "viridis", "corrplot",
"factoextra", "ggridges", "gridExtra", "ggrepel", "naivebayes",
"tree", "rpart", "rpart.plot", "cluster", "jsonlite", "MLmetrics",
"doParallel", "foreach", "pls", "PRROC", "igraph", "umap",
"dendextend", "lme4", "nlme", "MuMIn", "ez", "fossil"
))
# Bioconductor packages
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("methylKit", "GenomicRanges", "genomation", "regioneR", "IRanges"))Rscript -e "library(methylKit); library(caret); library(xgboost)"
Rscript MethylSense_load_data.R --helpMethylSense requires a sample metadata file (Excel .xlsx or comma-separated .csv) that describes your samples, their group assignments, and file locations. This information is essential for the pipeline to correctly load, process, and analyse your data.
These columns must be present in your sample sheet:
| Column | Description | Example |
|---|---|---|
| ID | Unique sample identifier (must match BED filenames) | sample_001 |
| Species | Species name used for filtering samples | Gallus_gallus |
| bedFileOrg | Original BED filename from NanoporeToBED | sample_001.CpG.bed |
| bedFile | Output filename for converted BED (user-specified) | sample_001_8col.bed |
| treatMethylkit | Numeric group code encoding each sample's treatment or infection status for methylKit. Each unique integer represents a distinct group (e.g., 0 = Control, 1 = Suspected, 2 = Infected). These codes are mapped to human-readable names via --group_names in ascending numeric order, or explicitly with --treatment_mapping (e.g., 0=Control,1=Suspected,2=Infected). Plot colours are then auto-assigned from the MethylSense palette based on group name keywords. | 0, 1, 2 |
These columns provide additional metadata for reporting and stratification:
| Column | Description | Example |
|---|---|---|
| Infection | Human-readable group label | Control, Infected, Suspected |
| Study | Study or cohort identifier for stratified analysis | Study_A |
| Batch | Batch identifier for covariate adjustment (use with --covariate_cols) | Batch_1, Batch_2 |
| Pre-analytic QC (numeric) | Optional technical covariates: reviewer auto-detects columns whose names match keywords such as storage, dna, hemolysis, time, concentration, extraction, quality, qc (see script). Use several variables for meaningful histograms. | storage_time_days, cfDNA_concentration_ng_uL, hemolysis_index, … |
| Animal_ID | Individual animal identifier | Bird_42 |
| Lab_ID | Laboratory sample identifier | LAB001 |
ID Species bedFileOrg bedFile treatMethylkit Infection
control_1 Gallus_gallus control_1.CpG.bed control_1_8col.bed 0 Control
control_2 Gallus_gallus control_2.CpG.bed control_2_8col.bed 0 Control
suspected_1 Gallus_gallus suspected_1.CpG.bed suspected_1_8col.bed 1 Suspected
suspected_2 Gallus_gallus suspected_2.CpG.bed suspected_2_8col.bed 1 Suspected
infected_1 Gallus_gallus infected_1.CpG.bed infected_1_8col.bed 2 Infected
infected_2 Gallus_gallus infected_2.CpG.bed infected_2_8col.bed 2 Infected
See the example_data folder for a complete working example: sample_metadata_30.csv (and matching BED files) provides 30 samples (10 per class). You can also use sample_metadata.xlsx with the same column layout.
Runtime: 5-10 minutes
This step converts raw CpG BED files into a unified methylKit object (qs format). The pipeline normalises coverage across samples, filters low-quality CpG sites, and prepares the data for downstream analysis.
Rscript MethylSense_load_data.R \
--species "Gallus_gallus" \
--sample_sheet ./example_data/sample_metadata_30.csv \
--bed_dir ./example_data \
--output_dir ./preprocessedWhat happens: Each sample's CpG methylation data is loaded, filtered by minimum coverage thresholds, and combined into a single methylRawList object for efficient analysis.
Output: YYYYMMDD_Species_nN_methylRaw.qs
Runtime: 1-2 hours (with 10-fold cross-validation)
This is the core analysis step. MethylSense identifies differentially methylated regions (DMRs) using the methylKit package, then trains machine learning classifiers to distinguish between your experimental groups.
Rscript MethylSense_analysis.R \
--qs_file ./preprocessed/*_methylRaw.qs \
--sample_sheet ./example_data/sample_metadata_30.csv \
--output_dir ./training \
--window_file ./example_data/windows/windows_5kb.bed \
--models rf,svm,xgboost \
--group_names Control,Suspected,Infected \
--group_colors '#1B5E20,#006064,#4A148C' \
--positive_class Infected \
--cv_repeats 10 \
--nested_cv \
--nested_cv_repeats 5Tip — Group colours: The
--group_colorsparameter accepts comma-separated hex codes or colour names, one per group in the same order as--group_names. If omitted, colours are auto-assigned from the MethylSense palette based on group name keywords (e.g., "Control" → green, "Infected" → purple). Override with any valid R colour, for example--group_colors 'darkgreen,orange,darkred'.
What happens:
- Methylation levels are calculated within genomic windows
- Statistical testing identifies DMRs with FDR correction (Benjamini-Hochberg)
- DMR methylation values become features for machine learning
- Multiple models are trained with cross-validation for robust performance estimates
- Nested cross-validation provides unbiased accuracy estimates suitable for publication
Output: Trained models (rds files), DMR coordinates, cross-validation results, performance metrics
Runtime: 5-15 minutes
This step is optional but recommended for exploring your DMR results and generating publication-ready visualisations. Run this after Step 2 to understand methylation patterns, assess data quality, and create comprehensive figures for presentations or publications.
Rscript MethylSense_general_data_overview.R \
--analysis_dir ./training/training_* \
--sample_sheet ./example_data/sample_metadata_30.csv \
--region_sizes "5000" \
--output_dir ./data_overview \
--plot_format png,pdf \
--infection_col Infection \
--sample_id_col ID \
--study_col StudyWhat happens: The script analyses your DMR results and generates publication-ready figures including:
- DMR counts across different window sizes
- Volcano plots showing effect sizes and significance
- Coverage distribution analysis (quality control)
- Chromosomal distribution of DMRs
- Effect size characterisation (hypermethylated vs hypomethylated)
- Sample-level methylation heatmaps for top DMRs
Output: Comprehensive report with figures and statistics in data_overview/
When to use: Run this after Step 2 to visualise DMR results, compare different window sizes, and generate exploratory figures. This is particularly useful for understanding your data before selecting the best model for clinical diagnosis.
Runtime: 5-10 minutes
Before using a model for clinical diagnosis, evaluate its performance to select the best model. This step generates comprehensive reports with publication-ready figures to help you choose the optimal diagnostic model.
# Find a trained model directory (pick one based on marker count)
Rscript MethylSense_reviewer.R \
--model_dir ./training/training_*/windows_5kb/rf/markers_5 \
--qs_file ./preprocessed/*_methylRaw.qs \
--sample_sheet ./example_data/sample_metadata_30.csv \
--sample_id_column ID \
--treatment_column Infection \
--output_subdir ./reportWhat happens: The reviewer analyses model performance on the training data, generates ROC curves, confusion matrices, and detailed reports. Compare performance across different models (rf, svm, xgboost) to select the best one for your diagnostic application.
Output: MODEL_EVALUATION_REPORT.md, MODEL_EVALUATION_REPORT.html, figures, tables
Runtime: 5-10 minutes (including preprocessing)
Apply your trained model to classify new unknown samples. This is the real-world clinical workflow where you have new patient samples that need diagnosis.
Step 4a: Generate unknown example samples (Tutorial only)
For this tutorial, we provide a script to generate 3 "unknown" samples with hidden labels:
Rscript generate_unknown_samples.RThis creates example_data/unknown_samples/ with 3 BED files and a metadata sheet.
Step 4b: Preprocess new samples
Preprocess the unknown samples just like in Step 1. For unknown samples, the treatMethylkit and Infection columns should be NA:
Rscript MethylSense_load_data.R \
--species "Gallus_gallus" \
--sample_sheet ./example_data/unknown_samples/unknown_samples_metadata.xlsx \
--bed_dir ./example_data/unknown_samples \
--output_dir ./unknown_preprocessedStep 4c: Run predictions
Rscript MethylSense_predict.R \
--model_dir ./training/training_*/windows_5kb/rf/markers_5 \
--qs_file ./unknown_preprocessed/*_methylRaw.qs \
--output_dir ./unknown_predictions \
--plotsWhat happens:
- The prediction script loads your trained model and its DMR coordinates
- It extracts methylation values from the new samples at those specific DMR regions
- The model predicts each sample's infection status with a probability score
- Samples with confidence ≥0.7 are classified; lower confidence samples are flagged for review
Output:
- *_predictions_detailed.csv - Full predictions with probabilities for each class
- *_predictions_clinical_report.csv - Simplified report for clinical use
- *_prediction_summary.txt - Overview of prediction results
- Diagnostic plots (if --plots flag used)
Tutorial Checkpoint: Check your predictions! The 3 unknown samples have hidden true labels (Control, Infected, Suspected). Compare with your model's predictions to see accuracy.
Example prediction output
MethylSense produces detailed predictions with class probabilities for each sample:
| Sample_ID | True_Status | Predicted_Class | Confidence | Prob_Control | Prob_Suspected | Prob_Infected |
|---|---|---|---|---|---|---|
| unknown_1 | Control | Control | 0.948 (High) | 0.948 | 0.052 | 0.000 |
| unknown_2 | Infected | Infected | 0.850 (High) | 0.006 | 0.144 | 0.850 |
| unknown_3 | Suspected | Suspected | 0.820 (High) | 0.002 | 0.820 | 0.178 |
All 3 samples correctly classified with high confidence (≥0.7). The model correctly identified one sample from each class.
MethylSense generates publication-ready visualisations across its pipeline scripts. The examples below were generated from a 3-class analysis (Control, Suspected, Infected) using 30 simulated samples with 10-fold nested cross-validation. Many additional plots are generated by each script—see the output directories for the full suite.
Plots generated during model training (use --cpg_report for CpG-level plots).
Model evaluation plots generated when reviewing a trained model.
Exploratory plots providing a landscape view of DMR results.
|
Volcano plot
Effect size vs significance with top DMRs labeled |
Methylation by infection status
Ridge plot of methylation distribution by group |
| Script | Purpose | Key Output |
|---|---|---|
| MethylSense_load_data.R | Preprocess BED files into methylKit format | qs methylRawList object |
| MethylSense_analysis.R | DMR detection and ML model training | Trained models, DMR coordinates |
| MethylSense_predict.R | Clinical diagnosis of new samples | Predictions with confidence scores |
| MethylSense_reviewer.R | Clinical interpretation and reporting | Evaluation reports and figures |
| Script | Purpose |
|---|---|
| MethylSense_general_data_overview.R | Data quality assessment and exploratory visualisation |
| MethylSense_installer.R | Complete package installer with verification |
MethylSense supports nine machine learning algorithms, implemented through the caret framework (Kuhn, 2008):
| Model | Code | Speed | Best Use Case |
|---|---|---|---|
| Random Forest | rf | Medium | General purpose, feature importance analysis |
| Support Vector Machine | svm | Medium | High accuracy classification |
| XGBoost | xgboost | Medium | Best overall accuracy |
| Elastic Net | glmnet | Fast | Biomarker discovery, interpretable models |
| Neural Network | nnet | Medium | Complex non-linear patterns |
| k-Nearest Neighbours | knn | Fast | Simple datasets, baseline comparisons |
| Linear Discriminant Analysis | lda | Fast | Highly interpretable results |
| Naive Bayes | nb | Fast | Baseline comparisons |
| Fast Random Forest | ranger | Fast | Large datasets |
Model shortcuts: --models all, --models fast, --models biomarker
output_directory/
├── analysis_summary.csv # Complete results for all models
├── analysis_settings.txt # Reproducibility settings
├── ANALYSIS_REPORT.md # Summary report (Markdown)
├── ANALYSIS_REPORT.html # Summary report (HTML)
├── run_log.txt # Complete execution log
├── ml_training_logs/ # Detailed ML training diagnostics
├── train_test_splits/ # Data partition information
├── precomputed_data/ # Cached intermediate results
├── complete_dmr_datasets/ # DMR statistics per window size
├── cv_analysis_plots/ # Cross-validation visualisations
└── models/ # Trained model files
├── rf_model/
│ ├── model.rds # Serialised model object
│ ├── dmr_coords.csv # DMR coordinates used
│ └── model_summary.txt # Model performance summary
├── svm_model/
└── xgboost_model/
report_directory/
├── MODEL_EVALUATION_REPORT.md # Comprehensive clinical report
├── MODEL_EVALUATION_REPORT.html # Interactive HTML version
├── MODEL_EVALUATION_REPORT.pdf # Publication-ready PDF
├── figures/ # All visualisations
│ ├── roc_curves/
│ ├── confusion_matrices/
│ ├── feature_importance/
│ └── methylation_heatmaps/
└── tables/ # Summary statistics
All MethylSense scripts support --help to display full parameter lists. This section documents parameters for publication reporting and advanced usage.
Preprocesses raw BED files into methylKit objects for analysis.
| Parameter | Type | Default | Description |
|---|---|---|---|
| --species | string | required | Species name (must match sample sheet) |
| --sample_sheet | file | required | Path to sample metadata (Excel or CSV) |
| --bed_dir | dir | required | Directory containing CpG BED files |
| --output_dir | dir | required | Output directory for processed .qs files |
| --assembly | string | auto | Genome assembly name |
| --cores | int | 20 | Number of CPU cores for parallel processing |
| --min_coverage | int | 1 | Minimum read coverage per CpG site |
| --force_convert | flag | false | Force re-conversion of existing files |
| --no_parallel | flag | false | Disable parallel processing |
Example:
Rscript MethylSense_load_data.R \
--species "Gallus_gallus" \
--sample_sheet ./samples.xlsx \
--bed_dir ./bed_files \
--output_dir ./preprocessed \
--cores 8Main analysis script for DMR detection and ML classifier training.
Core parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| --qs_file | file | required | Path to preprocessed methylRawList .qs file |
| --sample_sheet | file | — | Sample metadata (Excel/CSV) for covariate adjustment |
| --output_dir | dir | required | Output directory for results |
| --window_file | file | — | Path to a single genomic window BED file (e.g., windows_5kb.bed) |
| --bed_files | string | — | Comma-separated BED files for genomic windows (alternative to --window_file) |
| --region_dir | dir | — | Directory containing multiple window BED files (alternative to --window_file) |
| --models | string | rf,svm,glmnet,nnet,knn,lda,nb | ML models to train |
| --cores | int | 4 | CPU cores for parallel model training |
Treatment group options
| Parameter | Type | Default | Description |
|---|---|---|---|
| --group_names | string | auto | Custom group names, comma-separated |
| --group_colors | string | auto | Custom colours matching group_names |
| --positive_class | string | — | Name of positive class for binary metrics |
| --treatment_mapping | string | — | Map numeric codes to names (e.g., 0=Control,1=Infected) |
| --min_group_size | int | 4 | Minimum samples per group |
| --allow_small_groups | flag | false | Allow groups with fewer samples (risky) |
Cross-validation & performance
| Parameter | Type | Default | Description |
|---|---|---|---|
| --cv_folds | int | 10 | Cross-validation folds |
| --cv_repeats | int | 0 | Number of CV repetitions (5-10 recommended for publication) |
| --nested_cv | flag | false | Enable nested cross-validation (gold standard for publication) |
| --nested_cv_repeats | int | 1 | Number of times to repeat the entire nested k-fold CV with different seeds. For publication: use 5. Total evaluations = cv_repeats × nested_cv_repeats |
| --test_ratio | float | 0.3 | Proportion of data for test set (70:30 train:test split) |
| --train_test_file | file | — | Excel file with predefined train/test splits |
| --min_accuracy | float | 0.7 | Minimum accuracy to save trained models |
| --reproducible | flag | false | Fixed seeds for reproducibility |
DMR detection & filtering
| Parameter | Type | Default | Description |
|---|---|---|---|
| --qvalue_threshold | float | 0.05 | FDR threshold for DMR significance |
| --min_coverage | int | 20 | Minimum read coverage per CpG |
| --hyper_threshold | float | 5.0 | Minimum % for hypermethylated DMRs |
| --hypo_threshold | float | -2.0 | Maximum % for hypomethylated DMRs |
| --disable_meth_filter | flag | false | Disable methylation filtering (q-value only) |
| --covariate_cols | string | — | Comma-separated sample sheet columns for DMR covariate adjustment (e.g., 'Batch,Study') |
| --exclude_sex_chroms | flag | false | Exclude X, Y, Z, W chromosomes from markers |
| --filter_standard_chroms | flag | false | Filter to standard chromosomes only (1-99 + sex) |
| --min_samples_per_region | float | 1.0 | Minimum sample fraction per region |
Marker set size
| Parameter | Type | Default | Description |
|---|---|---|---|
| --max_markers | int | 100 | Maximum number of DMRs to use as features |
| --min_markers | int | 8 | Minimum number of DMRs to start training |
| --marker_step | int | 1 | Step size for marker iteration |
| --marker_range | string | — | Alternative: specify range as min:max (e.g., 2:50) |
Plot output
| Parameter | Type | Default | Description |
|---|---|---|---|
| --create_plots | flag | true | Generate visualisations |
| --no_plots | flag | false | Disable plot generation |
| --plot_format | string | png | Output format: png, jpg, svg, pdf, all |
| --plot_dpi | int | 300 | Resolution (300 standard, 600 high-res, 150 draft) |
Other options
| Parameter | Type | Default | Description |
|---|---|---|---|
| --force_ml | flag | false | Force re-run even with cached results |
| --verbose | flag | false | Enable verbose output |
| --debug | flag | false | Enable debug mode |
Model shortcuts: --models all (all 9), --models fast (glmnet,ranger,lda,nb), --models biomarker (glmnet,rf)
Example — Full research analysis (Drag et al. 2025 setup):
Rscript MethylSense_analysis.R \
--qs_file ./preprocessed/Chicken_n124.qs \
--output_dir ./results/Chicken_Aspergillosis \
--region_dir ./windowed_genomes \
--group_names 'Control,Aspergillus_Infected' \
--group_colors 'darkgreen,darkred' \
--positive_class 'Aspergillus_Infected' \
--treatment_mapping '0=Control,1=Aspergillus_Infected' \
--min_accuracy 0.8 \
--marker_range '2:50' \
--hyper_threshold 5.0 --hypo_threshold -2.0 \
--min_coverage 10 \
--cv_repeats 10 --nested_cv --nested_cv_repeats 5 \
--test_ratio 0.4 \
--exclude_sex_chroms --filter_standard_chroms \
--models 'nnet,glmnet,lda,knn,rf,nb,svm' \
--cores 8 \
--plot_format 'all' --plot_dpi 300 \
--force_mlExample — Quick test run:
Rscript MethylSense_analysis.R \
--qs_file ./data.qs \
--output_dir ./quick_test \
--group_names Control,Disease \
--models fast \
--max_markers 20Applies trained models to classify new samples.
| Parameter | Type | Default | Description |
|---|---|---|---|
| --qs_file | file | required | Path to preprocessed samples (.qs) |
| --model_dir | dir | required | Directory containing trained model |
| --output_dir | dir | required | Output directory for predictions |
| --sample_ids | string | all | Comma-separated sample IDs to predict |
| --min_coverage | int | 10 | Minimum coverage per CpG |
| --confidence_threshold | float | 0.7 | Confidence threshold for classification |
| --plots | flag | false | Generate diagnostic plots |
| --batch_mode | flag | false | Batch mode (suppress individual plots) |
| --verbose | flag | false | Enable verbose output |
Example:
Rscript MethylSense_predict.R \
--qs_file ./new_samples/samples.qs \
--model_dir ./results/rf/markers_10 \
--output_dir ./predictions \
--plots --confidence_threshold 0.8Generates comprehensive model evaluation reports for publication.
Core parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| --model_dir | dir | required | Path to trained model directory |
| --qs_file | file | required | Path to methylRawList .qs file |
| --sample_sheet | file | required | Path to sample metadata |
| --output_subdir | string | model_review | Output subdirectory name |
| --model_name | string | auto | Custom model name for report title |
| --n_cores | int | 4 | Number of CPU cores |
| --verbose | flag | false | Enable verbose output |
Statistical analysis
| Parameter | Type | Default | Description |
|---|---|---|---|
| --bootstrap_iterations | int | 500 | Bootstrap iterations for CI estimation |
| --stability_threshold | float | 0.70 | Feature stability threshold |
| --effect_size_threshold | float | 2.0 | Effect size threshold for significance |
| --min_effect_size | float | 5.0 | Minimum effect size (methylation %) |
| --multiple_testing | string | fdr | Correction: fdr, bonferroni, holm, BY, none |
| --significance_threshold | float | 0.05 | P-value significance threshold |
| --control_group | string | auto | Reference group for logistic regression |
Batch & time effects
| Parameter | Type | Default | Description |
|---|---|---|---|
| --batch_column | string | — | Column for batch/platform variable |
| --storage_column | string | — | Column for storage duration variable |
| --date_column | string | — | Column for collection/sampling date |
| --time_analysis | string | — | Column for timepoint/longitudinal data |
| --time_categorical | flag | false | Treat time as categorical (vs continuous) |
| --covariate_cols | string | — | Comma-separated column names from sample sheet to adjust for confounders during DMR calling (e.g., Batch,Storage_Days) |
Network & annotation
| Parameter | Type | Default | Description |
|---|---|---|---|
| --enable_wgcna | flag | false | Enable WGCNA co-methylation network analysis |
| --enable_gene_annotation | flag | false | Enable biomaRt gene annotation |
| --species | string | — | Species for annotation (e.g., 'Gallus gallus') |
| --enable_feature_annotation | flag | false | Enable genomic feature annotation |
| --gtf_file | file | — | GTF/GFF file for gene features |
| --regulatory_gff | file | — | Regulatory features GFF (enhancers, CTCF) |
| --emar_gff | file | — | EMAR regions GFF file |
| --open_chromatin_bed | file | — | Open chromatin BED file |
| --functional_enrichment | flag | false | Enable functional enrichment analysis |
PCR validation
| Parameter | Type | Default | Description |
|---|---|---|---|
| --prepare_pcr | flag | false | Generate PCR primer design links |
| --pcr_n_dmrs | int | 20 | Number of top DMRs for PCR validation |
Report output
| Parameter | Type | Default | Description |
|---|---|---|---|
| --generate_markdown | flag | true | Generate markdown report |
| --generate_pdf | flag | false | Generate PDF report |
| --markdown_title | string | auto | Custom report title |
| --markdown_author | string | — | Author name(s) for report |
| --figure_format | string | png | Output format: png, jpg, svg, all |
| --figure_resolution | int | 300 | Figure resolution (DPI) |
| --group_colors | string | auto | Custom colors for PCA/UMAP plots |
| --blind_paths | flag | false | Hide full paths in reports for privacy |
Example — Full research evaluation (Drag et al. 2025 setup):
Rscript MethylSense_reviewer.R \
--model_dir ./results/svm/markers_35 \
--qs_file ./preprocessed/Chicken_n124.qs \
--sample_sheet ./AsperMerged_v3.xlsx \
--model_name "Fast Model Diagnostic Test" \
--covariate_cols "SEQUENCER,cfDNA_conc_ng_uL,number_of_reads,mapping_percentage,mean_coverage" \
--time_analysis Sample_Week \
--enable_wgcna \
--bootstrap_iterations 1000 \
--stability_threshold 0.75 \
--effect_size_threshold 2.0 \
--min_effect_size 5.0 \
--multiple_testing fdr \
--n_cores 8 \
--prepare_pcr \
--functional_enrichment \
--enable_feature_annotation \
--gtf_file ./genomes/Gallus_gallus.GRCg7b.108.gtf \
--regulatory_gff ./genomes/Gallus_gallus.GRCg7b.regulatory_features.gff3 \
--emar_gff ./genomes/Gallus_gallus.GRCg7b.EMARs.gff \
--open_chromatin_bed ./genomes/openchrom-union.bed \
--figure_format "jpg,svg" \
--blind_paths \
--generate_pdf \
--verboseExample — Quick batch effect check:
Rscript MethylSense_reviewer.R \
--model_dir ./results/rf/markers_20 \
--qs_file ./data.qs \
--sample_sheet ./metadata.xlsx \
--batch_column Platform \
--storage_column StorageDaysCreates visualisations of the methylation data landscape.
| Parameter | Type | Default | Description |
|---|---|---|---|
| --analysis_dir | dir | required | Path to analysis directory |
| --sample_sheet | file | required | Path to sample metadata |
| --region_sizes | string | 1000,5000,10000,15000,20000,25000 | Region sizes (bp) |
| --plot_format | string | png | Output format: png, jpg, svg, pdf |
| --plot_width | float | 12 | Plot width in inches |
| --plot_height | float | 8 | Plot height in inches |
| --plot_dpi | int | 300 | Resolution for raster formats |
| --volcano_meth_threshold | float | 5 | Methylation difference threshold (%) |
Example — High-resolution SVG for publication:
Rscript MethylSense_general_data_overview.R \
--analysis_dir ./results \
--sample_sheet ./metadata.xlsx \
--region_sizes "5000,10000,25000" \
--plot_format svg --plot_dpi 600 \
--plot_width 14 --plot_height 10For publication-quality figures:
| Script | Default format | Default DPI | Recommended for print |
|---|---|---|---|
| MethylSense_analysis.R | PNG | 300 | --plot_format svg --plot_dpi 600 |
| MethylSense_reviewer.R | PNG | 300 | --figure_format 'png,svg' --figure_resolution 600 --generate_pdf |
| MethylSense_general_data_overview.R | PNG | 300 | --plot_format svg --plot_dpi 600 |
Common journal requirements:
- Nature/Science: 300-600 DPI, prefer vector formats (SVG/PDF)
- PLoS/BMC: 300 DPI minimum, TIFF or PNG acceptable
- Elsevier: 300 DPI for halftones, 1000 DPI for line art
All MethylSense scripts generate figures at 300 DPI by default, suitable for most journal requirements.
MethylSense implements two complementary cross-validation strategies for model evaluation. Both are controlled through command-line parameters and generate separate output files.
Enabled with --cv_repeats N (where N > 0). This is the default and faster method.
Each repeat draws a new random 60/40 stratified split. The model is trained on 60% of the data with internal k-fold CV for hyperparameter tuning, and evaluated on the held-out 40%. Metrics are averaged across all N repeats with 95% confidence intervals.
# Standard MCCV with 10 repeats
Rscript MethylSense_analysis.R \
--qs_file ./data.qs \
--output_dir ./results \
--cv_repeats 10Output: cv_detailed_*.csv, cv_summary_*.csv
Enabled by adding --nested_cv. This is the gold standard for unbiased performance estimation suitable for publication (Varma & Simon, 2006).
Nested CV uses two layers of cross-validation:
- Outer loop: K stratified, non-overlapping folds (set by
--cv_repeats). Each fold is held out as a test set exactly once. - Inner loop: Separate k-fold CV within the outer training set for hyperparameter tuning. The outer test fold is never seen during tuning.
The entire k-fold procedure can be repeated N times with different random seeds (set by --nested_cv_repeats), producing K × N total evaluations for more robust estimates.
# Publication-quality nested CV: 10 folds × 5 repetitions = 50 evaluations
Rscript MethylSense_analysis.R \
--qs_file ./data.qs \
--output_dir ./results \
--cv_repeats 10 \
--nested_cv \
--nested_cv_repeats 5Output: nested_cv_detailed_*.csv, nested_cv_summary_*.csv, nested_cv_hyperparameters_*.csv, plus nested CV visualisation plots
| Aspect | Standard MCCV | Nested CV |
|---|---|---|
| Outer split | Random 60/40 (overlapping across repeats) | K-fold, non-overlapping |
| Hyperparameter tuning | Internal CV on train split | Separate inner CV loop on outer train only |
| Test data isolation | Good (not used for tuning) | Complete (fully isolated from all tuning) |
| Bias | Potentially mildly optimistic | Unbiased (gold standard) |
| Speed | Fast | 5–10× slower |
| Use case | Development and exploration | Publication-quality results |
When --nested_cv is enabled, MethylSense runs both methods automatically and calculates the optimism bias (the difference between standard and nested CV accuracy). If the bias exceeds 5%, a warning is raised suggesting the nested CV results should be used for publication.
This section provides comprehensive guidance for veterinarians, diagnosticians, and clinicians using MethylSense for disease diagnosis. The pipeline translates complex epigenetic data into actionable clinical information.
Cell-free DNA (cfDNA) is released into the bloodstream when cells die, carrying the methylation patterns of their tissue of origin. During infection, pathogen-induced host cell damage releases cfDNA with characteristic methylation signatures that differ from healthy individuals. MethylSense detects these disease-associated methylation patterns to classify samples as infected or healthy.
The diagnostic approach is fundamentally different from pathogen detection methods (PCR, culture): rather than detecting the pathogen directly, MethylSense detects the host's epigenetic response to infection. This provides complementary diagnostic information and may detect infection earlier or in cases where pathogen load is below detection limits of conventional methods.
MethylSense can analyse cfDNA from both serum and plasma. Both sample types contain host cfDNA suitable for methylation analysis:
Serum: Blood allowed to clot before centrifugation. May contain slightly higher cfDNA concentrations due to cell lysis during clotting. Used in the original validation study (Drag et al., 2025).
Plasma: Blood collected in anticoagulant tubes (EDTA, citrate, or heparin) and centrifuged before clotting. May have lower background from lysed blood cells. Either sample type is suitable, but consistency within a study is recommended.
For avian aspergillosis diagnosis, the following specifications are based on Drag et al. (2025):
| Parameter | Specification |
|---|---|
| Sample type | Serum or plasma (cell-free DNA) |
| Volume | 200 µL minimum (serum or plasma) |
| Pre-extraction | Centrifugation 2 min at 400 × g to remove residual cells |
| DNA extraction | Norgen Biotek cfDNA Micro Kit (cat. 55500) or equivalent cfDNA kit |
| Quality control | Bioanalyzer 2100, High Sensitivity DNA kit (confirm cfDNA fragment profile) |
| Storage | -80°C (extracted cfDNA stable for months) |
| Library preparation | Native Barcoding Ligation Kit (SQK-NBD114.24), 11 µL cfDNA input |
| Sequencer | PromethION P2 Solo or MinION (any ONT platform) |
| Flow cell | R10.4.1 (version 14 chemistry) |
| Basecalling | Guppy v6.5.7 or Dorado, High Accuracy mode |
| Methylation model | dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_hac.cfg |
| Methylation caller | ONT modkit v0.1.2 or later, pileup mode with cpg combine mods |
| Reference genome | Gallus gallus (bGalGal1.mat.broiler.GRCg7b) or appropriate species |
Note: During library preparation, bead clean-up steps should be multiplied by 2.5× to improve recovery of short cfDNA fragments, as recommended by Martignano et al. (2023).
For complete methodology details, see Drag et al. (2025).
The complete diagnostic workflow from sample collection to clinical decision:
-
Sample collection — Collect 200 µL serum or plasma from patient. For serum, allow blood to clot at room temperature for 30-60 minutes before centrifugation. For plasma, collect in anticoagulant tube and centrifuge promptly.
-
Sample processing — Centrifuge to separate serum/plasma from cells. Perform a second centrifugation at 400 × g for 2 minutes to remove residual cells that could contaminate cfDNA.
-
DNA extraction — Extract cell-free DNA using a kit designed for cfDNA recovery (low input, short fragments). Standard genomic DNA kits are not suitable as they lose short cfDNA fragments.
-
Quality control — Verify cfDNA profile on Bioanalyzer. Successful extraction shows characteristic cfDNA peak at approximately 160-180 bp (mononucleosomal fragment).
-
Sequencing — Prepare Oxford Nanopore library with methylation-aware protocol. Sequence on MinION, GridION, or PromethION with R10.4.1 flow cells.
-
Data processing — Run basecalling with methylation model enabled. Process with NanoporeToBED-Pipeline to generate CpG BED files suitable for MethylSense.
-
Diagnosis — Run MethylSense_predict.R with your trained model. The script outputs a diagnostic classification with confidence score for each sample.
-
Interpretation — Review prediction report. High confidence predictions (>80%) can inform clinical decisions. Medium or low confidence results warrant additional confirmatory testing.
The prediction output includes a probability score for each sample indicating how strongly the methylation profile matches each class. The confidence score represents the maximum predicted probability across all classes:
| Confidence Level | Probability Range | Clinical Interpretation |
|---|---|---|
| High | >80% | Strong methylation signal consistent with predicted class. Prediction is reliable and can inform clinical decisions. |
| Medium | 60-80% | Moderate methylation signal. Consider confirmatory testing (culture, PCR, endoscopy, imaging) before treatment decisions. |
| Low | <60% | Weak or ambiguous methylation signal. Do not use for clinical decisions. Investigate sample quality or consider resequencing. |
Low confidence predictions warrant investigation before interpretation:
- Sample quality issues — Degraded cfDNA, insufficient DNA input, or contamination with genomic DNA from lysed blood cells
- Technical factors — Low sequencing coverage, library preparation failure, or basecalling errors
- Biological factors — Early-stage infection before host response develops, atypical disease presentation, or immunocompromised patients with altered epigenetic responses
- Species mismatch — Sample from a species different to the training data population
Pre-trained diagnostic models are available for download:
Avian Aspergillosis (Chicken, n=124)
- Accuracy: >95% with nested cross-validation
- Download: Zenodo Repository
- DOI: 10.5281/zenodo.15194046
- Includes: Trained models, DMR coordinates, and validation results
Additional species and disease models: Contact for availability at [email protected]
MethylSense is a research tool and has not been approved by FDA, EMA, or other regulatory bodies for clinical diagnostic use. Diagnostic decisions should be made in conjunction with other clinical findings and established diagnostic methods.
Models trained on one species may not generalise to other species without proper validation. Cross-species application requires separate validation studies before clinical use.
Package not found
Rscript MethylSense_installer.RNot enough samples per group
Rscript MethylSense_analysis.R ... --allow_small_groupsSmall groups (<10 samples) may produce unreliable or overfitted results. We recommend a minimum of 20 samples per group for robust classifier training.
Memory error
# Use larger windows (fewer features):
--window_file windows_10kb.bed
# Use fewer models:
--models rf,svmTraining too slow
--models fast
--cv_repeats 3| Step | Runtime | Memory |
|---|---|---|
| load_data | 5-10 min | 2-4 GB |
| general_overview | 5-15 min | 2-6 GB |
| train (no CV) | 20-30 min | 4-8 GB |
| train (10× CV) | 60-90 min | 4-8 GB |
| train (nested CV) | 2-3 hours | 4-8 GB |
| predict | 1-2 min | 1-2 GB |
| reviewer | 5-10 min | 2-4 GB |
Benchmarks for 30-50 samples, 2 groups, 8 CPU cores.
If you use MethylSense in your research, please cite:
Drag MH, Hvilsom C, Poulsen LL, Jensen HE, Tahas SA, Leineweber C, Cray C, Bertelsen MF, Bojesen AM. MethylSense: high accuracy machine learning-based diagnostics for Aspergillus fumigatus infection in chickens using host cell-free DNA methylation and Nanopore sequencing. J Clin Microbiol 0:e01054-25. https://doi.org/10.1128/jcm.01054-25
MethylSense builds upon excellent work from the R and Bioconductor communities.
methylKit (Akalin et al., 2012) — The foundation for DMR detection and methylation analysis in MethylSense. This Bioconductor package provides comprehensive tools for genome-wide DNA methylation analysis from bisulfite sequencing data.
Akalin A, Kormaksson M, Li S, Garrett-Bakelman FE, Figueroa ME, Melnick A, Mason CE (2012). methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles. Genome Biology, 13, R87. https://doi.org/10.1186/gb-2012-13-10-r87
caret (Kuhn, 2008) — The machine learning framework powering model training, cross-validation, and hyperparameter tuning in MethylSense.
Kuhn M (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5), 1-26. https://doi.org/10.18637/jss.v028.i05
pROC (Robin et al., 2011) — ROC curve analysis and AUC calculations.
Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, Müller M (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. https://doi.org/10.1186/1471-2105-12-77
MethylSense also uses: GenomicRanges (Lawrence et al., 2013), ggplot2 (Wickham, 2016), randomForest (Liaw & Wiener, 2002), xgboost (Chen & Guestrin, 2016), glmnet (Friedman et al., 2010), e1071, ranger, pheatmap, and RColorBrewer.
Expand full reference list
DNA Methylation
- Bird A (2002). DNA methylation patterns and epigenetic memory. Genes & Development, 16(1), 6-21.
- Jones PA (2012). Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nature Reviews Genetics, 13(7), 484-492.
Nanopore Methylation
- Simpson JT, Workman RE, Zuzarte PC, David M, Dursi LJ, Timp W (2017). Detecting DNA cytosine methylation using nanopore sequencing. Nature Methods, 14(4), 407-410.
Statistical Methods
- Benjamini Y, Hochberg Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289-300.
- Varma S, Simon R (2006). Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics, 7, 91.
Related Pipelines
License: Academic Free License 3.0 (AFL-3.0). See LICENSE.
Lead Developer: Markus Hodal Drag
Co-authors: Christina Hvilsom, Louise Ladefoged Poulsen, Henrik Elvang Jensen, Stamatios Alan Tahas, Christoph Leineweber, Carolyn Cray, Mads Frost Bertelsen, Anders Miki Bojesen
Contact: [email protected] | GitHub Issues
Developed for Oxford Nanopore methylation analysis















