ML-Model-Monitoring

A pure-Python ML monitoring stack for detecting data drift, model decay, and label shift in production — five detectors, three monitoring layers, zero external dependencies beyond NumPy and SciPy.

Most monitoring tutorials stop at: log predictions, compute accuracy, check if it drops. This library handles what comes before that signal arrives — watching the input feature distributions, prediction confidence, and class priors in real time, so you know something is wrong before the accuracy metric has moved.

Read the full write-up on EmiTechLogic → ML Model Monitoring: How to Detect Data Drift and Model Decay in Production

What It Does

Features → FeatureMonitor → PredictionMonitor → PerformanceMonitor → Alert
               (KS, PSI, MMD)     (confidence,        (Accuracy, F1,
                                   label shift)         ECE + ADWIN)

Three monitoring layers, five detectors, one entry point:

Component	Job
`KolmogorovSmirnovDetector`	Per-feature marginal distribution test with p-value
`PopulationStabilityIndex`	Binned magnitude divergence — industry-standard PSI
`MaximumMeanDiscrepancy`	Joint distribution test via RBF kernel + permutation test
`ADWIN`	Adaptive sequential detector — no pre-specified window size needed
`DDM`	Error-rate sequential detector — lagging ground-truth confirmation signal
`FeatureMonitor`	Wraps KS + PSI + MMD per batch against a fixed reference window
`PredictionMonitor`	Class distribution PSI, confidence KS, chi-squared label shift
`PerformanceMonitor`	Accuracy / F1 / ECE decay tracking with ADWIN on the accuracy stream

Installation

git clone https://github.com/Emmimal/ML-Model-Monitoring.git
cd ML-Model-Monitoring
pip install -r requirements.txt

numpy>=1.24.0
scipy>=1.10.0
scikit-learn>=1.3.0

No other dependencies. All five detectors and three monitors run on the Python standard library + NumPy + SciPy. scikit-learn is used only for F1 computation in PerformanceMonitor and the NearestCentroid model in the benchmark.

Quick Start

from monitors.feature_monitor     import FeatureMonitor
from monitors.prediction_monitor  import PredictionMonitor
from monitors.performance_monitor import PerformanceMonitor

# Build monitors from your training distribution
feat_monitor = FeatureMonitor(reference=training_X, ks_alpha=0.01)
pred_monitor = PredictionMonitor(n_classes=3,
                                  reference_preds=model.predict(training_X),
                                  reference_scores=model.predict_proba(training_X))
perf_monitor = PerformanceMonitor(n_classes=3,
                                   reference_acc=champion_acc,
                                   reference_f1=champion_f1)

# Call on each incoming production batch
for batch in production_stream:
    preds  = model.predict(batch.X)
    scores = model.predict_proba(batch.X)

    feat_report = feat_monitor.update(batch.X)
    pred_report = pred_monitor.update(preds, scores)

    if feat_report.any_drift:
        print(f"Feature drift: KS={feat_report.drifted_features_ks}, "
              f"PSI={feat_report.drifted_features_psi}")

    if pred_report.label_shift_drift:
        print("Label shift: class priors have changed")

    # When ground truth labels arrive (may be delayed)
    perf_report = perf_monitor.update(batch.y_true, preds, scores)
    if perf_report.any_decay:
        print(f"Performance decay: acc={perf_report.accuracy:.4f} "
              f"(Δ{perf_report.acc_delta:+.4f})")

Run the Benchmark

python run_benchmark.py

Runs all four scenarios and prints the full terminal report. Seed: 42. Device: CPU. No external downloads. Total runtime: ~47 seconds.

Benchmark Results

All numbers from real runs. Seed: 42. Python 3.12. CPU only. 15 batches × 200 samples per scenario. Model: Nearest-Centroid (frozen — intentionally naive, the article is about monitoring not model choice).

Performance Summary — Final Batch Metrics

Scenario	Acc	ΔAcc	F1	ΔF1	ECE	FeatDrift	Time
A — Stable Stream (no drift)	1.0000	+0.0000	1.0000	+0.0000	0.1268	73.3%	12.5s
B — Abrupt Drift (shift=2.0)	0.5800	−0.4200	0.5199	−0.4801	0.3194	93.3%	11.1s
C — Gradual Drift (batches 4–9)	0.6900	−0.3100	0.5663	−0.4337	0.3144	100.0%	11.2s
D — Label Shift (33%→70%)	1.0000	+0.0000	1.0000	+0.0000	0.1301	93.3%	12.4s

Detector Comparison — First Detection Batch

Scenario	True CP	KS	PSI	MMD	ADWIN	DDM
A — Stable (no drift)	none	batch 15	batch 1	batch 11	missed	missed
B — Abrupt Drift	batch 5	batch 6	batch 1	batch 6	batch 6	missed
C — Gradual Drift	batch 4	batch 1	batch 1	batch 5	batch 7	missed
D — Label Shift	batch 5	batch 6	batch 1	batch 6	missed	missed

Reading the Scenario D result honestly: Accuracy stays at 1.0 and F1 stays at 1.0 after the label shift. ADWIN and DDM produce zero detections. Feature drift rate is 93.3% — but that is a false signal from the shifted class composition. Only prediction distribution monitoring (chi-squared on class counts) catches this correctly. This is the most important finding in the benchmark: feature monitoring alone is not enough.

Reading the Scenario A result honestly: PSI fires on batch 1 of a stable stream. That is PSI's known behavior — it is a magnitude measure, not a significance test. In production, PSI > 0.10 opens an investigation. KS confirmation (which shows no false positives until batch 15) is what drives the action decision.

PSI From Scratch

The core PSI implementation used in all monitors:

import numpy as np

def psi(reference: np.ndarray, current: np.ndarray,
        n_bins: int = 10, epsilon: float = 1e-4) -> float:
    """
    Population Stability Index for a 1-D feature array.

    Returns
    -------
    float  —  < 0.10: stable | 0.10–0.20: investigate | > 0.20: critical
    """
    combined = np.concatenate([reference, current])
    bins = np.linspace(combined.min(), combined.max(), n_bins + 1)
    bins[0], bins[-1] = -np.inf, np.inf

    ref_pct = np.histogram(reference, bins=bins)[0] / len(reference) + epsilon
    cur_pct = np.histogram(current,   bins=bins)[0] / len(current)   + epsilon

    ref_pct /= ref_pct.sum()
    cur_pct /= cur_pct.sum()

    return float(np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct)))

Threshold Reference

KS test p-value
  p ≥ 0.05          →  Stable. No action.
  0.01 ≤ p < 0.05   →  Warning. Check PSI.
  p < 0.01          →  Critical. Confirm with MMD. Route to diagnosis.

PSI per feature
  PSI < 0.10        →  Stable.
  PSI 0.10–0.20     →  Investigate. PSI alone is not sufficient to act.
  PSI > 0.20        →  Significant. Act if KS also fires.
  PSI > 0.25        →  Emergency. Immediate investigation.

ECE (Expected Calibration Error)
  ECE < 0.05        →  Well calibrated.
  ECE 0.05–0.10     →  Acceptable for most use cases.
  ECE > 0.10        →  Confidence scores unreliable for threshold decisions.
  ECE > 0.20        →  Severely miscalibrated. Do not use confidence outputs.

ADWIN delta
  0.002             →  Conservative. Low false positives. Higher detection delay.
  0.005             →  Balanced. Recommended starting point.
  0.010             →  Aggressive. Fast detection. Higher false positive rate.

Project Structure

ML-Model-Monitoring/
├── data/
│   └── generators.py          # 5 stream generators (stable, abrupt, gradual,
│                              #   seasonal, label shift)
├── detectors/
│   ├── statistical.py         # KS test, PSI, MMD drift detectors
│   ├── adwin.py               # ADWIN + MultiMetricADWIN
│   └── ddm.py                 # DDM + EDDM sequential detectors
├── monitors/
│   ├── feature_monitor.py     # Per-feature drift tracking with reference window
│   ├── prediction_monitor.py  # Prediction distribution + label shift monitoring
│   └── performance_monitor.py # Accuracy / F1 / ECE decay tracking with ADWIN
├── evaluation/
│   ├── metrics.py             # MonitoringMetrics, DetectorComparisonResult
│   └── benchmark.py           # Four-scenario benchmark runner
├── dashboard/
│   └── report.py              # Terminal drift report (no external viz deps)
├── tests/
│   └── test_monitoring.py     # 32-test unit suite
├── run_benchmark.py           # Single entry point — runs all four scenarios
└── requirements.txt

Running the Tests

pip install pytest
python -m pytest tests/test_monitoring.py -v

32 tests. All pass. Coverage spans all five detectors, three monitors, five stream generators, and a full pipeline integration smoke test.

When to Use What

You have	Use
Labeled reference window + incoming batches	`FeatureMonitor` (KS + PSI + MMD)
Softmax scores from your model	`PredictionMonitor` (confidence + label shift)
Ground truth labels (real-time or delayed)	`PerformanceMonitor` (ADWIN on accuracy)
Streaming per-prediction accuracy	`ADWIN` directly
Per-prediction error labels	`DDM` as lagging confirmation
Suspected correlated drift across features	`MaximumMeanDiscrepancy` (joint test)

Skip the full stack when: single-turn, small fixed dataset, hard latency under 50ms requirements. MMD permutation tests are the dominant cost at ~10s per scenario with 150-sample subsets and 100 permutations.

Known Limitations

Token/sample estimation in PSI uses equal-width binning. Misfires on heavy-tailed distributions. Switch to quantile-based bins for financial data.
MMD runtime scales quadratically with subsample size. Default is 150 samples — increase for sensitivity, decrease for speed. Embedding-based distances can replace the RBF kernel for text features.
DDM requires label latency — it cannot fire until enough errors accumulate. In the benchmark (15 batches), DDM missed every injection. In production with days or weeks of delayed labels, it functions correctly as a confirmation signal.
ADWIN is univariate — one instance per metric. MultiMetricADWIN wraps this for convenience but runs independent detectors per metric, not a joint test.
Reference window is fixed by default — set rolling=True in FeatureMonitor if you expect slow ongoing drift and only care about sudden regime changes.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML-Model-Monitoring

What It Does

Installation

Quick Start

Run the Benchmark

Benchmark Results

Performance Summary — Final Batch Metrics

Detector Comparison — First Detection Batch

PSI From Scratch

Threshold Reference

Project Structure

Running the Tests

When to Use What

Known Limitations

Related

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
dashboard		dashboard
data		data
detectors		detectors
evaluation		evaluation
monitors		monitors
tests		tests
LICENSE		LICENSE
README.md		README.md
run_benchmark.py		run_benchmark.py

Folders and files

Latest commit

History

Repository files navigation

ML-Model-Monitoring

What It Does

Installation

Quick Start

Run the Benchmark

Benchmark Results

Performance Summary — Final Batch Metrics

Detector Comparison — First Detection Batch

PSI From Scratch

Threshold Reference

Project Structure

Running the Tests

When to Use What

Known Limitations

Related

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages