A pure-Python ML monitoring stack for detecting data drift, model decay, and label shift in production — five detectors, three monitoring layers, zero external dependencies beyond NumPy and SciPy.
Most monitoring tutorials stop at: log predictions, compute accuracy, check if it drops. This library handles what comes before that signal arrives — watching the input feature distributions, prediction confidence, and class priors in real time, so you know something is wrong before the accuracy metric has moved.
Read the full write-up on EmiTechLogic → ML Model Monitoring: How to Detect Data Drift and Model Decay in Production
Features → FeatureMonitor → PredictionMonitor → PerformanceMonitor → Alert
(KS, PSI, MMD) (confidence, (Accuracy, F1,
label shift) ECE + ADWIN)
Three monitoring layers, five detectors, one entry point:
| Component | Job |
|---|---|
KolmogorovSmirnovDetector |
Per-feature marginal distribution test with p-value |
PopulationStabilityIndex |
Binned magnitude divergence — industry-standard PSI |
MaximumMeanDiscrepancy |
Joint distribution test via RBF kernel + permutation test |
ADWIN |
Adaptive sequential detector — no pre-specified window size needed |
DDM |
Error-rate sequential detector — lagging ground-truth confirmation signal |
FeatureMonitor |
Wraps KS + PSI + MMD per batch against a fixed reference window |
PredictionMonitor |
Class distribution PSI, confidence KS, chi-squared label shift |
PerformanceMonitor |
Accuracy / F1 / ECE decay tracking with ADWIN on the accuracy stream |
git clone https://github.com/Emmimal/ML-Model-Monitoring.git
cd ML-Model-Monitoring
pip install -r requirements.txtnumpy>=1.24.0
scipy>=1.10.0
scikit-learn>=1.3.0
No other dependencies. All five detectors and three monitors run on the Python standard library + NumPy + SciPy. scikit-learn is used only for F1 computation in PerformanceMonitor and the NearestCentroid model in the benchmark.
from monitors.feature_monitor import FeatureMonitor
from monitors.prediction_monitor import PredictionMonitor
from monitors.performance_monitor import PerformanceMonitor
# Build monitors from your training distribution
feat_monitor = FeatureMonitor(reference=training_X, ks_alpha=0.01)
pred_monitor = PredictionMonitor(n_classes=3,
reference_preds=model.predict(training_X),
reference_scores=model.predict_proba(training_X))
perf_monitor = PerformanceMonitor(n_classes=3,
reference_acc=champion_acc,
reference_f1=champion_f1)
# Call on each incoming production batch
for batch in production_stream:
preds = model.predict(batch.X)
scores = model.predict_proba(batch.X)
feat_report = feat_monitor.update(batch.X)
pred_report = pred_monitor.update(preds, scores)
if feat_report.any_drift:
print(f"Feature drift: KS={feat_report.drifted_features_ks}, "
f"PSI={feat_report.drifted_features_psi}")
if pred_report.label_shift_drift:
print("Label shift: class priors have changed")
# When ground truth labels arrive (may be delayed)
perf_report = perf_monitor.update(batch.y_true, preds, scores)
if perf_report.any_decay:
print(f"Performance decay: acc={perf_report.accuracy:.4f} "
f"(Δ{perf_report.acc_delta:+.4f})")python run_benchmark.pyRuns all four scenarios and prints the full terminal report. Seed: 42. Device: CPU. No external downloads. Total runtime: ~47 seconds.
All numbers from real runs. Seed: 42. Python 3.12. CPU only. 15 batches × 200 samples per scenario. Model: Nearest-Centroid (frozen — intentionally naive, the article is about monitoring not model choice).
| Scenario | Acc | ΔAcc | F1 | ΔF1 | ECE | FeatDrift | Time |
|---|---|---|---|---|---|---|---|
| A — Stable Stream (no drift) | 1.0000 | +0.0000 | 1.0000 | +0.0000 | 0.1268 | 73.3% | 12.5s |
| B — Abrupt Drift (shift=2.0) | 0.5800 | −0.4200 | 0.5199 | −0.4801 | 0.3194 | 93.3% | 11.1s |
| C — Gradual Drift (batches 4–9) | 0.6900 | −0.3100 | 0.5663 | −0.4337 | 0.3144 | 100.0% | 11.2s |
| D — Label Shift (33%→70%) | 1.0000 | +0.0000 | 1.0000 | +0.0000 | 0.1301 | 93.3% | 12.4s |
| Scenario | True CP | KS | PSI | MMD | ADWIN | DDM |
|---|---|---|---|---|---|---|
| A — Stable (no drift) | none | batch 15 | batch 1 | batch 11 | missed | missed |
| B — Abrupt Drift | batch 5 | batch 6 | batch 1 | batch 6 | batch 6 | missed |
| C — Gradual Drift | batch 4 | batch 1 | batch 1 | batch 5 | batch 7 | missed |
| D — Label Shift | batch 5 | batch 6 | batch 1 | batch 6 | missed | missed |
Reading the Scenario D result honestly: Accuracy stays at 1.0 and F1 stays at 1.0 after the label shift. ADWIN and DDM produce zero detections. Feature drift rate is 93.3% — but that is a false signal from the shifted class composition. Only prediction distribution monitoring (chi-squared on class counts) catches this correctly. This is the most important finding in the benchmark: feature monitoring alone is not enough.
Reading the Scenario A result honestly: PSI fires on batch 1 of a stable stream. That is PSI's known behavior — it is a magnitude measure, not a significance test. In production, PSI > 0.10 opens an investigation. KS confirmation (which shows no false positives until batch 15) is what drives the action decision.
The core PSI implementation used in all monitors:
import numpy as np
def psi(reference: np.ndarray, current: np.ndarray,
n_bins: int = 10, epsilon: float = 1e-4) -> float:
"""
Population Stability Index for a 1-D feature array.
Returns
-------
float — < 0.10: stable | 0.10–0.20: investigate | > 0.20: critical
"""
combined = np.concatenate([reference, current])
bins = np.linspace(combined.min(), combined.max(), n_bins + 1)
bins[0], bins[-1] = -np.inf, np.inf
ref_pct = np.histogram(reference, bins=bins)[0] / len(reference) + epsilon
cur_pct = np.histogram(current, bins=bins)[0] / len(current) + epsilon
ref_pct /= ref_pct.sum()
cur_pct /= cur_pct.sum()
return float(np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct)))KS test p-value
p ≥ 0.05 → Stable. No action.
0.01 ≤ p < 0.05 → Warning. Check PSI.
p < 0.01 → Critical. Confirm with MMD. Route to diagnosis.
PSI per feature
PSI < 0.10 → Stable.
PSI 0.10–0.20 → Investigate. PSI alone is not sufficient to act.
PSI > 0.20 → Significant. Act if KS also fires.
PSI > 0.25 → Emergency. Immediate investigation.
ECE (Expected Calibration Error)
ECE < 0.05 → Well calibrated.
ECE 0.05–0.10 → Acceptable for most use cases.
ECE > 0.10 → Confidence scores unreliable for threshold decisions.
ECE > 0.20 → Severely miscalibrated. Do not use confidence outputs.
ADWIN delta
0.002 → Conservative. Low false positives. Higher detection delay.
0.005 → Balanced. Recommended starting point.
0.010 → Aggressive. Fast detection. Higher false positive rate.
ML-Model-Monitoring/
├── data/
│ └── generators.py # 5 stream generators (stable, abrupt, gradual,
│ # seasonal, label shift)
├── detectors/
│ ├── statistical.py # KS test, PSI, MMD drift detectors
│ ├── adwin.py # ADWIN + MultiMetricADWIN
│ └── ddm.py # DDM + EDDM sequential detectors
├── monitors/
│ ├── feature_monitor.py # Per-feature drift tracking with reference window
│ ├── prediction_monitor.py # Prediction distribution + label shift monitoring
│ └── performance_monitor.py # Accuracy / F1 / ECE decay tracking with ADWIN
├── evaluation/
│ ├── metrics.py # MonitoringMetrics, DetectorComparisonResult
│ └── benchmark.py # Four-scenario benchmark runner
├── dashboard/
│ └── report.py # Terminal drift report (no external viz deps)
├── tests/
│ └── test_monitoring.py # 32-test unit suite
├── run_benchmark.py # Single entry point — runs all four scenarios
└── requirements.txt
pip install pytest
python -m pytest tests/test_monitoring.py -v32 tests. All pass. Coverage spans all five detectors, three monitors, five stream generators, and a full pipeline integration smoke test.
| You have | Use |
|---|---|
| Labeled reference window + incoming batches | FeatureMonitor (KS + PSI + MMD) |
| Softmax scores from your model | PredictionMonitor (confidence + label shift) |
| Ground truth labels (real-time or delayed) | PerformanceMonitor (ADWIN on accuracy) |
| Streaming per-prediction accuracy | ADWIN directly |
| Per-prediction error labels | DDM as lagging confirmation |
| Suspected correlated drift across features | MaximumMeanDiscrepancy (joint test) |
Skip the full stack when: single-turn, small fixed dataset, hard latency under 50ms requirements. MMD permutation tests are the dominant cost at ~10s per scenario with 150-sample subsets and 100 permutations.
- Token/sample estimation in PSI uses equal-width binning. Misfires on heavy-tailed distributions. Switch to quantile-based bins for financial data.
- MMD runtime scales quadratically with subsample size. Default is 150 samples — increase for sensitivity, decrease for speed. Embedding-based distances can replace the RBF kernel for text features.
- DDM requires label latency — it cannot fire until enough errors accumulate. In the benchmark (15 batches), DDM missed every injection. In production with days or weeks of delayed labels, it functions correctly as a confirmation signal.
- ADWIN is univariate — one instance per metric.
MultiMetricADWINwraps this for convenience but runs independent detectors per metric, not a joint test. - Reference window is fixed by default — set
rolling=TrueinFeatureMonitorif you expect slow ongoing drift and only care about sudden regime changes.
Same series — production layers for ML systems:
Retrain vs Fine-Tune vs Train from Scratch: A Decision Framework for ML Engineers — Article 08. The decision framework this monitoring stack feeds into. When a drift alert fires, this is where it routes.
Continual Learning in PyTorch: A Practical Guide for ML Engineers — Article 07. Three continual learning scenarios and when each applies. Distinguishes the cases where monitoring output routes to EWC vs replay vs retraining.
How to Prevent Catastrophic Forgetting in PyTorch — Article 05. The failure mode that looks identical to concept drift from the outside. Monitoring alone cannot distinguish them — this article explains why.
Production ML Engineering Guide — Article 01. Series hub. All 15 articles mapped.
MIT