Skip to content

Emmimal/ML-Model-Monitoring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML-Model-Monitoring

A pure-Python ML monitoring stack for detecting data drift, model decay, and label shift in production — five detectors, three monitoring layers, zero external dependencies beyond NumPy and SciPy.

Python Version License Tests

Most monitoring tutorials stop at: log predictions, compute accuracy, check if it drops. This library handles what comes before that signal arrives — watching the input feature distributions, prediction confidence, and class priors in real time, so you know something is wrong before the accuracy metric has moved.

Read the full write-up on EmiTechLogic → ML Model Monitoring: How to Detect Data Drift and Model Decay in Production


What It Does

Features → FeatureMonitor → PredictionMonitor → PerformanceMonitor → Alert
               (KS, PSI, MMD)     (confidence,        (Accuracy, F1,
                                   label shift)         ECE + ADWIN)

Three monitoring layers, five detectors, one entry point:

Component Job
KolmogorovSmirnovDetector Per-feature marginal distribution test with p-value
PopulationStabilityIndex Binned magnitude divergence — industry-standard PSI
MaximumMeanDiscrepancy Joint distribution test via RBF kernel + permutation test
ADWIN Adaptive sequential detector — no pre-specified window size needed
DDM Error-rate sequential detector — lagging ground-truth confirmation signal
FeatureMonitor Wraps KS + PSI + MMD per batch against a fixed reference window
PredictionMonitor Class distribution PSI, confidence KS, chi-squared label shift
PerformanceMonitor Accuracy / F1 / ECE decay tracking with ADWIN on the accuracy stream

Installation

git clone https://github.com/Emmimal/ML-Model-Monitoring.git
cd ML-Model-Monitoring
pip install -r requirements.txt
numpy>=1.24.0
scipy>=1.10.0
scikit-learn>=1.3.0

No other dependencies. All five detectors and three monitors run on the Python standard library + NumPy + SciPy. scikit-learn is used only for F1 computation in PerformanceMonitor and the NearestCentroid model in the benchmark.


Quick Start

from monitors.feature_monitor     import FeatureMonitor
from monitors.prediction_monitor  import PredictionMonitor
from monitors.performance_monitor import PerformanceMonitor

# Build monitors from your training distribution
feat_monitor = FeatureMonitor(reference=training_X, ks_alpha=0.01)
pred_monitor = PredictionMonitor(n_classes=3,
                                  reference_preds=model.predict(training_X),
                                  reference_scores=model.predict_proba(training_X))
perf_monitor = PerformanceMonitor(n_classes=3,
                                   reference_acc=champion_acc,
                                   reference_f1=champion_f1)

# Call on each incoming production batch
for batch in production_stream:
    preds  = model.predict(batch.X)
    scores = model.predict_proba(batch.X)

    feat_report = feat_monitor.update(batch.X)
    pred_report = pred_monitor.update(preds, scores)

    if feat_report.any_drift:
        print(f"Feature drift: KS={feat_report.drifted_features_ks}, "
              f"PSI={feat_report.drifted_features_psi}")

    if pred_report.label_shift_drift:
        print("Label shift: class priors have changed")

    # When ground truth labels arrive (may be delayed)
    perf_report = perf_monitor.update(batch.y_true, preds, scores)
    if perf_report.any_decay:
        print(f"Performance decay: acc={perf_report.accuracy:.4f} "
              f"(Δ{perf_report.acc_delta:+.4f})")

Run the Benchmark

python run_benchmark.py

Runs all four scenarios and prints the full terminal report. Seed: 42. Device: CPU. No external downloads. Total runtime: ~47 seconds.


Benchmark Results

All numbers from real runs. Seed: 42. Python 3.12. CPU only. 15 batches × 200 samples per scenario. Model: Nearest-Centroid (frozen — intentionally naive, the article is about monitoring not model choice).

Performance Summary — Final Batch Metrics

Scenario Acc ΔAcc F1 ΔF1 ECE FeatDrift Time
A — Stable Stream (no drift) 1.0000 +0.0000 1.0000 +0.0000 0.1268 73.3% 12.5s
B — Abrupt Drift (shift=2.0) 0.5800 −0.4200 0.5199 −0.4801 0.3194 93.3% 11.1s
C — Gradual Drift (batches 4–9) 0.6900 −0.3100 0.5663 −0.4337 0.3144 100.0% 11.2s
D — Label Shift (33%→70%) 1.0000 +0.0000 1.0000 +0.0000 0.1301 93.3% 12.4s

Detector Comparison — First Detection Batch

Scenario True CP KS PSI MMD ADWIN DDM
A — Stable (no drift) none batch 15 batch 1 batch 11 missed missed
B — Abrupt Drift batch 5 batch 6 batch 1 batch 6 batch 6 missed
C — Gradual Drift batch 4 batch 1 batch 1 batch 5 batch 7 missed
D — Label Shift batch 5 batch 6 batch 1 batch 6 missed missed

Reading the Scenario D result honestly: Accuracy stays at 1.0 and F1 stays at 1.0 after the label shift. ADWIN and DDM produce zero detections. Feature drift rate is 93.3% — but that is a false signal from the shifted class composition. Only prediction distribution monitoring (chi-squared on class counts) catches this correctly. This is the most important finding in the benchmark: feature monitoring alone is not enough.

Reading the Scenario A result honestly: PSI fires on batch 1 of a stable stream. That is PSI's known behavior — it is a magnitude measure, not a significance test. In production, PSI > 0.10 opens an investigation. KS confirmation (which shows no false positives until batch 15) is what drives the action decision.


PSI From Scratch

The core PSI implementation used in all monitors:

import numpy as np

def psi(reference: np.ndarray, current: np.ndarray,
        n_bins: int = 10, epsilon: float = 1e-4) -> float:
    """
    Population Stability Index for a 1-D feature array.

    Returns
    -------
    float  —  < 0.10: stable | 0.10–0.20: investigate | > 0.20: critical
    """
    combined = np.concatenate([reference, current])
    bins = np.linspace(combined.min(), combined.max(), n_bins + 1)
    bins[0], bins[-1] = -np.inf, np.inf

    ref_pct = np.histogram(reference, bins=bins)[0] / len(reference) + epsilon
    cur_pct = np.histogram(current,   bins=bins)[0] / len(current)   + epsilon

    ref_pct /= ref_pct.sum()
    cur_pct /= cur_pct.sum()

    return float(np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct)))

Threshold Reference

KS test p-value
  p ≥ 0.05          →  Stable. No action.
  0.01 ≤ p < 0.05   →  Warning. Check PSI.
  p < 0.01          →  Critical. Confirm with MMD. Route to diagnosis.

PSI per feature
  PSI < 0.10        →  Stable.
  PSI 0.10–0.20     →  Investigate. PSI alone is not sufficient to act.
  PSI > 0.20        →  Significant. Act if KS also fires.
  PSI > 0.25        →  Emergency. Immediate investigation.

ECE (Expected Calibration Error)
  ECE < 0.05        →  Well calibrated.
  ECE 0.05–0.10     →  Acceptable for most use cases.
  ECE > 0.10        →  Confidence scores unreliable for threshold decisions.
  ECE > 0.20        →  Severely miscalibrated. Do not use confidence outputs.

ADWIN delta
  0.002             →  Conservative. Low false positives. Higher detection delay.
  0.005             →  Balanced. Recommended starting point.
  0.010             →  Aggressive. Fast detection. Higher false positive rate.

Project Structure

ML-Model-Monitoring/
├── data/
│   └── generators.py          # 5 stream generators (stable, abrupt, gradual,
│                              #   seasonal, label shift)
├── detectors/
│   ├── statistical.py         # KS test, PSI, MMD drift detectors
│   ├── adwin.py               # ADWIN + MultiMetricADWIN
│   └── ddm.py                 # DDM + EDDM sequential detectors
├── monitors/
│   ├── feature_monitor.py     # Per-feature drift tracking with reference window
│   ├── prediction_monitor.py  # Prediction distribution + label shift monitoring
│   └── performance_monitor.py # Accuracy / F1 / ECE decay tracking with ADWIN
├── evaluation/
│   ├── metrics.py             # MonitoringMetrics, DetectorComparisonResult
│   └── benchmark.py           # Four-scenario benchmark runner
├── dashboard/
│   └── report.py              # Terminal drift report (no external viz deps)
├── tests/
│   └── test_monitoring.py     # 32-test unit suite
├── run_benchmark.py           # Single entry point — runs all four scenarios
└── requirements.txt

Running the Tests

pip install pytest
python -m pytest tests/test_monitoring.py -v

32 tests. All pass. Coverage spans all five detectors, three monitors, five stream generators, and a full pipeline integration smoke test.


When to Use What

You have Use
Labeled reference window + incoming batches FeatureMonitor (KS + PSI + MMD)
Softmax scores from your model PredictionMonitor (confidence + label shift)
Ground truth labels (real-time or delayed) PerformanceMonitor (ADWIN on accuracy)
Streaming per-prediction accuracy ADWIN directly
Per-prediction error labels DDM as lagging confirmation
Suspected correlated drift across features MaximumMeanDiscrepancy (joint test)

Skip the full stack when: single-turn, small fixed dataset, hard latency under 50ms requirements. MMD permutation tests are the dominant cost at ~10s per scenario with 150-sample subsets and 100 permutations.


Known Limitations

  • Token/sample estimation in PSI uses equal-width binning. Misfires on heavy-tailed distributions. Switch to quantile-based bins for financial data.
  • MMD runtime scales quadratically with subsample size. Default is 150 samples — increase for sensitivity, decrease for speed. Embedding-based distances can replace the RBF kernel for text features.
  • DDM requires label latency — it cannot fire until enough errors accumulate. In the benchmark (15 batches), DDM missed every injection. In production with days or weeks of delayed labels, it functions correctly as a confirmation signal.
  • ADWIN is univariate — one instance per metric. MultiMetricADWIN wraps this for convenience but runs independent detectors per metric, not a joint test.
  • Reference window is fixed by default — set rolling=True in FeatureMonitor if you expect slow ongoing drift and only care about sudden regime changes.

Related

Same series — production layers for ML systems:

Retrain vs Fine-Tune vs Train from Scratch: A Decision Framework for ML Engineers — Article 08. The decision framework this monitoring stack feeds into. When a drift alert fires, this is where it routes.

Continual Learning in PyTorch: A Practical Guide for ML Engineers — Article 07. Three continual learning scenarios and when each applies. Distinguishes the cases where monitoring output routes to EWC vs replay vs retraining.

How to Prevent Catastrophic Forgetting in PyTorch — Article 05. The failure mode that looks identical to concept drift from the outside. Monitoring alone cannot distinguish them — this article explains why.

Production ML Engineering Guide — Article 01. Series hub. All 15 articles mapped.


License

MIT

About

ML model monitoring is not just drift detection. This repo benchmarks PSI, KS, MMD, and ADWIN across real failure scenarios like data drift, label shift, and model decay in production.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages