MLB Prediction System

Research-grade MLB prediction system for pre-game win probability, standings, and related analysis — regular season and spring training, 2000–2026.

Models

The system trains six models on 136 pre-game features using an expanding-window protocol — each season N is evaluated using a model trained exclusively on seasons before N, so all reported metrics are fully out-of-sample. Every model goes through probability calibration (isotonic for tree models, Platt sigmoid for linear/neural) and time-weighted training (exponential decay rate 0.12 per season, so 2024 weight = 1.0, 2020 weight ≈ 0.61, 2015 weight ≈ 0.30). Features are dynamically selected based on availability across seasons.

Logistic Regression

A regularised linear model that serves as the interpretable baseline. All 136 features are z-score standardised before fitting. Because the decision boundary is a hyperplane, the model captures additive effects — for example, a larger Elo differential increases home-win probability by a fixed amount regardless of the other features. Its simplicity makes it fast, stable, and easy to audit.

Regularisation: L2 (ridge), C=1.0
Solver: L-BFGS with up to 1 000 iterations
Interpretability: SHAP attributions are computed directly from coef × z-score — no approximate explainer needed
When to use: Speed-critical inference, auditing individual predictions, baseline comparison

LightGBM

Microsoft's LightGBM grows an ensemble of shallow decision trees in sequence, where each tree corrects the residual errors of the ones before it. Unlike logistic regression, it captures non-linear interactions — for example, a high Elo differential combined with a good home/away split may carry a larger joint effect than either feature alone.

Hyperparameters: 60-trial Optuna Bayesian search minimising out-of-sample Brier score (typical result: num_leaves≈63, learning_rate≈0.05, n_estimators≈500)
Interpretability: Tree-based SHAP values via shap.TreeExplainer
When to use: Fast batch inference at scale; often competitive with XGBoost

XGBoost

DMLC's XGBoost is the other dominant gradient-boosted tree library. Its regularisation scheme (min_child_weight, separate L1/L2 penalties on leaf weights) and histogram-based split finding produce probability estimates that are complementary to LightGBM — they tend to disagree most on uncertain games near 50%, which makes them useful ensemble partners. XGBoost typically achieves the best single-model Brier score in this system.

Hyperparameters: 60-trial Optuna Bayesian search (typical result: max_depth≈6, learning_rate≈0.05, n_estimators≈500)
Interpretability: Tree-based SHAP values via shap.TreeExplainer
When to use: Highest standalone accuracy; default choice when not ensembling

CatBoost

Yandex's CatBoost uses ordered boosting and symmetric (oblivious) decision trees. Its unique training procedure reduces prediction shift, and symmetric trees tend to generalise well on tabular data. Acts as a third complementary tree model in the stacked ensemble, providing low-variance predictions that differ structurally from LightGBM and XGBoost.

Regularisation: L2 leaf regularisation, learning rate decay
Architecture: Symmetric (oblivious) trees with ordered boosting
When to use: Robustness-focused inference; low-variance ensemble partner

Neural Network (MLP)

A multi-layer perceptron classifier with three hidden layers (128, 64, 32 units) and ReLU activations. Captures nonlinear feature interactions that tree models may miss. Features are z-score normalised before training. Provides model diversity for stacking since its error surface is fundamentally different from tree-based learners.

Architecture: 128 → 64 → 32 → 1, Adam optimiser
Regularisation: L2 weight decay (alpha)
When to use: Ensemble diversity; capturing non-tree-like nonlinearities

Stacked Ensemble (default production model)

The stacked ensemble never sees raw features. Instead, it takes the calibrated probability outputs of all five base models as its five inputs and trains a logistic-regression meta-learner to find the optimal blend. Because each base model makes different errors, the meta-learner learns to up-weight whichever model is most confident in each probability range.

 Logistic prob  ─┐
 LightGBM prob  ─┤
 XGBoost prob   ─┼──▶  Logistic meta-learner  ──▶  P(home win)
 CatBoost prob  ─┤
 MLP prob       ─┘

Meta-learner: LogisticRegression(C=0.5) — slight regularisation prevents over-fitting to the calibration set
Training: The meta-learner is fit on the same held-out calibration split used for Platt scaling, so base-model probabilities are out-of-sample relative to the meta-learner
When to use: Always — this is the default and achieves the best Brier score and calibration

Training tiers

The system supports two training tiers, each producing distinct model artifacts:

Tier	Version tag	Directory	Training scope	When used
Quick (`--tier quick`)	`v4q`	`data/models/quick/`	Skip HPO, skip CV, skip Stage 1 player model	Initial bootstrap (first cold-start), emergency model recovery
Full (`--tier full`)	`v4`	`data/models/full/`	Complete pipeline with Stage 1 (CV skipped for dashboard retrains to fit in container memory)	Production retraining, daily scheduled retraining

At startup, the system prefers full models over quick, with legacy (pre-tier) models as final fallback. Users can switch between available tiers on the admin dashboard.

When retraining, old model artifacts are archived (moved to data/models/archive/) rather than deleted, preserving them for drift analysis. Each archive entry is timestamped so multiple training runs can be compared.

4-way startup logic

On startup, the application checks for existing processed data and trained models, then takes the minimal action needed:

Data exists?	Models exist?	Action taken
Yes	Yes	Normal startup — load models and serve immediately
Yes	No	Quick-train only — train bootstrap models (v4q), skip data ingestion
No	Yes	Data ingest only — ingest data, preserve existing models
No	No	Full bootstrap — ingest all data, then quick-train models

When data and models are both present, the server still starts listening immediately: model and feature loading run in the background while / shows the initializing page until loading finishes. That avoids connection errors in the browser during long stacked-model + Stage 1 startup.

Training techniques

Technique	What it does
Probability calibration	Isotonic calibration (non-parametric monotonic mapping) for tree models (LightGBM, XGBoost, CatBoost); Platt calibration (sigmoid) for logistic and MLP. Both use a held-out calibration set so predicted 65% games actually win ~65% of the time.
Time-weighted training	Exponential decay (`rate=0.12` per season) gives recent seasons more influence. This adapts the model to baseball rule changes — the 2023 shift ban, pitch clock, and larger bases shift team-level stats in ways that older seasons do not reflect.
Optuna HPO	Bayesian hyperparameter search (200 trials per model type) over a 5-season expanding-window objective. Searches `learning_rate`, tree depth, `n_estimators`, `subsample`, `colsample_bytree`, L1/L2 regularisation, and calibration method. Supports LightGBM, XGBoost, and CatBoost.
Expanding-window CV	For evaluation season N, the model is trained on all seasons before N. No future data ever leaks into training or calibration.
Dynamic feature selection	The pipeline automatically detects the intersection of available features across all season DataFrames and trains using only those features, ensuring robustness to missing columns in older seasons.
Spring training weighting	Spring training games are down-weighted via `--spring-weight` (default 0.25) so regular-season performance drives the model more strongly.

Features (136 total, v4)

Team performance (27 features)

Elo rating (home, away, diff) — sequential cross-season rating with regression-to-mean at each season start; accounts for opponent quality
Multi-window rolling (7 / 14 / 15 / 30 / 60 games, cross-season warm-start): win%, run differential, Pythagorean expectation
EWMA rolling (span=20): exponentially-weighted recent-form metrics
Home/away performance splits: team win% and Pythagorean computed separately in home games vs. road games

Run distribution (4 features)

Scoring variance (30-game window): run standard deviation for each team
One-run game win% (30-game window): close-game resilience metric

Context & fatigue (7 features)

Streak: current win (+) or loss (−) streak for each team
Rest days: calendar days since last game (capped at 10)
Season progress: 0 = opener, 1 = final day
Day/night: 1 = day game, 0 = night game
Interleague: 1 = interleague matchup
Day of week: 0 (Monday) – 6 (Sunday)

Pitcher quality (8 features)

Prior-season SP ERA, K/9, BB/9, WHIP from the MLB Stats API — one row per pitcher per season, joined by name

Statcast individual player features (6 features)

Lineup-weighted batter xwOBA (home, away) — prior-season Statcast expected wOBA averaged across the 9-man lineup; uses Chadwick Register for Retrosheet → MLBAM ID mapping
Lineup-weighted barrel% (home, away) — prior-season barrel rate averaged across the lineup
Starting pitcher expected wOBA allowed (home, away) — prior-season Statcast xwOBA for the opposing starter

Advanced team metrics (FanGraphs, prior season, 20 features)

Batting: wOBA, Barrel%, Hard Hit%, ISO, BABIP, xwOBA
Pitching: FIP, xFIP, K%, BB%, HR/FB, WHIP

Bullpen (8 features)

Bullpen usage (15 / 30 game window): rolling average of relief innings pitched
Bullpen ERA proxy (15 / 30 game window): rolling average of earned runs allowed by the bullpen

Lineup (2 features)

Lineup continuity (home, away) — fraction of the prior game's lineup retained

Park & venue (1 feature)

Park run factor — historical runs per game at the venue vs. league average

Game type (1 feature)

is_spring — binary: 1.0 for spring training, 0.0 for regular season

Vegas odds (2 features)

Implied home win probability — converted from money-line odds (defaults to 0.5 when unavailable)
Line movement — change from opening to closing implied probability

Weather (3 features)

Game temperature (°F), wind speed (mph), humidity (%) — fetched from Open-Meteo historical API using park geo-coordinates

Differential features (9 features)

Pythagorean diff, EWMA Pythagorean diff, home/road split diff, SP ERA diff, wOBA diff, FIP diff, xwOBA diff, WHIP diff, ISO diff

Stage 1 player model features (17 features)

Lineup strength (home, away) — neural lineup quality score from PyTorch player embedding model
Top-3 / bottom-3 quality (home, away) — average player quality for batters 1–3 and 7–9
Lineup variance (home, away) — standard deviation of player quality across the 9-man lineup
Platoon advantage (home, away) — learned platoon interaction vs. opposing SP handedness
SP quality (home, away) — neural starting pitcher quality from EWMA rolling stats and learned embeddings
Lineup vs SP (home, away) — learned interaction between lineup strength and opposing SP quality
Differentials — lineup strength diff, SP quality diff, matchup advantage diff

Quick start

Install

git clone <repo>
cd mlb-predict
pip install -e .

Full data ingestion (first run)

# 1. Fetch MLB schedules (2000–2026) — includes preseason by default; use --no-preseason to opt out
python scripts/ingest_schedule.py --seasons $(seq 2000 2026)

# 2. Fetch Retrosheet gamelogs (historical + current season)
python scripts/ingest_retrosheet_gamelogs.py --seasons $(seq 2000 2025)

# 3. Build Retrosheet ↔ MLB crosswalk
python scripts/build_crosswalk.py --seasons $(seq 2000 2025)

# 4. Fetch individual pitcher season stats
python scripts/ingest_pitcher_stats.py --seasons $(seq 2000 2025)

# 5. Fetch FanGraphs team advanced metrics
python scripts/ingest_fangraphs.py --seasons $(seq 2002 2025)

Build features

# Historical seasons (2000–2025)
python scripts/build_features.py --seasons $(seq 2000 2025)

# Spring training features (schedule scores + prior-season team state)
python scripts/build_spring_features.py --seasons $(seq 2000 2026)

# 2026 pre-season predictions (uses 2025 end-of-season team strength)
python scripts/build_features_2026.py

Ingest external data (optional — Vegas odds and weather)

# Vegas odds (requires a CSV of historical money lines)
python scripts/ingest_vegas.py --input odds.csv

# Weather data (backfills from Open-Meteo API based on gamelogs)
python scripts/ingest_weather.py

Train models (with Optuna HPO)

python scripts/train_model.py --hpo --hpo-trials 60

Skip HPO if you just want to re-train with existing hyperparameters:

# Train all 6 models: logistic, lightgbm, xgboost, catboost, mlp, stacked
# --spring-weight 0.25 (default) down-weights spring training games
python scripts/train_model.py

# Train a subset
python scripts/train_model.py --models logistic xgboost stacked

Training tiers

# Full training (default) — complete pipeline with HPO, CV, Stage 1
python scripts/train_model.py --tier full

# Quick training — fast bootstrap, skips HPO/CV/Stage 1
python scripts/train_model.py --tier quick

Quick tier automatically sets --skip-cv, --no-stage1, and disables HPO. Models are saved with a v4q version tag under data/models/quick/, keeping them separate from full-pipeline models in data/models/full/.

Launch the web dashboard

python scripts/serve.py                   # default: stacked ensemble, http://localhost:30087
python scripts/serve.py --model xgboost   # use XGBoost model
python scripts/serve.py --model catboost  # use CatBoost model
python scripts/serve.py --model mlp       # use MLP (neural network) model

Open:

http://localhost:30087 — all-seasons games browser
http://localhost:30087/season/2026 — 2026 schedule and predictions
http://localhost:30087/standings — predicted vs actual standings with team stats
http://localhost:30087/dashboard — admin dashboard (update season, full reingest, retrain, system status)
http://localhost:30087/sitemap — complete page and API index

CLI query tool

# Game detail with SHAP attribution
python scripts/query_game.py --game-pk 745444

# Dodgers vs. Padres on opening day 2024
python scripts/query_game.py --home SDP --away LAD --season 2024 --date 2024-03-20

# All 2024 Dodgers home games (compact)
python scripts/query_game.py --home LAD --season 2024 --show-schedule

# Biggest upsets of 2024
python scripts/query_game.py --season 2024 --show-upsets --top-n 10

# Brief one-line output
python scripts/query_game.py --home NYY --season 2025 --brief

Server management

Start in foreground (development)

python scripts/serve.py                   # stacked ensemble (default)
python scripts/serve.py --model xgboost   # explicit model selection

Start in background (production)

mkdir -p logs
nohup python scripts/serve.py >> logs/server.log 2>&1 &
echo $! > server.pid

The server PID is saved to server.pid so it can be stopped cleanly later.

Stop the server

# Graceful stop using saved PID
kill $(cat server.pid)

# Force stop using saved PID (if graceful stop hangs)
kill -9 $(cat server.pid)

# Stop by port number (no PID file needed)
kill $(lsof -ti:30087)

# Force stop by port number
kill -9 $(lsof -ti:30087)

# Stop all uvicorn/serve.py processes
pkill -f "serve.py"

Restart the server

kill $(lsof -ti:30087) 2>/dev/null; sleep 2
nohup python scripts/serve.py >> logs/server.log 2>&1 &
echo $! > server.pid

Check server status

# Is the server running?
lsof -i:30087

# Tail the server log
tail -f logs/server.log

# Check the PID file
cat server.pid && kill -0 $(cat server.pid) && echo "running" || echo "not running"

Daily update (cron job)

The scripts/update_daily.sh script refreshes game results, rebuilds features, and restarts the server. It is designed to run at 01:00 each night after Retrosheet publishes the previous day's results.

What the script does

Step	Action
1	Refresh the current-season MLB schedule (picks up postponements and rescheduled games)
2	Refresh the current-season Retrosheet gamelogs (yesterday's results)
3	Rebuild the Retrosheet ↔ MLB crosswalk for the current season
4	Rebuild the 136-feature matrix for the current season (incl. Statcast, Vegas, weather)
5	Build spring training features for the current season
6	Rebuild 2026 pre-season predictions from the updated team state
7	Kill the running server and start a fresh instance to load the new data

All output is appended to logs/cron.log; the server log goes to logs/server.log.

One-time setup

# 1. Make the script executable
chmod +x scripts/update_daily.sh

# 2. Create the logs directory
mkdir -p logs

# 3. Test it manually first
scripts/update_daily.sh

Install the cron job

crontab -e

Add this line (replace the path with your actual project root):

0 1 * * * /path/to/mlb-predict/scripts/update_daily.sh >> /path/to/mlb-predict/logs/cron.log 2>&1

The format is minute hour day month weekday command:

Field	Value	Meaning
`0`	minute	at the top of the hour
`1`	hour	1 AM local time
`*`	day	every day
`*`	month	every month
`*`	weekday	every day of the week

Verify the cron job is registered

crontab -l

Remove the cron job

crontab -e
# Delete the update_daily.sh line, save and exit

Override environment variables

The script respects the following environment variables, which can be set inline:

# Use a different Python or model
PYTHON=/usr/local/bin/python3 MODEL=stacked scripts/update_daily.sh

# Run for a specific season only (useful for backfilling)
# Edit update_daily.sh YEAR variable or export:
YEAR=2025 scripts/update_daily.sh

Docker

The entire workflow — data ingestion, model training, web server, and scheduled re-runs — can be run as a single Docker container with the data volume mounted on the host.

Prerequisites

Docker ≥ 24
Docker Compose v2 (bundled with Docker Desktop)
Minimum 4 GB RAM allocated to the container (8 GB recommended for training). The ML stack — pandas, LightGBM, XGBoost, CatBoost, scikit-learn MLP, SHAP — requires ~2 GB at startup; training all 6 models concurrently can peak higher.
Supported platforms: linux/amd64 and linux/arm64 (Synology/QNAP NAS, AWS Graviton, Oracle Ampere, Apple Silicon via Rosetta)

Quick start

# 1. Build the image and start the container
docker compose up --build

# 2. Open the dashboard (once the bootstrap is complete)
open http://localhost:30087

First-run notice: On a cold start (no data/ directory on the host) the container runs the full bootstrap pipeline — ingesting 25+ years of historical data and training all 6 models. This can take several hours. Subsequent starts are fast because the data volume persists on the host.

Detached / daemon mode

docker compose up --build -d        # start in background
docker compose logs -f mlb-predict      # follow all logs
docker compose logs -f mlb-predict | grep '\[server\]'   # server logs only

Stop / restart

docker compose down                  # stop and remove container (data is preserved)
docker compose restart mlb-predict       # restart without rebuilding
docker compose up -d                 # start again

Environment overrides

Variable	Default	Description
`MODEL`	`stacked`	Model served: `logistic \| lightgbm \| xgboost \| catboost \| mlp \| stacked`
`PORT`	`30087`	Host port the dashboard is exposed on
`MEM_LIMIT`	`1536m`	Container memory limit (use `2g` for training/bootstrap)
`CPUS`	`2`	Container CPU limit
`MLB_PREDICT_LIVE_API`	`1`	Set to `0` to disable live MLB Stats API and Odds API calls at runtime (minimize network)
`TORCH_SOURCE`	`0`	Build arg: set to `1` to compile PyTorch from source for SSE4.2 CPUs (see PyTorch source build)

# Serve on port 9000 with the XGBoost model
PORT=9000 MODEL=xgboost docker compose up -d

Minimize network footprint

To reduce outbound traffic when the app is running:

Disable live API calls at runtime — set MLB_PREDICT_LIVE_API=0. The dashboard will serve predicted standings and game lists from precomputed data only; play-by-play, live standings, league leaders, team stats, and odds refresh will not call external APIs. Standings pages show predicted standings without live actuals; other live endpoints return a “live data disabled” message.
Reduce scheduled network use — the container runs cron at 01:00 UTC (ingest) and 20:00 UTC (retrain). To avoid that traffic, you can comment out the two entries in docker/crontab and rebuild the image, or run the container without supercronic (custom entrypoint). Data will then only change when you run ingest/train manually or via the admin pipeline.

Live Odds (optional)

To enable live game and futures odds from The Odds API, set an API key in one of these ways:

Environment: ODDS_API_KEY=your-key
Admin dashboard: Dashboard → “Live Odds API Key” → enter key → Save (writes data/processed/odds/config.json)
Config file: Create data/processed/odds/config.json with {"api_key": "your-key"} (the data/ directory is git-ignored)

Without a key, the app runs normally; odds features are simply unavailable. See the Wiki → Data Sources → Live Odds for full instructions.

Force a full re-bootstrap

Deleting the model artifacts directory causes the entrypoint to re-run the complete pipeline on the next start. The 4-way startup logic determines the minimal action needed:

docker compose down
rm -rf data/models/          # remove all trained model artifacts
docker compose up -d         # triggers quick-train (data preserved) or full bootstrap (no data)

To force a complete re-bootstrap from scratch (ingest + train):

docker compose down
rm -rf data/models/ data/processed/    # remove both models and processed data
docker compose up -d                   # triggers full bootstrap (ingest + quick-train)

Scheduled jobs (inside the container)

The container runs two cron jobs via supercronic:

Schedule	Script	What it does
01:00 UTC	`docker/ingest_daily.sh`	Refresh current-season schedule and gamelogs (incl. spring training), rebuild 136-feature matrix and spring features, restart server
20:00 UTC	`docker/retrain_daily.sh`	Retrain all 6 models on fresh data, restart server

Logs are written to ./logs/ingest_daily.log and ./logs/retrain_daily.log on the host.

Inspect or restart processes inside the container

docker exec -it mlb-predict supervisorctl status
docker exec -it mlb-predict supervisorctl restart mlb-predict-server
docker exec -it mlb-predict supervisorctl tail -f mlb-predict-server

Data volume layout

Both ./data and ./logs on the host are bind-mounted into /app/data and /app/logs inside the container.

./data/          ←→  /app/data     (raw + processed data, trained models)
./logs/          ←→  /app/logs     (server, cron, bootstrap, supervisord logs)

All data is accessible on the host machine at all times. The container itself is stateless — removing and recreating it leaves all data intact.

Build the image without Compose

docker build -t mlb-predict .
docker run -p 30087:30087 \
    -v "$(pwd)/data:/app/data" \
    -v "$(pwd)/logs:/app/logs" \
    -e MODEL=stacked \
    mlb-predict

GitHub Container Registry (GHCR)

The CI pipeline automatically builds and publishes the production image to GHCR on every push to main and on version tags:

# Pull the latest image from GHCR
docker pull ghcr.io/sv4u/mlb-predict:main

# Run directly from GHCR (no local build needed)
docker run -p 30087:30087 \
    -v "$(pwd)/data:/app/data" \
    -v "$(pwd)/logs:/app/logs" \
    ghcr.io/sv4u/mlb-predict:main

# Or use the image-only Compose file (same env/volumes as docker-compose.yml, no build)
docker compose -f docker-compose.image.yml pull
docker compose -f docker-compose.image.yml up -d

Git event	Image tag(s) published
Push to `main`	`:main`, `:sha-<short>`
Tag `v1.2.3`	`:1.2.3`, `:1.2`, `:sha-<short>`
Pull request	Image is built but not pushed

Docker image stages

The Dockerfile uses a multi-stage build:

Stage	Built by	Platforms	Contents
`pytorch-builder`	CI (when `TORCH_SOURCE=1`)	amd64	PyTorch compiled from source for SSE4.2 CPUs
`gitlog`	both	all	Git commit history extraction
`base`	both	amd64 + arm64	System deps, supercronic (arch-aware), editable Python package
`test`	CI only	amd64 only	`base` + dev deps (`ruff`, `mypy`, `pytest`) + `tests/`
`production`	CI + local	amd64 + arm64	`base` + `scripts/`, `docker/` helpers, entrypoint

supercronic is downloaded for the correct architecture at build time using Docker BuildKit's TARGETARCH built-in — no manual configuration needed.

To build only the test stage locally:

docker build --target test -t mlb-predict:test .
docker run --rm --entrypoint python mlb-predict:test -m pytest tests/ -v

PyTorch source build (`TORCH_SOURCE=1`)

Pre-built PyTorch wheels from download.pytorch.org/whl/cpu are compiled with AVX2 instructions. CPUs that lack AVX2 (e.g. Intel Celeron J4125 in TrueNAS devices) crash with SIGILL (Illegal Instruction) when running Stage 1 player embeddings.

Setting TORCH_SOURCE=1 compiles PyTorch from source inside the pytorch-builder Docker stage, targeting only SSE4.2 instructions. This produces a wheel that runs on all x86-64 CPUs.

# Build locally with source-compiled PyTorch (slow first time, ~30-90 min)
TORCH_SOURCE=1 docker compose up --build

# Or build the image directly
docker build --build-arg TORCH_SOURCE=1 -t mlb-predict:sse42 .

Build arg	Default	Description
`TORCH_SOURCE`	`0`	`1` = build PyTorch from source for SSE4.2; `0` = use pre-built wheels
`PYTORCH_VERSION`	`2.6.0`	PyTorch version tag to build (only used when `TORCH_SOURCE=1`)
`PYTORCH_BUILD_JOBS`	`2`	Max parallel compilation jobs (lower = less RAM; raise on CI runners)

CI behavior: The GitHub Actions workflow automatically sets TORCH_SOURCE=1 for non-PR builds (pushes to main and version tags). The pytorch-builder layer is cached by GHA cache, so only the first build (or PYTORCH_VERSION bumps) pays the full compile cost. PR builds use pre-built wheels for fast validation.

NAS deployment: Pull the CI-built image from GHCR via docker-compose.image.yml — no local source build needed.

Data pipeline

MLB Stats API     Retrosheet gamelogs   FanGraphs      Statcast (pybaseball)
      │                  │                  │                  │
      ▼                  ▼                  ▼                  ▼
 schedule/          retrosheet/        fangraphs/        statcast_player/
 games_YYYY         gamelogs_YYYY      fangraphs_YYYY    batter/pitcher stats
      │                  │
      └──── crosswalk ───┘
             game_id_map_YYYY

 Open-Meteo API      Vegas odds CSV
       │                   │
       ▼                   ▼
   weather/             vegas/
 by_park_date         vegas_YYYY

                    │
                    ▼
              features/
     features_YYYY.parquet      ←── 136 features per game (build_features.py)
     features_spring_YYYY.parquet ←── spring training (build_spring_features.py)
     features_2026.parquet      ←── pre-season 2026 (from build_features_2026.py)
                    │
                    ▼
               models/
  quick/                        ←── bootstrap models (v4q, --tier quick)
    logistic_v4q_train2026/     lightgbm_v4q_train2026/  ...
  full/                         ←── production models (v4, --tier full)
    logistic_v4_train2026/      lightgbm_v4_train2026/   ...
  archive/                      ←── archived models (for drift analysis)
    logistic_v4_train2025_20260315T.../  ...

Data locations

Path	Contents
`data/raw/mlb_api/schedule/`	Raw MLB Stats API JSON responses (schedule endpoint)
`data/raw/mlb_api/stats/`	Raw MLB Stats API JSON responses (pitcher stats endpoint)
`data/raw/mlb_api/teams/`	Raw MLB Stats API JSON responses (teams endpoint)
`data/raw/retrosheet/gamelogs/`	Raw Retrosheet GL text files (`GL<YYYY>.TXT`)
`data/processed/schedule/`	`games_YYYY.parquet` + CSV + checksums
`data/processed/retrosheet/`	`gamelogs_YYYY.parquet` + CSV + checksums
`data/processed/crosswalk/`	`game_id_map_YYYY.parquet`, coverage report, failed lists
`data/processed/teams/`	`teams_YYYY.parquet` (MLB team roster metadata)
`data/processed/pitcher_stats/`	`pitchers_YYYY.parquet` (MLB API individual pitcher stats)
`data/processed/fangraphs/`	`fangraphs_YYYY.parquet` (FanGraphs team advanced metrics)
`data/processed/statcast_player/`	Statcast individual batter and pitcher stats (via pybaseball)
`data/processed/vegas/`	`vegas_YYYY.parquet` (implied probabilities from money lines)
`data/processed/weather/`	`by_park_date.parquet` (historical temp, wind, humidity per game)
`data/processed/features/`	`features_YYYY.parquet` (136-feature matrix per season), `features_spring_YYYY.parquet` (spring training)
`data/models/quick/`	Quick-trained (bootstrap) model artifacts (`v4q`)
`data/models/full/`	Full-pipeline model artifacts (`v4`)
`data/models/archive/`	Archived model artifacts (timestamped, for drift analysis)
`data/processed/predictions/`	Immutable prediction snapshots (Parquet, by season)
`data/processed/drift/`	Drift monitoring logs (`run_metrics_YYYY.parquet`, global)
`logs/server.log`	Web server stdout/stderr
`logs/cron.log`	Daily cron job output
`server.pid`	PID of the running server process

Querying predictions (Python)

import pandas as pd
from pathlib import Path

# Load all predictions (features_*.parquet + features_spring_*.parquet)
frames = [pd.read_parquet(f) for f in sorted(Path("data/processed/features").glob("features_*.parquet"))]
df = pd.concat(frames, ignore_index=True)

# Backward compat: ensure is_spring exists (0.0 for older feature files)
if "is_spring" not in df.columns:
    df["is_spring"] = 0.0

from mlb_predict.model.artifacts import latest_artifact, load_model
from mlb_predict.model.train import _predict_proba

model, meta = load_model(latest_artifact("logistic", version="v4"))
df["prob"] = _predict_proba(model, df[meta.feature_cols].fillna(0.5))

# 2024 games with high home-team probability
df24 = df[df["season"] == 2024].sort_values("prob", ascending=False)
print(df24[["date","home_retro","away_retro","prob","home_win"]].head(10))

# 2026 pre-season predictions
df26 = df[df["season"] == 2026].sort_values("date")
print(df26[["date","home_retro","away_retro","prob"]].head(10))

# Accuracy by favourite probability bucket (historical seasons only)
dfh = df[df["home_win"].notna()]
dfh["fav_won"] = ((dfh["prob"] >= 0.5) == (dfh["home_win"] == 1)).astype(float)
print(dfh.groupby(pd.cut(dfh["prob"].clip(0.5, 0.99), 5))["fav_won"].mean())

Web dashboard

Start the dashboard with python scripts/serve.py, then open http://localhost:30087.

Pages

URL	Description
`http://localhost:30087/`	All-seasons games browser (2000–2026)
`http://localhost:30087/season/2026`	2026 schedule, pre-season predictions, standings summary, and Elo power rankings
`http://localhost:30087/standings`	Predicted vs actual divisional standings, league leaders, team batting and pitching stats
`http://localhost:30087/game/{game_pk}`	Individual game detail with SHAP feature attribution and embedded EV calculator
`http://localhost:30087/leaders`	League leaders by stat category (AL and NL)
`http://localhost:30087/players`	Full player statistics browser with filtering
`http://localhost:30087/odds`	Live moneyline odds and EV opportunities (requires Odds API key)
`http://localhost:30087/tools/ev-calculator`	Expected value calculator for sports bets (American, decimal, fractional odds; edge, ROI, Kelly criterion)
`http://localhost:30087/wiki`	Technical wiki: models, data sources, features, training pipeline
`http://localhost:30087/dashboard`	Admin dashboard: update season, full reingest, retrain models, system status
`http://localhost:30087/sitemap`	Complete index of all pages and API endpoints
`http://localhost:30087/sitemap.xml`	XML sitemap for search engine crawlers

Features

Games browser — filter by season, home team, away team, or date; paginated; links to game detail
2026 season page — full 2,430-game schedule with pre-season win probabilities, countdown, favourite/toss-up badges, a sticky Elo power rankings sidebar, and a standings summary with predicted league leaders and divisional standings
Standings page — predicted vs actual divisional standings for all 6 divisions, predicted and actual league leaders (AL/NL), and tabbed team batting and pitching statistics fetched live from the MLB Stats API
Game detail — probability bars, SHAP factor attribution chart, key stats comparison
Biggest upsets — all-time or by season, filterable by home/away team and minimum favourite probability
CV accuracy chart — out-of-sample accuracy trend across all 6 model types
Models explained — collapsible cards describing each model with live Brier/Accuracy from CV data
Technical wiki — comprehensive documentation of all models, baseball statistics, data sources, feature engineering, training pipeline, calibration, evaluation metrics, prediction snapshots, drift monitoring, error handling, and system architecture
Admin dashboard — three pipeline controls: "Update Season" (non-destructive current-year refresh), "Full Reingest" (clears all processed data and re-ingests every season from scratch), and "Retrain Models" (archives existing models of the target tier and retrains; supports quick and full tiers). Destructive actions require confirmation. All pipelines run async with real-time log streaming, status badges, tiered model inventory, CV performance table, and data coverage stats. Pipelines auto-reload the server on completion.
EV Calculator — expected value calculator for sports bets. Supports American, decimal, and fractional odds with real-time computation of EV, implied probability, edge, ROI, break-even probability, and Kelly criterion with adjustable fraction slider. Available as a standalone page (/tools/ev-calculator) and as an embedded widget on each game detail page with auto-populated model probabilities and home/away team toggle
Sitemap — complete index of all pages and API endpoints with descriptions; also available as XML (/sitemap.xml) for search engine crawlers

API endpoints

Endpoint	Description
`GET /api/version`	Application version and git commit hash
`GET /api/seasons`	List available seasons
`GET /api/teams`	List all teams (Retrosheet codes + names)
`GET /api/games?season=&home=&away=&date=`	Paginated game list with predictions
`GET /api/games/{game_pk}`	Full detail + SHAP attribution for one game
`GET /api/upsets?season=&home=&away=&min_prob=`	Biggest upsets, filterable by team
`GET /api/cv-summary`	Model CV results by season
`GET /api/standings?season=`	Predicted vs actual standings by division with league leaders
`GET /api/team-stats?season=`	Team batting and pitching statistics from MLB Stats API
`GET /api/admin/status`	Full system status (data, models, pipelines)
`POST /api/admin/ingest`	Full re-ingestion: clear all data + re-ingest all seasons (async)
`POST /api/admin/update`	Update current season only — non-destructive (async)
`POST /api/admin/retrain`	Archive models + retrain (async). Body: `{"training_tier": "quick"\|"full"}`

Project structure

mlb-predict/
├── src/mlb_predict/
│   ├── mlbapi/          # MLB Stats API client (async, rate-limited)
│   │   ├── client.py
│   │   ├── schedule.py
│   │   ├── teams.py
│   │   ├── pitcher_stats.py
│   │   └── standings.py     # Live standings, team batting/pitching stats
│   ├── statcast/        # Statcast / FanGraphs advanced metrics
│   │   ├── fangraphs.py     # FanGraphs team-level stats (via pybaseball)
│   │   └── player_stats.py  # Statcast individual batter/pitcher stats + ID mapping
│   ├── external/        # External data sources
│   │   ├── vegas.py         # Money-line → implied probability conversion
│   │   └── weather.py       # Open-Meteo historical weather API client + cache
│   ├── features/        # Feature engineering pipeline
│   │   ├── elo.py           # Sequential Elo rating
│   │   ├── team_stats.py    # Multi-window (7/14/15/30/60), EWMA, home/away splits
│   │   ├── pitcher_stats.py # Gamelog-based pitcher ERA
│   │   ├── park_factors.py
│   │   ├── bullpen.py       # Bullpen usage and ERA proxy features
│   │   ├── lineup.py        # Lineup continuity features
│   │   └── builder.py       # Assembles 136-feature matrix (v4)
│   ├── model/           # Model training and evaluation
│   │   ├── train.py     # LR + LightGBM + XGBoost + CatBoost + MLP + stacked
│   │   ├── evaluate.py
│   │   └── artifacts.py # Save / load model artifacts
│   ├── predict/         # Prediction snapshots
│   │   └── snapshot.py
│   ├── drift/           # Drift monitoring
│   │   └── compute.py
│   ├── standings.py     # Division mappings, predicted standings, merge logic
│   ├── errors.py        # Structured error taxonomy (WinProbError hierarchy)
│   └── app/             # FastAPI web dashboard
│       ├── main.py          # Routes and API endpoints
│       ├── data_cache.py    # In-memory feature and model cache
│       ├── admin.py         # Background pipeline runner + system status
│       └── templates/
│           ├── index.html        # All-seasons games browser
│           ├── game.html         # Individual game detail + SHAP
│           ├── season_2026.html  # 2026 season schedule + predictions + standings summary
│           ├── standings.html    # Predicted vs actual standings + team stats
│           ├── wiki.html         # Technical wiki (models, data, training)
│           ├── dashboard.html    # Admin dashboard (update, reingest, retrain, status)
│           └── sitemap.html      # Complete page and API endpoint index
├── scripts/
│   ├── ingest_schedule.py              # MLB Stats API schedule ingestion
│   ├── ingest_retrosheet_gamelogs.py   # Retrosheet game log ingestion
│   ├── build_crosswalk.py              # Retrosheet ↔ MLB ID crosswalk
│   ├── ingest_pitcher_stats.py         # MLB Stats API individual pitcher stats
│   ├── ingest_fangraphs.py             # FanGraphs team advanced metrics
│   ├── ingest_vegas.py                 # Vegas money-line odds → implied probabilities
│   ├── ingest_weather.py               # Open-Meteo historical weather backfill
│   ├── ingest_all.py                   # Orchestrate all ingestion steps
│   ├── build_features.py               # Build 136-feature matrices (historical)
│   ├── build_spring_features.py        # Build spring training feature matrices
│   ├── build_features_2026.py          # Build 2026 pre-season feature matrix
│   ├── train_model.py                  # Optuna HPO + expanding-window CV + 6 production models
│   ├── feature_importance.py           # SHAP-based feature importance analysis
│   ├── run_predictions.py              # Snapshot predictions
│   ├── compute_drift.py                # Drift monitoring
│   ├── query_game.py                   # Human-centric CLI query tool
│   ├── serve.py                        # Launch FastAPI dashboard
│   └── update_daily.sh                 # Daily cron: refresh data + restart server (host)
├── docker/
│   ├── entrypoint.sh                   # Container startup: bootstrap check + supervisord
│   ├── supervisord.conf                # Process manager config (server + cron)
│   ├── crontab                         # supercronic schedule (1am ingest, 11pm retrain)
│   ├── ingest_daily.sh                 # Daily 1am data refresh
│   └── retrain_daily.sh                # Daily 11pm model retrain (all 6 models)
├── Dockerfile                          # Multi-stage image (base → test → production)
├── docker-compose.yml                  # Compose config (volumes, ports, env vars)
├── .dockerignore                       # Excludes data/, .git/, .venv/, caches from build context
├── data/
│   ├── raw/
│   ├── processed/
│   │   ├── schedule/
│   │   ├── retrosheet/
│   │   ├── crosswalk/
│   │   ├── pitcher_stats/
│   │   ├── fangraphs/
│   │   ├── statcast_player/    # Statcast batter/pitcher individual stats
│   │   ├── vegas/              # Implied probabilities from money lines
│   │   ├── weather/            # Open-Meteo historical weather cache
│   │   ├── features/        # features_YYYY.parquet, features_spring_YYYY.parquet
│   │   ├── predictions/
│   │   └── drift/
│   └── models/
│       ├── quick/          # Bootstrap models (v4q)
│       ├── full/           # Production models (v4)
│       └── archive/        # Archived models (timestamped)
├── logs/
│   ├── server.log          # Web server output
│   ├── cron.log            # Daily cron output (host-based cron)
│   ├── ingest_daily.log    # Docker daily ingest output
│   ├── retrain_daily.log   # Docker daily retrain output
│   ├── bootstrap.log       # Docker first-run bootstrap output
│   └── supervisord.log     # Docker process manager output
├── server.pid       # PID of the running server (host-based only)
├── pyproject.toml
└── README.md

Changelog

Documentation — March 2026

mlb-predict-pipeline.Rmd: Synced team_stats rolling-window description with implementation (7/14/15/30/60); documented deferred model load + initializing.html, nested MCP lifespan, game_detail_cache / response_cache / timing.py, admin odds/betting endpoints and JSON bodies for ingest/update/retrain, bootstrap-progress, and git/CHANGELOG.txt behavior in data_cache.
README: Clarified that the server accepts HTTP connections immediately while models load in the background.

v4.2 — PyTorch Source Build for NAS Hardware

PyTorch source build: New TORCH_SOURCE=1 Docker build arg compiles PyTorch from source targeting SSE4.2, enabling Stage 1 player embeddings on CPUs without AVX2 (e.g. Intel Celeron J4125 in TrueNAS).
Build flags: Disables MKL-DNN, FBGEMM, and all CUDA/distributed features; uses Eigen BLAS and -march=nehalem for maximum x86-64 compatibility.
CI integration: Non-PR builds automatically use TORCH_SOURCE=1; the pytorch-builder layer is cached by GHA cache so only the first build is slow.
Full-tier Stage 1 re-enabled: Dashboard full-tier retrains now include Stage 1 player embeddings (previously skipped due to SIGILL on non-AVX CPUs). Quick-tier continues to skip Stage 1.
Configurable build args: PYTORCH_VERSION (default 2.6.0) and PYTORCH_BUILD_JOBS (default 2) allow version and parallelism control.

v4.1 — Tiered Training & Smart Startup

4-way startup logic: Application startup now independently checks for processed data and trained models, performing only the minimal action needed (normal startup, quick-train only, ingest only, or full bootstrap).
Training tiers: Two distinct tiers — quick (bootstrap, v4q) and full (production pipeline, v4) — each with separate storage directories (data/models/quick/, data/models/full/).
Model archiving: Old models are moved to data/models/archive/ with timestamps instead of being deleted, enabling drift analysis across training runs.
Tier-aware API: POST /api/admin/retrain accepts training_tier parameter ("quick" or "full") to control which tier is retrained.
Model preference: Full-tier models are preferred over quick, with legacy (pre-tier) as final fallback; users can switch between tiers on the dashboard.
CLI --tier flag: scripts/train_model.py --tier quick|full controls training scope.
65 new tests: Comprehensive test coverage for tier storage, archiving, 4-way startup logic, admin tier functions, and retrain API tier support.

v4 — Two-Stage Player Model

Stage 1 player embedding model: PyTorch neural model with learned player ID embeddings, per-player EWMA rolling stats, and biographical features. Produces 17 game-level player features.
Feature schema bump: 119 → 136 features (119 team-level + 17 Stage 1 player features).
Per-pitcher game logs: high-fidelity pitcher stats from MLB Stats API game log endpoint.
Expanded player data pipeline: FanGraphs player-level stats, expanded Statcast batter/pitcher stats, player biographical data.
MCP tools wired: find_ev_bets and get_team_stats tools now return live data instead of placeholder messages.
BettingService gRPC removed: unused proto and generated stubs cleaned up.

v3.1 — Code Review Fixes

Fixes from the comprehensive code review (22 new tests added, 228 total passing):

Critical

team_stats.py: away_win_pct_away_only now correctly computes away-only rolling stats (was using home-only stats for the away side)
lineup.py: Lineup continuity now tracks across home/away venue transitions (was fragmenting tracking by venue)
train.py: Stacked meta-learner uses disjoint calibration split — first half calibrates base models, second half trains the meta-learner (fixes data leakage)

High

train.py: Production model now computes real eval_brier instead of hardcoded 0.0; metadata field renamed from train_brier to eval_brier (legacy auto-migration on load)
compute.py: Global drift dedup now includes model_version; baseline metrics are now persisted; empty diffs produce zero metrics instead of NaN
client.py: Non-retryable 4xx errors (400, 401, 403) raise immediately instead of retrying; TokenBucket rejects requests exceeding capacity
bullpen.py: Bullpen fatigue now tracks across home and away games (was fragmenting by venue); ER fill value corrected to game-level scale (2.5, not ERA-scale 4.5)
weather.py: Game-hour estimation now uses longitude-based timezone offset instead of hardcoded UTC hour 19
data_cache.py: Added threading lock for thread-safe cache reads/writes during hot reloads

Medium

standings.py: Null rank values no longer crash int() conversion
teams.py: NaN abbreviations are filtered out; empty DataFrames handled gracefully
builder.py: Feature hash handles inf/-inf values; crosswalk join drop count is logged
snapshot.py: Collision detection prevents overwriting immutable snapshots; deduplicates hashing with util/hashing.py
errors.py: APIError now defined and exported; MLBAPIError inherits from it
artifacts.py: Legacy train_brier metadata migrated to eval_brier on load
run_predictions.py: Uses model's trained feature_cols from metadata (not the global FEATURE_COLS); adds catboost and mlp to --model-type choices

Low

hashing.py: Uses POSIX paths for cross-platform determinism; added docstrings
ingest_all.py: Replaced deprecated datetime.utcnow() and get_event_loop(); uses sys.executable for subprocess calls

Attribution

Game log data from Retrosheet (retrosheet.org).

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at 20 Sunset Rd., Newark, DE 19711.

Advanced metrics from FanGraphs (fangraphs.com) via the pybaseball library. Statcast individual player data from Baseball Savant (baseballsavant.mlb.com) via pybaseball. Schedule and player data from the MLB Stats API (statsapi.mlb.com). Historical weather data from the Open-Meteo API (open-meteo.com). Player ID mapping via the Chadwick Baseball Bureau register.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.agent		.agent
.github/workflows		.github/workflows
docker		docker
docs		docs
proto/mlb_predict/v1		proto/mlb_predict/v1
scripts		scripts
src/mlb_predict		src/mlb_predict
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.image.yml		docker-compose.image.yml
docker-compose.yml		docker-compose.yml
mlb-predict-pipeline.Rmd		mlb-predict-pipeline.Rmd
pyproject.toml		pyproject.toml
start.bat		start.bat
start.sh		start.sh
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

MLB Prediction System

Models

Logistic Regression

LightGBM

XGBoost

CatBoost

Neural Network (MLP)

Stacked Ensemble (default production model)

Training tiers

4-way startup logic

Training techniques

Features (136 total, v4)

Team performance (27 features)

Run distribution (4 features)

Context & fatigue (7 features)

Pitcher quality (8 features)

Statcast individual player features (6 features)

Advanced team metrics (FanGraphs, prior season, 20 features)

Bullpen (8 features)

Lineup (2 features)

Park & venue (1 feature)

Game type (1 feature)

Vegas odds (2 features)

Weather (3 features)

Differential features (9 features)

Stage 1 player model features (17 features)

Quick start

Install

Full data ingestion (first run)

Build features

Ingest external data (optional — Vegas odds and weather)

Train models (with Optuna HPO)

Training tiers

Launch the web dashboard

CLI query tool

Server management

Start in foreground (development)

Start in background (production)

Stop the server

Restart the server

Check server status

Daily update (cron job)

What the script does

One-time setup

Install the cron job

Verify the cron job is registered

Remove the cron job

Override environment variables

Docker

Prerequisites

Quick start

Detached / daemon mode

Stop / restart

Environment overrides

Minimize network footprint

Live Odds (optional)

Force a full re-bootstrap

Scheduled jobs (inside the container)

Inspect or restart processes inside the container

Data volume layout

Build the image without Compose

GitHub Container Registry (GHCR)

Docker image stages

PyTorch source build (TORCH_SOURCE=1)

Data pipeline

Data locations

Querying predictions (Python)

Web dashboard

Pages

Features

API endpoints

Project structure

Changelog

Documentation — March 2026

v4.2 — PyTorch Source Build for NAS Hardware

v4.1 — Tiered Training & Smart Startup

PyTorch source build (`TORCH_SOURCE=1`)

Packages