An open, free, self-hosting fare tracker for New Zealand ↔ Sri Lanka & India — the Christchurch ↔ Colombo and Auckland ↔ Colombo / Delhi / Mumbai corridors out of the box (and any route you add). It scrapes the full fare list several times a day, builds an intraday price history, forecasts where the fare is heading with a calibrated model, and shows a buy / wait decision with a confidence rate — plus deal scores, a predicted booking curve, a cheapest-day heatmap and free push alerts — all running for $0 on GitHub's free tier.
- Data: Google Flights, scraped with real headless browsers (Playwright) — no API key, no token, no paid proxy. Shared Chromium and Firefox browsers run every itinerary concurrently (a lightweight context + rotating fingerprint/engine each — a genuine Chrome/Firefox mix, not just spoofed UA strings), harvesting every offer per scan to build a large open dataset
- Coverage: a rolling ~3-month grid — every day a departure × every trip length 20–40 days (~1,900 itineraries, auto-generated relative to today) — sharded across the day's scan slots so the whole grid is covered daily without any single job blowing the CI limit, on top of a dense fixed set for deep intraday curves
- Cadence: multiple scans/day; rows carry an hourly
scan_slotso intraday moves accumulate instead of overwriting - Intelligence (100% offline): quantile gradient-boosting + split-conformal bands, seasonality features, a walk-forward backtest, deal scores, anomaly detection, best-time-to-book and templated natural-language summaries — no LLM, no API cost
- Alerts: optional free Telegram / email pushes for new lows, price drops and BUY calls
- Automation: GitHub Actions (free & unlimited for public repos) · Hosting: GitHub Pages (free)
- Dataset: every observation is committed as plain CSV in
data/— open for anyone
Fares are informational. Always confirm the live price before booking.
Experimental — Singapore Airlines' own site. Besides Google Flights, an opt-in source (
flightwatch/provider_sq.py) scrapes Singapore Airlines' public Best Fare Finder for SQ's own end-to-end fares — the genuine "SQ flying the whole trip" price. It is off by default (the scheduled scan only uses Google Flights, which stays rock-solid) because singaporeair.com sits behind heavy bot protection and needs selector tuning against a live page. Try it withpython -m flightwatch sq-diag CHC CMB 2026-09-05 2026-09-26, which prints what it found and saves a screenshot + HTML todebug/.
Singapore Airlines on the dashboard. The command-center spotlights a configured carrier (
featured_airline, default Singapore Airlines). Because SQ often flies these corridors as a connection (from Christchurch it's CHC→SIN→CMB with Air New Zealand), the dashboard distinguishes SQ operating one leg from SQ flying the whole trip on its own metal (e.g. AKL→SIN→CMB all on SQ) and shows the cheapest whole-trip SQ fare separately. The trip finder lets you filter for any single carrier — even one that only flies a leg of a connection.
Why scrape Google Flights? FlightWatch first used the Amadeus Self-Service API (dropped its free tier), then the Travelpayouts Data API (free, but a cache of Aviasales searches that simply has no data for a thin route like CHC↔CMB — it returned zero fares). Google Flights actually has the fares, and driving a real browser needs no key and stays free — so the tracker can be fully autonomous.
Heads up: scraping is inherently more fragile than an official API. Google can change its markup or rate-limit a runner. FlightWatch is built to degrade gracefully (it records
no_resultsinstead of crashing and saves debug screenshots), and the scraper is isolated in one file so it's easy to repair.
config.yaml ──► flightwatch/collect.py ──► data/flights_YYYY_MM.csv (append-only, intraday slots)
(async, shared browser) │
flightwatch/predict.py ───► model + backtest ┐
flightwatch/analytics.py ─► offline AI layer ┤
flightwatch/dashboard.py ─► docs/index.html │ (GitHub Pages)
flightwatch/alerts.py ────► Telegram / email ┘ (optional)
A GitHub Actions cron runs the collector several times a day — and again on every
merge to main, so the published dashboard refreshes the moment a change lands.
Shared Chromium + Firefox browsers scrape every itinerary concurrently (each context
using a different fingerprint/engine to dodge soft-blocks), appends every fare
found as its own CSV row stamped with an
hourly scan_slot, retrains the forecast model, regenerates the dashboard, sends any
fresh alerts, and commits everything back to the repo. Over weeks, the per-itinerary
history becomes an intraday booking curve — which is what makes a real "should I book
now?" prediction (and an honest backtest of it) possible.
flightwatch/ # the Python package (importable, no path hacks)
__init__.py # shared paths (repo root, data/, docs/, config.yaml)
provider.py # fare source — Google Flights scraper, sync + async (swap providers here)
collect.py # async shared-browser scan -> appends to data/
storage.py # append-only monthly CSVs, intraday scan_slot de-duplication
predict.py # forecast + buy/wait/watch with conformal bands, seasonality, backtest
analytics.py # offline AI: deal scores, anomalies, heatmap, digest, narratives
dashboard.py # renders docs/index.html
alerts.py # optional free Telegram / email pushes
__main__.py # CLI: python -m flightwatch [collect|build|all|alert|backtest|diag]
config.yaml # what to track + concurrency
data/ # the open dataset (monthly CSVs)
docs/ # the published dashboard (GitHub Pages)
.github/workflows/ # the scan cron
tests/ # pytest suite (storage, model, analytics, alerts)
Everything runs through the package, so the provider lives in exactly one file
(flightwatch/provider.py). To use a different fare source later, keep its
search_flight_offers / search_flight_offers_async shape and the rest is unchanged.
The scraper needs no token or secret, so setup is just GitHub settings.
- Make a public GitHub repository (public = free Actions + Pages).
- Upload these files (or
git pushthem).
Repo → Settings → Actions → General → Workflow permissions → select Read and write permissions → Save. (The workflow commits the daily data back.)
Repo → Settings → Pages → Source: Deploy from a branch → Branch: main,
Folder: /docs → Save. Your dashboard will be at
https://<your-username>.github.io/<repo-name>/.
Repo → Actions tab → scan → Run workflow. The workflow installs a
headless Chromium, scrapes each route, and commits the first data point to data/.
From then on it runs automatically several times a day. If a run scrapes 0 fares,
download the scrape-debug artifact from that run to see exactly what Google returned.
pip install -r requirements.txt
python -m playwright install --with-deps chromium firefox # one-time browser download (both engines)
python -m flightwatch collect # one scrape -> appends to data/
python -m flightwatch build # regenerate docs/index.html
python -m flightwatch all # both at once
python -m flightwatch alert # push fresh signals (dry-prints if no secrets set)
python -m flightwatch backtest # how the engine's past calls actually fared
python -m flightwatch diag # scrape one route verbosely + save debug screenshot
python -m flightwatch sq-diag # EXPERIMENTAL: scrape Singapore Airlines' own site
open docs/index.htmlRun the tests with pytest (they use synthetic data — no network or browser needed).
Set FLIGHTWATCH_HEADFUL=1 (the CI default, via xvfb-run) to drive non-headless
browsers, which Google is less likely to soft-block than pure headless.
Edit config.yaml. The fixed itineraries: list is the dense core — daily
depth on a few date-pairs that builds tight intraday curves:
itineraries:
- {origin: CHC, destination: CMB, depart_date: 2026-09-01, return_date: 2026-09-22}
- {origin: AKL, destination: CMB, depart_date: 2026-09-01, return_date: 2026-09-22}For breadth, the auto_generate: block sweeps a rolling window of departures
across the next ~3 months for each route — every day a departure, paired with
every trip length in a range (e.g. 20–40 days). It's recomputed from today
every run, so it always covers the next 3 months. Each route can override any
grid setting (depart_step_days, horizon_days, min_days_out,
trip_length_*), so a dense core corridor and a lighter new route can share one
grid without the additions blowing the CI budget:
auto_generate:
enabled: true
horizon_days: 90 # ~3 months out (block default)
min_days_out: 1
depart_step_days: 1 # every day is a departure date (block default)
trip_length_min: 20 # trip duration 20..40 days inclusive
trip_length_max: 40
trip_length_step: 1
routes:
- {origin: CHC, destination: CMB} # dense: every day
- {origin: AKL, destination: CMB, depart_step_days: 4} # lighter sweep
- {origin: AKL, destination: DEL, depart_step_days: 14}
- {origin: AKL, destination: BOM, depart_step_days: 14}
shard_across_slots: true # see below
shards: 4That full grid is ~1,900 itineraries — far too many for one run. So the
generated grid is sharded across the day's scan slots: each 6-hourly run
scrapes one shard (~1/4, a few hundred itineraries, ~1–1.5 h), and across the day
every date/length is covered exactly once. Sharding is stable per itinerary, so
the same date/length lands in the same daily slot each day (a clean per-slot
history). The fixed itineraries: are never sharded — they run every scan for
tight intraday curves. Set shards to the number of daily cron slots. Prefer a
denser-but-smaller grid? Just raise depart_step_days / trip_length_step or drop
shard_across_slots.
Tune concurrency (how many bots scrape at once) / jitter_seconds here too. Change the scan frequency in
.github/workflows/daily-scan.yml (the cron, in UTC — currently every 6 hours).
Scheduled runs only fire on the default branch, so a new cadence takes effect once
it's merged there.
flightwatch/predict.py is the decision engine. While history is thin it uses a
transparent heuristic (today's price vs the route's own min/median and days left)
and honestly reports low confidence. Once there are enough cheapest-per-day
observations (~120) it trains gradient-boosting models over the booking curve with:
- Honest validation — error is time-series cross-validated (
TimeSeriesSplit, forward-chaining), never a shuffled split that would leak the future and overstate skill. - Calibrated bands — a split-conformal interval whose width is learned from real held-out residuals, so an "80% band" actually covers ~80% of outcomes.
- Seasonality features — cyclical day-of-week / month / week-of-year encodings, peak-season flags, lead-time buckets and a per-route price level.
Each recommendation exposes its full reasoning — signal (BUY / WAIT / WATCH), confidence (0–100%), the whole predicted forward curve with bands, the expected forecast low, expected savings from waiting and the drop probability.
Offline AI layer (flightwatch/analytics.py, no LLM or API) adds, on the dashboard:
- Deal score (0–100) for each route's current cheapest fare
- Predicted booking curve fan chart + best time to book
- Cheapest day to fly heatmap (which departure/return weekday is cheapest)
- What changed since the last scan, and anomaly drop/spike detection
- Model accuracy — a walk-forward backtest of whether past BUY/WAIT calls paid off
- A per-route natural-language summary generated purely from the numbers
…alongside the existing market insights and latest-offers tables.
flightwatch/alerts.py turns the freshest signals into Telegram / email notifications,
de-duplicated via data/alert_state.json. It is a no-op until you add secrets, so
the project stays free and runnable by anyone:
- Telegram — message
@BotFatherto create a bot, then set repo Settings → Secrets and variables → Actions secretsTELEGRAM_BOT_TOKENandTELEGRAM_CHAT_ID. - Email (optional) — set
SMTP_HOST,SMTP_USER,SMTP_PASS,ALERT_EMAIL_TO(e.g. a Gmail address + app password).
Alerts fire on a fresh all-time low, an unusual price drop, a new BUY call, or a "book now — window closing" nudge.
Scraping Google Flights is the trade-off for being keyless and free. If a run
records no_results, it's almost always one of:
- Google soft-blocked the runner — datacenter IPs are fingerprinted. The
workflow already runs a non-headless Chromium under
xvfb(viaFLIGHTWATCH_HEADFUL=1) to look more like a real browser, which is usually enough for a once-a-day, few-route scrape. A later daily run typically succeeds. - Google changed its markup — the extractor keys off the stable
"… round trip total"ARIA label rather than CSS class names, but if Google reshapes the page the selectors inprovider.pymay need a tweak.
To see exactly what Google returned, scrape one route locally and inspect the saved screenshot + HTML:
python -m flightwatch diag
ls debug/ # diag-CHC-CMB.png / .htmlEvery failed CI scrape also uploads those files as the scrape-debug artifact on the run, so you can debug a remote run without reproducing it locally.
If Google blocking ever becomes persistent on GitHub's shared IPs, the most reliable free fix is to run the same
python -m flightwatch allon any small always-on machine (a home server / Raspberry Pi) on a cron and push the commits — the code is identical; only the runner's IP changes.
| Piece | Service | Cost |
|---|---|---|
| Automation (several runs/day) | GitHub Actions (public repo, standard runner) | Free, unlimited |
| Dashboard hosting | GitHub Pages | Free |
| Data storage | CSV in the repo | Free |
| Fare data | Google Flights scrape (headless Chromium + Firefox) | Free, no key |
| Alerts | Telegram Bot API / SMTP | Free |
| Forecasts & AI features | scikit-learn, offline (no LLM) | Free |
Each run takes a few Actions minutes (browser install + scraping the rolling ~3-month grid across Chromium + Firefox); several runs a day plus per-merge refreshes stay comfortably inside the free unlimited allowance for public repos.
Code: MIT. Data in data/: open — attribution appreciated (CC-BY).