Warning
Early development — not ready for general use yet. Tested only by the maintainer so far; expect rough edges.
Backups run every night. When was one last actually restored?
Redrill is a self-hosted daemon that proves backups are restorable by actually restoring them. On a schedule it pulls data out of the backups other tools already make (Borg and restic repos and Postgres dumps, for now), restores them into a throwaway sandbox, and for databases boots a real instance to check the loaded data. Each dataset gets a line like:
last proven restore: 3 days ago ✅
Every backup tool checks its own integrity, while none of them check whether the data is actually usable. The real-world failures lie elsewhere:
- A
pg_dumpcron writes an empty file after a password rotation. - A newly introduced exclude pattern drops directories from the archive.
- Expired API tokens result in empty dumps.
- A volume quietly unmounts.
- A version misconfiguration between
pg_dumpand Postgres. - etc.
Redrill is read-only and never modifies the backups it reads. Each drill is configurable: use the backup tool's own integrity check, or fully restore the data and run custom checks. Runs are kept with history, and failures raise an alert.
| Layer | What it does | IO |
|---|---|---|
| L1 — Integrity | Borg & restic: native borg/restic check, snapshot freshness, size-anomaly detection.pg_dump: minimum file size, gzip -t/zstd -t compression test, file freshness (mtime). |
Low |
| L2 — Restorability | Borg & restic: restores a sample of files into scratch, asserts path_exists, newest-file freshness, file-count tolerance vs. the last good run.pg_dump: copies the dump into scratch, but for a single dump L1+L3 do the real work. |
Moderate (scales with sample size) |
| L3 — Usability | Borg & restic: extracts the dump at extract_path from the archive, then boots it the same way.pg_dump: boots an ephemeral, network-isolated Postgres, loads the dump, runs the configured sql assertions (select count(*) from users → > 0, age < 8d, …). |
High (full restore + DB boot) |
Layers always run sequentially, so if L3 is selected in the config, a failing L2 stops the job from executing.
fail— a check returned false. The backup is the problem, and data is at risk.error— the check couldn't be completed (repo unreachable, scratch full, no container runtime), reported with the reason. Redrill is the problem, not the backup, and never a silent pass.stale— a dataset hasn't been proven within itsmax_proof_age, for any reason, including the daemon having been down.
L3 boots database sandboxes, so it needs a container runtime (Docker or podman). Without one, L1/L2 still run and L3 reports skipped rather than passing.
git clone https://github.com/redrillhq/redrill
cd redrill/deploy/compose
# 1. Point the config at the backups and tune the checks.
$EDITOR config.example.yaml
# 2. In compose.yaml, mount the backup dir read-only and (for L3) the docker socket.
$EDITOR compose.yaml
# 3. Go.
docker compose up -d
docker compose logs -f redrillPrefer not to use Docker? Build the single static binary with
go build ./cmd/redrill(Go 1.26). It needs theborgorresticbinary on the host for those sources and a container runtime for L3. Runredrill doctorto see exactly what's missing.
Auditing a directory of pg_dump files:
version: 1
data_dir: /var/lib/redrill
scratch: { dir: /var/lib/redrill/scratch, max_bytes: 40GiB }
notify:
urls: ["ntfy://ntfy.example.com/redrill"] # any shoutrrr URL: ntfy/telegram/discord/email/webhook
events: [fail, error, recover, stale]
sources:
- name: pg-dumps
type: dumpdir
path: /backups/pg # mount this read-only
pattern: "*.sql.gz"
pick: newest
drills:
- name: app-db
source: pg-dumps
schedule: "Sun 05:00" # cron or human shorthand ("Sun 05:00", "04:10")
max_proof_age: 10d # stale alert if no proof newer than this
retention: { max_count: 50 } # keep the newest 50 runs of history
levels:
l1: { file_min_bytes: 1MiB, compression_test: true, max_age: 36h }
l3:
sandbox: { image: postgres:16, network: none, memory: 1GiB }
load: auto
checks:
- sql: { query: "select count(*) from users", expect: "> 0" }
- sql: { query: "select max(created_at) from events", expect: "age < 8d" }Auditing a Borg repo:
sources:
- name: nextcloud-borg
type: borg
repo: "ssh://[email protected]/./borg/nextcloud-aio"
passphrase_file: /etc/redrill/secrets/borg-pass
ssh_key_file: /etc/redrill/secrets/borg-readonly-key
drills:
- name: nextcloud-files
source: nextcloud-borg
schedule: "Sun 04:10"
max_proof_age: 10d
levels:
l1: { native_check: true, snapshot_max_age: 36h, size_anomaly_pct: 40 }
l2:
restore: { scope: sample, sample: { files: 200, newest: 50 }, include_paths: ["config/", "data/"] }
checks:
- path_exists: "config/config.php"
- newest_file_max_age: 8d
- file_count_tolerance_pct: 15redrill validate # strictly check the config (exit 3 on any problem)
redrill doctor # preflight: engines, container runtime, scratch space, repo reachability
redrill run [NAME] # run a drill now: a NAME, --all, or pick interactively (--level l1|l2|l3)
redrill status # table: each drill's last run, proof age, next run, SLA state
redrill history NAME # past runs with verdicts and durations (-n 20)
redrill serve # the daemon: scheduler + notifications
redrill version
Every command takes --json and resolves its config from -c, else $REDRILL_CONFIG, else /etc/redrill/config.yaml. Exit codes are stable: 0 ok, 1 a drill failed, 2 infra error, 3 bad config. Drop redrill status in a terminal for the whole picture:
DRILL LAST RUN PROVEN NEXT RUN SLA
app-db pass 2h ago 2h ago Sun 05:00 ok
nextcloud-files fail 1d ago 6d ago Sun 04:10 STALE
1 of 2 drills proven within SLA
(PROVEN shows the proof age of the drill's highest configured level.)
- Sources — where backups live and how to read them. Today:
borg,restic, anddumpdir(a directory of dump files). - Drills — a scheduled audit of one source:
schedule,max_proof_age(the proof SLA), optionaljitter/timeout/retention, and one or morelevels. - Checks — typed assertions producing evidence (expected vs. actual):
path_exists,file_count_tolerance_pct,newest_file_max_age,sql,sql_no_error. Thesqlexpectgrammar covers> N,>= N,== N,!= N,between A B,age < 8d/age > 8d,matches REGEX,nonempty. - Notifications — via shoutrrr: ntfy, Telegram, Discord, email, webhooks, and more. Messages lead with the diagnosis and the last-good date, not a stack trace.
- Retention — prune each drill's run history by
max_ageand/ormax_count. The proof timeline (last_proven_at) is kept forever.
The full annotated schema lives in deploy/compose/config.example.yaml.
- Read-only by construction: the drivers have no write, prune, or delete code paths.
- Secrets are referenced by
*_file/*_envonly and redacted from stored output. - L3 sandboxes run with
network=none, memory limits, and guaranteed cleanup. - The Docker socket is needed only for L3; drop it to keep L1/L2.
A verifier that can't be trusted is worse than none. On every change the test suite runs, including real-engine drills in throwaway containers. Some of those tests feed Redrill deliberately broken backups that are byte-perfect but semantically dead (for example, an empty-but-valid gzip, a dump of the wrong database, or a stale snapshot), and the build fails unless it flags each one.
Redrill is under active development. Bug reports and ideas are welcome through issues, especially which backup tools should be supported next.