Redrill

Warning

Early development — not ready for general use yet. Tested only by the maintainer so far; expect rough edges.

Backups run every night. When was one last actually restored?

Redrill is a self-hosted daemon that proves backups are restorable by actually restoring them. On a schedule it pulls data out of the backups other tools already make (Borg and restic repos and Postgres dumps, for now), restores them into a throwaway sandbox, and for databases boots a real instance to check the loaded data. Each dataset gets a line like:

last proven restore: 3 days ago ✅

The problem

Every backup tool checks its own integrity, while none of them check whether the data is actually usable. The real-world failures lie elsewhere:

A pg_dump cron writes an empty file after a password rotation.
A newly introduced exclude pattern drops directories from the archive.
Expired API tokens result in empty dumps.
A volume quietly unmounts.
A version misconfiguration between pg_dump and Postgres.
etc.

Redrill in essence

Redrill is read-only and never modifies the backups it reads. Each drill is configurable: use the backup tool's own integrity check, or fully restore the data and run custom checks. Runs are kept with history, and failures raise an alert.

Verification layers

Layer	What it does	IO
L1 — Integrity	Borg & restic: native `borg`/`restic check`, snapshot freshness, size-anomaly detection. pg_dump: minimum file size, `gzip -t`/`zstd -t` compression test, file freshness (mtime).	Low
L2 — Restorability	Borg & restic: restores a sample of files into scratch, asserts `path_exists`, newest-file freshness, file-count tolerance vs. the last good run. pg_dump: copies the dump into scratch, but for a single dump L1+L3 do the real work.	Moderate (scales with sample size)
L3 — Usability	Borg & restic: extracts the dump at `extract_path` from the archive, then boots it the same way. pg_dump: boots an ephemeral, network-isolated Postgres, loads the dump, runs the configured `sql` assertions (`select count(*) from users` → `> 0`, `age < 8d`, …).	High (full restore + DB boot)

Layers always run sequentially, so if L3 is selected in the config, a failing L2 stops the job from executing.

Drill results

fail — a check returned false. The backup is the problem, and data is at risk.
error — the check couldn't be completed (repo unreachable, scratch full, no container runtime), reported with the reason. Redrill is the problem, not the backup, and never a silent pass.
stale — a dataset hasn't been proven within its max_proof_age, for any reason, including the daemon having been down.

Quickstart

Installation

L3 boots database sandboxes, so it needs a container runtime (Docker or podman). Without one, L1/L2 still run and L3 reports skipped rather than passing.

git clone https://github.com/redrillhq/redrill
cd redrill/deploy/compose

# 1. Point the config at the backups and tune the checks.
$EDITOR config.example.yaml

# 2. In compose.yaml, mount the backup dir read-only and (for L3) the docker socket.
$EDITOR compose.yaml

# 3. Go.
docker compose up -d
docker compose logs -f redrill

Prefer not to use Docker? Build the single static binary with go build ./cmd/redrill (Go 1.26). It needs the borg or restic binary on the host for those sources and a container runtime for L3. Run redrill doctor to see exactly what's missing.

Config example

Auditing a directory of pg_dump files:

version: 1
data_dir: /var/lib/redrill
scratch: { dir: /var/lib/redrill/scratch, max_bytes: 40GiB }

notify:
  urls: ["ntfy://ntfy.example.com/redrill"]   # any shoutrrr URL: ntfy/telegram/discord/email/webhook
  events: [fail, error, recover, stale]

sources:
  - name: pg-dumps
    type: dumpdir
    path: /backups/pg            # mount this read-only
    pattern: "*.sql.gz"
    pick: newest

drills:
  - name: app-db
    source: pg-dumps
    schedule: "Sun 05:00"        # cron or human shorthand ("Sun 05:00", "04:10")
    max_proof_age: 10d           # stale alert if no proof newer than this
    retention: { max_count: 50 } # keep the newest 50 runs of history
    levels:
      l1: { file_min_bytes: 1MiB, compression_test: true, max_age: 36h }
      l3:
        sandbox: { image: postgres:16, network: none, memory: 1GiB }
        load: auto
        checks:
          - sql: { query: "select count(*) from users", expect: "> 0" }
          - sql: { query: "select max(created_at) from events", expect: "age < 8d" }

Auditing a Borg repo:

sources:
  - name: nextcloud-borg
    type: borg
    repo: "ssh://[email protected]/./borg/nextcloud-aio"
    passphrase_file: /etc/redrill/secrets/borg-pass
    ssh_key_file: /etc/redrill/secrets/borg-readonly-key

drills:
  - name: nextcloud-files
    source: nextcloud-borg
    schedule: "Sun 04:10"
    max_proof_age: 10d
    levels:
      l1: { native_check: true, snapshot_max_age: 36h, size_anomaly_pct: 40 }
      l2:
        restore: { scope: sample, sample: { files: 200, newest: 50 }, include_paths: ["config/", "data/"] }
        checks:
          - path_exists: "config/config.php"
          - newest_file_max_age: 8d
          - file_count_tolerance_pct: 15

Available CLI commands

redrill validate          # strictly check the config (exit 3 on any problem)
redrill doctor            # preflight: engines, container runtime, scratch space, repo reachability
redrill run [NAME]        # run a drill now: a NAME, --all, or pick interactively  (--level l1|l2|l3)
redrill status            # table: each drill's last run, proof age, next run, SLA state
redrill history NAME      # past runs with verdicts and durations      (-n 20)
redrill serve             # the daemon: scheduler + notifications
redrill version

Every command takes --json and resolves its config from -c, else $REDRILL_CONFIG, else /etc/redrill/config.yaml. Exit codes are stable: 0 ok, 1 a drill failed, 2 infra error, 3 bad config. Drop redrill status in a terminal for the whole picture:

DRILL             LAST RUN      PROVEN     NEXT RUN     SLA
app-db            pass 2h ago   2h ago     Sun 05:00    ok
nextcloud-files   fail 1d ago   6d ago     Sun 04:10    STALE

1 of 2 drills proven within SLA

(PROVEN shows the proof age of the drill's highest configured level.)

Configuration glossary

Sources — where backups live and how to read them. Today: borg, restic, and dumpdir (a directory of dump files).
Drills — a scheduled audit of one source: schedule, max_proof_age (the proof SLA), optional jitter/timeout/retention, and one or more levels.
Checks — typed assertions producing evidence (expected vs. actual): path_exists, file_count_tolerance_pct, newest_file_max_age, sql, sql_no_error. The sql expect grammar covers > N, >= N, == N, != N, between A B, age < 8d / age > 8d, matches REGEX, nonempty.
Notifications — via shoutrrr: ntfy, Telegram, Discord, email, webhooks, and more. Messages lead with the diagnosis and the last-good date, not a stack trace.
Retention — prune each drill's run history by max_age and/or max_count. The proof timeline (last_proven_at) is kept forever.

The full annotated schema lives in deploy/compose/config.example.yaml.

Safety measures

Read-only by construction: the drivers have no write, prune, or delete code paths.
Secrets are referenced by *_file/*_env only and redacted from stored output.
L3 sandboxes run with network=none, memory limits, and guaranteed cleanup.
The Docker socket is needed only for L3; drop it to keep L1/L2.

Trusting the verifier

A verifier that can't be trusted is worse than none. On every change the test suite runs, including real-engine drills in throwaway containers. Some of those tests feed Redrill deliberately broken backups that are byte-perfect but semantically dead (for example, an empty-but-valid gzip, a dump of the wrong database, or a stale snapshot), and the build fails unless it flags each one.

Feedback

Redrill is under active development. Bug reports and ideas are welcome through issues, especially which backup tools should be supported next.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
cmd/redrill		cmd/redrill
deploy		deploy
dev		dev
docs		docs
internal		internal
recipes		recipes
testdata/fixtures		testdata/fixtures
web		web
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
redrill		redrill

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Redrill

The problem

Redrill in essence

Verification layers

Drill results

Quickstart

Installation

Config example

Available CLI commands

Configuration glossary

Safety measures

Trusting the verifier

Feedback

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Redrill

The problem

Redrill in essence

Verification layers

Drill results

Quickstart

Installation

Config example

Available CLI commands

Configuration glossary

Safety measures

Trusting the verifier

Feedback

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages