Skip to content

fix(tra-898): cloudflared survives power-loss boot race#491

Merged
mikestankavich merged 1 commit into
mainfrom
fix/tra-898-cloudflared-powerloss
Jun 13, 2026
Merged

fix(tra-898): cloudflared survives power-loss boot race#491
mikestankavich merged 1 commit into
mainfrom
fix/tra-898-cloudflared-powerloss

Conversation

@mikestankavich

Copy link
Copy Markdown
Contributor

Problem

After a power loss at the demo box (observed live at Tim's site today as Cloudflare error 1033), the Cloudflare tunnel never came back on its own.

Root cause: on a cold boot, cloudflared can start before upstream DNS/internet is reachable. Edge discovery fails:

ERR edge discovery: error looking up Cloudflare edge IPs: the DNS query failed
    error="lookup _v2-origintunneld._tcp.argotunnel.com on 10.89.0.1:53: no such host"

systemd then restarts it fast enough to exhaust the default 5-in-10s start-rate limit (Start request repeated too quickly), parking the unit in failed — so it never retries even once the network is up. Every other container recovers fine because only cloudflared needs the external edge.

Fix

In the quadlet (deploy/edge/quadlets/cloudflared.container):

  • [Unit] StartLimitIntervalSec=0 — disable the start-rate limiter so Restart=always never gives up.
  • [Service] RestartSec=10 — wait between attempts so retries loop patiently through a slow network-up instead of burning the burst.

Verification

Applied live on the demo box: daemon-reload, confirmed StartLimitIntervalUSec=0 / RestartUSec=10s on the generated unit, tunnel reconnected (4 QUIC edge connections, app.demo.trakrf.id -> https://traefik:443), 1033 cleared. Manual recovery for an already-failed unit remains systemctl --user reset-failed cloudflared.service && systemctl --user start cloudflared.service.

🤖 Generated with Claude Code

On a cold start after power loss, cloudflared can launch before upstream
DNS/internet is reachable. Edge discovery fails, and the default systemd
5-in-10s restart burst is exhausted, parking the unit in 'failed' so it
never retries even once the network comes up (observed as Cloudflare 1033).

Disable the start-rate limiter (StartLimitIntervalSec=0) and add RestartSec=10
so Restart=always loops patiently until the edge resolves and self-heals.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
@github-actions

Copy link
Copy Markdown

🚀 Preview Deployment Update

✅ This PR has been successfully merged into the preview branch.

The preview environment will update shortly at: https://app.preview.trakrf.id

@mikestankavich mikestankavich merged commit f47bd39 into main Jun 13, 2026
14 checks passed
@mikestankavich mikestankavich deleted the fix/tra-898-cloudflared-powerloss branch June 13, 2026 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant