Skip to content

PLT-2119: Replace per-tag PSLs with single tag-agnostic PrometheusServiceLevel#620

Open
eddiebarth1 wants to merge 1 commit into
mainfrom
plt-2119/stabilize-slo-tag-agnostic-psl
Open

PLT-2119: Replace per-tag PSLs with single tag-agnostic PrometheusServiceLevel#620
eddiebarth1 wants to merge 1 commit into
mainfrom
plt-2119/stabilize-slo-tag-agnostic-psl

Conversation

@eddiebarth1
Copy link
Copy Markdown
Contributor

@eddiebarth1 eddiebarth1 commented Apr 30, 2026

Closes PLT-2119

Problem

Per-deploy PrometheusServiceLevel objects ({app}-{target}-{tag}-servicelevels) caused Sloth to regenerate burn-rate recording rules on every deploy, resetting 1h/6h windows and producing duplicate alert series. The result: chronic SLO violations fired N times (once per revision) and burn-rate trends were untrustworthy.

What changed

  • One shared PSL per app/target ({app}-{target}-servicelevels) — stable across deploys.
  • Tag-preserving SLI queries: SLI events use sum by (tag) (rate(...)) instead of sum(rate(...)). Sloth wraps these without additional aggregation and its burn-rate alert template (max(...) without (sloth_window)) preserves the tag label through to the alert series. Per-revision rollback granularity is preserved at the Prometheus query layer rather than via per-deploy Kubernetes objects.
  • Single writer: ResourceSyncer.syncServiceLevels upserts the PSL once per reconcile from the leading releasable revision (mirrors syncSLORules pattern). No per-incarnation churn, no multi-writer oscillation when SLO config differs between concurrent revisions.
  • Rolling migration: cleanupLegacyServiceLevels runs on each Releasing/Canarying reconcile, scrubbing any legacy per-tag PSL for the current revision. Self-limiting (no-op once cleaned); handles infrequently-deployed services without coordinated rollout.
  • Retirement cleanup: DeleteServiceLevels wired into Deleting/Failing to remove the shared PSL when an app/target is fully retired.

Canary/rollback behavior (unchanged from main)

IsRevisionTriggered matches alert samples by sample.Metric["tag"]. Because the new SLI queries preserve tag through Sloth's recording rules and burn-rate alerts, per-revision SLO-driven rollback continues to work exactly as it did before this change. No changes to prometheus/api.go.

Validation

  • tag is the universal convention in mono SLI queries — confirmed across rito, lite, medium2, tutu, echo, bevy
  • Sloth wraps SLI as (error)/(total) with no additional aggregation (slok/sloth/internal/plugin/slo/core/sli_rules_v1/plugin.go)
  • Sloth burn-rate alert template max(...) without (sloth_window) preserves all labels except window (slok/sloth/internal/plugin/slo/core/alert_rules_v1/plugin.go)

Test plan

  • go test ./...
  • go vet ./..., make lint
  • Staging: confirm one PSL per app/target, recording rules include sum by (tag), alerts carry tag=<revision>
  • Verify burn-rate windows survive a deploy without resetting
  • Force a canary violation, confirm rollback fires for the offending tag

🤖 Generated with Claude Code

- One PrometheusServiceLevel per app/target (`{app}-{target}-servicelevels`),
  upserted by ResourceSyncer once per reconcile from the leading releasable
  revision. Single writer; no per-incarnation churn.
- SLI queries use `sum by (tag) (rate(...))` so Sloth's recording rules and
  burn-rate alerts preserve the `tag` label. Per-revision rollback continues
  to match by `sample.Metric["tag"]` without changes to prometheus/api.go.
- Legacy per-tag PSLs are scrubbed by cleanupLegacyServiceLevels on each
  Releasing/Canarying reconcile (self-limiting).
- DeleteServiceLevels wired into Deleting/Failing for retirement cleanup of
  the shared PSL.

Validated against Sloth: SLI wrapping is `(error)/(total)` with no extra
aggregation; burn-rate alerts use `max(...) without (sloth_window)` which
preserves `tag` through to alert series.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@eddiebarth1 eddiebarth1 force-pushed the plt-2119/stabilize-slo-tag-agnostic-psl branch from 257e2a9 to fd86181 Compare May 19, 2026 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant