PLT-2119: Replace per-tag PSLs with single tag-agnostic PrometheusServiceLevel#620
Open
eddiebarth1 wants to merge 1 commit into
Open
PLT-2119: Replace per-tag PSLs with single tag-agnostic PrometheusServiceLevel#620eddiebarth1 wants to merge 1 commit into
eddiebarth1 wants to merge 1 commit into
Conversation
- One PrometheusServiceLevel per app/target (`{app}-{target}-servicelevels`),
upserted by ResourceSyncer once per reconcile from the leading releasable
revision. Single writer; no per-incarnation churn.
- SLI queries use `sum by (tag) (rate(...))` so Sloth's recording rules and
burn-rate alerts preserve the `tag` label. Per-revision rollback continues
to match by `sample.Metric["tag"]` without changes to prometheus/api.go.
- Legacy per-tag PSLs are scrubbed by cleanupLegacyServiceLevels on each
Releasing/Canarying reconcile (self-limiting).
- DeleteServiceLevels wired into Deleting/Failing for retirement cleanup of
the shared PSL.
Validated against Sloth: SLI wrapping is `(error)/(total)` with no extra
aggregation; burn-rate alerts use `max(...) without (sloth_window)` which
preserves `tag` through to alert series.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
257e2a9 to
fd86181
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes PLT-2119
Problem
Per-deploy
PrometheusServiceLevelobjects ({app}-{target}-{tag}-servicelevels) caused Sloth to regenerate burn-rate recording rules on every deploy, resetting 1h/6h windows and producing duplicate alert series. The result: chronic SLO violations fired N times (once per revision) and burn-rate trends were untrustworthy.What changed
{app}-{target}-servicelevels) — stable across deploys.sum by (tag) (rate(...))instead ofsum(rate(...)). Sloth wraps these without additional aggregation and its burn-rate alert template (max(...) without (sloth_window)) preserves thetaglabel through to the alert series. Per-revision rollback granularity is preserved at the Prometheus query layer rather than via per-deploy Kubernetes objects.ResourceSyncer.syncServiceLevelsupserts the PSL once per reconcile from the leading releasable revision (mirrorssyncSLORulespattern). No per-incarnation churn, no multi-writer oscillation when SLO config differs between concurrent revisions.cleanupLegacyServiceLevelsruns on eachReleasing/Canaryingreconcile, scrubbing any legacy per-tag PSL for the current revision. Self-limiting (no-op once cleaned); handles infrequently-deployed services without coordinated rollout.DeleteServiceLevelswired intoDeleting/Failingto remove the shared PSL when an app/target is fully retired.Canary/rollback behavior (unchanged from main)
IsRevisionTriggeredmatches alert samples bysample.Metric["tag"]. Because the new SLI queries preservetagthrough Sloth's recording rules and burn-rate alerts, per-revision SLO-driven rollback continues to work exactly as it did before this change. No changes toprometheus/api.go.Validation
tagis the universal convention in mono SLI queries — confirmed across rito, lite, medium2, tutu, echo, bevy(error)/(total)with no additional aggregation (slok/sloth/internal/plugin/slo/core/sli_rules_v1/plugin.go)max(...) without (sloth_window)preserves all labels except window (slok/sloth/internal/plugin/slo/core/alert_rules_v1/plugin.go)Test plan
go test ./...go vet ./...,make lintsum by (tag), alerts carrytag=<revision>🤖 Generated with Claude Code