-
Notifications
You must be signed in to change notification settings - Fork 59
CNV-87529: management: add alert rule classification system #976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
sradco
wants to merge
1
commit into
openshift:main-alerts-management-api
Choose a base branch
from
sradco:alert-mgmt-restructured-04-classification
base: main-alerts-management-api
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,253 @@ | ||
| # Alert Rule Classification - Design and Usage | ||
|
|
||
| ## Overview | ||
| The backend classifies Prometheus alerting rules into a "component" and an "impact layer". It: | ||
| - Computes an `openshift_io_alert_rule_id` per alerting rule. | ||
| - Determines component/layer based on matcher logic and rule labels. | ||
| - Allows operator-managed classification overrides via AlertRelabelConfigs (ARCs) for platform | ||
| rules. Operator-managed classification overrides of user-defined workload rules require the `ENABLE_USER_WORKLOAD_ARCS` feature flag. | ||
| - Enriches the Alerts API response with `openshift_io_alert_rule_id`, `openshift_io_alert_component`, and `openshift_io_alert_layer`. | ||
|
|
||
| This document explains how it works, how to override, and how to test it. | ||
|
|
||
|
|
||
| ## Terminology | ||
| - openshift_io_alert_rule_id: Identifier for an alerting rule. Computed from a canonicalized view of the rule definition and encoded as `rid_` + base64url(nopad(sha256(payload))). Independent of `PrometheusRule` name. | ||
| - component: Logical owner of the alert (e.g., `kube-apiserver`, `etcd`, a namespace, etc.). | ||
| - layer: Impact scope. Allowed values: | ||
| - `cluster` | ||
| - `namespace` | ||
|
|
||
| Notes: | ||
| - **Stability**: | ||
| - The id is **always derived from the rule spec**. If the rule definition changes (expr/for/business labels/name), the id may change. | ||
| - For **platform rules**, this API currently only supports label updates via `AlertRelabelConfig` (not editing expr/for), so the id is effectively stable unless the upstream operator changes the rule definition. | ||
| - For **user-defined rules**, the API stamps the computed id into the `PrometheusRule` rule labels. If you update the rule definition, the API returns the **new** id and migrates any existing classification override to the new id. | ||
| - Layer values are validated as `cluster|namespace` when set. To remove an override, set the field to `null` via the API; empty/invalid values are ignored at read time. | ||
|
|
||
| ## Rule ID computation (openshift_io_alert_rule_id) | ||
| Location: `pkg/alert_rule/alert_rule.go` | ||
|
|
||
| The backend computes a specHash-like value from: | ||
| - `kind`/`name`: `alert` + `alert:` name or `record` + `record:` name | ||
| - `expr`: trimmed with consecutive whitespace collapsed | ||
| - `for`: trimmed (duration string as written in the rule) | ||
| - `labels`: only non-system labels | ||
| - excludes labels with `openshift_io_` prefix and the `alertname` label | ||
| - drops empty values | ||
| - keeps only valid Prometheus label names (`[a-zA-Z_][a-zA-Z0-9_]*`) | ||
| - sorted by key and joined as `key=value` lines | ||
|
|
||
| Annotations are intentionally ignored to reduce id churn on documentation-only changes. | ||
|
|
||
| ## Classification Logic (How component/layer are determined) | ||
| Location: `pkg/alertcomponent/matcher.go` | ||
|
|
||
| 1) The code adapts `cluster-health-analyzer` matchers: | ||
| - CVO-related alerts (update/upgrade) → component/layer based on known patterns | ||
| - Compute / node-related alerts | ||
| - Core control plane components (layer set to `cluster`) | ||
| - Workload/namespace-level alerts (layer set to `namespace`) | ||
|
|
||
| 2) Fallback: | ||
| - If the computed component is empty or "Others", we set: | ||
| - `component = other` | ||
| - `layer` derived from source: | ||
| - `openshift_io_alert_source=platform` → `cluster` | ||
| - `openshift_io_prometheus_rule_namespace=openshift-monitoring` → `cluster` | ||
| - `prometheus` label starting with `openshift-monitoring/` → `cluster` | ||
| - otherwise → `namespace` | ||
|
|
||
| 3) Result: | ||
| - Each alerting rule is assigned a `(component, layer)` tuple following the above logic. | ||
|
|
||
| ## Developer Overrides via Rule Labels (Recommended) | ||
| If you want explicit component/layer values and do not want to rely on the matcher, set | ||
| these labels on each rule in your `PrometheusRule`: | ||
| - `openshift_io_alert_rule_component` | ||
| - `openshift_io_alert_rule_layer` | ||
|
|
||
| Both are validated the same way as API overrides: | ||
| - `component`: 1-253 chars, alphanumeric + `._-`, must start/end alphanumeric | ||
| - `layer`: `cluster` or `namespace` | ||
|
|
||
| When these labels are present and valid, they override matcher-derived values. | ||
|
|
||
| ## Classification Override Storage | ||
|
|
||
| Location: `pkg/management/update_classification.go`, `pkg/management/get_alerts.go` | ||
|
|
||
| Classification overrides are stored differently depending on the rule type: | ||
|
|
||
| ### Platform rules → AlertRelabelConfig (ARC) | ||
|
|
||
| For operator-managed platform rules (rules whose `PrometheusRule` is registered as a | ||
| platform resource), overrides are stored in an `AlertRelabelConfig` (ARC) CR in the | ||
| `openshift-monitoring` namespace. | ||
|
|
||
| - **ARC naming**: `arc-<sanitized-pr-name>-<short-hash-of-rule-id>` | ||
| (generated by `k8s.GetAlertRelabelConfigName`) | ||
| - **ARC namespace**: `openshift-monitoring` | ||
| - **Shared ARC**: classification labels are written into the same ARC that the platform | ||
| alert management path uses for other label changes (severity, Drop/Restore). This avoids | ||
| creating separate CRs per concern. | ||
| - **Labels on the ARC**: | ||
| - `monitoring.openshift.io/prometheus-rule-name`: name of the source `PrometheusRule` | ||
| - `monitoring.openshift.io/alert-name`: alert name | ||
| - **Annotation on the ARC**: | ||
| - `monitoring.openshift.io/alert-rule-id`: the `openshift_io_alert_rule_id` | ||
|
|
||
| The ARC contains `RelabelConfig` entries that: | ||
| 1. Match the rule by its original labels (alert name + all non-namespace labels) and | ||
| stamp `openshift_io_alert_rule_id` via a `Replace` action. | ||
| 2. Apply each classification label as a `Replace` action keyed on `openshift_io_alert_rule_id`. | ||
|
|
||
| When all overrides are removed, the ARC is deleted. | ||
|
|
||
| **AlertingRule CR distinction:** Some platform alerts are defined via `AlertingRule` CRs, | ||
| which the cluster-monitoring-operator reconciles into `PrometheusRule` resources. | ||
| Classification overrides always use the ARC path regardless of `AlertingRule` | ||
| management status — this endpoint never writes directly to `AlertingRule` CRs. | ||
| (Other management endpoints, such as severity updates, may write to unmanaged | ||
| `AlertingRule` CRs directly, but classification is ARC-only.) | ||
|
|
||
| ### User-defined workload rules → blocked by default, ARC when enabled | ||
|
|
||
| Classification updates for operator-managed user-defined workload rules are **not | ||
| allowed by default**. The API returns a `NotAllowedError` when the feature flag is | ||
| disabled. | ||
|
|
||
| ### Feature flag: `ENABLE_USER_WORKLOAD_ARCS` | ||
|
|
||
| Setting the environment variable `ENABLE_USER_WORKLOAD_ARCS=true` enables full | ||
| alert management for operator-managed user-defined workload rules, including | ||
| classification overrides, label updates, and rule disable/enable (Drop/Restore). | ||
| When enabled, these rules use the same ARC-based path as platform rules, with | ||
| ARCs stored in the `openshift-user-workload-monitoring` namespace. | ||
|
|
||
| ### Dynamic classification (`_from` labels) | ||
|
|
||
| Two special labels allow deriving component/layer dynamically from the alert itself | ||
| at query time: | ||
| - `openshift_io_alert_rule_component_from`: name of an alert label whose value | ||
| becomes the component (e.g., `"name"` → use the alert's `name` label). | ||
| - `openshift_io_alert_rule_layer_from`: same pattern for layer. | ||
|
|
||
| These `_from` labels are stored in the ARC alongside static classification labels. | ||
| At read time, `ApplyDynamicClassification` resolves them against the alert's labels. | ||
|
|
||
| ### Read path | ||
|
|
||
| The read path is unified regardless of storage mechanism: | ||
| 1. The relabeled rules cache (`k8s.RelabeledRules().Get`) returns each rule with all | ||
| ARC relabel configs already applied. This means classification labels (whether set | ||
| via ARC or directly on the `PrometheusRule`) are available as rule labels. | ||
| 2. `ApplyDynamicClassification` checks for `_from` labels on the relabeled rule and | ||
| resolves them against the alert's own labels to produce the final component/layer. | ||
|
|
||
| Notes: | ||
| - `_from` values must be valid Prometheus label names (`[a-zA-Z_][a-zA-Z0-9_]*`). | ||
| - If a `_from` label is present but the alert does not carry that label or the derived | ||
| value is invalid, the backend falls back to static values (if present) or defaults. | ||
| - If all overrides are removed, the ARC is deleted. | ||
|
|
||
|
|
||
| ## Alerts API Enrichment | ||
| Location: `pkg/management/get_alerts.go`, `pkg/k8s/prometheus_alerts.go` | ||
|
|
||
| - Endpoint: `GET /api/v1/alerting/alerts` (prom-compatible schema) | ||
| - The backend fetches active alerts and enriches each alert with: | ||
| - `openshift_io_alert_rule_id` | ||
| - `openshift_io_alert_component` | ||
| - `openshift_io_alert_layer` | ||
| - `prometheusRuleName`: name of the PrometheusRule resource the alert originates from | ||
| - `prometheusRuleNamespace`: namespace of that PrometheusRule resource | ||
| - `alertingRuleName`: name of the AlertingRule CR that generated the PrometheusRule (empty when the PrometheusRule is not owned by an AlertingRule CR) | ||
| - Prometheus compatibility: | ||
| - Base response matches Prometheus `/api/v1/alerts`. | ||
| - Additional fields are additive and safe for clients like Perses. | ||
|
|
||
| ## Prometheus/Thanos Sources | ||
| Location: `pkg/k8s/prometheus_alerts.go` | ||
|
|
||
| - Order of candidates: | ||
| 1) Thanos Route `thanos-querier` at `/api` + `/v1/alerts` (oauth-proxied) | ||
| 2) In-cluster Thanos service `https://thanos-querier.openshift-monitoring.svc:9091/api/v1/alerts` | ||
| 3) In-cluster Prometheus `https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts` | ||
| 4) In-cluster Prometheus (plain HTTP) `http://prometheus-k8s.openshift-monitoring.svc:9090/api/v1/alerts` (fallback) | ||
| 5) Prometheus Route `prometheus-k8s` at `/api/v1/alerts` | ||
|
|
||
| - TLS and Auth: | ||
| - Bearer token: service account token from in-cluster config. | ||
| - CA trust: system pool + `SSL_CERT_FILE` + `/var/run/configmaps/service-ca/service-ca.crt`. | ||
|
|
||
| RBAC: | ||
| - Read routes in `openshift-monitoring`. | ||
| - Access `prometheuses/api` as needed for oauth-proxied endpoints. | ||
|
|
||
| ## Updating Rules Classification | ||
| APIs: | ||
| - Single update: | ||
| - Method: `PATCH /api/v1/alerting/rules/{ruleId}` | ||
|
PeterYurkovich marked this conversation as resolved.
|
||
| - Request body: | ||
| ```json | ||
| { | ||
| "classification": { | ||
| "openshift_io_alert_rule_component": "team-x", | ||
| "openshift_io_alert_rule_layer": "namespace", | ||
| "openshift_io_alert_rule_component_from": "name", | ||
| "openshift_io_alert_rule_layer_from": "layer" | ||
| } | ||
| } | ||
| ``` | ||
| - `openshift_io_alert_rule_layer`: `cluster` or `namespace` | ||
| - To remove a classification override, set the field to `null` (e.g. `"openshift_io_alert_rule_layer": null`). | ||
| - Response: | ||
| - 200 OK with a status payload (same format as other rule PATCH responses), where `status_code` is 204 on success. | ||
| - Standard error body on failure (400 validation, 404 not found, etc.) | ||
| - Bulk update: | ||
| - Method: `PATCH /api/v1/alerting/rules` | ||
| - Request body: | ||
| ```json | ||
| { | ||
| "ruleIds": ["<id-a>", "<id-b>"], | ||
| "classification": { | ||
| "openshift_io_alert_rule_component": "etcd", | ||
| "openshift_io_alert_rule_layer": "cluster" | ||
| } | ||
| } | ||
| ``` | ||
| - Response: | ||
| - 200 OK with per-rule results (same format as other bulk rule PATCH responses). Clients should handle partial failures. | ||
|
|
||
| Direct K8s (supported for power users/GitOps): | ||
| - For platform rules: create or update the `AlertRelabelConfig` CR in `openshift-monitoring` | ||
| with the appropriate relabel configs (respect `resourceVersion` for optimistic concurrency). | ||
| - For user-defined rules (requires `ENABLE_USER_WORKLOAD_ARCS=true`): create or update the | ||
| `AlertRelabelConfig` CR in `openshift-user-workload-monitoring`. | ||
| - UI should check update permissions with SelfSubjectAccessReview before showing an editor. | ||
|
|
||
| Notes: | ||
| - These endpoints are intended for updating **classification only** (component/layer overrides), | ||
| with permissions enforced based on the rule's ownership (platform, user workload, operator-managed, | ||
| GitOps-managed). | ||
| - To update other rule fields (expr/labels/annotations/etc.), use `PATCH /api/v1/alerting/rules/{ruleId}`. | ||
| Clients that need to update both should issue two requests. The combined operation is not atomic. | ||
|
|
||
| ## Security Notes | ||
| - Classification overrides are stored in AlertRelabelConfig CRs (`openshift-monitoring` | ||
| for platform rules, `openshift-user-workload-monitoring` for user-defined rules when | ||
| enabled), subject to standard Kubernetes RBAC. | ||
| - No secrets or sensitive data are persisted in classification metadata. | ||
|
|
||
| ## Testing and Ops | ||
| Unit tests: | ||
| - `pkg/management/update_classification_test.go` | ||
| - ARC-based classification for platform rules, blocked-by-default for user-defined | ||
| rules, ARC in user-workload namespace when flag enabled, dynamic `_from` label resolution. | ||
| - `pkg/management/get_alerts_test.go` | ||
| - Alert enrichment with classification labels, `_from` label behavior, fallback behavior. | ||
|
|
||
| ## Future Work | ||
| - Optional composite update API if we need to update rule fields and classification atomically. | ||
| - De-duplication/merge logic when aggregating alerts across sources. | ||
65 changes: 65 additions & 0 deletions
65
internal/managementrouter/alert_rule_classification_patch.go
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| package managementrouter | ||
|
|
||
| import "encoding/json" | ||
|
|
||
| // AlertRuleClassificationPatch represents a partial update ("patch") payload for | ||
| // alert rule classification labels. | ||
| // | ||
| // This type supports a three-state contract per field: | ||
| // - omitted: leave unchanged | ||
| // - null: clear the override | ||
| // - string: set the override | ||
| // | ||
| // Note: Go's encoding/json cannot represent "explicit null" vs "omitted" using **string | ||
| // (both decode to nil), so we custom-unmarshal and track key presence with *Set flags. | ||
| type AlertRuleClassificationPatch struct { | ||
| Component *string `json:"openshift_io_alert_rule_component,omitempty"` | ||
| ComponentSet bool `json:"-"` | ||
| Layer *string `json:"openshift_io_alert_rule_layer,omitempty"` | ||
| LayerSet bool `json:"-"` | ||
| ComponentFrom *string `json:"openshift_io_alert_rule_component_from,omitempty"` | ||
| ComponentFromSet bool `json:"-"` | ||
| LayerFrom *string `json:"openshift_io_alert_rule_layer_from,omitempty"` | ||
| LayerFromSet bool `json:"-"` | ||
| } | ||
|
|
||
| func (p *AlertRuleClassificationPatch) UnmarshalJSON(b []byte) error { | ||
| var m map[string]json.RawMessage | ||
| if err := json.Unmarshal(b, &m); err != nil { | ||
| return err | ||
| } | ||
|
|
||
| decodeNullableString := func(key string) (set bool, v *string, err error) { | ||
| raw, ok := m[key] | ||
| if !ok { | ||
| return false, nil, nil | ||
| } | ||
| if len(raw) == 0 || string(raw) == "null" { | ||
| return true, nil, nil | ||
| } | ||
| var s string | ||
| if err := json.Unmarshal(raw, &s); err != nil { | ||
| return true, nil, err | ||
| } | ||
| return true, &s, nil | ||
| } | ||
|
|
||
| var err error | ||
| p.ComponentSet, p.Component, err = decodeNullableString("openshift_io_alert_rule_component") | ||
| if err != nil { | ||
| return err | ||
| } | ||
| p.LayerSet, p.Layer, err = decodeNullableString("openshift_io_alert_rule_layer") | ||
| if err != nil { | ||
| return err | ||
| } | ||
| p.ComponentFromSet, p.ComponentFrom, err = decodeNullableString("openshift_io_alert_rule_component_from") | ||
| if err != nil { | ||
| return err | ||
| } | ||
| p.LayerFromSet, p.LayerFrom, err = decodeNullableString("openshift_io_alert_rule_layer_from") | ||
| if err != nil { | ||
| return err | ||
| } | ||
| return nil | ||
| } |
50 changes: 50 additions & 0 deletions
50
internal/managementrouter/alert_rule_classification_patch_test.go
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| package managementrouter_test | ||
|
|
||
| import ( | ||
| "encoding/json" | ||
| "testing" | ||
|
|
||
| "github.com/openshift/monitoring-plugin/internal/managementrouter" | ||
| ) | ||
|
|
||
| func TestAlertRuleClassificationPatch_FieldOmitted(t *testing.T) { | ||
| var p managementrouter.AlertRuleClassificationPatch | ||
| if err := json.Unmarshal([]byte(`{}`), &p); err != nil { | ||
| t.Fatalf("unexpected error: %v", err) | ||
| } | ||
| if p.ComponentSet { | ||
| t.Error("expected ComponentSet to be false when field is omitted") | ||
| } | ||
| if p.Component != nil { | ||
| t.Error("expected Component to be nil when field is omitted") | ||
| } | ||
| } | ||
|
|
||
| func TestAlertRuleClassificationPatch_FieldExplicitNull(t *testing.T) { | ||
| var p managementrouter.AlertRuleClassificationPatch | ||
| if err := json.Unmarshal([]byte(`{"openshift_io_alert_rule_component":null}`), &p); err != nil { | ||
| t.Fatalf("unexpected error: %v", err) | ||
| } | ||
| if !p.ComponentSet { | ||
| t.Error("expected ComponentSet to be true when field is explicitly null") | ||
| } | ||
| if p.Component != nil { | ||
| t.Error("expected Component to be nil when field is explicitly null") | ||
| } | ||
| } | ||
|
|
||
| func TestAlertRuleClassificationPatch_FieldString(t *testing.T) { | ||
| var p managementrouter.AlertRuleClassificationPatch | ||
| if err := json.Unmarshal([]byte(`{"openshift_io_alert_rule_component":"team-x"}`), &p); err != nil { | ||
| t.Fatalf("unexpected error: %v", err) | ||
| } | ||
| if !p.ComponentSet { | ||
| t.Error("expected ComponentSet to be true when field is a string") | ||
| } | ||
| if p.Component == nil { | ||
| t.Fatal("expected Component to be non-nil when field is a string") | ||
| } | ||
| if *p.Component != "team-x" { | ||
| t.Errorf("expected Component %q, got %q", "team-x", *p.Component) | ||
| } | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.