CNV-87529: management: add alert rule classification system by sradco · Pull Request #976 · openshift/monitoring-plugin

sradco · 2026-06-08T14:22:09Z

Add classification labels (component,
layer) for alert rules with validation,
ARC-based persistence, and bulk update
support.

Signed-off-by: Shirly Radco [email protected]
Signed-off-by: João Vilaça [email protected]
Signed-off-by: Aviv Litman [email protected]
Co-authored-by: AI Assistant [email protected]

Made with Cursor

openshift-ci-robot · 2026-06-08T14:22:19Z

@sradco: This pull request references CNV-80608 which is a valid jira issue.

Details

In response to this:

Add classification labels (component,
layer) for alert rules with validation,
ARC-based persistence, and bulk update
support.

Signed-off-by: Shirly Radco [email protected]
Signed-off-by: João Vilaça [email protected]
Signed-off-by: Aviv Litman [email protected]
Co-authored-by: AI Assistant [email protected]

Made with Cursor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-06-08T14:23:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sradco
Once this PR has been reviewed and has the lgtm label, please assign peteryurkovich for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-06-08T14:27:11Z

@sradco: This pull request references CNV-87529 which is a valid jira issue.

Details

In response to this:

Add classification labels (component,
layer) for alert rules with validation,
ARC-based persistence, and bulk update
support.

Signed-off-by: Shirly Radco [email protected]
Signed-off-by: João Vilaça [email protected]
Signed-off-by: Aviv Litman [email protected]
Co-authored-by: AI Assistant [email protected]

Made with Cursor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

sradco · 2026-06-09T11:36:38Z

/retest

openshift-ci · 2026-06-09T14:30:00Z

@sradco: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

PeterYurkovich · 2026-06-15T19:49:34Z

+   - Core control plane components (renamed to layer `cluster`)
+   - Workload/namespace-level alerts (renamed to layer `namespace`)


Avoid using rename since alerts/rules have names:

Suggested change

- Core control plane components (renamed to layer `cluster`)

- Workload/namespace-level alerts (renamed to layer `namespace`)

- Core control plane components (layer set to `cluster`)

- Workload/namespace-level alerts (layer set to `namespace`)

Good catch, I applied the suggestion. Updated to "layer set to" to avoid confusion with alert/rule names.

PeterYurkovich · 2026-06-15T20:14:03Z

+
+// ValidateLayer returns true if the provided layer is one of the allowed values.
+func ValidateLayer(layer string) bool {
+	_, ok := allowedLayers[strings.ToLower(strings.TrimSpace(layer))]


the allowed layers calls ToLower in a number of places, is there a reason that we should accept "Cluster" as a layer value instead of requiring "cluster"

+1. I removed the ToLower/TrimSpace from ValidateLayer , it now requires exact lowercase (cluster or namespace).
The normalization in buildClassificationLabels was also removed since validation already ensures correctness.
This makes the API strict at the boundary.

PeterYurkovich · 2026-06-15T20:17:24Z

+	if req.Layer != nil && !classification.ValidateLayer(*req.Layer) {
+		return &ValidationError{Message: fmt.Sprintf("invalid layer %q (allowed: cluster, namespace)", *req.Layer)}
+	}
+	if req.ComponentFrom != nil {


If req.Component is a whitespace only string then the validation error's. If req.ComponentFrom is a whitespace only string then the validation does NOT error. Is there a reason for the different handling of the 2 fields?

No good reason for the difference, fixed.
ComponentFrom and LayerFrom now reject whitespace-only values with a clear error message ("must not be empty or whitespace-only; set to null to remove"), consistent with how Component rejects empty strings via ValidateComponent.

PeterYurkovich · 2026-06-15T20:38:45Z

+
+	if len(nextChanges) == 0 {
+		if found {
+			if err := c.k8sClient.AlertRelabelConfigs().Delete(ctx, arcNamespace, arcName); err != nil {


Since the len(nextChanges) is called without regard to existingRuleDrops, this code will always attempt to delete an ARC if there are no changes after applying the ARC, which may not be desired. An ARC that was manually created with a drop command on a specific label which does not currently exist on the PrometheusRule will be deleted even if the user wanted the ARC to exist to gate against that label being added in the future

Good catch. Fixed. The ARC deletion check now also considers existingRuleDrops.
The condition is len(nextChanges) == 0 && len(existingRuleDrops) == 0 before deleting.
If there are existing Drop configs on the ARC (e.g., a disabled rule), the ARC is preserved even when classification produces no label changes.

PeterYurkovich · 2026-06-15T20:45:48Z

+// layer_from label, the corresponding alert label value is used instead of the
+// static default. Unresolvable or empty lookups fall back to the supplied
+// defaults.
+func ApplyDynamicClassification(ruleLabels, alertLabels map[string]string, defaultComponent, defaultLayer string) (string, string) {


I assume this will be used in a future PR for get AlertRules, just want to double check

Correct. ApplyDynamicClassification is used in the get-alerts read path (get_alerts.go) to resolve _from labels at query time.
It's called from the alerts enrichment flow when building the API response.
The get-alerts PR is the next one in the stack.

PeterYurkovich · 2026-06-15T21:01:21Z

+Annotations are intentionally ignored to reduce id churn on documentation-only changes.
+
+## Classification Logic (How component/layer are determined)
+Location: `pkg/alertcomponent/matcher.go`


This code isn't included in this PR, which is fine, but I'm wondering how this will be implemented once it is. Will there be some kind of reconcile loop which watches for PrometheusRules, or will it only trigger an update when one of these API endpoints are called?

No reconcile loop. Classification is triggered only through the API endpoints (PATCH single/bulk).
The matcher (pkg/alertcomponent/matcher.go) runs at read time during alert enrichment.
When GET /api/v1/alerting/alerts is called, the backend computes component/layer from rule labels and the matcher fallback logic, then applies any ARC-based overrides.
The matcher code itself is included in the base branch (main-alerts-management-api) and was introduced in an earlier PR in the stack.

PeterYurkovich · 2026-06-15T21:05:01Z

+			arc.Annotations = map[string]string{}
+		}
+		arc.Annotations[managementlabels.ARCAnnotationAlertRuleIDKey] = alertRuleId
+		if err := k8sClient.AlertRelabelConfigs().Update(ctx, *arc); err != nil {


Not needed for this PR, but is there any desire to have retry functionality occur when trying to update, or should that be something that is strictly driven by the frontend UI (ie. user needs to resubmit a form)

Good question. For now, retry is driven by the frontend/client.
If an ARC update fails (e.g., conflict on resourceVersion), the client should re-fetch and resubmit. Backend retry with conflict resolution is worth considering for bulk operations at scale, but for MVP the client-driven approach is simpler and matches how other management endpoints (severity, drop/restore) work today.

PeterYurkovich · 2026-06-15T21:09:01Z

+the owning `AlertingRule` CR is operator-managed (has operator owner references), the
+backend cannot modify it directly (the operator would reconcile the change back). In
+this case, label updates are applied through an ARC instead. When the `AlertingRule` CR
+is not externally managed, label updates are written directly into the CR. Classification


I'm unclear on why we would want to swap between directly updating the AlertingRule CR and creating/updating an ARC through a single backend endpoint instead of always just always updating through the ARC when this endpoint is called.

I can image a scenario where the frontend has created a user-created and managed AR which has these component and layer labels, and then the user wants to update these labels. It is nice to have these validation layers for these labels specifically, but if a user manually updates the labels for the alert rather than using this endpoint then the validation will be lost.

It seems like it would make more sense to have this "update the ARC vs update the AR" decision happen on the frontend rather than the backend, and then have the update AR logic always perform the validation on these labels if they exist. That way the PATCH requests with the classification will only ever update ARC's and an update to an AR directly will still receive the validation

Good feedback. I've clarified the doc.
The classification endpoint in this PR always uses the ARC path, regardless of whether the AlertingRule CR is operator-managed or not. It never writes directly to AlertingRule CRs.

The paragraph you flagged was describing the behavior of other management endpoints (severity, drop/restore) which may write to unmanaged AlertingRule CRs directly. I've rewritten it to make this distinction clear:

Classification overrides always use the ARC path regardless of AlertingRule management status — this endpoint never writes directly to AlertingRule CRs. (Other management endpoints, such as severity updates, may write to unmanaged AlertingRule CRs directly, but classification is ARC-only.)

This is intentional because classification labels are additive metadata that should not interfere with the rule definition itself, and the ARC path provides a consistent storage model for both platform and user-defined rules.

Add classification labels (component, layer) for alert rules with validation, ARC-based persistence, and bulk update support. Signed-off-by: Shirly Radco <[email protected]> Signed-off-by: João Vilaça <[email protected]> Signed-off-by: Aviv Litman <[email protected]> Co-authored-by: AI Assistant <[email protected]> Co-authored-by: Cursor <[email protected]>

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 8, 2026

openshift-ci Bot requested review from PeterYurkovich and zhuje June 8, 2026 14:23

sradco changed the title ~~CNV-80608: management: add alert rule classification system~~ CNV-87529: management: add alert rule classification system Jun 8, 2026

PeterYurkovich reviewed Jun 15, 2026

View reviewed changes

sradco force-pushed the alert-mgmt-restructured-04-classification branch from 8f0202e to 6268b33 Compare June 16, 2026 12:02

		- Core control plane components (renamed to layer `cluster`)
		- Workload/namespace-level alerts (renamed to layer `namespace`)

Conversation

sradco commented Jun 8, 2026

Uh oh!

openshift-ci-robot commented Jun 8, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci Bot commented Jun 8, 2026

Uh oh!

openshift-ci-robot commented Jun 8, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sradco commented Jun 9, 2026

Uh oh!

openshift-ci Bot commented Jun 9, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sradco Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sradco Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sradco Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sradco Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sradco Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sradco Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sradco Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sradco Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

openshift-ci-robot commented Jun 8, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Jun 8, 2026 •

edited by openshift-ci Bot

Loading

sradco Jun 16, 2026 •

edited

Loading

sradco Jun 16, 2026 •

edited

Loading

sradco Jun 16, 2026 •

edited

Loading

sradco Jun 16, 2026 •

edited

Loading

sradco Jun 16, 2026 •

edited

Loading

sradco Jun 16, 2026 •

edited

Loading

sradco Jun 16, 2026 •

edited

Loading

sradco Jun 16, 2026 •

edited

Loading