Skip to content

OCPBUGS-86506: Fix lso_lvset_provisioned_PV_count metric#629

Open
tsmetana wants to merge 1 commit into
openshift:mainfrom
tsmetana:86506-lvset-metrics
Open

OCPBUGS-86506: Fix lso_lvset_provisioned_PV_count metric#629
tsmetana wants to merge 1 commit into
openshift:mainfrom
tsmetana:86506-lvset-metrics

Conversation

@tsmetana

@tsmetana tsmetana commented Jun 5, 2026

Copy link
Copy Markdown
Member

The lso_lvset_provisioned_PV_count Prometheus metric always reports 0 after the initial provisioning reconcile. The root cause is that the provisioned device count was only computed inside processNewSymlink(), which is never called once all devices already have symlinks.

The fix is to just count all the symlinked devices.

Summary by CodeRabbit

  • Refactor
    • Streamlined reconciliation logic in the disk provisioning controller for improved efficiency. The system now uses a consolidated method to track provisioned symlinks and identify stale entries, simplifying internal processing without affecting functionality.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 5, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@tsmetana: This pull request references Jira Issue OCPBUGS-86506, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The lso_lvset_provisioned_PV_count Prometheus metric always reports 0 after the initial provisioning reconcile. The root cause is that the provisioned device count was only computed inside processNewSymlink(), which is never called once all devices already have symlinks.

The fix is to just count all the symlinked devices.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown

Walkthrough

The reconciler refactors how it counts already-provisioned symlinks and identifies stale entries. Instead of accumulating counts during the validDevices loop from processNewSymlink results, it consolidates counting into a single getAlreadySymlinked pass after the loop, simplifying the result struct and reducing distributed state tracking.

Changes

Symlink Count Consolidation

Layer / File(s) Summary
Symlink counting consolidation
pkg/diskmaker/controllers/lvset/reconcile.go
processNewSymlinkResult is simplified to return only fastRequeue and maxCountReached; accumulators for totalProvisionedPVs and noMatch are removed from the validation loop; and provisioned symlink counts and stale entry paths are computed in a single getAlreadySymlinked call after the loop instead of being accumulated incrementally.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error Line 329-330 logs file system paths (noMatch) containing symlink names with embedded disk IDs and internal device identifiers, which exposes internal infrastructure details. Remove or redact sensitive path information from logs: instead of logging full paths, log only the count or generic identifiers that don't expose disk IDs and system infrastructure.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically references the bug fix (OCPBUGS-86506) and the specific metric being fixed (lso_lvset_provisioned_PV_count), which aligns with the core change described in the PR objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No Ginkgo test code was added or modified in this PR. The repository uses standard Go testing, not Ginkgo. Only production code in reconcile.go was changed.
Test Structure And Quality ✅ Passed No Ginkgo tests exist in the PR scope. The modified package uses standard Go testing.T tests only, making the Ginkgo test quality check not applicable to this PR.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. All tests use Go's standard testing framework, not Ginkgo (no It/Describe/Context/When declarations found).
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR modifies only pkg/diskmaker/controllers/lvset/reconcile.go, a production controller file with no Ginkgo e2e tests. No new test code was added.
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces no topology-breaking scheduling constraints. DaemonSets use standard patterns without restrictive nodeSelector, affinity, or topology constraints. Works on all topologies.
Ote Binary Stdout Contract ✅ Passed PR modifies only controller-level code (Reconcile method), not process-level code. OTE check applies only to main(), TestMain(), suite setup—not controller methods.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR contains no new Ginkgo e2e tests. Single modified file (reconcile.go) is a controller implementation file, not a test file, so IPv6/disconnected network check is not applicable.
No-Weak-Crypto ✅ Passed No weak cryptography detected. The PR modifies metric counting logic in a reconciler, using only standard Go and Kubernetes libraries with no crypto operations.
Container-Privileges ✅ Passed PR only modifies Go reconciliation code (reconcile.go), not container manifests. No privilege-escalation settings were added or modified.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from gnufied and rhrmo June 5, 2026 07:45
@openshift-ci

openshift-ci Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tsmetana

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 5, 2026
@tsmetana

tsmetana commented Jun 5, 2026

Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 5, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@tsmetana: This pull request references Jira Issue OCPBUGS-86506, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@tsmetana: This pull request references Jira Issue OCPBUGS-86506, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

The lso_lvset_provisioned_PV_count Prometheus metric always reports 0 after the initial provisioning reconcile. The root cause is that the provisioned device count was only computed inside processNewSymlink(), which is never called once all devices already have symlinks.

The fix is to just count all the symlinked devices.

Summary by CodeRabbit

  • Refactor
  • Streamlined reconciliation logic in the disk provisioning controller for improved efficiency. The system now uses a consolidated method to track provisioned symlinks and identify stale entries, simplifying internal processing without affecting functionality.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/diskmaker/controllers/lvset/reconcile.go`:
- Around line 303-311: The call to getAlreadySymlinked can fail and currently we
log the error but still publish totalProvisionedPVs (which may be an invalid
zero); change the flow so that after calling getAlreadySymlinked you check err
and, if non-nil, log the error and skip updating the metric (do not call
localmetrics.SetLVSProvisionedPVMetric) or return/propagate the error
instead—update the block around getAlreadySymlinked/totalProvisionedPVs to only
call localmetrics.SetLVSProvisionedPVMetric(nodeName, storageClassName,
totalProvisionedPVs) when err == nil.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6215edd6-8842-41fc-860e-971dd1bd77ba

📥 Commits

Reviewing files that changed from the base of the PR and between 2cb6d5a and 9956526.

📒 Files selected for processing (1)
  • pkg/diskmaker/controllers/lvset/reconcile.go

Comment on lines +303 to 311
totalProvisionedPVs, noMatch, err := getAlreadySymlinked(symLinkDir, blockDevices)
if err != nil {
klog.ErrorS(err, "error counting provisioned symlinks")
}

klog.InfoS("total devices provisioned", "storageClass", storageClassName, "count", totalProvisionedPVs)

// update metrics for total persistent volumes provisioned
localmetrics.SetLVSProvisionedPVMetric(nodeName, storageClassName, totalProvisionedPVs)

@coderabbitai coderabbitai Bot Jun 5, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle symlink-count failures before updating metrics.

On Line 303, getAlreadySymlinked errors are only logged, but Line 311 still publishes totalProvisionedPVs (which can be zero from the failure path). This can reintroduce false 0 values for lso_lvset_provisioned_PV_count.

Suggested fix
  totalProvisionedPVs, noMatch, err := getAlreadySymlinked(symLinkDir, blockDevices)
  if err != nil {
-     klog.ErrorS(err, "error counting provisioned symlinks")
+     return ctrl.Result{}, fmt.Errorf("error counting provisioned symlinks: %w", err)
  }

  klog.InfoS("total devices provisioned", "storageClass", storageClassName, "count", totalProvisionedPVs)

  // update metrics for total persistent volumes provisioned
  localmetrics.SetLVSProvisionedPVMetric(nodeName, storageClassName, totalProvisionedPVs)

As per coding guidelines "Never ignore error returns".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
totalProvisionedPVs, noMatch, err := getAlreadySymlinked(symLinkDir, blockDevices)
if err != nil {
klog.ErrorS(err, "error counting provisioned symlinks")
}
klog.InfoS("total devices provisioned", "storageClass", storageClassName, "count", totalProvisionedPVs)
// update metrics for total persistent volumes provisioned
localmetrics.SetLVSProvisionedPVMetric(nodeName, storageClassName, totalProvisionedPVs)
totalProvisionedPVs, noMatch, err := getAlreadySymlinked(symLinkDir, blockDevices)
if err != nil {
return ctrl.Result{}, fmt.Errorf("error counting provisioned symlinks: %w", err)
}
klog.InfoS("total devices provisioned", "storageClass", storageClassName, "count", totalProvisionedPVs)
// update metrics for total persistent volumes provisioned
localmetrics.SetLVSProvisionedPVMetric(nodeName, storageClassName, totalProvisionedPVs)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/diskmaker/controllers/lvset/reconcile.go` around lines 303 - 311, The
call to getAlreadySymlinked can fail and currently we log the error but still
publish totalProvisionedPVs (which may be an invalid zero); change the flow so
that after calling getAlreadySymlinked you check err and, if non-nil, log the
error and skip updating the metric (do not call
localmetrics.SetLVSProvisionedPVMetric) or return/propagate the error
instead—update the block around getAlreadySymlinked/totalProvisionedPVs to only
call localmetrics.SetLVSProvisionedPVMetric(nodeName, storageClassName,
totalProvisionedPVs) when err == nil.

@tsmetana tsmetana Jun 8, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the the getAlreadySymlinked function returns an error, the value of the totalProvisionedPVs cannot be trusted anyway, just as the metric itself. In fact, setting it to zero would make any problem more visible and increase chance somebody would take a look into the problem.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsmetana That's a fair point — a metric that drops to 0 is arguably a better observable signal than one that silently retains a stale value, and it's consistent with an "alert on anomaly" philosophy. The error is still surfaced via klog.ErrorS, so it isn't truly ignored. I accept the reasoning and withdraw the suggestion.

@openshift-ci

openshift-ci Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

@tsmetana: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@rhrmo

rhrmo commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants