fix: Improve DPU Extension Service image pull error handling#3096
fix: Improve DPU Extension Service image pull error handling#3096hanyux-nv wants to merge 1 commit into
Conversation
Summary by CodeRabbit
WalkthroughThis PR enhances Kubernetes pod status aggregation to detect image-pull failures at the sandbox level. It introduces marker strings and a matching helper, propagates the sandbox status message through pod inspection, aggregation, and error formatting, and returns ChangesImage-pull failure detection
Estimated code review effort: 2 (Simple) | ~15 minutes Sequence Diagram(s)sequenceDiagram
participant Handler as get_pod_status
participant Sandbox as get_pod_sandbox_status
participant Aggregator as aggregate_status
participant ErrorBuilder as build_service_error_message
Handler->>Sandbox: inspect pod sandbox
Sandbox-->>Handler: state, pod_message
Handler->>Aggregator: aggregate_status(statuses, state, pod_message)
alt no containers, SANDBOX_NOTREADY, image-pull failure message
Aggregator-->>Handler: DpuExtensionServiceError
Handler->>ErrorBuilder: build_service_error_message(pod_message)
ErrorBuilder-->>Handler: formatted error string
else no containers, SANDBOX_NOTREADY, no matching message
Aggregator-->>Handler: Pending
end
Related PRs: None identified. Suggested labels: bug, k8s, extension-services Suggested reviewers: None specified. 🐰 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
crates/agent/src/extension_services/k8s_pod_handler.rs (1)
1581-1676: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winConsider table-driven tests for
aggregate_statuscoverage.The new
test_k8s_pod_handler_aggregate_status_image_pull_failuretest (and the adjacent updated tests) enumerates several input/expected-output rows via repeated direct calls and asserts. As per coding guidelines, "Prefer table-driven tests using thecarbide-test-supportcrate withscenarios!for fallible operations andvalue_scenarios!for total operations, implementing theOutcomeenum pattern." Sinceaggregate_statusis a total function returning a plain enum value,value_scenarios!would consolidate these cases with per-case labels, making failures easier to pinpoint and additions cheaper.Not blocking for this PR since it follows the existing style in this file, but worth consolidating this cohort of
aggregate_statustests going forward.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/agent/src/extension_services/k8s_pod_handler.rs` around lines 1581 - 1676, The `aggregate_status` tests are written as repeated direct assertions instead of a table-driven scenario set, making the coverage harder to maintain. Consolidate the cases in `test_k8s_pod_handler_aggregate_status_all_running`, `test_k8s_pod_handler_aggregate_status_no_containers`, `test_k8s_pod_handler_aggregate_status_exited_zero_is_running`, and `test_k8s_pod_handler_aggregate_status_image_pull_failure` into a `value_scenarios!`-based table-driven test for `KubernetesPodServicesHandler::aggregate_status`, using per-case labels and expected `rpc::DpuExtensionServiceDeploymentStatus` values so new cases are easier to add and failures are easier to identify.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/agent/src/extension_services/k8s_pod_handler.rs`:
- Around line 63-71: The IMAGE_PULL_ERROR_MARKERS list in k8s_pod_handler.rs is
too broad because the generic “not found” entry can misclassify unrelated
sandbox failures as image-pull terminal errors. Narrow that marker in
IMAGE_PULL_ERROR_MARKERS to a more image-specific string, and keep the matching
logic in the pod sandbox error classification path aligned with the intended
image resolution cases so SANDBOX_NOTREADY messages still fall through to
Pending unless they clearly indicate an image pull failure.
---
Nitpick comments:
In `@crates/agent/src/extension_services/k8s_pod_handler.rs`:
- Around line 1581-1676: The `aggregate_status` tests are written as repeated
direct assertions instead of a table-driven scenario set, making the coverage
harder to maintain. Consolidate the cases in
`test_k8s_pod_handler_aggregate_status_all_running`,
`test_k8s_pod_handler_aggregate_status_no_containers`,
`test_k8s_pod_handler_aggregate_status_exited_zero_is_running`, and
`test_k8s_pod_handler_aggregate_status_image_pull_failure` into a
`value_scenarios!`-based table-driven test for
`KubernetesPodServicesHandler::aggregate_status`, using per-case labels and
expected `rpc::DpuExtensionServiceDeploymentStatus` values so new cases are
easier to add and failures are easier to identify.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 12f996b7-f875-4a4e-b2ce-b5da921a762c
📒 Files selected for processing (1)
crates/agent/src/extension_services/k8s_pod_handler.rs
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
| rpc::DpuExtensionServiceDeploymentStatus::DpuExtensionServicePending | ||
| // A SANDBOX_NOTREADY pod whose message indicates an image-pull failure will never | ||
| // produce containers - it must be reported as Error rather than left in Pending. | ||
| if pod_status == "SANDBOX_NOTREADY" |
There was a problem hiding this comment.
these should be constant strings if we're matching on them
| "failed to pull image", | ||
| "failed to resolve reference", | ||
| "manifest unknown", | ||
| "not found", |
There was a problem hiding this comment.
Shouldn't be here image not found?
When a DPU Extension Service deployment used an invalid or unreachable image reference, the deployment status would get stuck in
Pendinginstead of transitioning toError. This is because the image pull failures occur at the sandbox level before any containers are created, and theaggregate_statusfunction returnsPendingwhen no containers are created. This fix updatesget_pod_sandbox_statusto also return the sandbox message, soaggregate_statuswill treatSANDBOX_NOTREADYwith any non-empty message asError. The sandbox message is also surfaced in the deployment status detail so the failure reason is visible to the user.Related issues
nvbug https://nvbugspro.nvidia.com/bug/5953587
Type of Change
Breaking Changes
Testing
Additional Notes