fix: Improve DPU Extension Service image pull error handling by hanyux-nv · Pull Request #3096 · NVIDIA/infra-controller

hanyux-nv · 2026-07-02T17:53:13Z

When a DPU Extension Service deployment used an invalid or unreachable image reference, the deployment status would get stuck in Pending instead of transitioning to Error. This is because the image pull failures occur at the sandbox level before any containers are created, and the aggregate_status function returns Pending when no containers are created. This fix updates get_pod_sandbox_status to also return the sandbox message, so aggregate_status will treat SANDBOX_NOTREADY with any non-empty message as Error. The sandbox message is also surfaced in the deployment status detail so the failure reason is visible to the user.

Related issues

nvbug https://nvbugspro.nvidia.com/bug/5953587

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

coderabbitai · 2026-07-02T17:54:56Z

Summary by CodeRabbit

Bug Fixes
- Image pull failures during pod startup are now detected more reliably and surfaced as an error instead of leaving services stuck in pending state.
- Status summaries now include relevant pod error details when available, making deployment issues easier to understand.
- Status handling has been updated so failing sandbox-level pod conditions are reflected correctly in overall service health.

Walkthrough

This PR enhances Kubernetes pod status aggregation to detect image-pull failures at the sandbox level. It introduces marker strings and a matching helper, propagates the sandbox status message through pod inspection, aggregation, and error formatting, and returns DpuExtensionServiceError instead of Pending when an image-pull failure is detected.

Changes

Image-pull failure detection

Layer / File(s)	Summary
Image-pull failure markers and matcher `crates/agent/src/extension_services/k8s_pod_handler.rs`	Adds a constant list of image-pull failure message markers and an `is_image_pull_failure_message` helper for case-insensitive substring matching; `get_pod_sandbox_status` now also returns the pod's `status.message`.
Aggregate status handling of sandbox message `crates/agent/src/extension_services/k8s_pod_handler.rs`	`aggregate_status` accepts an optional pod message and returns `DpuExtensionServiceError` for `SANDBOX_NOTREADY` with an image-pull failure message instead of `Pending`; documentation updated accordingly.
Pod status pipeline wiring `crates/agent/src/extension_services/k8s_pod_handler.rs`	`get_pod_status` captures the optional pod message and forwards it to `aggregate_status` and `build_service_error_message` call sites, including the message in formatted error output.
Unit test updates and new coverage `crates/agent/src/extension_services/k8s_pod_handler.rs`	Existing tests updated for the new `pod_message` parameter; a new test validates `DpuExtensionServiceError`, `Pending`, and `Terminating` outcomes based on message content and deploy expectation.

Estimated code review effort: 2 (Simple) | ~15 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Handler as get_pod_status
  participant Sandbox as get_pod_sandbox_status
  participant Aggregator as aggregate_status
  participant ErrorBuilder as build_service_error_message

  Handler->>Sandbox: inspect pod sandbox
  Sandbox-->>Handler: state, pod_message
  Handler->>Aggregator: aggregate_status(statuses, state, pod_message)
  alt no containers, SANDBOX_NOTREADY, image-pull failure message
    Aggregator-->>Handler: DpuExtensionServiceError
    Handler->>ErrorBuilder: build_service_error_message(pod_message)
    ErrorBuilder-->>Handler: formatted error string
  else no containers, SANDBOX_NOTREADY, no matching message
    Aggregator-->>Handler: Pending
  end

Related PRs: None identified.

Suggested labels: bug, k8s, extension-services

Suggested reviewers: None specified.

🐰

A pod that stalls with image woes,
No longer just "Pending", it shows,
A marker, a match,
An ERROR to catch,
While tests confirm how the logic flows.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: improved image pull error handling for DPU Extension Services.
Description check	✅ Passed	The description accurately matches the change set and explains the Pending-to-Error fix, tests, and bug reference.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

crates/agent/src/extension_services/k8s_pod_handler.rs (1)
1581-1676: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consider table-driven tests for aggregate_status coverage.

The new test_k8s_pod_handler_aggregate_status_image_pull_failure test (and the adjacent updated tests) enumerates several input/expected-output rows via repeated direct calls and asserts. As per coding guidelines, "Prefer table-driven tests using the carbide-test-support crate with scenarios! for fallible operations and value_scenarios! for total operations, implementing the Outcome enum pattern." Since aggregate_status is a total function returning a plain enum value, value_scenarios! would consolidate these cases with per-case labels, making failures easier to pinpoint and additions cheaper.

Not blocking for this PR since it follows the existing style in this file, but worth consolidating this cohort of aggregate_status tests going forward.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/agent/src/extension_services/k8s_pod_handler.rs` around lines 1581 -
1676, The `aggregate_status` tests are written as repeated direct assertions
instead of a table-driven scenario set, making the coverage harder to maintain.
Consolidate the cases in `test_k8s_pod_handler_aggregate_status_all_running`,
`test_k8s_pod_handler_aggregate_status_no_containers`,
`test_k8s_pod_handler_aggregate_status_exited_zero_is_running`, and
`test_k8s_pod_handler_aggregate_status_image_pull_failure` into a
`value_scenarios!`-based table-driven test for
`KubernetesPodServicesHandler::aggregate_status`, using per-case labels and
expected `rpc::DpuExtensionServiceDeploymentStatus` values so new cases are
easier to add and failures are easier to identify.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/agent/src/extension_services/k8s_pod_handler.rs`:
- Around line 63-71: The IMAGE_PULL_ERROR_MARKERS list in k8s_pod_handler.rs is
too broad because the generic “not found” entry can misclassify unrelated
sandbox failures as image-pull terminal errors. Narrow that marker in
IMAGE_PULL_ERROR_MARKERS to a more image-specific string, and keep the matching
logic in the pod sandbox error classification path aligned with the intended
image resolution cases so SANDBOX_NOTREADY messages still fall through to
Pending unless they clearly indicate an image pull failure.

---

Nitpick comments:
In `@crates/agent/src/extension_services/k8s_pod_handler.rs`:
- Around line 1581-1676: The `aggregate_status` tests are written as repeated
direct assertions instead of a table-driven scenario set, making the coverage
harder to maintain. Consolidate the cases in
`test_k8s_pod_handler_aggregate_status_all_running`,
`test_k8s_pod_handler_aggregate_status_no_containers`,
`test_k8s_pod_handler_aggregate_status_exited_zero_is_running`, and
`test_k8s_pod_handler_aggregate_status_image_pull_failure` into a
`value_scenarios!`-based table-driven test for
`KubernetesPodServicesHandler::aggregate_status`, using per-case labels and
expected `rpc::DpuExtensionServiceDeploymentStatus` values so new cases are
easier to add and failures are easier to identify.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 12f996b7-f875-4a4e-b2ce-b5da921a762c

📥 Commits

Reviewing files that changed from the base of the PR and between 8dc207a and 5a4a14a.

📒 Files selected for processing (1)

crates/agent/src/extension_services/k8s_pod_handler.rs

github-actions · 2026-07-02T19:21:04Z

🔍 Container Scan Summary

Service	Total	Critical	High	Medium	Low	Other
boot-artifacts-aarch64	3	0	0	3	0	0
boot-artifacts-x86_64	3	0	0	3	0	0
forge-admin-cli-x86_64	272	5	27	89	7	144
machine-validation-runner	769	25	209	278	36	221
machine_validation	769	25	209	278	36	221
machine_validation-aarch64	769	25	209	278	36	221
nvmetal-carbide	769	25	209	278	36	221
TOTAL	3354	105	863	1207	151	1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

rwthompsonii · 2026-07-02T19:56:45Z

-                rpc::DpuExtensionServiceDeploymentStatus::DpuExtensionServicePending
+                // A SANDBOX_NOTREADY pod whose message indicates an image-pull failure will never
+                // produce containers - it must be reported as Error rather than left in Pending.
+                if pod_status == "SANDBOX_NOTREADY"


these should be constant strings if we're matching on them

hwadekar-nv · 2026-07-02T20:17:14Z

+    "failed to pull image",
+    "failed to resolve reference",
+    "manifest unknown",
+    "not found",


Shouldn't be here image not found?

fix: Improve DPU Extension Service image pull error handling

5a4a14a

hanyux-nv requested a review from a team as a code owner July 2, 2026 17:53

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread crates/agent/src/extension_services/k8s_pod_handler.rs

rwthompsonii reviewed Jul 2, 2026

View reviewed changes

hwadekar-nv reviewed Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Improve DPU Extension Service image pull error handling#3096

fix: Improve DPU Extension Service image pull error handling#3096
hanyux-nv wants to merge 1 commit into
NVIDIA:mainfrom
hanyux-nv:es_image_pull_fix

hanyux-nv commented Jul 2, 2026

Uh oh!

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

rwthompsonii Jul 2, 2026

Uh oh!

hwadekar-nv Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

hanyux-nv commented Jul 2, 2026

Related issues

Type of Change

Breaking Changes

Testing

Additional Notes

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jul 2, 2026

🔍 Container Scan Summary

Uh oh!

rwthompsonii Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

hwadekar-nv Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading