Skip to content

fix: Improve DPU Extension Service image pull error handling#3096

Open
hanyux-nv wants to merge 1 commit into
NVIDIA:mainfrom
hanyux-nv:es_image_pull_fix
Open

fix: Improve DPU Extension Service image pull error handling#3096
hanyux-nv wants to merge 1 commit into
NVIDIA:mainfrom
hanyux-nv:es_image_pull_fix

Conversation

@hanyux-nv

Copy link
Copy Markdown
Contributor

When a DPU Extension Service deployment used an invalid or unreachable image reference, the deployment status would get stuck in Pending instead of transitioning to Error. This is because the image pull failures occur at the sandbox level before any containers are created, and the aggregate_status function returns Pending when no containers are created. This fix updates get_pod_sandbox_status to also return the sandbox message, so aggregate_status will treat SANDBOX_NOTREADY with any non-empty message as Error. The sandbox message is also surfaced in the deployment status detail so the failure reason is visible to the user.

Related issues

nvbug https://nvbugspro.nvidia.com/bug/5953587

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@hanyux-nv hanyux-nv requested a review from a team as a code owner July 2, 2026 17:53
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • Bug Fixes
    • Image pull failures during pod startup are now detected more reliably and surfaced as an error instead of leaving services stuck in pending state.
    • Status summaries now include relevant pod error details when available, making deployment issues easier to understand.
    • Status handling has been updated so failing sandbox-level pod conditions are reflected correctly in overall service health.

Walkthrough

This PR enhances Kubernetes pod status aggregation to detect image-pull failures at the sandbox level. It introduces marker strings and a matching helper, propagates the sandbox status message through pod inspection, aggregation, and error formatting, and returns DpuExtensionServiceError instead of Pending when an image-pull failure is detected.

Changes

Image-pull failure detection

Layer / File(s) Summary
Image-pull failure markers and matcher
crates/agent/src/extension_services/k8s_pod_handler.rs
Adds a constant list of image-pull failure message markers and an is_image_pull_failure_message helper for case-insensitive substring matching; get_pod_sandbox_status now also returns the pod's status.message.
Aggregate status handling of sandbox message
crates/agent/src/extension_services/k8s_pod_handler.rs
aggregate_status accepts an optional pod message and returns DpuExtensionServiceError for SANDBOX_NOTREADY with an image-pull failure message instead of Pending; documentation updated accordingly.
Pod status pipeline wiring
crates/agent/src/extension_services/k8s_pod_handler.rs
get_pod_status captures the optional pod message and forwards it to aggregate_status and build_service_error_message call sites, including the message in formatted error output.
Unit test updates and new coverage
crates/agent/src/extension_services/k8s_pod_handler.rs
Existing tests updated for the new pod_message parameter; a new test validates DpuExtensionServiceError, Pending, and Terminating outcomes based on message content and deploy expectation.

Estimated code review effort: 2 (Simple) | ~15 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Handler as get_pod_status
  participant Sandbox as get_pod_sandbox_status
  participant Aggregator as aggregate_status
  participant ErrorBuilder as build_service_error_message

  Handler->>Sandbox: inspect pod sandbox
  Sandbox-->>Handler: state, pod_message
  Handler->>Aggregator: aggregate_status(statuses, state, pod_message)
  alt no containers, SANDBOX_NOTREADY, image-pull failure message
    Aggregator-->>Handler: DpuExtensionServiceError
    Handler->>ErrorBuilder: build_service_error_message(pod_message)
    ErrorBuilder-->>Handler: formatted error string
  else no containers, SANDBOX_NOTREADY, no matching message
    Aggregator-->>Handler: Pending
  end
Loading

Related PRs: None identified.

Suggested labels: bug, k8s, extension-services

Suggested reviewers: None specified.

🐰

A pod that stalls with image woes,
No longer just "Pending", it shows,
A marker, a match,
An ERROR to catch,
While tests confirm how the logic flows.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: improved image pull error handling for DPU Extension Services.
Description check ✅ Passed The description accurately matches the change set and explains the Pending-to-Error fix, tests, and bug reference.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
crates/agent/src/extension_services/k8s_pod_handler.rs (1)

1581-1676: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consider table-driven tests for aggregate_status coverage.

The new test_k8s_pod_handler_aggregate_status_image_pull_failure test (and the adjacent updated tests) enumerates several input/expected-output rows via repeated direct calls and asserts. As per coding guidelines, "Prefer table-driven tests using the carbide-test-support crate with scenarios! for fallible operations and value_scenarios! for total operations, implementing the Outcome enum pattern." Since aggregate_status is a total function returning a plain enum value, value_scenarios! would consolidate these cases with per-case labels, making failures easier to pinpoint and additions cheaper.

Not blocking for this PR since it follows the existing style in this file, but worth consolidating this cohort of aggregate_status tests going forward.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/agent/src/extension_services/k8s_pod_handler.rs` around lines 1581 -
1676, The `aggregate_status` tests are written as repeated direct assertions
instead of a table-driven scenario set, making the coverage harder to maintain.
Consolidate the cases in `test_k8s_pod_handler_aggregate_status_all_running`,
`test_k8s_pod_handler_aggregate_status_no_containers`,
`test_k8s_pod_handler_aggregate_status_exited_zero_is_running`, and
`test_k8s_pod_handler_aggregate_status_image_pull_failure` into a
`value_scenarios!`-based table-driven test for
`KubernetesPodServicesHandler::aggregate_status`, using per-case labels and
expected `rpc::DpuExtensionServiceDeploymentStatus` values so new cases are
easier to add and failures are easier to identify.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/agent/src/extension_services/k8s_pod_handler.rs`:
- Around line 63-71: The IMAGE_PULL_ERROR_MARKERS list in k8s_pod_handler.rs is
too broad because the generic “not found” entry can misclassify unrelated
sandbox failures as image-pull terminal errors. Narrow that marker in
IMAGE_PULL_ERROR_MARKERS to a more image-specific string, and keep the matching
logic in the pod sandbox error classification path aligned with the intended
image resolution cases so SANDBOX_NOTREADY messages still fall through to
Pending unless they clearly indicate an image pull failure.

---

Nitpick comments:
In `@crates/agent/src/extension_services/k8s_pod_handler.rs`:
- Around line 1581-1676: The `aggregate_status` tests are written as repeated
direct assertions instead of a table-driven scenario set, making the coverage
harder to maintain. Consolidate the cases in
`test_k8s_pod_handler_aggregate_status_all_running`,
`test_k8s_pod_handler_aggregate_status_no_containers`,
`test_k8s_pod_handler_aggregate_status_exited_zero_is_running`, and
`test_k8s_pod_handler_aggregate_status_image_pull_failure` into a
`value_scenarios!`-based table-driven test for
`KubernetesPodServicesHandler::aggregate_status`, using per-case labels and
expected `rpc::DpuExtensionServiceDeploymentStatus` values so new cases are
easier to add and failures are easier to identify.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 12f996b7-f875-4a4e-b2ce-b5da921a762c

📥 Commits

Reviewing files that changed from the base of the PR and between 8dc207a and 5a4a14a.

📒 Files selected for processing (1)
  • crates/agent/src/extension_services/k8s_pod_handler.rs

Comment thread crates/agent/src/extension_services/k8s_pod_handler.rs
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 272 5 27 89 7 144
machine-validation-runner 769 25 209 278 36 221
machine_validation 769 25 209 278 36 221
machine_validation-aarch64 769 25 209 278 36 221
nvmetal-carbide 769 25 209 278 36 221
TOTAL 3354 105 863 1207 151 1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

rpc::DpuExtensionServiceDeploymentStatus::DpuExtensionServicePending
// A SANDBOX_NOTREADY pod whose message indicates an image-pull failure will never
// produce containers - it must be reported as Error rather than left in Pending.
if pod_status == "SANDBOX_NOTREADY"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be constant strings if we're matching on them

"failed to pull image",
"failed to resolve reference",
"manifest unknown",
"not found",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be here image not found?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants