feat: Dpf: Support for multiple DpuDeployments to handle BF3, BF4 and other deployment strategy #3084
feat: Dpf: Support for multiple DpuDeployments to handle BF3, BF4 and other deployment strategy #3084abvarshney-nv wants to merge 3 commits into
Conversation
Adds config structure and initialization support for multiple named DPUDeployments. Each deployment (e.g. bf4_generic) has its own BFB, DPUFlavor, DPUDeployment CR, node selector label, and mandatory services. - Add DpfDeploymentConfig + DpfDeploymentsConfig to DpfConfig; the existing top-level fields remain as the BF3 (default) deployment - Add node_labels_for_deployment(name) to ResourceLabeler trait so build_deployment looks up per-deployment node selector labels - Extend CarbideDPFLabeler with a deployment_node_labels registry (BTreeMap<deployment_name, labels>) and with_deployment_node_labels() builder method - setup.rs builds the labeler with labels for all configured deployments and calls create_initialization_objects for bf4_generic if present State machine changes (setting deployment_name on DpuNodeInfo/ DpuDeviceInfo at registration time) are a follow-up. Closes NVIDIA#3056 Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…r deployment models
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (10)
🚧 Files skipped from review as they are similar to previous changes (10)
Summary by CodeRabbit
WalkthroughThis PR adds deployment-type-aware DPF configuration, SDK initialization, and label verification. ChangesDeployment-type-aware DPF provisioning and labeling
Estimated code review effort: 4 (Complex) | ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
crates/machine-controller/src/handler/dpf.rs (1)
205-232: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winUse one host-level deployment type consistently.
Line 232 registers the host’s
DpuNodeInfowith the primary DPU’s deployment type, but Lines 494-496 verify that same host-level node using whicheverdpu_snapshotis currently being reconciled. If a host has mixed or transiently inconsistent DPU classifications, a correctly labeled DPUNode can be marked stale. Resolve and validate a single deployment type for the host, then reuse it for both registration and label verification.As per path instructions, controller logic should be reviewed for reconciliation correctness, idempotency, timeout/cancellation behavior, state-machine transitions, and safe recovery from partial failures.
Also applies to: 494-496
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/machine-controller/src/handler/dpf.rs` around lines 205 - 232, The host-level DpuNodeInfo registration and label verification are using different DPU snapshots, which can cause a correctly labeled node to be treated as stale. In dpf.rs, resolve a single host deployment type once from the primary DPU (or a validated host-wide source) in the reconciliation flow around DpuNodeInfo creation, then reuse that same value both when registering the node and when verifying labels in the later check path near the stale-label logic. Make the deployment type selection explicit and consistent across the relevant dpf_sdk and state reconciliation code paths.Source: Path instructions
crates/dpf/src/sdk.rs (2)
2282-2303: 🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win
TestLabelerwasn't updated for the newnode_labels_for_deployment_typehook — this breaks two tests.
TestLabeleroverridesdevice_labels,node_labels, andnode_context_labels, but not the newnode_labels_for_deployment_type. It therefore falls back to the trait default, which returnsOk(BTreeMap::new()).This has two concrete downstream effects:
test_dpu_node_labels(Line 2371-2395):register_dpu_nodenow sources labels vianode_labels_for_deployment_type(empty forTestLabeler) +node_context_labels(also empty), sometadata.labelsbecomesNone. The test'snode.metadata.labels.as_ref().unwrap()will panic.verify_node_labels_against_seeded_node(Line 3155-3292):required_labelsis computed via the same empty default, sorequired_labels.iter().all(...)is vacuouslytruefor every row — theYields(false)cases ("stale labels", "no labels field at all", "wrong value") no longer exercise the real comparison logic; they'd pass for the wrong reason (or the earlier scenario collides and fails, depending on evaluation order).Both are fixed by giving
TestLabelera matching override.🐛 Proposed fix
impl ResourceLabeler for TestLabeler { fn device_labels(&self, info: &DpuDeviceInfo) -> BTreeMap<String, String> { BTreeMap::from([ ("test/device".to_string(), "true".to_string()), ("test/host-bmc-ip".to_string(), info.host_bmc_ip.to_string()), ( "test/dpu-machine-id".to_string(), info.dpu_machine_id.clone(), ), ]) } fn node_labels(&self) -> BTreeMap<String, String> { BTreeMap::from([("test/node".to_string(), "true".to_string())]) } + fn node_labels_for_deployment_type( + &self, + _deployment_type: DpuDeploymentType, + ) -> Result<BTreeMap<String, String>, DpfError> { + Ok(self.node_labels()) + } + fn node_context_labels(&self, _info: &DpuNodeInfo) -> BTreeMap<String, String> { BTreeMap::new() } }Also applies to: 2371-2395, 3155-3292
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/dpf/src/sdk.rs` around lines 2282 - 2303, Update TestLabeler to override node_labels_for_deployment_type with the same test node label it previously provided via node_labels, so register_dpu_node and the label verification tests keep seeing non-empty required labels. Keep the existing device_labels and node_context_labels behavior unchanged, and make the new override return the deployment-type-independent node label expected by test_dpu_node_labels and verify_node_labels_against_seeded_node.
1265-1404: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy liftRemove deployment-specific node labels during deletion In
crates/dpf/src/sdk.rs,node_label_removal_patch()still clears onlynode_labels(), while registration now appliesnode_labels_for_deployment_type(info.deployment_type). For labelers likeCarbideDPFLabeler, BF4 nodes can keep their deployment selector label afterdelete_dpu_node()/force_delete_host(). Threaddeployment_typeinto the delete path, or derive it before patching, so cleanup matches creation.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/dpf/src/sdk.rs` around lines 1265 - 1404, The delete path in DpfSdk::delete_dpu_node is only removing the generic node labels, but register_dpu_node now applies deployment-specific labels via node_labels_for_deployment_type(info.deployment_type). Update the cleanup flow so it knows the DpuDeploymentType before building the patch, and use that when calling node_label_removal_patch (or equivalent) so labels like the BF4 selector are removed consistently. Make sure any higher-level delete helper such as force_delete_host passes the deployment type through to this path.
🧹 Nitpick comments (2)
crates/machine-controller/src/dpf.rs (1)
177-192: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winSeed the BF3 labels in
new()to keep the labeler hard to misuse.
CarbideDPFLabeler::new(...)now constructs a labeler wherenode_labels_for_deployment_type(DpuDeploymentType::Bf3)fails unless every caller rememberswith_deployment_type_labels(...). Preserve the single-deployment default in the constructor and let the builder override/add configured deployments.Suggested refactor
pub fn new(node_label_key: String) -> Self { + let bf3_labels = BTreeMap::from([ + (node_label_key.clone(), "true".to_string()), + ( + "feature.node.kubernetes.io/dpu-enabled".to_string(), + "true".to_string(), + ), + ]); + Self { node_label_key, - deployment_type_labels: BTreeMap::new(), + deployment_type_labels: BTreeMap::from([(DpuDeploymentType::Bf3, bf3_labels)]), } }As per coding guidelines, prefer designs that are hard to misuse.
Also applies to: 225-237
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/machine-controller/src/dpf.rs` around lines 177 - 192, `CarbideDPFLabeler::new` currently leaves `deployment_type_labels` empty, which makes `node_labels_for_deployment_type(DpuDeploymentType::Bf3)` fail unless every caller remembers to call `with_deployment_type_labels`. Seed the BF3 default labels in the constructor so the labeler works out of the box, and keep `with_deployment_type_labels` as an override path that can replace or extend the configured deployment labels.Source: Coding guidelines
crates/api-core/src/cfg/file.rs (1)
1166-1189: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winConsider validating deployment-specific config to prevent silent misrouting.
DpfDeploymentConfighas no validation ensuringnode_label_key(ordeployment_name/flavor_name) is distinct from the top-levelDpfConfigvalues or non-empty. Per the referencedbuild_deployment_type_labelsinsetup.rs, abf4_generic.node_label_keyequal to the top-leveldpf.node_label_keywould produce identical BF3/BF4Generic label sets, silently defeating deployment-type routing rather than failing fast at startup.Consider adding a small validation step (e.g., in config loading or
initialize_dpf_sdk) that rejects overlapping label keys/deployment names across configured deployments.🛡️ Example validation sketch
impl DpfConfig { fn validate_deployments(&self) -> Result<(), String> { if let Some(bf4) = &self.deployments.bf4_generic { if bf4.node_label_key == self.node_label_key { return Err(format!( "dpf.deployments.bf4_generic.node_label_key must differ from dpf.node_label_key ({})", self.node_label_key )); } } Ok(()) } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/cfg/file.rs` around lines 1166 - 1189, Add validation for deployment-specific DPF config so overlapping values fail fast instead of silently reusing routing labels. In the DpfConfig/DpfDeploymentsConfig path, validate DpfDeploymentConfig fields like node_label_key, deployment_name, and flavor_name against the top-level DpfConfig equivalents and reject empty or duplicate values. Wire this into the config loading or initialize_dpf_sdk flow, and use the existing build_deployment_type_labels/setup.rs routing logic as the place to confirm the BF3/BF4Generic label sets cannot collide.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/machine-controller/src/dpf.rs`:
- Around line 89-90: The DPU deployment resolver currently treats missing or
unknown hardware as BF4 by falling through to DpuDeploymentType::Bf4Generic in
deployment_type_for_dpu, which can misclassify unreported BF3/unknown devices.
Update the resolver around the hardware_info/dpu_info/part_number handling to
return an error or unknown state when the part number is absent or unrecognized,
and only select Bf4Generic after an explicit BF4 check. Make the logic in
deployment_type_for_dpu and its callers fail safely instead of defaulting
missing hardware to BF4.
---
Outside diff comments:
In `@crates/dpf/src/sdk.rs`:
- Around line 2282-2303: Update TestLabeler to override
node_labels_for_deployment_type with the same test node label it previously
provided via node_labels, so register_dpu_node and the label verification tests
keep seeing non-empty required labels. Keep the existing device_labels and
node_context_labels behavior unchanged, and make the new override return the
deployment-type-independent node label expected by test_dpu_node_labels and
verify_node_labels_against_seeded_node.
- Around line 1265-1404: The delete path in DpfSdk::delete_dpu_node is only
removing the generic node labels, but register_dpu_node now applies
deployment-specific labels via
node_labels_for_deployment_type(info.deployment_type). Update the cleanup flow
so it knows the DpuDeploymentType before building the patch, and use that when
calling node_label_removal_patch (or equivalent) so labels like the BF4 selector
are removed consistently. Make sure any higher-level delete helper such as
force_delete_host passes the deployment type through to this path.
In `@crates/machine-controller/src/handler/dpf.rs`:
- Around line 205-232: The host-level DpuNodeInfo registration and label
verification are using different DPU snapshots, which can cause a correctly
labeled node to be treated as stale. In dpf.rs, resolve a single host deployment
type once from the primary DPU (or a validated host-wide source) in the
reconciliation flow around DpuNodeInfo creation, then reuse that same value both
when registering the node and when verifying labels in the later check path near
the stale-label logic. Make the deployment type selection explicit and
consistent across the relevant dpf_sdk and state reconciliation code paths.
---
Nitpick comments:
In `@crates/api-core/src/cfg/file.rs`:
- Around line 1166-1189: Add validation for deployment-specific DPF config so
overlapping values fail fast instead of silently reusing routing labels. In the
DpfConfig/DpfDeploymentsConfig path, validate DpfDeploymentConfig fields like
node_label_key, deployment_name, and flavor_name against the top-level DpfConfig
equivalents and reject empty or duplicate values. Wire this into the config
loading or initialize_dpf_sdk flow, and use the existing
build_deployment_type_labels/setup.rs routing logic as the place to confirm the
BF3/BF4Generic label sets cannot collide.
In `@crates/machine-controller/src/dpf.rs`:
- Around line 177-192: `CarbideDPFLabeler::new` currently leaves
`deployment_type_labels` empty, which makes
`node_labels_for_deployment_type(DpuDeploymentType::Bf3)` fail unless every
caller remembers to call `with_deployment_type_labels`. Seed the BF3 default
labels in the constructor so the labeler works out of the box, and keep
`with_deployment_type_labels` as an override path that can replace or extend the
configured deployment labels.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c0230885-2a65-46e7-87af-5c77f06166aa
📒 Files selected for processing (18)
crates/api-core/src/cfg/file.rscrates/api-core/src/dpf_services.rscrates/api-core/src/setup.rscrates/api-core/src/tests/dpf/duplicate_events.rscrates/api-core/src/tests/dpf/happy_path.rscrates/api-core/src/tests/dpf/reprovisioning.rscrates/api-core/src/tests/dpf/stale_labels.rscrates/api-core/src/tests/dpf/waiting_for_ready.rscrates/api-core/src/tests/machine_admin_force_delete.rscrates/dpf/src/bin/api_harness.rscrates/dpf/src/flavor.rscrates/dpf/src/lib.rscrates/dpf/src/sdk.rscrates/dpf/src/test/sdk_device_registration.rscrates/dpf/src/test/sdk_reboot_annotation.rscrates/dpf/src/types.rscrates/machine-controller/src/dpf.rscrates/machine-controller/src/handler/dpf.rs
| #[derive(Clone, Debug, Serialize, Deserialize)] | ||
| pub struct DpfDeploymentConfig { | ||
| /// URL to the BlueField firmware bundle (BFB) for DPU provisioning. | ||
| pub bfb_url: String, |
There was a problem hiding this comment.
BF4 uses a completely separate BFB?
There was a problem hiding this comment.
It can, for testing or any reason. I just want to make it configurable. but yeah, let me set a sane default value to it.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
| // This should be used to create a separate flavor for BF3, BF4Generic, BF4Astra | ||
| _deployment_type: DpuDeploymentType, | ||
| ) -> Result<DPUFlavor, crate::error::DpfError> { | ||
| default_flavor(namespace, proxy) |
There was a problem hiding this comment.
once the BF4 is implemented, I'm not sure "default flavor" should imply BF3. I don't think there would be a reasonable default
There was a problem hiding this comment.
This part is not implemented as I have not received a working DPUflavor from DPF team yet.
| #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, PartialOrd, Ord)] | ||
| pub enum DpuDeploymentType { | ||
| Bf3, | ||
| Bf4Generic, |
There was a problem hiding this comment.
will there be multiple BF4 types?
There was a problem hiding this comment.
and if so, wouldn't there be multiple BF3 types (or at least a rename of the Bf3 to Bf3Generic)?
There was a problem hiding this comment.
yes, BF4Generic and BF4Astra afaik.
There was a problem hiding this comment.
Will category or configuration like BF4NicMode or BF4DPUMode fall under Bf4Generic?
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/api-core/src/cfg/file.rs (1)
1152-1167: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winAllow BF3 partial overrides to inherit defaults
[dpf.deployments.bf3]is documented as allowing per-field overrides, but once the table is present it still requires all four keys. That breaks thebfb_url-only migration path. Make BF3 merge supplied values ontoDpfDeploymentConfig::default(), and add a deserialization test for a partial BF3 block.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/cfg/file.rs` around lines 1152 - 1167, The BF3 deployment config currently requires all fields whenever the [dpf.deployments.bf3] table is present, so partial overrides cannot inherit the default BF3 values. Update DpfDeploymentConfig deserialization so BF3 values merge onto DpfDeploymentConfig::default() instead of treating every field as mandatory, while keeping the existing behavior for other deployment blocks like bf4_generic. Add a deserialization test around DpfDeploymentsConfig/DpfDeploymentConfig that covers a partial bf3 block (for example only bfb_url) and verifies the missing fields come from the default config.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/api-core/src/setup.rs`:
- Around line 690-693: The BF4Generic initialization path currently reuses BF3
identifiers, which can create overlapping resources or selectors. In the setup
flow around carbide_config.dpf.deployments.bf4_generic and the
make_init_config/create_initialization_objects call, add a preflight validation
that checks deployment_name, flavor_name, and node_label_key against the BF3
configuration and fails with a clear config error if any values overlap. Ensure
the guard runs before any resource creation so the bf4_generic path never
proceeds with conflicting names or label keys.
---
Outside diff comments:
In `@crates/api-core/src/cfg/file.rs`:
- Around line 1152-1167: The BF3 deployment config currently requires all fields
whenever the [dpf.deployments.bf3] table is present, so partial overrides cannot
inherit the default BF3 values. Update DpfDeploymentConfig deserialization so
BF3 values merge onto DpfDeploymentConfig::default() instead of treating every
field as mandatory, while keeping the existing behavior for other deployment
blocks like bf4_generic. Add a deserialization test around
DpfDeploymentsConfig/DpfDeploymentConfig that covers a partial bf3 block (for
example only bfb_url) and verifies the missing fields come from the default
config.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c45efd5c-da95-40dd-8348-59f621271866
📒 Files selected for processing (10)
crates/api-core/src/cfg/file.rscrates/api-core/src/setup.rscrates/api-core/src/tests/dpf/duplicate_events.rscrates/api-core/src/tests/dpf/happy_path.rscrates/api-core/src/tests/dpf/reprovisioning.rscrates/api-core/src/tests/dpf/stale_labels.rscrates/api-core/src/tests/dpf/waiting_for_ready.rscrates/api-core/src/tests/machine_admin_force_delete.rscrates/machine-controller/src/dpf.rscrates/machine-controller/src/handler/dpf.rs
🚧 Files skipped from review as they are similar to previous changes (4)
- crates/machine-controller/src/handler/dpf.rs
- crates/api-core/src/tests/machine_admin_force_delete.rs
- crates/api-core/src/tests/dpf/reprovisioning.rs
- crates/api-core/src/tests/dpf/duplicate_events.rs
3fa1b6d to
c88a5f1
Compare
Adds support for multiple DPUDeployments in DPF, enabling BF4 DPUs to be provisionedalongside BF3 using a separate deployment configuration. BF3 and BF4 DPUs are now automatically routed to their respective deployments based on hardware part number.
Attempting to register a BF4 DPU when no BF4 deployment is configured fails with a clear error.
For the completeness, a BF4Generic deployment is added, but Dpuflavor is not implemented. This will be done in another PR when the information is available.
This PR also restructures the
BF3config intodeploymentssection. The BF3 section is default, thus will be configured even if not present in the toml.The updated tool looks like:
Related issues
Type of Change
Breaking Changes
Testing
Additional Notes