Skip to content

feat(api): skip DPU-tied steps for zero-DPU hosts#1980

Open
s3rj1k wants to merge 2 commits into
NVIDIA:mainfrom
s3rj1k:feat/zero-dpu-state-machine
Open

feat(api): skip DPU-tied steps for zero-DPU hosts#1980
s3rj1k wants to merge 2 commits into
NVIDIA:mainfrom
s3rj1k:feat/zero-dpu-state-machine

Conversation

@s3rj1k
Copy link
Copy Markdown

@s3rj1k s3rj1k commented May 28, 2026

Description

When allow_zero_dpu_hosts is true, zero-DPU hosts skip: is_bios_setup verification, machine_setup (short-circuited before the Redfish call), is_boot_order_setup, and set_host_boot_order's SetBootOrder arm (returns Done — no DPU-first boot ordering to configure). The toggle is surfaced in the local nico-core values (default false).

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 28, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…u_hosts

When allow_zero_dpu_hosts is true, zero-DPU hosts skip: is_bios_setup
verification, machine_setup (short-circuited before the Redfish call),
is_boot_order_setup, and set_host_boot_order's SetBootOrder arm (returns
Done — no DPU-first boot ordering to configure). The toggle is surfaced
in the local nico-core values (default false).

Signed-off-by: s3rj1k <evasive.gyron@gmail.com>
@s3rj1k s3rj1k force-pushed the feat/zero-dpu-state-machine branch from a4fb21b to bb43f98 Compare May 28, 2026 10:08
@ajf
Copy link
Copy Markdown
Collaborator

ajf commented May 28, 2026

@kensimon @chet please take a look

@ajf ajf requested review from chet and kensimon May 28, 2026 20:39
@ajf
Copy link
Copy Markdown
Collaborator

ajf commented May 28, 2026

/ok to test bb43f98

@s3rj1k
Copy link
Copy Markdown
Author

s3rj1k commented May 28, 2026

/ok to test bb43f98

I think I need to do a followup on this, can either do this as separate PR or as 2nd commit, RN testing small change on my env.


Added followup 9840809

`set_boot_order_dpu_first` is DPU-targeting and returns
`NotSupported` on vendors without a custom impl. For hosts with no
DPU under `allow_zero_dpu_hosts`, PATCH BootSourceOverride directly
via `boot_first` — try UefiHttp first, fall back to Pxe for BMCs
that don't accept UefiHttp. The downstream SetBootOrder substates
already handle `jid = None`, so the host still progresses through
reboot + verify.

Signed-off-by: s3rj1k <evasive.gyron@gmail.com>
@s3rj1k s3rj1k force-pushed the feat/zero-dpu-state-machine branch from 954bd43 to 9840809 Compare May 28, 2026 21:34
@s3rj1k s3rj1k marked this pull request as ready for review May 28, 2026 21:35
@s3rj1k s3rj1k requested review from a team and shayan1995 as code owners May 28, 2026 21:35
@kensimon
Copy link
Copy Markdown
Contributor

/ok to test 9840809

Copy link
Copy Markdown
Contributor

@kensimon kensimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to do this... these steps are still important even without DPU's... we want boot order setup so that we can boot to the scout image, which is still a thing even in the zero-DPU world. Plus all the other things that machine_setup does.

tracing::info!(
"Skipping machine_setup: zero-DPU host (allow_zero_dpu_hosts=true); BIOS profile is DPU-tied."
);
return Ok(None);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure we want to skip the setup just because there are zero DPU's. machine_setup does other things like enable virtualization, clears the TPM, sets up serial console, etc.

libredfish only sends a NoDpu error for Dell systems today: https://github.com/NVIDIA/libredfish/blob/dd2152ac5642c5256b893e396647e159003d0071/src/dell.rs#L363 ... and even then it still ends up applying the rest of the config (although it doesn't return a job ID which is bad.)

libredfish only tries to detect the DPU so it can determine what interface MAC address to configure for network boot (the DPU becomes the boot device), so the correct fix is to send it the non-DPU primary NIC as the MAC address as part of the setup call (see my TODO line above which we never got around to fixing.)

For now though, skipping the setup call altogether if there are zero DPUs is not what we want.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

machine_setup does other things like enable virtualization, clears the TPM, sets up serial console, etc.

not all machines even support this kind of configuration, if no_dpu_flag is not enough, I suggest introducing another one that explicitly disables all this configuration magic and assumes that operator will be in charge of setting up BIOS config

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed sounds more like a bios or bmc profile related setting and feature than anything directly related to DPUs

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic here is simple, if there is no DPU on server, there is little point in configuring NIC boot device ordering (at this point we don't care if server boot from specific NIC, we only want it to boot from some NIC) and other BIOS enforcements, I do agree that this is more related to server settings, it might be worth having another dedicated flag for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants