Skip to content

docs: explain phoneHomeEnabled flag and the phone-home flow#3007

Open
CTOmari360 wants to merge 2 commits into
NVIDIA:mainfrom
CTOmari360:docs/explain-phone-home-enabled
Open

docs: explain phoneHomeEnabled flag and the phone-home flow#3007
CTOmari360 wants to merge 2 commits into
NVIDIA:mainfrom
CTOmari360:docs/explain-phone-home-enabled

Conversation

@CTOmari360

Copy link
Copy Markdown
Contributor

Documents the phoneHomeEnabled flag, which was previously a one-liner ("Enable phone-home on first boot") across the operating-system and instance docs with no explanation of what it does, what NICo injects, or how the callback works.

The explanation is traced from the code, not inferred:

  • Readiness gate. When enabled, NICo holds the instance in a provisioning state and does not report it Ready until the booted OS phones home. This is the carbide tenant-state gate in crates/rpc/src/model/instance/status/tenant.rs (phone_home_pending = phone_home_enrolled && phone_home_last_contact.is_none()), fed by instance_config.os.phone_home_enabled in crates/rpc/src/model/instance/status.rs. The proto note on InstanceOperatingSystemConfig.phone_home_enabled says the same: "the instance will not transition to a Ready state until InstancePhoneHomeLastContact is updated."
  • NICo injects nothing. User-data is served verbatim by the metadata endpoint (USER_DATA_CATEGORY => metadata.user_data.clone() in crates/agent/src/instance_metadata_endpoint.rs); no code mutates user-data based on the flag. The OS must supply a cloud-init phone_home module that POSTs to the link-local metadata endpoint http://169.254.169.254:7777/latest/meta-data/phone_home (the canonical example appears in rest-api/openapi/spec.yaml). The update_phone_home_last_contact handler (crates/api-core/src/handlers/instance.rs) authorizes the calling host/DPU and records the contact timestamp (UPDATE instances SET phone_home_last_contact=now()), which releases the gate.
  • Guidance + caveats. Directly answers the issue's ISV use case (delay Ready until cloud-init finishes applying SSH keys/passwords), and documents the gotchas: enabling the flag without adding a phone_home module leaves the instance never-ready, and a one-time custom-iPXE reboot re-arms the gate (clear_phone_home_last_contact when boot_with_custom_ipxe && phone_home_enabled).

Adds a Phone-home section to docs/configuration/tenant_management.md (the page that already covers the instance lifecycle) and links the operating-system create / operating-system update CLI reference pages to it.

Related issues

Fixes #2790

Type of Change

  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • No testing required (docs, internal refactor, etc.)

Docs-only change. The added relative link (../../../../configuration/tenant_management.md#phone-home) resolves to the existing page and anchor.

Additional Notes

Scoped to documentation only. The flag descriptions in the two CLI reference pages now point at one authoritative explanation rather than duplicating it.

@CTOmari360 CTOmari360 requested a review from a team as a code owner June 30, 2026 03:33
@copy-pr-bot

copy-pr-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ab3cfed6-fb23-4694-be02-4409d6b737c0

📥 Commits

Reviewing files that changed from the base of the PR and between 6951cde and 186d596.

📒 Files selected for processing (3)
  • docs/configuration/tenant_management.md
  • docs/manuals/nico-admin-cli/commands/operating-system/operating-system-create.md
  • docs/manuals/nico-admin-cli/commands/operating-system/operating-system-update.md
✅ Files skipped from review due to trivial changes (2)
  • docs/manuals/nico-admin-cli/commands/operating-system/operating-system-update.md
  • docs/manuals/nico-admin-cli/commands/operating-system/operating-system-create.md

Summary by CodeRabbit

  • Documentation
    • Added a new “Phone-home” subsection under “Verifying an Instance Is Running,” detailing how phone-home gates readiness/status, required callback behavior, and operational caveats (link-local endpoint reachability, stuck provisioning if callbacks are missing/fail, and reboot re-arming).
    • Updated guidance to note that NICo injects a cloud-init phone_home block into provided cloud-init YAML (including handling when userData is absent).
    • Expanded operating-system create and operating-system update --phone-home-enabled option descriptions, including the requirement for valid cloud-init YAML and the endpoint/override behavior.

Walkthrough

Three documentation pages are updated to describe phoneHomeEnabled / --phone-home-enabled. The tenant management guide adds a new Phone-home section, and the two operating-system CLI references expand the option text with provisioning, cloud-init, and metadata endpoint behavior.

Changes

Phone-home documentation

Layer / File(s) Summary
Phone-home reference and CLI option docs
docs/configuration/tenant_management.md, docs/manuals/nico-admin-cli/commands/operating-system/operating-system-create.md, docs/manuals/nico-admin-cli/commands/operating-system/operating-system-update.md
Adds a Phone-home section covering readiness gating, injected cloud-init phone_home behavior, link-local metadata callback flow, and operational caveats. Expands both CLI option descriptions to explain the gating semantics, YAML requirement, and documentation cross-reference.

Estimated code review effort: 1 (Trivial) | ~5 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the docs update about phoneHomeEnabled and the phone-home flow.
Description check ✅ Passed The description is clearly related to the documented phone-home behavior and the affected CLI and configuration pages.
Linked Issues check ✅ Passed The docs now explain the flag's behavior, injected cloud-init block, readiness gating, and user guidance, satisfying #2790.
Out of Scope Changes check ✅ Passed The change set is documentation-only and stays within the three referenced docs pages.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

The phoneHomeEnabled flag was documented as a one-liner ("Enable
phone-home on first boot") with no explanation of what it does, what (if
anything) NICo injects, or how the phone-home callback works. Document
the mechanism, sourced from the code:

- It gates instance readiness: when enabled, the instance is held in a
  provisioning state and not reported Ready until the booted OS contacts
  NICo's metadata service (the carbide tenant-state gate in
  crates/rpc/src/model/instance/status/tenant.rs; proto note on
  InstanceOperatingSystemConfig.phone_home_enabled).
- NICo injects nothing: user-data is served verbatim
  (crates/agent/src/instance_metadata_endpoint.rs). The OS must include a
  cloud-init phone_home module posting to the link-local metadata
  endpoint http://169.254.169.254:7777/latest/meta-data/phone_home, which
  records the contact timestamp and releases the gate.
- Add usage guidance (delay readiness until cloud-init applies SSH
  keys/passwords) and caveats (no module => never ready; custom-iPXE
  reboot re-arms the gate).

Adds a Phone-home section to docs/configuration/tenant_management.md and
links the operating-system create/update CLI reference pages to it.

Fixes NVIDIA#2790

Signed-off-by: Omar Refai <[email protected]>
@CTOmari360 CTOmari360 force-pushed the docs/explain-phone-home-enabled branch from 07feed4 to 6951cde Compare June 30, 2026 03:37

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/configuration/tenant_management.md`:
- Around line 594-617: Clarify the precedence of phoneHomeEnabled in the tenant
management docs: update the phone-home section to explicitly state that an
instance-level setting overrides the operating-system default, and that the
OS-level value is only inherited when the instance does not specify one. Make
this change in the phoneHomeEnabled description and keep the wording aligned
with the existing instance create/update and operating-system create/update
references.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2677088f-c21c-4d49-86c0-a82267d33700

📥 Commits

Reviewing files that changed from the base of the PR and between 07feed4 and 6951cde.

📒 Files selected for processing (3)
  • docs/configuration/tenant_management.md
  • docs/manuals/nico-admin-cli/commands/operating-system/operating-system-create.md
  • docs/manuals/nico-admin-cli/commands/operating-system/operating-system-update.md
✅ Files skipped from review due to trivial changes (1)
  • docs/manuals/nico-admin-cli/commands/operating-system/operating-system-create.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/manuals/nico-admin-cli/commands/operating-system/operating-system-update.md

Comment on lines +594 to +617
`phoneHomeEnabled` (`--phone-home-enabled`) controls whether NICo waits for the booted operating system to call back -- to "phone home" -- before it reports the instance as ready. It is a property of the operating-system definition (`operating-system create` / `operating-system update`) and can also be set per instance (`instance create` / `instance update`).

**What it does.** When phone-home is enabled, NICo does not consider the instance fully provisioned until the booted OS contacts NICo's metadata service from inside the guest. The instance is held in a provisioning state -- the transition to `Ready` is gated -- until that callback arrives. When it is disabled, NICo reports the instance ready as soon as provisioning and config sync finish, without waiting for any signal from the OS.

**What NICo injects: nothing.** Enabling the flag does *not* add anything to your `userData`. NICo serves your cloud-init user-data to the booting host verbatim. To make the host actually phone home, your user-data must include a cloud-init [`phone_home`](https://cloudinit.readthedocs.io/en/latest/reference/modules.html#phone-home) module that POSTs to NICo's link-local metadata endpoint:

```yaml
#cloud-config
# ... your first-boot setup (SSH keys, passwords, packages, ...) ...
phone_home:
url: http://169.254.169.254:7777/latest/meta-data/phone_home
post: all
```

cloud-init runs the `phone_home` module in its final stage, after the rest of your configuration has been applied, so the callback fires only once your setup has completed. When NICo receives the POST it records the contact and releases the readiness gate, allowing the instance to become `Ready`.

**When to use it.** Enable phone-home when "ready" must mean "the guest has finished its own first-boot setup" -- for example, to delay readiness until cloud-init has applied SSH keys and passwords, so that automation watching for `Ready` does not connect before the host is fully configured. Leave it disabled when NICo finishing provisioning is a sufficient ready signal and you do not need a callback from inside the guest.

**Notes and caveats.**

- Phone-home only gates *status reporting*. It does not change how the host is provisioned or booted -- the reboot into the provisioned OS happens identically whether or not phone-home is enabled.
- If you enable phone-home but never include a `phone_home` module in your user-data, the instance stays in its provisioning state indefinitely and never reports ready.
- The metadata endpoint (`169.254.169.254:7777`) is reachable only from the provisioned host over its link-local metadata link; it is not a tenant-facing API.
- Rebooting the instance with a one-time custom iPXE override (`instance update --reboot-with-custom-ipxe=true`) re-arms the gate when phone-home is enabled: NICo clears the recorded contact, so the OS must phone home again before the instance is reported ready.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n -C 3 --hidden -g '!**/target/**' 'phoneHomeEnabled|phone-home-enabled' .

Repository: NVIDIA/infra-controller

Length of output: 50381


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show the doc section around the reviewed lines.
sed -n '594,617p' docs/configuration/tenant_management.md

echo
echo '---'
echo

# Show the merge logic that resolves phoneHomeEnabled for instance creation/update.
sed -n '2068,2095p' rest-api/api/pkg/api/handler/instance.go

echo
echo '---'
echo

# Show the update-path precedence comment for phoneHomeEnabled.
sed -n '1228,1245p' rest-api/api/pkg/api/model/instance.go

Repository: NVIDIA/infra-controller

Length of output: 4532


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '594,617p' docs/configuration/tenant_management.md
echo '---'
sed -n '2068,2095p' rest-api/api/pkg/api/handler/instance.go
echo '---'
sed -n '1228,1245p' rest-api/api/pkg/api/model/instance.go

Repository: NVIDIA/infra-controller

Length of output: 4528


State the precedence for phoneHomeEnabled. The text says the flag can be set both on the operating system and per instance, but it should explicitly say that the instance value overrides the OS default, and that the OS value is inherited only when the instance omits it.

🧰 Tools
🪛 LanguageTool

[style] ~614-~614: ‘whether or not’ might be wordy. Consider a shorter alternative.
Context: ... the provisioned OS happens identically whether or not phone-home is enabled. - If you enable ...

(EN_WORDINESS_PREMIUM_WHETHER_OR_NOT)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/configuration/tenant_management.md` around lines 594 - 617, Clarify the
precedence of phoneHomeEnabled in the tenant management docs: update the
phone-home section to explicitly state that an instance-level setting overrides
the operating-system default, and that the OS-level value is only inherited when
the instance does not specify one. Make this change in the phoneHomeEnabled
description and keep the wording aligned with the existing instance
create/update and operating-system create/update references.

Source: Path instructions

Comment thread docs/configuration/tenant_management.md Outdated

**What it does.** When phone-home is enabled, NICo does not consider the instance fully provisioned until the booted OS contacts NICo's metadata service from inside the guest. The instance is held in a provisioning state -- the transition to `Ready` is gated -- until that callback arrives. When it is disabled, NICo reports the instance ready as soon as provisioning and config sync finish, without waiting for any signal from the OS.

**What NICo injects: nothing.** Enabling the flag does *not* add anything to your `userData`. NICo serves your cloud-init user-data to the booting host verbatim. To make the host actually phone home, your user-data must include a cloud-init [`phone_home`](https://cloudinit.readthedocs.io/en/latest/reference/modules.html#phone-home) module that POSTs to NICo's link-local metadata endpoint:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thossain-nv I thought we were injecting phone_home into user data?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NICo injects the cloud-init phone_home block into user-data itself
when phoneHomeEnabled is set (util.InsertPhoneHomeIntoUserData, from
both the OS and instance paths); it does not serve user-data verbatim.
Document the injected site.phoneHomeUrl endpoint and the valid-YAML
requirement, and fix the two CLI reference blurbs that repeated the
same wrong claim.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
@CTOmari360 CTOmari360 requested a review from polarweasel as a code owner July 2, 2026 18:31
@CTOmari360

Copy link
Copy Markdown
Contributor Author

Good catch, thank you — you're right, and this was a factual error in my draft. NICo does inject the block: InsertPhoneHomeIntoUserData (rest-api/api/pkg/api/model/util/util.go) is called from both the OS path (ValidateAndSetUserData) and the instance create/update/batch paths, using site.phoneHomeUrl (default http://169.254.169.254:7777/latest/meta-data/phone_home) exactly as you linked. It strips any existing phone_home block, inserts its own, and requires the user-data to be valid cloud-init YAML — the user never adds it themselves.

Fixed in the latest push: rewrote the "What NICo injects" section to describe the real behavior (injects/removes for you, site-configured URL, valid-YAML requirement) and corrected the two CLI reference blurbs that repeated the same wrong claim.

On the override — you're right that site.phoneHomeUrl isn't surfaced in the Helm values or docs today; it's viper config only. That's a separate gap from this correctness fix, so I'd rather keep this PR focused and take the Helm/override documentation as a follow-up. Happy to own that PR — let me know if you'd prefer I fold it in here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docs: phoneHomeEnabled flag lacks explanation of what it injects and how it works

3 participants