Skip to content

feat: NIC auto-heal scheduled task for chained-clone Windows guests#1

Merged
CMGS merged 6 commits intomainfrom
feat/nic-autoheal-task
May 7, 2026
Merged

feat: NIC auto-heal scheduled task for chained-clone Windows guests#1
CMGS merged 6 commits intomainfrom
feat/nic-autoheal-task

Conversation

@CMGS
Copy link
Copy Markdown
Collaborator

@CMGS CMGS commented May 7, 2026

Problem

Chained Windows clones (vm clonevm clonevm clone, generation ≥ 3) hit a state where the post-restore hot-swapped NIC binds at the OS layer but never transmits. Get-PnpDevice -Class Net reports Status='OK' and Windows shows the adapter as up, but no DHCPDISCOVER ever leaves the guest. cocoon-net never grants a lease and the VM stays unreachable indefinitely (tested 8 min budget — 0/5 self-recovered at gen3+).

Tracked at cocoonstack/cocoon#28.

Fix

Bake an in-guest auto-heal scheduled task into the image. Every minute, SYSTEM runs C:\CocoonNicAutoHeal.ps1 which iterates Get-PnpDevice -Class Net and Disable-PnpDevice / Enable-PnpDevice each entry unconditionally. The unconditional cycle is essential — the user-suggested ? Status -EQ 'Error' filter misses the failure mode (Windows reports OK while the NIC is dead at the wire).

Validation

5-generation chained clone test on testbed-1 (cocoon master be35341, CH v51.0.0, virtio-win 0.1.285):

Gen baseline (no task) this PR (cycle-all task)
1 T+13s ✅ T+107s ✅
2 T+180s ✅ T+18s ✅
3 8min ❌ T+169s ✅
4 8min ❌ T+19s ✅
5 8min ❌ T+292s ✅

Task confirmed propagating across snapshots: schtasks /query /tn CocoonNicAutoHeal on every clone shows Status: Ready / Running. Recovery overhead bounded by the 1-minute trigger interval.

Implementation notes

  • New file: scripts/cocoon-nic-autoheal.ps1 — the script body (also documented for manual install).
  • autounattend.xml: insert Order=52 SynchronousCommand running base64-UTF16LE PowerShell that writes C:\CocoonNicAutoHeal.ps1 + registers the task via schtasks /create /sc minute /mo 1 /ru SYSTEM /rl HIGHEST. Existing Order 52/53 (QuickEdit restore + install marker) renumber to 53/54.
  • Used schtasks /create instead of Register-ScheduledTask because Win11's New-ScheduledTaskTrigger rejects sub-minute intervals; schtasks accepts /sc minute /mo 1 cleanly.
  • Used base64-encoded PS because the inline write-script-then-schtasks one-liner had nested cmd/PS/XML quoting that was fragile.

Limitations / follow-ups

  • 1-minute trigger is the floor for schtasks /sc minute. Worst-case recovery is ~60s + DHCP renegotiation. If we need faster, switch to a long-running PS service or add a DHCP-failure event-trigger (Microsoft-Windows-Dhcp-Client event ID 1003) as a secondary trigger.
  • The task fires unconditionally so a healthy guest experiences a brief NIC interruption every minute. Empirically negligible (~1s blip) but if it causes issues, a Status check + lightweight DHCP probe could gate the cycle.
  • This is the cocoon-side workaround for the chained-clone NDIS-state bug. The root cause is in CH/Windows hot-plug interaction during vm.restore; if upstream CH provides a clean MAC-update API or in-place virtio-net config-change interrupt, this task can be retired.

CMGS added 2 commits May 8, 2026 00:42
Chained Windows clones (vm clone of vm clone of clone, gen3+) hit a state
where the post-restore hot-swapped NIC binds at the OS layer but never
transmits — Get-PnpDevice reports Status='OK' so a Status='Error' filter
misses it. cocoon-net never sees a DHCPDISCOVER from the new MAC and the
guest stays unreachable indefinitely.

Add a SYSTEM scheduled task that cycles every Net PnP device every minute
(Disable-PnpDevice + Enable-PnpDevice). Validated on testbed by chaining
gen1..gen5: without the task gen3-5 timeout 8 min each; with the task all
five recover (T+19s..T+292s). Task and script are baked into the image at
firstboot via autounattend, so every clone descendant carries them.

Implementation: Order=52 SynchronousCommand runs a base64-encoded PS that
writes C:\CocoonNicAutoHeal.ps1 and registers CocoonNicAutoHeal via
schtasks /sc minute /mo 1 /ru SYSTEM /rl HIGHEST. Existing Order 52/53
(QuickEdit restore + install marker) renumber to 53/54.
`virtio-win-guest-tools.exe /S` (Order 21) installs viostor / NetKvm /
balloon but skips the viosock driver, so AF_VSOCK in the guest has no
WSP provider and cocoon-agent cannot bind. Run pnputil directly against
the virtio-win ISO at firstboot to register the driver.

Verified post-install:

  netsh winsock show catalog | findstr /i vsock
    Description: Virtio Vsock STREAM
    Provider Path: %SystemRoot%\System32\viosocklib.dll
    Address Family: 40

End-to-end vsock proof on a fresh Win11 VM (this PR's image will produce
the same state out of the box):

  host  $ python3 -c 'CONNECT 1024 ...'  → CH banner: OK 1073741836
  guest → cocoon-agent receives MsgExec, runs cmd.exe /c whoami
        → MsgStarted{pid:7748}, MsgStdout{"nt authority\\system"}, MsgExit
  host  ← {type:exit, exit_code:0}

NIC auto-heal task renumbers from Order 52 → 53; QuickEdit restore and
install marker likewise (53 → 54, 54 → 55). FirstLogonCommands count in
the README table updated from 54 to 55.
@CMGS
Copy link
Copy Markdown
Collaborator Author

CMGS commented May 7, 2026

Pushed eebee5f: viosock driver install at firstboot (Order 52, before NIC auto-heal task).

virtio-win-guest-tools.exe /S (Order 21) installs viostor / NetKvm / balloon but skips the viosock driver — without it, AF_VSOCK has no WSP provider and netsh winsock show catalog does not list Virtio Vsock STREAM, so cocoon-agent's AF_VSOCK listener cannot bind. Verified on testbed-1 against a fresh ghcr.io image: viosock missing → installed manually via pnputil /add-driver → registered immediately, end-to-end vsock + cocoon-agent exec works.

Driver path search mirrors the existing <DriverPaths> pattern (D:/E:, standard + attestation layouts):

  • D:\viosock\w11\amd64\viosock.inf (standard layout)
  • E:\viosock\w11\amd64\viosock.inf
  • D:\Win11\amd64\viosock\viosock.inf (attestation layout)
  • E:\Win11\amd64\viosock\viosock.inf

Post-install state on the fresh build:

netsh winsock show catalog | findstr /i vsock
  Description:    Virtio Vsock STREAM
  Provider Path:  %SystemRoot%\System32\viosocklib.dll
  Address Family: 40

Get-PnpDevice -Class System | Where InstanceId -like "*VEN_1AF4*DEV_1053*"
  FriendlyName         Status   Class
  VirtIO Socket Driver OK       System

Follow-up (separate PR): bake cocoon-agent.exe + install-cocoon-agent.ps1 from cocoonstack/cocoon-agent. With viosock + agent in the image, cocoon vm exec <vmid> -- powershell -Command "..." from the host works end-to-end (currently blocked by info.Config.Windows guard in cocoon/cmd/vm/exec.go:56).

The previous two commits added two firstboot actions (viosock driver
install, CocoonNicAutoHeal scheduled task) but the build pipeline's
verify.ps1 did not assert them and remediate.ps1 did not re-apply them
on the post-install reboot. Brings parity with every other autounattend
FirstLogonCommands entry.

verify.ps1:
- viosock device bound (Get-PnpDevice VirtIO Socket Driver, Status=OK)
- Virtio Vsock STREAM provider registered (netsh winsock show catalog)
- CocoonNicAutoHeal task registered (schtasks /query)
- C:\CocoonNicAutoHeal.ps1 file present

remediate.ps1:
- pnputil /add-driver viosock.inf if catalog missing the provider, with
  the same D:/E: + standard/attestation path search as autounattend
- schtasks /create CocoonNicAutoHeal if missing, recreating the script
  body inline (same content as the base64-encoded autounattend version)
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an in-guest mitigation for a chained-clone Windows networking failure by installing a scheduled task that periodically cycles Windows Net-class PnP devices, and updates the image setup/docs to include this behavior.

Changes:

  • Add CocoonNicAutoHeal scheduled task creation during first boot to run a NIC-cycling PowerShell script every minute as SYSTEM.
  • Add a viosock (virtio-vsock) driver install step during first boot.
  • Document the new script and update autounattend.xml FirstLogonCommands table/count.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

File Description
scripts/cocoon-nic-autoheal.ps1 Adds the reference PowerShell implementation that disables/enables Net PnP devices.
autounattend.xml Adds FirstLogonCommands to install viosock driver and to write/register the NIC auto-heal scheduled task.
README.md Documents the new script and updates the FirstLogonCommands table/count with the new steps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread autounattend.xml
D: and E:, mirroring the windowsPE DriverPaths. Verified post-
install: "Virtio Vsock STREAM" Provider Path %SystemRoot%\System32
\viosocklib.dll, Address Family 40 in `netsh winsock show catalog`. -->
<SynchronousCommand wcm:action="add"><Order>52</Order><CommandLine>cmd /c "if exist D:\viosock\w11\amd64\viosock.inf (pnputil /add-driver D:\viosock\w11\amd64\viosock.inf /install) else if exist E:\viosock\w11\amd64\viosock.inf (pnputil /add-driver E:\viosock\w11\amd64\viosock.inf /install) else if exist D:\Win11\amd64\viosock\viosock.inf (pnputil /add-driver D:\Win11\amd64\viosock\viosock.inf /install) else if exist E:\Win11\amd64\viosock\viosock.inf (pnputil /add-driver E:\Win11\amd64\viosock\viosock.inf /install)"</CommandLine></SynchronousCommand>
Comment thread autounattend.xml Outdated
Comment on lines +171 to +173
filter would miss it). Base64-encoded UTF-16LE PowerShell that writes
C:\CocoonNicAutoHeal.ps1 and registers CocoonNicAutoHeal via schtasks. -->
<SynchronousCommand wcm:action="add"><Order>53</Order><CommandLine>powershell.exe -NoProfile -EncodedCommand JABjAG8AbgB0AGUAbgB0ACAAPQAgAEAAJwAKACQARQByAHIAbwByAEEAYwB0AGkAbwBuAFAAcgBlAGYAZQByAGUAbgBjAGUAIAA9ACAAIgBTAGkAbABlAG4AdABsAHkAQwBvAG4AdABpAG4AdQBlACIACgBmAG8AcgBlAGEAYwBoACAAKAAkAGQAIABpAG4AIAAoAEcAZQB0AC0AUABuAHAARABlAHYAaQBjAGUAIAAtAEMAbABhAHMAcwAgAE4AZQB0ACkAKQAgAHsACgAgACAAIAAgAEQAaQBzAGEAYgBsAGUALQBQAG4AcABEAGUAdgBpAGMAZQAgAC0ASQBuAHMAdABhAG4AYwBlAEkAZAAgACQAZAAuAEkAbgBzAHQAYQBuAGMAZQBJAGQAIAAtAEMAbwBuAGYAaQByAG0AOgAkAGYAYQBsAHMAZQAKACAAIAAgACAAUwB0AGEAcgB0AC0AUwBsAGUAZQBwACAALQBTAGUAYwBvAG4AZABzACAAMgAKACAAIAAgACAARQBuAGEAYgBsAGUALQBQAG4AcABEAGUAdgBpAGMAZQAgAC0ASQBuAHMAdABhAG4AYwBlAEkAZAAgACQAZAAuAEkAbgBzAHQAYQBuAGMAZQBJAGQAIAAtAEMAbwBuAGYAaQByAG0AOgAkAGYAYQBsAHMAZQAKAH0ACgAnAEAACgBTAGUAdAAtAEMAbwBuAHQAZQBuAHQAIAAtAFAAYQB0AGgAIABDADoAXABDAG8AYwBvAG8AbgBOAGkAYwBBAHUAdABvAEgAZQBhAGwALgBwAHMAMQAgAC0AVgBhAGwAdQBlACAAJABjAG8AbgB0AGUAbgB0ACAALQBFAG4AYwBvAGQAaQBuAGcAIABBAFMAQwBJAEkAIAAtAEYAbwByAGMAZQAKAHMAYwBoAHQAYQBzAGsAcwAgAC8AYwByAGUAYQB0AGUAIAAvAHQAbgAgAEMAbwBjAG8AbwBuAE4AaQBjAEEAdQB0AG8ASABlAGEAbAAgAC8AdAByACAAIgBwAG8AdwBlAHIAcwBoAGUAbABsAC4AZQB4AGUAIAAtAE4AbwBQAHIAbwBmAGkAbABlACAALQBFAHgAZQBjAHUAdABpAG8AbgBQAG8AbABpAGMAeQAgAEIAeQBwAGEAcwBzACAALQBGAGkAbABlACAAQwA6AFwAQwBvAGMAbwBvAG4ATgBpAGMAQQB1AHQAbwBIAGUAYQBsAC4AcABzADEAIgAgAC8AcwBjACAAbQBpAG4AdQB0AGUAIAAvAG0AbwAgADEAIAAvAHIAdQAgAFMAWQBTAFQARQBNACAALwByAGwAIABIAEkARwBIAEUAUwBUACAALwBmAAoA</CommandLine></SynchronousCommand>
CMGS added 3 commits May 8, 2026 01:58
Adds Order 53 to FirstLogonCommands: download the pinned cocoon-agent
Windows release zip from GitHub, verify SHA256, run the bundled
install-cocoon-agent.ps1 (registers cocoon-agent Windows service —
LocalSystem, auto-start, restart-on-crash, vsock port 1024).

The bootstrap script lives at scripts/install-cocoon-agent-bootstrap.ps1
as the canonical source. autounattend Order 53 is a base64-encoded
wrapper that drops the same content to C:\Scripts\install-cocoon-agent-
bootstrap.ps1 and invokes it; remediate.ps1 re-invokes the on-disk copy
when verify.ps1 reports the service missing or stopped.

Order renumbering: 52 viosock (unchanged), 53 NEW cocoon-agent install,
54 NIC auto-heal (was 53), 55 QuickEdit restore (was 54), 56 install
marker (was 55).

Drift: the autounattend base64 must be regenerated from the standalone
.ps1 whenever the pinned version is bumped (URL + SHA256). This is
called out in the bootstrap header and the autounattend comment block;
the standalone-vs-base64 match is verifiable by base64-decoding Order 53
and comparing the inner here-string against the .ps1 file. Image
build-time CI catches drift via verify.ps1's "service running" check.

Verify.ps1 adds: cocoon-agent service Running, bootstrap present at
C:\Scripts\, version printout from `cocoon-agent --version`.

Remediate.ps1 adds: re-run C:\Scripts\install-cocoon-agent-bootstrap.ps1
if service is missing or stopped (idempotent re-install).
The autounattend.xml Orders 53 (cocoon-agent bootstrap) and 54 (NIC
auto-heal) embed PowerShell as base64 -EncodedCommand. The decoded
wrapper holds the same body as the matching scripts/*.ps1 file in a
here-string. Without a check, editing the standalone .ps1 silently
ships the old version inside the image.

scripts/check-base64-drift.py decodes each Order's base64, extracts the
here-string body, and asserts byte-for-byte equality with the standalone
.ps1 (modulo trailing whitespace from CRLF). Runs in <1s.

.github/workflows/check.yml runs the script on every push to main and
every pull_request — fast feedback (under a 2-minute timeout) decoupled
from the slow build.yml workflow that only runs on workflow_dispatch.

Order 54 (NIC auto-heal) base64 was regenerated to include the standalone
.ps1's comment header so the drift check passes from day one. The
deployed C:\CocoonNicAutoHeal.ps1 now carries the explanatory comment
too — same runtime behavior, slightly more readable on inspection.

Verified: python3 scripts/check-base64-drift.py exits 0 with both pairs
matching; xmllint accepts the regenerated autounattend.xml.
@CMGS CMGS merged commit a557ba5 into main May 7, 2026
1 check passed
@CMGS CMGS deleted the feat/nic-autoheal-task branch May 7, 2026 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants