feat: NIC auto-heal scheduled task for chained-clone Windows guests#1
feat: NIC auto-heal scheduled task for chained-clone Windows guests#1
Conversation
Chained Windows clones (vm clone of vm clone of clone, gen3+) hit a state where the post-restore hot-swapped NIC binds at the OS layer but never transmits — Get-PnpDevice reports Status='OK' so a Status='Error' filter misses it. cocoon-net never sees a DHCPDISCOVER from the new MAC and the guest stays unreachable indefinitely. Add a SYSTEM scheduled task that cycles every Net PnP device every minute (Disable-PnpDevice + Enable-PnpDevice). Validated on testbed by chaining gen1..gen5: without the task gen3-5 timeout 8 min each; with the task all five recover (T+19s..T+292s). Task and script are baked into the image at firstboot via autounattend, so every clone descendant carries them. Implementation: Order=52 SynchronousCommand runs a base64-encoded PS that writes C:\CocoonNicAutoHeal.ps1 and registers CocoonNicAutoHeal via schtasks /sc minute /mo 1 /ru SYSTEM /rl HIGHEST. Existing Order 52/53 (QuickEdit restore + install marker) renumber to 53/54.
`virtio-win-guest-tools.exe /S` (Order 21) installs viostor / NetKvm /
balloon but skips the viosock driver, so AF_VSOCK in the guest has no
WSP provider and cocoon-agent cannot bind. Run pnputil directly against
the virtio-win ISO at firstboot to register the driver.
Verified post-install:
netsh winsock show catalog | findstr /i vsock
Description: Virtio Vsock STREAM
Provider Path: %SystemRoot%\System32\viosocklib.dll
Address Family: 40
End-to-end vsock proof on a fresh Win11 VM (this PR's image will produce
the same state out of the box):
host $ python3 -c 'CONNECT 1024 ...' → CH banner: OK 1073741836
guest → cocoon-agent receives MsgExec, runs cmd.exe /c whoami
→ MsgStarted{pid:7748}, MsgStdout{"nt authority\\system"}, MsgExit
host ← {type:exit, exit_code:0}
NIC auto-heal task renumbers from Order 52 → 53; QuickEdit restore and
install marker likewise (53 → 54, 54 → 55). FirstLogonCommands count in
the README table updated from 54 to 55.
|
Pushed
Driver path search mirrors the existing
Post-install state on the fresh build: Follow-up (separate PR): bake |
The previous two commits added two firstboot actions (viosock driver install, CocoonNicAutoHeal scheduled task) but the build pipeline's verify.ps1 did not assert them and remediate.ps1 did not re-apply them on the post-install reboot. Brings parity with every other autounattend FirstLogonCommands entry. verify.ps1: - viosock device bound (Get-PnpDevice VirtIO Socket Driver, Status=OK) - Virtio Vsock STREAM provider registered (netsh winsock show catalog) - CocoonNicAutoHeal task registered (schtasks /query) - C:\CocoonNicAutoHeal.ps1 file present remediate.ps1: - pnputil /add-driver viosock.inf if catalog missing the provider, with the same D:/E: + standard/attestation path search as autounattend - schtasks /create CocoonNicAutoHeal if missing, recreating the script body inline (same content as the base64-encoded autounattend version)
There was a problem hiding this comment.
Pull request overview
This PR adds an in-guest mitigation for a chained-clone Windows networking failure by installing a scheduled task that periodically cycles Windows Net-class PnP devices, and updates the image setup/docs to include this behavior.
Changes:
- Add
CocoonNicAutoHealscheduled task creation during first boot to run a NIC-cycling PowerShell script every minute as SYSTEM. - Add a viosock (
virtio-vsock) driver install step during first boot. - Document the new script and update
autounattend.xmlFirstLogonCommands table/count.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
scripts/cocoon-nic-autoheal.ps1 |
Adds the reference PowerShell implementation that disables/enables Net PnP devices. |
autounattend.xml |
Adds FirstLogonCommands to install viosock driver and to write/register the NIC auto-heal scheduled task. |
README.md |
Documents the new script and updates the FirstLogonCommands table/count with the new steps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| D: and E:, mirroring the windowsPE DriverPaths. Verified post- | ||
| install: "Virtio Vsock STREAM" Provider Path %SystemRoot%\System32 | ||
| \viosocklib.dll, Address Family 40 in `netsh winsock show catalog`. --> | ||
| <SynchronousCommand wcm:action="add"><Order>52</Order><CommandLine>cmd /c "if exist D:\viosock\w11\amd64\viosock.inf (pnputil /add-driver D:\viosock\w11\amd64\viosock.inf /install) else if exist E:\viosock\w11\amd64\viosock.inf (pnputil /add-driver E:\viosock\w11\amd64\viosock.inf /install) else if exist D:\Win11\amd64\viosock\viosock.inf (pnputil /add-driver D:\Win11\amd64\viosock\viosock.inf /install) else if exist E:\Win11\amd64\viosock\viosock.inf (pnputil /add-driver E:\Win11\amd64\viosock\viosock.inf /install)"</CommandLine></SynchronousCommand> |
| filter would miss it). Base64-encoded UTF-16LE PowerShell that writes | ||
| C:\CocoonNicAutoHeal.ps1 and registers CocoonNicAutoHeal via schtasks. --> | ||
| <SynchronousCommand wcm:action="add"><Order>53</Order><CommandLine>powershell.exe -NoProfile -EncodedCommand JABjAG8AbgB0AGUAbgB0ACAAPQAgAEAAJwAKACQARQByAHIAbwByAEEAYwB0AGkAbwBuAFAAcgBlAGYAZQByAGUAbgBjAGUAIAA9ACAAIgBTAGkAbABlAG4AdABsAHkAQwBvAG4AdABpAG4AdQBlACIACgBmAG8AcgBlAGEAYwBoACAAKAAkAGQAIABpAG4AIAAoAEcAZQB0AC0AUABuAHAARABlAHYAaQBjAGUAIAAtAEMAbABhAHMAcwAgAE4AZQB0ACkAKQAgAHsACgAgACAAIAAgAEQAaQBzAGEAYgBsAGUALQBQAG4AcABEAGUAdgBpAGMAZQAgAC0ASQBuAHMAdABhAG4AYwBlAEkAZAAgACQAZAAuAEkAbgBzAHQAYQBuAGMAZQBJAGQAIAAtAEMAbwBuAGYAaQByAG0AOgAkAGYAYQBsAHMAZQAKACAAIAAgACAAUwB0AGEAcgB0AC0AUwBsAGUAZQBwACAALQBTAGUAYwBvAG4AZABzACAAMgAKACAAIAAgACAARQBuAGEAYgBsAGUALQBQAG4AcABEAGUAdgBpAGMAZQAgAC0ASQBuAHMAdABhAG4AYwBlAEkAZAAgACQAZAAuAEkAbgBzAHQAYQBuAGMAZQBJAGQAIAAtAEMAbwBuAGYAaQByAG0AOgAkAGYAYQBsAHMAZQAKAH0ACgAnAEAACgBTAGUAdAAtAEMAbwBuAHQAZQBuAHQAIAAtAFAAYQB0AGgAIABDADoAXABDAG8AYwBvAG8AbgBOAGkAYwBBAHUAdABvAEgAZQBhAGwALgBwAHMAMQAgAC0AVgBhAGwAdQBlACAAJABjAG8AbgB0AGUAbgB0ACAALQBFAG4AYwBvAGQAaQBuAGcAIABBAFMAQwBJAEkAIAAtAEYAbwByAGMAZQAKAHMAYwBoAHQAYQBzAGsAcwAgAC8AYwByAGUAYQB0AGUAIAAvAHQAbgAgAEMAbwBjAG8AbwBuAE4AaQBjAEEAdQB0AG8ASABlAGEAbAAgAC8AdAByACAAIgBwAG8AdwBlAHIAcwBoAGUAbABsAC4AZQB4AGUAIAAtAE4AbwBQAHIAbwBmAGkAbABlACAALQBFAHgAZQBjAHUAdABpAG8AbgBQAG8AbABpAGMAeQAgAEIAeQBwAGEAcwBzACAALQBGAGkAbABlACAAQwA6AFwAQwBvAGMAbwBvAG4ATgBpAGMAQQB1AHQAbwBIAGUAYQBsAC4AcABzADEAIgAgAC8AcwBjACAAbQBpAG4AdQB0AGUAIAAvAG0AbwAgADEAIAAvAHIAdQAgAFMAWQBTAFQARQBNACAALwByAGwAIABIAEkARwBIAEUAUwBUACAALwBmAAoA</CommandLine></SynchronousCommand> |
Adds Order 53 to FirstLogonCommands: download the pinned cocoon-agent Windows release zip from GitHub, verify SHA256, run the bundled install-cocoon-agent.ps1 (registers cocoon-agent Windows service — LocalSystem, auto-start, restart-on-crash, vsock port 1024). The bootstrap script lives at scripts/install-cocoon-agent-bootstrap.ps1 as the canonical source. autounattend Order 53 is a base64-encoded wrapper that drops the same content to C:\Scripts\install-cocoon-agent- bootstrap.ps1 and invokes it; remediate.ps1 re-invokes the on-disk copy when verify.ps1 reports the service missing or stopped. Order renumbering: 52 viosock (unchanged), 53 NEW cocoon-agent install, 54 NIC auto-heal (was 53), 55 QuickEdit restore (was 54), 56 install marker (was 55). Drift: the autounattend base64 must be regenerated from the standalone .ps1 whenever the pinned version is bumped (URL + SHA256). This is called out in the bootstrap header and the autounattend comment block; the standalone-vs-base64 match is verifiable by base64-decoding Order 53 and comparing the inner here-string against the .ps1 file. Image build-time CI catches drift via verify.ps1's "service running" check. Verify.ps1 adds: cocoon-agent service Running, bootstrap present at C:\Scripts\, version printout from `cocoon-agent --version`. Remediate.ps1 adds: re-run C:\Scripts\install-cocoon-agent-bootstrap.ps1 if service is missing or stopped (idempotent re-install).
The autounattend.xml Orders 53 (cocoon-agent bootstrap) and 54 (NIC auto-heal) embed PowerShell as base64 -EncodedCommand. The decoded wrapper holds the same body as the matching scripts/*.ps1 file in a here-string. Without a check, editing the standalone .ps1 silently ships the old version inside the image. scripts/check-base64-drift.py decodes each Order's base64, extracts the here-string body, and asserts byte-for-byte equality with the standalone .ps1 (modulo trailing whitespace from CRLF). Runs in <1s. .github/workflows/check.yml runs the script on every push to main and every pull_request — fast feedback (under a 2-minute timeout) decoupled from the slow build.yml workflow that only runs on workflow_dispatch. Order 54 (NIC auto-heal) base64 was regenerated to include the standalone .ps1's comment header so the drift check passes from day one. The deployed C:\CocoonNicAutoHeal.ps1 now carries the explanatory comment too — same runtime behavior, slightly more readable on inspection. Verified: python3 scripts/check-base64-drift.py exits 0 with both pairs matching; xmllint accepts the regenerated autounattend.xml.
Problem
Chained Windows clones (
vm clone→vm clone→vm clone, generation ≥ 3) hit a state where the post-restore hot-swapped NIC binds at the OS layer but never transmits.Get-PnpDevice -Class NetreportsStatus='OK'and Windows shows the adapter as up, but no DHCPDISCOVER ever leaves the guest. cocoon-net never grants a lease and the VM stays unreachable indefinitely (tested 8 min budget — 0/5 self-recovered at gen3+).Tracked at cocoonstack/cocoon#28.
Fix
Bake an in-guest auto-heal scheduled task into the image. Every minute, SYSTEM runs
C:\CocoonNicAutoHeal.ps1which iteratesGet-PnpDevice -Class NetandDisable-PnpDevice/Enable-PnpDeviceeach entry unconditionally. The unconditional cycle is essential — the user-suggested? Status -EQ 'Error'filter misses the failure mode (Windows reports OK while the NIC is dead at the wire).Validation
5-generation chained clone test on testbed-1 (cocoon master
be35341, CHv51.0.0, virtio-win0.1.285):Task confirmed propagating across snapshots:
schtasks /query /tn CocoonNicAutoHealon every clone showsStatus: Ready/Running. Recovery overhead bounded by the 1-minute trigger interval.Implementation notes
scripts/cocoon-nic-autoheal.ps1— the script body (also documented for manual install).autounattend.xml: insertOrder=52SynchronousCommand running base64-UTF16LE PowerShell that writesC:\CocoonNicAutoHeal.ps1+ registers the task viaschtasks /create /sc minute /mo 1 /ru SYSTEM /rl HIGHEST. ExistingOrder 52/53(QuickEdit restore + install marker) renumber to53/54.schtasks /createinstead ofRegister-ScheduledTaskbecause Win11'sNew-ScheduledTaskTriggerrejects sub-minute intervals;schtasksaccepts/sc minute /mo 1cleanly.Limitations / follow-ups
schtasks /sc minute. Worst-case recovery is ~60s + DHCP renegotiation. If we need faster, switch to a long-running PS service or add a DHCP-failure event-trigger (Microsoft-Windows-Dhcp-Clientevent ID1003) as a secondary trigger.vm.restore; if upstream CH provides a clean MAC-update API or in-place virtio-net config-change interrupt, this task can be retired.