fix(e2e): unblock Windows sysprep when VMAgentDisabler.dll load stalls#8544
Merged
r2k1 merged 8 commits intoMay 28, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the e2e harness to mitigate a recurring Windows2022 VHDCaching flake where Sysprep /generalize can hang when the SysPrepExternal\Generalize registry points at VMAgentDisabler.dll, and modernizes the test harness to use the VMSS RunCommand v2 API surface for script execution.
Changes:
- Introduces a VMSS RunCommand v2 wrapper that uses
VirtualMachineRunCommand(v2) and fetches theinstanceViewfor stdout/stderr. - Adds a Windows sysprep script that removes
SysPrepExternal\Generalizeentries referencingVMAgentDisabler.dlland pollsImageStateuntil generalize completion. - Refactors Linux SSH-related validators to consume
stdout/stderrdirectly from the new RunCommand wrapper instead of marshaling full JSON.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| e2e/test_helpers.go | Adds RunCommand v2 wrapper and a Windows sysprep script with registry cleanup + ImageState polling; updates CreateImage to use it. |
| e2e/validators.go | Refactors validator RunCommand call sites to use the new wrapper and parse stdout/stderr directly. |
Comments suppressed due to low confidence (1)
e2e/test_helpers.go:733
- CreateImage only checks
errfrom RunCommand; with RunCommand v2 the ARM operation can succeed even when the guest script fails (non-zero exit code / error output). Please fail fast here based on the runcommand instance view result (exit code/execution state), otherwise the test may proceed to capture a non-generalized disk and produce confusing downstream failures.
if stderr != "" {
s.T.Logf("Sysprep stderr: %s", stderr)
}
require.NoErrorf(s.T, err, "failed to run sysprep on Windows VM for image creation")
}
4c3d7dd to
cda17e1
Compare
132d157 to
a54b363
Compare
97ddec5 to
c67a743
Compare
Windows2022 VHDCaching scenarios have been failing at the Sysprep /generalize step in PR check-in runs since ~May 9 2026. The Sysprep RunCommand never completes within the test's vmssCtx budget (TestTimeoutVMSS - prepareAKSNode time, ~14m), and the validation step fails with 'context deadline exceeded'. Root cause: VMAgentDisabler.dll is a Sysprep provider shipped by the Windows Azure Guest Agent. The agent self-updates from Azure fabric on every boot, and in Jan 2026 added a WDAC catalog file install feature (msazure ADO PR 14499782) for the DLL. The feature had bugs (hotfixes 14889344 / 14901019) and rolled out unevenly Feb-May 2026. On hosts where the catalog install failed, Code Integrity cannot validate the DLL and LoadLibrary stalls long enough to exhaust our test timeout. This matches a 2020 incident (ICM 210726081) — the existing vhdbuilder/packer/windows/sysprep.ps1 already has the same workaround during VHD bake. Causal proof: on a healthy Win2022 host where sysprep normally completes in ~10s, renaming VMAgentDisabler.dll while leaving the SysPrepExternal\\Generalize registry entry intact reproduces the stall. Fix (e2e/test_helpers.go): - New windowsSysprepScript that removes any SysPrepExternal\\Generalize registry value pointing at VMAgentDisabler.dll before invoking Sysprep, then polls ImageState until generalization completes. - Replaces the inline sysprep invocation in CreateImage; reads res.Output / res.Error instead of marshaling JSON. Migrate RunCommand from v1 (VMSSVM.BeginRunCommand) to v2 (VMSSVMRunCommands.BeginCreateOrUpdate). v2 is the supported path going forward and matches the migration done in aks-rp PR 15721814 to avoid the 'Keyset does not exist' failure mode of the v1 extension on newer Windows hosts. Two call sites in validators.go refactored to use the new wrapper. Verified: Test_Windows2022_VHDCaching_LegacyTLSBootstrap passes end-to-end in ~9m36s with sysprep completing in ~1m, vs hanging out the full vmssCtx on broken hosts before this change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously the poll wrote a line every 10s for up to 10 min (~60 lines). Log only when ImageState changes — typically 2-3 lines for a normal sysprep run — to stay well under RunCommand's stdout cap and keep the test log readable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The ARM CreateOrUpdate operation reports success when the RunCommand extension successfully runs the script, regardless of whether the script itself succeeded. A non-zero exit, PowerShell throw, or timeout inside the script only shows up in InstanceView.ExecutionState / ExitCode (per https://learn.microsoft.com/en-us/azure/virtual-machines/windows/run-command-managed). Without this check the helper returns nil err on a failed script, and callers like CreateImage proceed to capture a non-generalized VM — the exact silent-failure mode our sysprep poll throw was designed to catch. Return a descriptive error including ExecutionState / ExitCode / stdout / stderr so require.NoError fails with actionable info. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
VirtualMachineRunCommand resources persist on the VM after CreateOrUpdate, so the previous code accumulated them across e2e calls. Add a best-effort BeginDelete in defer with a fresh 2-minute context so cleanup runs even if the caller's ctx is cancelled. Also stop logging the full script body — multi-line scripts (like windowsSysprepScript) flooded the log and could surface secrets if a future caller embedded them. Log a quoted first line plus a byte count instead. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sync windowsSysprepScript with the docs-grounded version applied to the
aks-rp PIS bake path:
- Add C:\Windows\Panther cleanup (per Azure generalize doc: stale
Panther logs can cause Sysprep to fail).
- Use -like instead of .Contains (case-insensitive, REG_MULTI_SZ-safe).
- exit $LASTEXITCODE so RunCommand surfaces sysprep failures.
- Drop ImageState poll: /quit waits for sysprep to finish per Microsoft's
Sysprep command-line options doc ("Closes the Sysprep tool without
rebooting or shutting down the computer after Sysprep runs the
specified commands").
Refs:
- https://learn.microsoft.com/en-us/azure/virtual-machines/generalize
- https://learn.microsoft.com/en-us/windows-hardware/manufacture/desktop/sysprep-command-line-options
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two issues raised in rubber-duck review: 1. ImageState poll dropped from previous revision was unsafe. Sysprep /quit on Server 2022 can return before the background SetupHost.exe finishes generalizing; deallocating before the registry transitions to IMAGE_STATE_GENERALIZE_RESEAL_TO_OOBE races sysprep and can capture a partially-generalized disk. The same poll has lived in vhdbuilder/packer/windows/sysprep.ps1 since 2020 (PR #429). 2. $ErrorActionPreference = 'Stop' is set at the top, but the registry cleanup loop uses Get-Item / Remove-ItemProperty without per-cmdlet overrides. A transient access error there would have terminated the script before Sysprep.exe ever ran. Wrap the cleanup in try/catch and log a warning so a registry hiccup doesn't block the bake. Also tightened the exit-code check (throw instead of "exit $LASTEXITCODE") so a sysprep non-zero exit fails the RunCommand v2 instance view. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extends sysprep coverage to Windows Server 2025 — previously only Windows 2022 exercised the sysprep path via VHDCaching, leaving the newer OS untested. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
c67a743 to
cdf6503
Compare
timmy-wright
approved these changes
May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
Test_Windows2022_VHDCachingintermittently flakes when the Sysprep RunCommand on the test VM stalls for ~14 minutes (observed in build 164698615 on PR #8535) and the test fails withcontext deadline exceeded. Root cause: a brokenHKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Setup\SysPrepExternal\Generalizeprovider entry points atC:\Windows\system32\VMAgentDisabler.dll— the Windows Azure Guest Agent that ships the DLL is missing on the test images, so Sysprep blocks waiting on the load.vhdbuilder/packer/windows/sysprep.ps1has stripped that registry entry since 2020 (PR #429) for the production VHD-bake path. The e2eCreateImagehelper was added later (PR #4631) and never inherited the workaround — it invokes Sysprep directly via RunCommand. This PR brings the e2e path to parity.Also migrates RunCommand from v1 (
VMSSVM.BeginRunCommand) to v2 (VMSSVMRunCommands.BeginCreateOrUpdate) — to avoid the v1 extension'sKeyset does not existfailure on newer Windows hosts. Two call sites invalidators.goupdated.