Skip to content

fix(e2e): unblock Windows sysprep when VMAgentDisabler.dll load stalls#8544

Merged
r2k1 merged 8 commits into
mainfrom
akhantimirov/fix-windows-sysprep-vmagentdisabler-flake
May 28, 2026
Merged

fix(e2e): unblock Windows sysprep when VMAgentDisabler.dll load stalls#8544
r2k1 merged 8 commits into
mainfrom
akhantimirov/fix-windows-sysprep-vmagentdisabler-flake

Conversation

@r2k1
Copy link
Copy Markdown
Contributor

@r2k1 r2k1 commented May 20, 2026

What this PR does / why we need it:

Test_Windows2022_VHDCaching intermittently flakes when the Sysprep RunCommand on the test VM stalls for ~14 minutes (observed in build 164698615 on PR #8535) and the test fails with context deadline exceeded. Root cause: a broken HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Setup\SysPrepExternal\Generalize provider entry points at C:\Windows\system32\VMAgentDisabler.dll — the Windows Azure Guest Agent that ships the DLL is missing on the test images, so Sysprep blocks waiting on the load.

vhdbuilder/packer/windows/sysprep.ps1 has stripped that registry entry since 2020 (PR #429) for the production VHD-bake path. The e2e CreateImage helper was added later (PR #4631) and never inherited the workaround — it invokes Sysprep directly via RunCommand. This PR brings the e2e path to parity.

Also migrates RunCommand from v1 (VMSSVM.BeginRunCommand) to v2 (VMSSVMRunCommands.BeginCreateOrUpdate) — to avoid the v1 extension's Keyset does not exist failure on newer Windows hosts. Two call sites in validators.go updated.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the e2e harness to mitigate a recurring Windows2022 VHDCaching flake where Sysprep /generalize can hang when the SysPrepExternal\Generalize registry points at VMAgentDisabler.dll, and modernizes the test harness to use the VMSS RunCommand v2 API surface for script execution.

Changes:

  • Introduces a VMSS RunCommand v2 wrapper that uses VirtualMachineRunCommand (v2) and fetches the instanceView for stdout/stderr.
  • Adds a Windows sysprep script that removes SysPrepExternal\Generalize entries referencing VMAgentDisabler.dll and polls ImageState until generalize completion.
  • Refactors Linux SSH-related validators to consume stdout/stderr directly from the new RunCommand wrapper instead of marshaling full JSON.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
e2e/test_helpers.go Adds RunCommand v2 wrapper and a Windows sysprep script with registry cleanup + ImageState polling; updates CreateImage to use it.
e2e/validators.go Refactors validator RunCommand call sites to use the new wrapper and parse stdout/stderr directly.
Comments suppressed due to low confidence (1)

e2e/test_helpers.go:733

  • CreateImage only checks err from RunCommand; with RunCommand v2 the ARM operation can succeed even when the guest script fails (non-zero exit code / error output). Please fail fast here based on the runcommand instance view result (exit code/execution state), otherwise the test may proceed to capture a non-generalized disk and produce confusing downstream failures.
		if stderr != "" {
			s.T.Logf("Sysprep stderr: %s", stderr)
		}
		require.NoErrorf(s.T, err, "failed to run sysprep on Windows VM for image creation")
	}

Comment thread e2e/test_helpers.go
Comment thread e2e/test_helpers.go
Copilot AI review requested due to automatic review settings May 21, 2026 00:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread e2e/test_helpers.go Outdated
Comment thread e2e/test_helpers.go
Copilot AI review requested due to automatic review settings May 24, 2026 21:56
@r2k1 r2k1 force-pushed the akhantimirov/fix-windows-sysprep-vmagentdisabler-flake branch from 4c3d7dd to cda17e1 Compare May 24, 2026 23:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@r2k1 r2k1 force-pushed the akhantimirov/fix-windows-sysprep-vmagentdisabler-flake branch from 132d157 to a54b363 Compare May 24, 2026 23:31
Copy link
Copy Markdown
Contributor

@timmy-wright timmy-wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copilot AI review requested due to automatic review settings May 25, 2026 01:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@r2k1 r2k1 enabled auto-merge (squash) May 25, 2026 01:08
@r2k1 r2k1 force-pushed the akhantimirov/fix-windows-sysprep-vmagentdisabler-flake branch from 97ddec5 to c67a743 Compare May 26, 2026 20:14
r2k1 and others added 8 commits May 27, 2026 08:14
Windows2022 VHDCaching scenarios have been failing at the Sysprep
/generalize step in PR check-in runs since ~May 9 2026. The Sysprep
RunCommand never completes within the test's vmssCtx budget
(TestTimeoutVMSS - prepareAKSNode time, ~14m), and the validation step
fails with 'context deadline exceeded'.

Root cause: VMAgentDisabler.dll is a Sysprep provider shipped by the
Windows Azure Guest Agent. The agent self-updates from Azure fabric on
every boot, and in Jan 2026 added a WDAC catalog file install feature
(msazure ADO PR 14499782) for the DLL. The feature had bugs (hotfixes
14889344 / 14901019) and rolled out unevenly Feb-May 2026. On hosts
where the catalog install failed, Code Integrity cannot validate the
DLL and LoadLibrary stalls long enough to exhaust our test timeout.
This matches a 2020 incident (ICM 210726081) — the existing
vhdbuilder/packer/windows/sysprep.ps1 already has the same workaround
during VHD bake.

Causal proof: on a healthy Win2022 host where sysprep normally
completes in ~10s, renaming VMAgentDisabler.dll while leaving the
SysPrepExternal\\Generalize registry entry intact reproduces the
stall.

Fix (e2e/test_helpers.go):
- New windowsSysprepScript that removes any SysPrepExternal\\Generalize
  registry value pointing at VMAgentDisabler.dll before invoking
  Sysprep, then polls ImageState until generalization completes.
- Replaces the inline sysprep invocation in CreateImage; reads
  res.Output / res.Error instead of marshaling JSON.

Migrate RunCommand from v1 (VMSSVM.BeginRunCommand) to v2
(VMSSVMRunCommands.BeginCreateOrUpdate). v2 is the supported path
going forward and matches the migration done in aks-rp PR 15721814 to
avoid the 'Keyset does not exist' failure mode of the v1 extension on
newer Windows hosts. Two call sites in validators.go refactored to use
the new wrapper.

Verified: Test_Windows2022_VHDCaching_LegacyTLSBootstrap passes
end-to-end in ~9m36s with sysprep completing in ~1m, vs hanging out
the full vmssCtx on broken hosts before this change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously the poll wrote a line every 10s for up to 10 min (~60 lines).
Log only when ImageState changes — typically 2-3 lines for a normal
sysprep run — to stay well under RunCommand's stdout cap and keep the
test log readable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The ARM CreateOrUpdate operation reports success when the RunCommand
extension successfully runs the script, regardless of whether the script
itself succeeded. A non-zero exit, PowerShell throw, or timeout inside
the script only shows up in InstanceView.ExecutionState / ExitCode (per
https://learn.microsoft.com/en-us/azure/virtual-machines/windows/run-command-managed).

Without this check the helper returns nil err on a failed script, and
callers like CreateImage proceed to capture a non-generalized VM — the
exact silent-failure mode our sysprep poll throw was designed to catch.

Return a descriptive error including ExecutionState / ExitCode / stdout
/ stderr so require.NoError fails with actionable info.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
VirtualMachineRunCommand resources persist on the VM after CreateOrUpdate,
so the previous code accumulated them across e2e calls. Add a best-effort
BeginDelete in defer with a fresh 2-minute context so cleanup runs even if
the caller's ctx is cancelled.

Also stop logging the full script body — multi-line scripts (like
windowsSysprepScript) flooded the log and could surface secrets if a future
caller embedded them. Log a quoted first line plus a byte count instead.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sync windowsSysprepScript with the docs-grounded version applied to the
aks-rp PIS bake path:

- Add C:\Windows\Panther cleanup (per Azure generalize doc: stale
  Panther logs can cause Sysprep to fail).
- Use -like instead of .Contains (case-insensitive, REG_MULTI_SZ-safe).
- exit $LASTEXITCODE so RunCommand surfaces sysprep failures.
- Drop ImageState poll: /quit waits for sysprep to finish per Microsoft's
  Sysprep command-line options doc ("Closes the Sysprep tool without
  rebooting or shutting down the computer after Sysprep runs the
  specified commands").

Refs:
- https://learn.microsoft.com/en-us/azure/virtual-machines/generalize
- https://learn.microsoft.com/en-us/windows-hardware/manufacture/desktop/sysprep-command-line-options

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two issues raised in rubber-duck review:

1. ImageState poll dropped from previous revision was unsafe. Sysprep
   /quit on Server 2022 can return before the background SetupHost.exe
   finishes generalizing; deallocating before the registry transitions
   to IMAGE_STATE_GENERALIZE_RESEAL_TO_OOBE races sysprep and can
   capture a partially-generalized disk. The same poll has lived in
   vhdbuilder/packer/windows/sysprep.ps1 since 2020 (PR #429).

2. $ErrorActionPreference = 'Stop' is set at the top, but the registry
   cleanup loop uses Get-Item / Remove-ItemProperty without per-cmdlet
   overrides. A transient access error there would have terminated the
   script before Sysprep.exe ever ran. Wrap the cleanup in try/catch
   and log a warning so a registry hiccup doesn't block the bake.

Also tightened the exit-code check (throw instead of "exit $LASTEXITCODE")
so a sysprep non-zero exit fails the RunCommand v2 instance view.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extends sysprep coverage to Windows Server 2025 — previously only
Windows 2022 exercised the sysprep path via VHDCaching, leaving the
newer OS untested.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@r2k1 r2k1 force-pushed the akhantimirov/fix-windows-sysprep-vmagentdisabler-flake branch from c67a743 to cdf6503 Compare May 26, 2026 20:15
@r2k1 r2k1 disabled auto-merge May 28, 2026 08:19
@r2k1 r2k1 merged commit 20f849c into main May 28, 2026
26 of 30 checks passed
@r2k1 r2k1 deleted the akhantimirov/fix-windows-sysprep-vmagentdisabler-flake branch May 28, 2026 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants