fix(test-plans): mitigate scheduled e2e-autotest flakiness#1622
Merged
Conversation
Triage of the last 8 scheduled e2e-autotest runs identified three failure categories: a real plan bug, LLM screenshot-based false downgrades, and real timing flakes. This change addresses all three. Category A — real plan bug * java-pack-help-center-webview was missing vscjava.vscode-java-pack from setup.extensions. On scheduled runs (no PR VSIX) java.welcome was unregistered and the open-help-center step silently timed out. This was the #1 failure across the last 8 nightly runs (7/8). Now installs the pack from the marketplace on schedule runs while still letting --vsix override on PR runs. Category B — LLM downgrade noise on ls-ready * Add skipLlmVerify: true (introduced in @vscjava/vscode-autotest 0.7.5) to every ls-ready step that has no structured verify* field. The waitForLanguageServer action is itself the authoritative deterministic check; the LLM was downgrading these whenever the status bar still showed background indexing ("Java: Searching... 0%"), even though the LS was fully functional. Affected: java-dependency-viewer, java-extension-pack, java-fresh-import, java-maven-resolve-type, java-maven, java-new-file-snippet, java-single-file, java-webview-migration. Category C — real timing flakes * java-test-runner: bump wait-test-discovery from 45s to 90s (the vscode-java-test discovery scan can take longer than 45s on a cold cache) and add retries: 1 to run-all-tests so a discovery-still-warming first invocation can retry. * java-maven-resolve-type: add retries: 1 to save-after-resolve so a slow Maven re-import on a cold cache (where the LS hasn't yet republished zero-errors at the time of save) can retry instead of failing the plan. Plans whose flaky steps already carry a structured verify* field (e.g. verify-completion with verifyCompletion: { notEmpty: true }, save-after-organize with verifyFile, verify-help-center-content with verifyWebview) no longer need plan changes because the framework auto-skip in @vscjava/vscode-autotest 0.7.5 already short-circuits the LLM re-check whenever any structured verifier is present. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d-cache flakes
Reverts the over-broad framework auto-skip (any structured verify -> no LLM)
that was landed in autotest v0.7.5/0.7.6. LLM screenshot verification is the
anti-silent-pass safety net and must stay enabled on steps where the screenshot
carries unique signal (popup visibility, decoration lag, panel content).
Final policy:
- skipLlmVerify=true on Group 1 (16 ls-ready steps): waitForLanguageServer
polls the same status bar text the LLM would read, so LLM adds zero signal.
- skipLlmVerify=true on Group 4 (3 disk-write steps: save-after-organize,
add-gson-dependency, create-formatter-profile): action mutates a file not
open in any editor; before/after screenshots are by-design identical and
LLM always downgrades. verifyFile from disk is the authoritative signal.
- retries: 1 on 8 verify-completion steps to mitigate cold-cache 'Loading...'
LLM downgrades while keeping the screenshot check enabled.
- retries: 1 on java-maven-resolve-type save-after-resolve (kept from prior
commit) for Maven indexer warm-up.
- Wait bump 45 -> 90s on java-test-runner wait-test-discovery (kept).
- java-pack-help-center-webview setup.extensions hard-requires java-pack
(kept) — fixes the real bug (5/8 failures).
LLM coverage preserved on verify-completion (popup visibility), verifyEditor
(guards against page-wide DOM stale-tab fallback), verifyProblems
(diagnostics red squiggle lag) and verifyWebview (visual rendering).
Requires autotest >= 0.7.7 to honor skipLlmVerify without the auto-skip side
effect.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 0.7.7 release did not actually honor skipLlmVerify because planParser dropped the field on deserialize. 0.7.8 contains the parser fix; this empty commit restarts CI so the matrix installs the correct version. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
chagong
approved these changes
May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scheduled
e2e-autotestCI has been failing 1-7 tests per day. After investigating 40 failed-job logs across 8 scheduled runs, the failures fall into three categories:Categories
A. Real bug (1 plan) ΓÇö
java-pack-help-center-webview(7/8 failures): the Help Center extension is part ofvscode-java-pack, but the plan ran withoutextensions: [vscjava.vscode-java-pack]in setup, so the menu item never existed. Fixed by adding the required extension.B. LLM screenshot noise on "no-visual-signal" steps (16+3 steps) ΓÇö Two narrow patterns where the LLM screenshot check adds zero signal:
waitForLanguageServer(16ls-readysteps): the action polls the VS Code footer status bar forJava: Ready/ 👍; the LLM screenshot only sees that same status bar, so it can never disagree productively.insertLineInFile/File: Save All(3 steps): action mutates a file that is not open in any editor (or saves a non-active tab); before/after screenshots are by-design identical and the LLM always downgrades.verifyFilereading from disk is the authoritative signal.These 19 steps now set
skipLlmVerify: trueso the LLM check is skipped only for them ΓÇö every other step keeps full LLM coverage.C. Timing flakes mitigated with retries (not skips) ΓÇö On these steps the LLM screenshot DOES add unique signal (popup visibility, decoration lag, panel rendering) and must stay enabled. Instead of skipping the LLM, we retry once on transient cold-cache states:
verify-completionin 8 plans:retries: 1— survives the cold-cache "Loading..." spinner the first time without sacrificing screenshot verification of the popup.java-maven-resolve-type::save-after-resolve:retries: 1for Maven indexer warm-up.java-test-runner::wait-test-discovery:waitBeforebumped 45→90s.Why this approach (revised after review)
An earlier version of this PR also landed a framework-side rule in
@vscjava/vscode-autotestthat automatically skipped the LLM check on any step with structuredverify*fields. That was too aggressive ΓÇö the LLM screenshot is the anti-silent-pass safety net for cases where deterministic checks read stale DOM (e.g.verifyEditorfalls back to a page-wide.monaco-editor:has-text()that can match hidden tabs).The framework auto-skip was reverted in autotest v0.7.7 / v0.7.8.
skipLlmVerify: trueis now an explicit opt-in marker, used only on the 19 steps above.Required autotest version
This PR requires
@vscjava/vscode-autotest>= 0.7.8 so the explicitskipLlmVerifyfield is honored without the broader auto-skip side-effect. The workflow installslatest, so this is automatic.Observed top offenders (last 8 scheduled runs)
Update
The initial autotest fix landed in v0.7.7 but
planParser.tswas dropping the newskipLlmVerifyfield on deserialize, so the field had no effect. v0.7.8 contains the parser fix. The PR now requires@vscjava/vscode-autotest@>=0.7.8.