Skip to content

fix(test-plans): mitigate scheduled e2e-autotest flakiness#1622

Merged
wenytang-ms merged 3 commits into
mainfrom
fix/e2e-autotest-flakiness-mitigation
May 20, 2026
Merged

fix(test-plans): mitigate scheduled e2e-autotest flakiness#1622
wenytang-ms merged 3 commits into
mainfrom
fix/e2e-autotest-flakiness-mitigation

Conversation

@wenytang-ms
Copy link
Copy Markdown
Contributor

@wenytang-ms wenytang-ms commented May 20, 2026

Scheduled e2e-autotest CI has been failing 1-7 tests per day. After investigating 40 failed-job logs across 8 scheduled runs, the failures fall into three categories:

Categories

A. Real bug (1 plan) ΓÇö java-pack-help-center-webview (7/8 failures): the Help Center extension is part of vscode-java-pack, but the plan ran without extensions: [vscjava.vscode-java-pack] in setup, so the menu item never existed. Fixed by adding the required extension.

B. LLM screenshot noise on "no-visual-signal" steps (16+3 steps) ΓÇö Two narrow patterns where the LLM screenshot check adds zero signal:

  • G1 ΓÇö waitForLanguageServer (16 ls-ready steps): the action polls the VS Code footer status bar for Java: Ready / ≡ƒæì; the LLM screenshot only sees that same status bar, so it can never disagree productively.
  • G4 ΓÇö disk-only insertLineInFile/File: Save All (3 steps): action mutates a file that is not open in any editor (or saves a non-active tab); before/after screenshots are by-design identical and the LLM always downgrades. verifyFile reading from disk is the authoritative signal.

These 19 steps now set skipLlmVerify: true so the LLM check is skipped only for them ΓÇö every other step keeps full LLM coverage.

C. Timing flakes mitigated with retries (not skips) ΓÇö On these steps the LLM screenshot DOES add unique signal (popup visibility, decoration lag, panel rendering) and must stay enabled. Instead of skipping the LLM, we retry once on transient cold-cache states:

  • verify-completion in 8 plans: retries: 1 ΓÇö survives the cold-cache "Loading..." spinner the first time without sacrificing screenshot verification of the popup.
  • java-maven-resolve-type::save-after-resolve: retries: 1 for Maven indexer warm-up.
  • java-test-runner::wait-test-discovery: waitBefore bumped 45ΓåÆ90s.

Why this approach (revised after review)

An earlier version of this PR also landed a framework-side rule in @vscjava/vscode-autotest that automatically skipped the LLM check on any step with structured verify* fields. That was too aggressive ΓÇö the LLM screenshot is the anti-silent-pass safety net for cases where deterministic checks read stale DOM (e.g. verifyEditor falls back to a page-wide .monaco-editor:has-text() that can match hidden tabs).

The framework auto-skip was reverted in autotest v0.7.7 / v0.7.8. skipLlmVerify: true is now an explicit opt-in marker, used only on the 19 steps above.

Required autotest version

This PR requires @vscjava/vscode-autotest >= 0.7.8 so the explicit skipLlmVerify field is honored without the broader auto-skip side-effect. The workflow installs latest, so this is automatic.

Observed top offenders (last 8 scheduled runs)

Plan Failures Cause Fix
java-pack-help-center-webview 7/8 Missing extension setup.extensions
java-dependency-viewer 5/8 ls-ready LLM noise G1 skip
java-maven-resolve-type 4/8 Maven indexer slow + disk-write false fail G4 skip + retries:1
java-single-file 3/8 Cold-cache completion retries:1
java-basic-editing 2/8 ls-ready noise + Save All false fail G1+G4 skips
java-extension-pack 2/8 ls-ready noise G1 skip
java-webview-migration 2/8 ls-ready noise + xml disk write G1+G4 skips

Update

The initial autotest fix landed in v0.7.7 but planParser.ts was dropping the new skipLlmVerify field on deserialize, so the field had no effect. v0.7.8 contains the parser fix. The PR now requires @vscjava/vscode-autotest@>=0.7.8.

Triage of the last 8 scheduled e2e-autotest runs identified three failure
categories: a real plan bug, LLM screenshot-based false downgrades, and
real timing flakes. This change addresses all three.

Category A — real plan bug
* java-pack-help-center-webview was missing vscjava.vscode-java-pack from
  setup.extensions. On scheduled runs (no PR VSIX) java.welcome was
  unregistered and the open-help-center step silently timed out. This was
  the #1 failure across the last 8 nightly runs (7/8). Now installs the
  pack from the marketplace on schedule runs while still letting --vsix
  override on PR runs.

Category B — LLM downgrade noise on ls-ready
* Add skipLlmVerify: true (introduced in @vscjava/vscode-autotest 0.7.5) to
  every ls-ready step that has no structured verify* field. The
  waitForLanguageServer action is itself the authoritative deterministic
  check; the LLM was downgrading these whenever the status bar still showed
  background indexing ("Java: Searching... 0%"), even though the LS was
  fully functional. Affected: java-dependency-viewer, java-extension-pack,
  java-fresh-import, java-maven-resolve-type, java-maven,
  java-new-file-snippet, java-single-file, java-webview-migration.

Category C — real timing flakes
* java-test-runner: bump wait-test-discovery from 45s to 90s (the
  vscode-java-test discovery scan can take longer than 45s on a cold cache)
  and add retries: 1 to run-all-tests so a discovery-still-warming first
  invocation can retry.
* java-maven-resolve-type: add retries: 1 to save-after-resolve so a slow
  Maven re-import on a cold cache (where the LS hasn't yet republished
  zero-errors at the time of save) can retry instead of failing the plan.

Plans whose flaky steps already carry a structured verify* field (e.g.
verify-completion with verifyCompletion: { notEmpty: true },
save-after-organize with verifyFile, verify-help-center-content with
verifyWebview) no longer need plan changes because the framework
auto-skip in @vscjava/vscode-autotest 0.7.5 already short-circuits the
LLM re-check whenever any structured verifier is present.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
wenytang-ms and others added 2 commits May 20, 2026 14:16
…d-cache flakes

Reverts the over-broad framework auto-skip (any structured verify -> no LLM)
that was landed in autotest v0.7.5/0.7.6. LLM screenshot verification is the
anti-silent-pass safety net and must stay enabled on steps where the screenshot
carries unique signal (popup visibility, decoration lag, panel content).

Final policy:
  - skipLlmVerify=true on Group 1 (16 ls-ready steps): waitForLanguageServer
    polls the same status bar text the LLM would read, so LLM adds zero signal.
  - skipLlmVerify=true on Group 4 (3 disk-write steps: save-after-organize,
    add-gson-dependency, create-formatter-profile): action mutates a file not
    open in any editor; before/after screenshots are by-design identical and
    LLM always downgrades. verifyFile from disk is the authoritative signal.
  - retries: 1 on 8 verify-completion steps to mitigate cold-cache 'Loading...'
    LLM downgrades while keeping the screenshot check enabled.
  - retries: 1 on java-maven-resolve-type save-after-resolve (kept from prior
    commit) for Maven indexer warm-up.
  - Wait bump 45 -> 90s on java-test-runner wait-test-discovery (kept).
  - java-pack-help-center-webview setup.extensions hard-requires java-pack
    (kept) — fixes the real bug (5/8 failures).

LLM coverage preserved on verify-completion (popup visibility), verifyEditor
(guards against page-wide DOM stale-tab fallback), verifyProblems
(diagnostics red squiggle lag) and verifyWebview (visual rendering).

Requires autotest >= 0.7.7 to honor skipLlmVerify without the auto-skip side
effect.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 0.7.7 release did not actually honor skipLlmVerify because
planParser dropped the field on deserialize. 0.7.8 contains the
parser fix; this empty commit restarts CI so the matrix installs
the correct version.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wenytang-ms wenytang-ms merged commit 8de1414 into main May 20, 2026
53 checks passed
@wenytang-ms wenytang-ms deleted the fix/e2e-autotest-flakiness-mitigation branch May 20, 2026 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants