Skip to content

fix(swarm): WindowsTerminalBackend pidFile health check + 5-state lifecycle#1237

Open
amDosion wants to merge 2 commits into
claude-code-best:mainfrom
amDosion:fix/wt-pidfile-health
Open

fix(swarm): WindowsTerminalBackend pidFile health check + 5-state lifecycle#1237
amDosion wants to merge 2 commits into
claude-code-best:mainfrom
amDosion:fix/wt-pidfile-health

Conversation

@amDosion
Copy link
Copy Markdown
Collaborator

@amDosion amDosion commented May 18, 2026

Summary

修复 fork 自研的 Windows Terminal Agent Teams backend(src/utils/swarm/backends/WindowsTerminalBackend.ts)两类问题:

  • wt.exe split-pane fire-and-forgetwt.exe 返回 exit 0 不代表 PowerShell 真启动。原代码立即写 mailbox,导致 teammate 假死、TeamDelete 卡在 "active teammate"。加 waitForPidFile() 在 wt.exe 返回后轮询 pidFile 直到子 PowerShell 真写入 PID,默认 8s timeout(env CLAUDE_WT_PANE_TIMEOUT_MS 覆盖),超时 throw 含完整诊断信息。
  • kill-while-spawn race + 状态分歧:引入 5 态生命周期(registered/spawning/ready/killing/dead),killPane 在 spawning 中先 await in-flight Promise 再决策(含 TOCTOU 重读),优先用缓存 pane.pid 避免读盘,Stop-Process 失败一律清缓存 + 标 dead 防 PID 复用误杀。

同时严格化 pid 解析(/^\d+$/ + Number.isFinite + > 0,拒绝 "123abc" 等)、构造函数改 options 对象支持 pidFileDir 注入(测试隔离)、makePidFile 由模块级函数改为私有方法。

Test plan

  • bun test src/utils/swarm/backends/__tests__/WindowsTerminalBackend.test.ts12 pass / 0 fail(5 v1 适配 + 7 v2 新场景:kill-while-spawn race / ready 态重 spawn / corrupted pid / Stop-Process 失败清缓存 等)
  • bun run tsc --noEmit — 零新错误(pre-existing doubaoSTT.ts 4 个 doubaoime-asr 模块缺失与本 PR 无关)
  • bun test src/utils/swarm/backends/__tests__/PaneBackendExecutor.test.ts — 2 pass(未破坏已有 PaneBackendExecutor 用例)
  • 在真实 Windows Terminal 环境下手动 e2e(wt.exe 是 UWP app 仅 Windows 可测,建议 reviewer 协助验证)

范围

  • 仅改 2 个文件,+462/-51:
    • src/utils/swarm/backends/WindowsTerminalBackend.ts (+215/-38)
    • src/utils/swarm/backends/__tests__/WindowsTerminalBackend.test.ts (+247/-13)
  • 未改PaneBackendExecutor / registry / detection / 其他 backend / CI / workflow / 其他 fork-specific 配置

Follow-up(不阻断本 PR)

  1. Timeout 抛错时主动 best-effort kill 已交出的 wt 进程(防孤儿进程)
  2. Dead pane GC 策略(dead 滞留 map 影响 isFirstTeammate 判断)
  3. 每 spawn pidFile generation token(彻底解 PowerShell finally 删除 vs 新 spawn 写入的根本竞争)
  4. fs.watch 替代 200ms 轮询(性能优化)
  5. killPane false 区分 "pane 不存在" vs "kill 真失败"(observability)

Summary by CodeRabbit

Release Notes

  • New Features

    • Added configurable timeout for Windows Terminal pane initialization.
  • Bug Fixes

    • Improved reliability of pane lifecycle management with enhanced state tracking and safety checks.
    • Fixed handling of concurrent spawn and kill operations.
    • Enhanced robustness when dealing with corrupted or missing configuration files.
  • Tests

    • Expanded test coverage for pane management edge cases.

Review Change Stack

unraid added 2 commits May 18, 2026 19:05
…ecycle

修 wt.exe split-pane fire-and-forget 导致 teammate 假死、TeamDelete 卡死、
kill-while-spawn race 等多个问题。

- 加 waitForPidFile() 在 wt.exe 返回后等 powershell.exe 真启动写 pidFile
  默认 8s timeout,env CLAUDE_WT_PANE_TIMEOUT_MS 覆盖,超时 throw 含完整诊断
- 加 5 态生命周期 (registered/spawning/ready/killing/dead),sendCommandToPane
  inner Promise 包装 spawnPromise,ready 态重 spawn 直接 throw
- killPane TOCTOU 修正:await spawnPromise 后重读 status;优先用缓存 pane.pid
  避免读盘,Stop-Process 失败也清缓存 + 标 dead 防 PID 复用误杀
- pid 解析严格化:/^\d+$/ + Number.isFinite + >0;移除 dead try/catch
- 构造函数 options 对象注入 pidFileDir(兼容原位置参数)
- 清启动前陈旧 pidFile,killPane fallback 3×500ms retry 兜底
…, pid validation

为 WindowsTerminalBackend 加 12 个测试覆盖 v2 全部新行为,含 5 个 v1 兼容 + 7 个
v2 新场景。配套构造函数 options 对象,测试用 pidFileDir: tempDir 隔离防泄漏到
真实 OS tmpdir。

新场景覆盖:
- unlinks stale pidFile so a stale pid is not adopted
- rejects re-spawn on a ready pane
- throws on unknown paneId in sendCommandToPane
- rejects corrupted pidFile content ("123abc") and times out
- killPane awaits in-flight spawn before killing (kill-while-spawn race)
- Stop-Process failure clears cached pid and marks pane dead
- killPane uses cached pid and returns false when pane is unknown

createBackend helper 改用 options 对象 + simulatePidWrite 模拟 powershell 写
pidFile,pidFileDir 注入 tempDir,env CLAUDE_WT_PANE_TIMEOUT_MS beforeEach 设置
afterEach 清理。
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: efd110fd-867d-487b-8acd-cbf46a06606f

📥 Commits

Reviewing files that changed from the base of the PR and between 2bca31e and 9b6b9a5.

📒 Files selected for processing (2)
  • src/utils/swarm/backends/WindowsTerminalBackend.ts
  • src/utils/swarm/backends/__tests__/WindowsTerminalBackend.test.ts

📝 Walkthrough

Walkthrough

WindowsTerminalBackend pane lifecycle management is refactored to support dependency injection, explicit state tracking via a PaneStatus enum, and timeout-driven PID file polling. sendCommandToPane and killPane are rewritten as state machines that properly handle concurrent spawn/kill races and prevent invalid lifecycle transitions. Tests validate state transitions, timeout behavior, and edge cases.

Changes

Windows Terminal Pane Lifecycle Management

Layer / File(s) Summary
Pane State Data Shapes and Configuration
src/utils/swarm/backends/WindowsTerminalBackend.ts
Introduces PaneStatus enum and extends WindowsTerminalPane to track lifecycle state (registered, spawning, ready, killing, dead), optional cached pid, and in-flight spawnPromise. Adds configurable CLAUDE_WT_PANE_TIMEOUT_MS and waitForPidFile polling helper. Updates fs import to include unlink for PID file cleanup.
Constructor Refactoring and Dependency Injection
src/utils/swarm/backends/WindowsTerminalBackend.ts
Refactors constructor to accept optional injected runCommand, getPlatform, and pidFileDir via options object or positional parameters. Replaces module-level PID file creation with instance method makePidFile that respects the configured directory.
Pane and Window Registration with Lifecycle State
src/utils/swarm/backends/WindowsTerminalBackend.ts
When registering new panes and windows, PID file path now comes from injected makePidFile and each pane's initial status is set to 'registered' for explicit lifecycle tracking.
sendCommandToPane Spawn State Machine
src/utils/swarm/backends/WindowsTerminalBackend.ts
Reworks sendCommandToPane as a state-driven flow: validates pane status against invalid transitions, creates and attaches spawnPromise before awaiting, launches wt.exe via PowerShell -EncodedCommand, clears existing PID file, polls for PID file within timeout window, updates cached pid and transitions to ready on success, or transitions to dead and rejects promise on timeout/error.
killPane Race-Safe Lifecycle Handling
src/utils/swarm/backends/WindowsTerminalBackend.ts
Rewrites killPane to avoid kill-while-spawn races: awaits any in-flight spawnPromise when called during spawning, validates pane status, marks pane as killing, uses cached PID when available (otherwise retries reading/parsing from disk), issues Stop-Process command, then marks pane dead and removes it from the internal map to prevent PID reuse errors, even if the stop attempt fails.
Test Infrastructure and Comprehensive Coverage
src/utils/swarm/backends/__tests__/WindowsTerminalBackend.test.ts
Test setup manages CLAUDE_WT_PANE_TIMEOUT_MS environment variable per test. Test harness extended to simulate delayed PID file writes when wt.exe runs with -EncodedCommand. Adds 11 targeted test cases validating cached PID usage, timeout diagnostics with override hints, stale PID cleanup, re-spawn rejection when pane is ready, unknown paneId errors, corrupted PID content handling, kill-while-spawn race avoidance, failed Stop-Process with state clearing, and successful PID caching across operations.

Sequence Diagram

sequenceDiagram
  participant Caller
  participant Backend
  participant PowerShell
  participant FileSystem
  Caller->>Backend: sendCommandToPane(paneId, cmd)
  Backend->>Backend: validate pane status not spawning/ready/killing/dead
  Backend->>Backend: create & attach spawnPromise
  Backend->>PowerShell: wt.exe via PowerShell -EncodedCommand
  Backend->>FileSystem: delete existing PID file
  Backend->>FileSystem: waitForPidFile (poll with timeout)
  FileSystem-->>Backend: PID file appears & validates
  Backend->>Backend: update pid, status=ready, resolve spawnPromise
  Backend-->>Caller: resolved promise
  note over Backend: on error: status=dead, reject promise, clear promise
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A pane's life unfolds in states most clear,
From spawning dreams to ready cheer,
With timeouts, caches, races tamed,
This Terminal lifecycle's now renamed—
State machines guard each kill and spawn! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main changes: adding pidFile health checking and implementing a 5-state lifecycle for pane management in WindowsTerminalBackend.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant