Skip to content

feat(bench): live observe→steer join (real worker + real observer)#195

Merged
drewstone merged 1 commit into
mainfrom
feat/loops-live-join
Jun 8, 2026
Merged

feat(bench): live observe→steer join (real worker + real observer)#195
drewstone merged 1 commit into
mainfrom
feat/loops-live-join

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

Adds bench/src/cloud-loop.mts — the observe→steer loop closed on live endpoints:

round → REAL cloud worker (openSandboxRun, opencode in a box) over task + accumulated steers
      → its real event trace
      → observe() with a REAL router LLM reads the trace → an AnalystFinding
      → finding.recommended_action injected as the next round's steer
      → stop on the deterministic verifier, or budget

Why

PR #194 shipped observe-steer-workspace-loop.mts, which proves the join through a mock observer (transport:'mock', canned findings) feeding a canned worker — "the grammar talking to itself," the exact pattern docs/research/loop-facade-postmortem.md (also in #194) warns against. This proves the same join with both ends real.

The two are complementary surfaces, not duplicates:

  • observe-steer-workspace-loop.mts — exercises the Scope/Supervisor/coordination-MCP/git-workspace plumbing (mock ends, deterministic, no creds).
  • cloud-loop.mts (this PR) — exercises the live worker + live observer path (openSandboxRun + observe()).

Status (honest)

The join ran live end-to-end for 3 rounds (real worker → real trace → real router-LLM finding → real steer injection). Re-runs are currently blocked at provisioning by a sandbox egress regression: router.tangle.tools returns CONNECT-403 from inside the box (only that host — id/pangolin/sandbox.tangle.tools and api.openai.com all pass). It worked 2026-06-06 → platform regression, tracked as ops-board #984. So this proves the live join; efficacy (does the steer improve behavior at equal budget) is gated on that unblock.

Follow-up (recommendation, not in this PR)

Once #984 is unblocked, run for efficacy. Separately, consider converting observe-steer-workspace-loop.mts into a real CI unit test under tests/loops/ (it currently runs as a standalone tsx demo and asserts nothing), or retiring it now that the live join exists.

Test

Code is byte-identical to the version that ran live for 3 rounds (only the header docstring changed). bench/** is outside the root biome scope (consistent with sibling fleet.mts/workspace-loop.mts); build is verified-by-execution.

…r observer

The merged observe-steer-workspace-loop.mts proves the join through a mock
observer (transport:'mock', canned findings) and a canned worker — the
grammar talking to itself, which docs/research/loop-facade-postmortem.md
warns against. This closes the same join on LIVE endpoints: a real cloud
opencode worker (openSandboxRun) produces a real event trace, observe()
reads it with a real router LLM, and the finding's recommended_action is
injected as the next round's steer.

The join ran live end-to-end for 3 rounds. Re-runs are currently blocked at
provisioning by a sandbox egress regression (router.tangle.tools CONNECT-403
from inside the box; only that host — every other tangle host + provider
egress passes), tracked as ops-board #984. So this proves the live JOIN;
efficacy (does the steer improve behavior at equal budget) is gated on that
unblock.
@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 302af97e

Readiness 76/100 · Confidence 65/100 · 6 findings (1 medium, 5 low)

deepseek glm aggregate
Readiness 76 89 76
Confidence 65 65 65
Correctness 76 89 76
Security 76 89 76
Testing 76 89 76
Architecture 76 89 76

Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM observer call has no timeout/abort signal — bench/src/cloud-loop.mts

L117-120: await observe(...) is called without an AbortSignal. The per-round 240s timeout on L88 (setTimeout(() => controller.abort(), 240_000)) only aborts openSandboxRun, and the timer is cleared on L109 before observe runs. If the router LLM inside observe hangs (network stall, model degradation), the script hangs indefinitely with no timeout. The controller is still in scope — pass signal: controller.signal into the observe options to give it a hard deadline. The observe() function forwards opts.signal to chat.chat() (observe.ts:150), so the plumbing already exists.

🟡 LOW AbortSignal not propagated to observe() call — bench/src/cloud-loop.mts

Line 111: observe(...) accepts opts.signal (ObserveOptions.signal exists per src/runtime/observe.ts:46) but the cloud-loop does not pass controller.signal. If the observe LLM call hangs, the round cannot be cancelled. The overall loop is bounded by ROUNDS, so this is not a hang risk, but it means a timed-out round's observer call continues burning tokens after the worker was already aborted. Pass { chat, model, signal: controller.signal } to observe.

🟡 LOW Final status message checks current steers, not cumulative history — bench/src/cloud-loop.mts

Line 125: steers.length ? 'steered ' : '' reflects only whether the LAST round had steers (steers is cleared and refilled each round at line 117-118). If the observer returned findings in round 2 but not round 3, the final message says 'rounds' without 'steered', which is misleading. Cosmetic only. Track a boolean everSteered if accurate reporting matters.

🟡 LOW no test coverage — bench/src/cloud-loop.mts

No tests exist for this file. The vitest config (vitest.config.ts:5) excludes bench/** entirely, so even if tests were written they wouldn't run in CI. This is a bench/tooling script by design, but the verify() and tools() functions are pure and testable. Consider extracting them to a testable location or adding an integration check gated on env vars.

🟡 LOW observer failure unhandled — crashes the loop — bench/src/cloud-loop.mts

L117-120: observe() is called outside the try/catch that protects the sandbox run (L91-108). If the router LLM returns a malformed JSON response, or the network errors, observe() throws → bypasses the catch on L105 → propagates to main().catch() on L135 → logs and exits 1. The per-round error handling pattern (log, continue) is broken for the observer leg. Wrap in try/catch and continue on failure so a transient router blip doesn't kill the whole bench run.

🟡 LOW unnecessary as never type assertion — bench/src/cloud-loop.mts

L96: fromEvents: (e) => answerOutput.parse(e as never). The e parameter is typed SandboxEvent[] from Deliverable<'events'>. answerOutput.parse accepts ReadonlyArray<unknown> per OutputAdapter<string> (experiment.ts:45). SandboxEvent[] is assignable to ReadonlyArray<unknown> without any cast. The as never is dead code — remove it. If the cast was suppressing a real type error, the root cause should be fixed instead of papered over.


tangletools · 2026-06-08T14:30:03Z · trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 6 non-blocking findings — 302af97e

Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-08T14:30:03Z · immutable trace

@drewstone drewstone merged commit 4917ef6 into main Jun 8, 2026
1 check passed
@drewstone drewstone deleted the feat/loops-live-join branch June 8, 2026 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants