Skip to content

fix: handle WebGL context loss and add stream auto-reconnect (#584)#585

Open
claude[bot] wants to merge 5 commits intomainfrom
claude/fix-scene-drops-584
Open

fix: handle WebGL context loss and add stream auto-reconnect (#584)#585
claude[bot] wants to merge 5 commits intomainfrom
claude/fix-scene-drops-584

Conversation

@claude
Copy link
Copy Markdown
Contributor

@claude claude Bot commented Apr 30, 2026

Summary

Closes #584

The 3D scene can go white and lose all objects due to two root causes, both now addressed:

  • WebGL context loss — When running multiple 3D tabs, Chrome may evict WebGL contexts to reclaim GPU memory. Without handling the webglcontextlost/webglcontextrestored events, the scene goes permanently blank. A new ContextLossHandler component calls preventDefault() on loss (allowing the browser to restore the context) and invalidate() on restore to re-render the scene.

  • No auto-reconnect for Connect-RPC streams — When streamEntityChanges or streamSceneChanges errors out, the stream dies permanently with no retry. A new retryStream utility wraps streams with exponential backoff (1s–30s), automatically reconnecting while preserving existing entities in the scene during brief disconnections. This matches the reconnect pattern already used by the WebSocket-based useDrawAPI.

DRI

@btshrewsbury-viam

Test plan

  • Unit tests for retryStream covering: success, retry on error, abort during backoff, delay reset after success
  • Open 3+ tabs with 3D scenes to trigger WebGL context eviction — scene should recover automatically
  • Kill/restart the draw service backend — entity stream should reconnect and the scene should not go blank
  • Verify normal operation is unaffected (entities render, stream events flow)

🤖 Generated with Claude Code

The 3D scene could go white and lose all objects due to two issues:

1. No WebGL context loss handling — when the browser evicts a WebGL
   context (common with multiple 3D tabs), the scene goes blank with
   no recovery. Add a ContextLossHandler component that calls
   preventDefault() on context loss (allowing restoration) and
   invalidates the renderer when the context is restored.

2. No auto-reconnect for Connect-RPC streams — when streamEntityChanges
   or streamSceneChanges errors out, the stream dies permanently. Add a
   retryStream utility with exponential backoff (1s–30s) that
   automatically reconnects failed streams, preserving existing entities
   in the scene during brief disconnections.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 30, 2026

🦋 Changeset detected

Latest commit: 37d17e1

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@viamrobotics/motion-tools Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@micheal-parks
Copy link
Copy Markdown
Member

@claude Where is the evidence that opening multiple 3D tabs can cause context loss?

@claude
Copy link
Copy Markdown
Contributor Author

claude Bot commented Apr 30, 2026

Claude finished @micheal-parks's task in 1m 2s —— View job


Evidence for multi-tab WebGL context loss

Short answer: the PR cites no direct evidence — no reproduction steps, no console logs, no user report attributing the white scene specifically to context loss from multiple tabs. It's a hypothesis presented as a confirmed root cause.

What is true at the browser/spec level:

  • WebGL context limits are real. Chrome evicts older WebGL contexts under GPU memory pressure. The Chromium source enforces a cap (historically ~8–16 contexts per GPU process). MDN notes: "If you create more than 8 WebGL canvases in a single browser page, oldest contexts may be evicted." The multi-tab variant (same origin sharing a GPU process) is less precisely documented but the same pressure applies.
  • preventDefault() is the correct spec response. The WebGL spec requires calling event.preventDefault() on webglcontextlost to allow the browser to restore the context — without it, context loss is permanent. This is standard defensive practice regardless of cause.
  • The webglcontextrestoredinvalidate() path is correct for Threlte's on-demand render mode.

What the PR does not provide:

The ContextLossHandler is good defensive coding that costs nothing and protects against a real browser behavior. But the PR description overstates the confidence — "Chrome may evict WebGL contexts" should be framed as why this is worth defending against, not as a confirmed diagnosis of #584. The stream reconnect fix in retryStream is the more evidenced root cause.

If you want to validate the multi-tab scenario, you can trigger context loss artificially with:

canvas.getContext('webgl2').getExtension('WEBGL_lose_context').loseContext()

and confirm the handler fires and the scene recovers.

When the server closes the stream cleanly, retryStream was incorrectly
calling onRetry (which sets connectionStatus=DISCONNECTED) and sleeping
1s before reconnecting. This caused a spurious disconnect flash and
unnecessary delay on normal stream restarts.

Also restores error logging that was dropped when the original try/catch
was replaced with retryStream.

Adds a test verifying onRetry is not called on clean stream end.
Comment thread src/lib/components/ContextLossHandler.svelte Outdated
Copy link
Copy Markdown
Contributor Author

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two bugs fixed directly in a follow-up commit; one minor nit left as an inline comment.

Fixed: onRetry fired on clean stream end

retryStream called onRetry (and slept for 1 s) regardless of whether run threw or ended cleanly. For streamEntityChanges, onRetry sets connectionStatus = DISCONNECTED, so any normal server-side stream close would flash a disconnect banner and delay reconnect by 1 s unnecessarily.

Fix: track whether the iteration errored; call onRetry/sleep only in that branch, and restart immediately after a clean end. The "resets delay after a successful run" test already exercised this path but didn't assert onRetry call count, so the bug was masked — a new test does not call onRetry and restarts immediately on clean stream end covers it explicitly.

Fixed: errors silently swallowed

The original streamEntityChanges/streamSceneChanges logged errors with console.error. The new retryStream catch block dropped the error entirely, making transient stream failures invisible in devtools. Fixed by adding console.warn in the catch block (warn rather than error because the failure is transient and will be retried).

Nit (inline): console.warn on context restored

See inline comment on ContextLossHandler.svelte line 14 — console.warn is semantically wrong for a successful recovery; console.info is more appropriate.

Everything else looks good

  • ContextLossHandler: preventDefault() on context lost is the correct mechanism; invalidate() on restore is sufficient for Threlte's on-demand renderer. Canvas reference is stable and captured correctly.
  • retryStream signal propagation: inner sig is correctly passed to client.streamEntityChanges so abort cancels the active stream, not just the retry loop.
  • Cleanup in provideDrawService's effect teardown destroys all entities and aborts the controller — reconnect won't leak ECS entities.
  • Existing tests are all logically correct and still pass with the fix applied (traced manually; browser Playwright runner unavailable in this environment).

@micheal-parks
Copy link
Copy Markdown
Member

@claude Two issues:

  1. When the user said "drops all objects and goes white", they still saw non-network based scene elements, like the <Grid>. So a context loss event is out of the question.
  2. I'm concerned that blindly restoring the context will cause greater issues in the future. From what I see there's no evidence in context loss from multiple tabs with one getContext('webgl') per tab, and if we enter a scenerio where 8-16 instances of this app have been created on a page we want it to explicitly fail since that's an abuse of this tool.

@claude
Copy link
Copy Markdown
Contributor Author

claude Bot commented Apr 30, 2026

Claude finished @micheal-parks's task in 1m 30s —— View job


Removed ContextLossHandler

  • Understand the request
  • Read ContextLossHandler.svelte and App.svelte
  • Remove ContextLossHandler.svelte
  • Remove usage from App.svelte
  • Lint and typecheck (pre-existing errors only, unrelated to this change)
  • Commit and push

Both ContextLossHandler.svelte and its import/usage in App.svelte are gone. The two points from @micheal-parks are valid:

  1. Grid stays visible → not context loss. If the WebGL context were lost, Three.js would stop rendering everything including Grid. Since Grid remains, the context is intact and the white scene is purely a stream issue.
  2. preventDefault() on webglcontextlost is the wrong default. Silently recovering a context that was evicted because the page created too many would mask an abuse scenario. Better to let it fail.

The retryStream fix (stream auto-reconnect) remains — that's the evidenced root cause.
| Branch

The user reported that non-network scene elements (like Grid) remained
visible when entities disappeared, ruling out WebGL context loss as the
cause. The stream reconnect fix in retryStream is sufficient.

Blindly calling preventDefault() on webglcontextlost would also silently
recover an app that is legitimately overusing WebGL contexts rather than
failing explicitly.

Co-authored-by: Micheal Parks <micheal-parks@users.noreply.github.com>
@micheal-parks
Copy link
Copy Markdown
Member

@claude this looks good but incomplete. The user reported the complete loss of a scene with objects that not only originate from that source, but also from useFrames, useGeometries, etc. In other words, everything that is fetched when a connection to a machine with PartID is made.

@claude
Copy link
Copy Markdown
Contributor Author

claude Bot commented Apr 30, 2026

Claude finished @micheal-parks's task in 10m 6s —— View job


  • Understand the request and gather context
  • Read useFrames, useGeometries, usePointclouds, useWorldState, and related hooks
  • Fix entity cleanup in useGeometries.svelte.ts
  • Fix entity cleanup in usePointclouds.svelte.ts
  • Lint and typecheck (pre-existing errors only, unrelated to these changes)
  • Commit and push

Root cause

Both useGeometries and usePointclouds derive their query lists from useResourceNames(partID, '...'). When the machine connection drops, useResourceNames can return an empty array — all queries disappear, activeQueryKeys becomes empty, and the outer cleanup loop destroys every ECS entity. On reconnect the queries come back, but the scene is blank in the interim.

useFrames is partially protected: TanStack stale data keeps query.data populated during retries, and the existing DISCONNECTED → fall back to configFrames path covers it.

useWorldState uses createResourceStream from the Viam SDK which has its own reconnect semantics; it's also re-initialized from scratch when clients return, so it's deferred from this fix.

Fix (8411068)

Guard the outer cleanup in both hooks with:

const anyQueriesActive = activeQueryKeys.size > 0

Then only destroy an entity when:

  1. Its partID prefix differs from currentPartID — PartID changed, always clean up old entities ✓
  2. OR anyQueriesActive === true — we're still connected and the resource was legitimately removed ✓

When anyQueriesActive === false (all queries gone), that signals a transient disconnect: entities are preserved. When the machine reconnects and resource names return, the inner $effects update each entity with fresh query data. If a resource truly no longer exists after reconnect, anyQueriesActive will be true (other resources are back), so it gets cleaned up then.

Branch |

When useResourceNames returns empty (machine connection dropped),
useGeometries and usePointclouds were destroying all ECS entities.
On reconnect the queries came back but there was a blank window.

Guard the outer cleanup: if ALL queries are gone (activeQueryKeys empty),
the machine is likely temporarily disconnected — skip entity destruction.
Only destroy when the partID changed or other queries are still active
(connected machine, resource legitimately removed).

Co-authored-by: Micheal Parks <micheal-parks@users.noreply.github.com>
@DTCurrie
Copy link
Copy Markdown
Member

@claude why did you update useDrawService.svelte.ts and not useWorldState.svelte.ts. I would expect that based on your comment:

useWorldState uses createResourceStream from the Viam SDK which has its own reconnect semantics; it's also re-initialized from scratch when clients return, so it's deferred from this fix.

Plus it is not related to machine connection, it is only used when running the app locally with make-up.

@claude
Copy link
Copy Markdown
Contributor Author

claude Bot commented Apr 30, 2026

Claude finished @DTCurrie's task in 1m 20s —— View job


  • Read relevant files to understand the question
  • Answer @DTCurrie's question about useDrawService vs useWorldState

Good catch — the previous comment conflated two separate things. Here's the full picture:

Why useDrawService got retryStream

useDrawService manually calls client.streamEntityChanges(...) and client.streamSceneChanges(...) — raw async iterators over a Connect-RPC transport with no built-in reconnect. When the stream errors, it's dead forever. That's why retryStream was added.

@DTCurrie is correct: useDrawService has nothing to do with machine connections. It connects to drawConnectionConfig.current?.backendIP (the local Draw backend started by make up). The fix is entirely scoped to that local dev service.

Why useWorldState was NOT updated

useWorldState uses createResourceStream(client, 'streamTransformChanges', ...) from @viamrobotics/svelte-sdk. That SDK wrapper handles reconnection internally — it's a different mechanism from the raw stream useDrawService uses, so wrapping it with retryStream would be redundant (and potentially conflict with the SDK's own retry logic).

Additionally, useWorldState manages its ECS entities through a direct Map<string, Entity> (not the TanStack Query + useResourceNames pattern that caused the blank-scene bug in useGeometries/usePointclouds). The entity cleanup bug — where an empty activeQueryKeys on disconnect triggered wholesale entity destruction — does not apply to useWorldState. Its entities are only destroyed on explicit REMOVED stream events or in the teardown return () => {...} block.

Summary

Hook Transport Bug pattern Fix applied
useDrawService Raw Connect-RPC stream No reconnect on error retryStream
useGeometries / usePointclouds TanStack Query + useResourceNames Entity cleanup on empty queries Guard on anyQueriesActive
useWorldState SDK createResourceStream N/A — SDK handles reconnect; entity cleanup not via query pattern No change needed

So the deferral was correct in outcome, but the previous explanation was imprecise — it should have said "uses SDK-level reconnect (not a raw stream) and manages entities directly, bypassing the query-cleanup bug" rather than leaning on "re-initialized from scratch when clients return."
| Branch

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://viamrobotics.github.io/visualization/pr-preview/pr-585/

Built to branch gh-pages at 2026-05-07 18:05 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3D scene sometimes drops all objects and goes white, refreshing fixes

3 participants