Skip to content

Add connection resilience layer with heartbeat and network awareness#7

Merged
alexchaomander merged 6 commits intomainfrom
claude/research-moshi-protocol-Ml4Cz
Mar 25, 2026
Merged

Add connection resilience layer with heartbeat and network awareness#7
alexchaomander merged 6 commits intomainfrom
claude/research-moshi-protocol-Ml4Cz

Conversation

@alexchaomander
Copy link
Copy Markdown
Owner

Summary

Implements a multi-layered connection resilience system to detect and recover from silent WebSocket failures on mobile networks. Adds server-side heartbeat pings, client-side watchdog timeout detection, network-change event handling, and improved wake-from-sleep recovery. This ensures the terminal reconnects within seconds of network disruptions (phone sleep, Wi-Fi↔cellular switches, signal loss) rather than waiting for TCP timeout.

Type of change

  • New feature
  • Documentation update

Changes

Backend (backend/src/terminal/routes.ts)

  • Server heartbeat: Sends a ping message every 15 seconds and closes the socket if no pong arrives before the next interval
  • Pong handler: Receives client pong responses to confirm the connection is alive
  • Cleanup: Properly clears the heartbeat timer on socket close/error

Frontend (frontend/src/hooks/useTerminal.ts)

  • Client-side ping watchdog: Tracks the last server ping timestamp and force-closes the socket if no ping arrives within 35 seconds (catches stale TCP sockets after phone sleep)
  • Reconnect-now function: Bypasses exponential backoff to immediately reconnect when a strong signal indicates the connection is dead
  • Network-aware reconnect: Listens to browser online/offline events and immediately reconnects on network transitions (Wi-Fi↔cellular) with 200ms debounce to avoid burst reconnects
  • Improved visibility handling: Detects not only closed sockets but also stuck CONNECTING sockets and terminates them immediately
  • Session ended state: New sessionEnded boolean in the hook result to distinguish between "syncing" (reconnecting) and "ended" (PTY exited) states
  • Stale socket guards: Prevents race conditions where reconnectNow() fires while a socket handshake is in-flight

UI (frontend/src/components/Terminal.tsx)

  • Status indicator: Terminal header now shows "Ended" (gray) when the session has terminated, vs "Live" (green) or "Syncing" (amber)

Documentation

  • how-it-works.md: New "Connection Resilience" section explaining the layered approach (server heartbeat, client watchdog, network events, wake-from-sleep recovery)
  • remote-control.md: New "Connection Resilience on Mobile" table showing recovery times for common scenarios
  • README.md: Added resilience feature to the feature list
  • CHANGELOG.md: Documented the new heartbeat and watchdog features

Testing

  • Ran npm run build successfully
  • Tested locally end-to-end (heartbeat/pong exchange, watchdog timeout, network event handling)
  • Tested on mobile browser (visibility change, network transitions, session ended state)

Checklist

  • Code follows the project's TypeScript conventions (strict mode, no any)
  • No hardcoded secrets or credentials
  • No console.log in production code paths
  • Documentation updated with resilience details
  • Proper cleanup of timers and event listeners to prevent leaks
  • Stale socket guards prevent race conditions during rapid reconnects

https://claude.ai/code/session_01EuCbuu1DNGduvLdMP11ykX

claude added 5 commits March 24, 2026 11:18
…nect

Addresses the core weakness the Moshi creator identified: tmux handles
server-side persistence, but the client-to-server WebSocket link dies
silently on phone sleep and WiFi↔cellular switches — the same fragility
mosh was designed to fix at the transport layer.

Since we're bound to WebSocket (browser), we implement the equivalent
resilience at the application layer:

1. Server-side heartbeat (routes.ts)
   - Server pings every 15 s; if no pong returns before the next ping
     the connection is considered dead and closed immediately.
   - Dead connections now detected in <20 s instead of TCP's multi-minute
     timeout window.

2. Client-side ping watchdog (useTerminal.ts)
   - Client tracks time of last server ping; if silent for 35 s the
     socket is forcibly closed to trigger a fresh reconnect cycle.
   - Catches the mirror case: client is "alive" but server can't reach
     it (common after mobile sleep with NAT table expiry).

3. Immediate reconnect on network change (useTerminal.ts)
   - `window.online` event fires when WiFi↔cellular switch completes.
   - New `reconnectNow()` helper kills any pending backoff timer and
     opens a fresh WebSocket immediately — no waiting for backoff queue.

4. Improved visibility reconnect (useTerminal.ts)
   - Existing handler only checked CLOSED state; now also detects stuck
     CONNECTING sockets (common after wake) and force-restarts them.
   - Cancels any queued backoff retry before calling connect().

https://claude.ai/code/session_01EuCbuu1DNGduvLdMP11ykX
Updates four docs to explain the heartbeat, ping watchdog, network-aware
reconnect, and improved visibility handling added in the previous commit.

- CHANGELOG.md — four new bullet points under [Unreleased]
- README.md — new "Resilient connection" feature bullet + new FAQ entry
  covering the phone sleep / Wi-Fi↔cellular scenario explicitly
- docs/how-it-works.md — new "Connection Resilience" section (§3)
  explaining each layer of the system; section numbers bumped; summary
  flow updated with a resilience step
- docs/remote-control.md — new "Connection Resilience on Mobile" section
  with a scenario table covering sleep, network switch, signal loss, and
  page-visibility cases

https://claude.ai/code/session_01EuCbuu1DNGduvLdMP11ykX
Two bugs in useTerminal fixed:

1. Ping watchdog interval leak — reconnectNow() suppresses the onclose
   callback (by setting ws.onclose = null) so the backoff path is skipped
   on forced reconnects. As a side-effect, the clearInterval(pingWatchdog)
   call inside onclose was never reached, leaking one setInterval per
   forced reconnect. Fix: track the active watchdog in pingWatchdogRef and
   clear it explicitly in reconnectNow() and on unmount.

2. Stale-socket race in onopen — if reconnectNow fires while a
   CONNECTING socket's handshake is in-flight, the old socket's onopen
   can still fire after the new socket has taken wsRef.current. That
   would corrupt shared state (retryCountRef, pendingMessagesRef,
   isConnected) on behalf of the wrong socket. Fix: guard onopen and
   onclose with if (ws !== wsRef.current) and silently close the
   orphaned socket.

https://claude.ai/code/session_01EuCbuu1DNGduvLdMP11ykX
Reflects workspace version bumps (backend/frontend 0.1.5→0.1.6) and
dev/peer flag corrections for optional rollup platform packages.

https://claude.ai/code/session_01EuCbuu1DNGduvLdMP11ykX
Three follow-up improvements to connection resilience:

1. offline event listener
   Listening to window 'offline' now calls setIsConnected(false)
   immediately, so the terminal header flips to "Syncing" the moment the
   network drops rather than waiting up to 35 s for the ping watchdog or
   20 s for the server heartbeat to notice.

2. handleOnline OPEN guard + 200 ms debounce
   - Guard: if wsRef.current is already OPEN, skip the reconnect entirely.
     Some mobile browsers fire 'online' even when the existing socket is
     healthy (e.g. switching back to a known Wi-Fi while LTE stays up),
     which would needlessly tear down a working connection.
   - Debounce: some OS/browser combos emit multiple 'online' events during
     a single network transition. The 200 ms debounce collapses the burst
     into one reconnect attempt instead of spinning up multiple sockets
     that the stale-socket guard then has to clean up.

3. session.status stopped/error handling
   The 'session.status' case was a no-op. The backend sends
   {type:'session.status', status:'stopped'} when the PTY exits and
   {status:'error'} on failures. Now these set sessionEnded=true and
   isConnected=false so the terminal header correctly shows "Ended"
   (grey dot) instead of staying on "Live" or "Syncing" after the agent
   finishes. sessionEnded is reset to false when a new session mounts.
   Terminal.tsx updated to consume the new sessionEnded field.

https://claude.ai/code/session_01EuCbuu1DNGduvLdMP11ykX
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust, multi-layered connection resilience system for WebSockets, specifically designed to improve the user experience on mobile networks. By implementing server-side heartbeats, client-side watchdogs, network-aware reconnects, and improved wake-from-sleep recovery, the system can quickly detect and recover from silent WebSocket failures caused by network disruptions like phone sleep, Wi-Fi/cellular switches, or signal loss. This ensures the terminal reconnects within seconds, significantly reducing downtime and enhancing the reliability of the application.

Highlights

  • Server-side Heartbeat: Implemented a server-side heartbeat that sends pings every 15 seconds and closes the socket if no pong is received, reducing dead connection detection time to under 20 seconds.
  • Client-side Ping Watchdog: Added a client-side watchdog that force-closes and reconnects the socket if no server ping is received for 35 seconds, addressing silently stale TCP sockets common after phone sleep.
  • Network-aware Reconnect: Implemented immediate reconnection on browser 'online'/'offline' events with a 200ms debounce, bypassing exponential backoff for faster recovery during network transitions (e.g., Wi-Fi ↔ cellular).
  • Improved Wake-from-Sleep Recovery: Enhanced visibility change handling to detect and immediately terminate 'CONNECTING' sockets that are stuck after waking from sleep, preventing connection timeouts.
  • Session Ended State: Introduced a new 'sessionEnded' boolean in the 'useTerminal' hook and updated the UI to distinguish between 'syncing' (reconnecting) and 'ended' (PTY exited) states.
  • Stale Socket Guards: Added guards to prevent race conditions during rapid reconnects, ensuring only the active WebSocket is used and preventing state corruption.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9824f8d01f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +151 to +154
const heartbeatTimer = setInterval(() => {
if (!heartbeatAlive) {
ws.close(1001, 'Ping timeout');
return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop heartbeat timer on unauthenticated websocket exits

The heartbeat interval is started before authentication, but the unauthenticated branch returns immediately after ws.close(...) without clearing it. Because the close/error handlers that call cleanupHeartbeat() are registered later, this path leaves one live interval per rejected connection attempt, which can accumulate into avoidable memory/CPU load under repeated unauthorized probes.

Useful? React with 👍 / 👎.

useEffect(() => {
let onlineDebounceTimer: ReturnType<typeof setTimeout> | null = null
const handleOnline = () => {
if (wsRef.current?.readyState === WebSocket.OPEN) return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reconnect on online event even when socket still reports OPEN

The online handler bails out whenever readyState === OPEN, but after Wi-Fi/cellular transitions a dead TCP/WebSocket often remains OPEN in browser state for a while. In that common mobile scenario this return prevents the new immediate-reconnect path from running, so recovery falls back to heartbeat/watchdog timeouts instead of happening right after the network change.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the connection resilience of CloudCode's WebSocket-based terminal. Key changes include implementing a server-side heartbeat to detect and close dead connections within 20 seconds, and a client-side ping watchdog that forces a reconnect if no server ping is received for 35 seconds. The client also now intelligently handles network changes (e.g., Wi-Fi to cellular) and page visibility changes (e.g., waking from sleep) by immediately attempting to re-establish the WebSocket connection. UI elements have been updated to reflect these new connection states, including a 'Session ended' status. The documentation has been thoroughly updated to describe these new resilience features. A minor improvement opportunity was noted in the ws.onerror handler in useTerminal.ts, where the explicit clearInterval(pingWatchdog) call is redundant as the ws.onclose handler already performs this cleanup.

Comment on lines 270 to 273
ws.onerror = () => {
clearInterval(pingWatchdog)
ws.close()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error event on a browser WebSocket is always followed by a close event. Since your onclose handler already performs all necessary cleanup, including clearing the pingWatchdog interval, calling clearInterval here is redundant. You can simplify this handler to just ws.close() and let onclose manage the cleanup to avoid duplication and improve clarity.

    ws.onerror = () => {
      ws.close();
    };

Two fixes to the heartbeat cleanup logic:

1. The auth failure early-return path now calls cleanupHeartbeat() before
   closing the socket. Previously the ws.on('close') handler (which calls
   cleanupHeartbeat) was registered after the auth check, so unauthenticated
   connections leaked the setInterval forever.

2. The heartbeat interval callback now clears itself before calling ws.close()
   when a ping timeout is detected, rather than relying solely on the onclose
   handler to stop it. This prevents the interval from firing additional times
   while the socket transitions through CLOSING state.
Copy link
Copy Markdown
Owner Author

@alexchaomander alexchaomander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The overall design is solid — layered heartbeat + watchdog + network events is the right approach for mobile WebSocket resilience, and the implementation is careful (stale socket guards, debouncing, proper cleanup in most paths).

One bug was found and fixed in the pushed commit (d910d41):

Bug: heartbeat timer leaked on unauthenticated connections

setInterval for the heartbeat was started before the auth check, but ws.on('close', cleanupHeartbeat) was registered after the auth failure return. Since the close handler was never registered on the unauthenticated path, the interval ran forever.

Fix: call cleanupHeartbeat() explicitly before the early return. Also took the opportunity to self-clear the interval inside the ping-timeout branch of the callback so it stops immediately rather than continuing to fire while the socket is in CLOSING state.

// Before
if (!authenticated) {
  ws.send(...);
  ws.close(1008, 'Unauthorized');
  return;  // ws.on('close') never registered → timer leaks
}

// After
if (!authenticated) {
  ws.send(...);
  cleanupHeartbeat();  // ← added
  ws.close(1008, 'Unauthorized');
  return;
}

Minor nit (not blocking): ws.readyState === 1 appears in several places. The ws library exports WebSocket.OPEN — using that constant would make the intent explicit, though this is cosmetic.

Everything else looks good: the reconnectNow stale-socket guard, the ping watchdog ref-vs-local handling, the debounce on the online event, and the sessionEnded state reset on sessionId change all check out.

@alexchaomander alexchaomander merged commit 05bab22 into main Mar 25, 2026
1 check passed
@alexchaomander alexchaomander deleted the claude/research-moshi-protocol-Ml4Cz branch March 25, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants