Add connection resilience layer with heartbeat and network awareness#7
Conversation
…nect
Addresses the core weakness the Moshi creator identified: tmux handles
server-side persistence, but the client-to-server WebSocket link dies
silently on phone sleep and WiFi↔cellular switches — the same fragility
mosh was designed to fix at the transport layer.
Since we're bound to WebSocket (browser), we implement the equivalent
resilience at the application layer:
1. Server-side heartbeat (routes.ts)
- Server pings every 15 s; if no pong returns before the next ping
the connection is considered dead and closed immediately.
- Dead connections now detected in <20 s instead of TCP's multi-minute
timeout window.
2. Client-side ping watchdog (useTerminal.ts)
- Client tracks time of last server ping; if silent for 35 s the
socket is forcibly closed to trigger a fresh reconnect cycle.
- Catches the mirror case: client is "alive" but server can't reach
it (common after mobile sleep with NAT table expiry).
3. Immediate reconnect on network change (useTerminal.ts)
- `window.online` event fires when WiFi↔cellular switch completes.
- New `reconnectNow()` helper kills any pending backoff timer and
opens a fresh WebSocket immediately — no waiting for backoff queue.
4. Improved visibility reconnect (useTerminal.ts)
- Existing handler only checked CLOSED state; now also detects stuck
CONNECTING sockets (common after wake) and force-restarts them.
- Cancels any queued backoff retry before calling connect().
https://claude.ai/code/session_01EuCbuu1DNGduvLdMP11ykX
Updates four docs to explain the heartbeat, ping watchdog, network-aware reconnect, and improved visibility handling added in the previous commit. - CHANGELOG.md — four new bullet points under [Unreleased] - README.md — new "Resilient connection" feature bullet + new FAQ entry covering the phone sleep / Wi-Fi↔cellular scenario explicitly - docs/how-it-works.md — new "Connection Resilience" section (§3) explaining each layer of the system; section numbers bumped; summary flow updated with a resilience step - docs/remote-control.md — new "Connection Resilience on Mobile" section with a scenario table covering sleep, network switch, signal loss, and page-visibility cases https://claude.ai/code/session_01EuCbuu1DNGduvLdMP11ykX
Two bugs in useTerminal fixed: 1. Ping watchdog interval leak — reconnectNow() suppresses the onclose callback (by setting ws.onclose = null) so the backoff path is skipped on forced reconnects. As a side-effect, the clearInterval(pingWatchdog) call inside onclose was never reached, leaking one setInterval per forced reconnect. Fix: track the active watchdog in pingWatchdogRef and clear it explicitly in reconnectNow() and on unmount. 2. Stale-socket race in onopen — if reconnectNow fires while a CONNECTING socket's handshake is in-flight, the old socket's onopen can still fire after the new socket has taken wsRef.current. That would corrupt shared state (retryCountRef, pendingMessagesRef, isConnected) on behalf of the wrong socket. Fix: guard onopen and onclose with if (ws !== wsRef.current) and silently close the orphaned socket. https://claude.ai/code/session_01EuCbuu1DNGduvLdMP11ykX
Reflects workspace version bumps (backend/frontend 0.1.5→0.1.6) and dev/peer flag corrections for optional rollup platform packages. https://claude.ai/code/session_01EuCbuu1DNGduvLdMP11ykX
Three follow-up improvements to connection resilience:
1. offline event listener
Listening to window 'offline' now calls setIsConnected(false)
immediately, so the terminal header flips to "Syncing" the moment the
network drops rather than waiting up to 35 s for the ping watchdog or
20 s for the server heartbeat to notice.
2. handleOnline OPEN guard + 200 ms debounce
- Guard: if wsRef.current is already OPEN, skip the reconnect entirely.
Some mobile browsers fire 'online' even when the existing socket is
healthy (e.g. switching back to a known Wi-Fi while LTE stays up),
which would needlessly tear down a working connection.
- Debounce: some OS/browser combos emit multiple 'online' events during
a single network transition. The 200 ms debounce collapses the burst
into one reconnect attempt instead of spinning up multiple sockets
that the stale-socket guard then has to clean up.
3. session.status stopped/error handling
The 'session.status' case was a no-op. The backend sends
{type:'session.status', status:'stopped'} when the PTY exits and
{status:'error'} on failures. Now these set sessionEnded=true and
isConnected=false so the terminal header correctly shows "Ended"
(grey dot) instead of staying on "Live" or "Syncing" after the agent
finishes. sessionEnded is reset to false when a new session mounts.
Terminal.tsx updated to consume the new sessionEnded field.
https://claude.ai/code/session_01EuCbuu1DNGduvLdMP11ykX
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust, multi-layered connection resilience system for WebSockets, specifically designed to improve the user experience on mobile networks. By implementing server-side heartbeats, client-side watchdogs, network-aware reconnects, and improved wake-from-sleep recovery, the system can quickly detect and recover from silent WebSocket failures caused by network disruptions like phone sleep, Wi-Fi/cellular switches, or signal loss. This ensures the terminal reconnects within seconds, significantly reducing downtime and enhancing the reliability of the application. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9824f8d01f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const heartbeatTimer = setInterval(() => { | ||
| if (!heartbeatAlive) { | ||
| ws.close(1001, 'Ping timeout'); | ||
| return; |
There was a problem hiding this comment.
Stop heartbeat timer on unauthenticated websocket exits
The heartbeat interval is started before authentication, but the unauthenticated branch returns immediately after ws.close(...) without clearing it. Because the close/error handlers that call cleanupHeartbeat() are registered later, this path leaves one live interval per rejected connection attempt, which can accumulate into avoidable memory/CPU load under repeated unauthorized probes.
Useful? React with 👍 / 👎.
| useEffect(() => { | ||
| let onlineDebounceTimer: ReturnType<typeof setTimeout> | null = null | ||
| const handleOnline = () => { | ||
| if (wsRef.current?.readyState === WebSocket.OPEN) return |
There was a problem hiding this comment.
Reconnect on online event even when socket still reports OPEN
The online handler bails out whenever readyState === OPEN, but after Wi-Fi/cellular transitions a dead TCP/WebSocket often remains OPEN in browser state for a while. In that common mobile scenario this return prevents the new immediate-reconnect path from running, so recovery falls back to heartbeat/watchdog timeouts instead of happening right after the network change.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the connection resilience of CloudCode's WebSocket-based terminal. Key changes include implementing a server-side heartbeat to detect and close dead connections within 20 seconds, and a client-side ping watchdog that forces a reconnect if no server ping is received for 35 seconds. The client also now intelligently handles network changes (e.g., Wi-Fi to cellular) and page visibility changes (e.g., waking from sleep) by immediately attempting to re-establish the WebSocket connection. UI elements have been updated to reflect these new connection states, including a 'Session ended' status. The documentation has been thoroughly updated to describe these new resilience features. A minor improvement opportunity was noted in the ws.onerror handler in useTerminal.ts, where the explicit clearInterval(pingWatchdog) call is redundant as the ws.onclose handler already performs this cleanup.
| ws.onerror = () => { | ||
| clearInterval(pingWatchdog) | ||
| ws.close() | ||
| } |
There was a problem hiding this comment.
The error event on a browser WebSocket is always followed by a close event. Since your onclose handler already performs all necessary cleanup, including clearing the pingWatchdog interval, calling clearInterval here is redundant. You can simplify this handler to just ws.close() and let onclose manage the cleanup to avoid duplication and improve clarity.
ws.onerror = () => {
ws.close();
};Two fixes to the heartbeat cleanup logic:
1. The auth failure early-return path now calls cleanupHeartbeat() before
closing the socket. Previously the ws.on('close') handler (which calls
cleanupHeartbeat) was registered after the auth check, so unauthenticated
connections leaked the setInterval forever.
2. The heartbeat interval callback now clears itself before calling ws.close()
when a ping timeout is detected, rather than relying solely on the onclose
handler to stop it. This prevents the interval from firing additional times
while the socket transitions through CLOSING state.
alexchaomander
left a comment
There was a problem hiding this comment.
Review
The overall design is solid — layered heartbeat + watchdog + network events is the right approach for mobile WebSocket resilience, and the implementation is careful (stale socket guards, debouncing, proper cleanup in most paths).
One bug was found and fixed in the pushed commit (d910d41):
Bug: heartbeat timer leaked on unauthenticated connections
setInterval for the heartbeat was started before the auth check, but ws.on('close', cleanupHeartbeat) was registered after the auth failure return. Since the close handler was never registered on the unauthenticated path, the interval ran forever.
Fix: call cleanupHeartbeat() explicitly before the early return. Also took the opportunity to self-clear the interval inside the ping-timeout branch of the callback so it stops immediately rather than continuing to fire while the socket is in CLOSING state.
// Before
if (!authenticated) {
ws.send(...);
ws.close(1008, 'Unauthorized');
return; // ws.on('close') never registered → timer leaks
}
// After
if (!authenticated) {
ws.send(...);
cleanupHeartbeat(); // ← added
ws.close(1008, 'Unauthorized');
return;
}Minor nit (not blocking): ws.readyState === 1 appears in several places. The ws library exports WebSocket.OPEN — using that constant would make the intent explicit, though this is cosmetic.
Everything else looks good: the reconnectNow stale-socket guard, the ping watchdog ref-vs-local handling, the debounce on the online event, and the sessionEnded state reset on sessionId change all check out.
Summary
Implements a multi-layered connection resilience system to detect and recover from silent WebSocket failures on mobile networks. Adds server-side heartbeat pings, client-side watchdog timeout detection, network-change event handling, and improved wake-from-sleep recovery. This ensures the terminal reconnects within seconds of network disruptions (phone sleep, Wi-Fi↔cellular switches, signal loss) rather than waiting for TCP timeout.
Type of change
Changes
Backend (
backend/src/terminal/routes.ts)pingmessage every 15 seconds and closes the socket if nopongarrives before the next intervalFrontend (
frontend/src/hooks/useTerminal.ts)online/offlineevents and immediately reconnects on network transitions (Wi-Fi↔cellular) with 200ms debounce to avoid burst reconnectsCONNECTINGsockets and terminates them immediatelysessionEndedboolean in the hook result to distinguish between "syncing" (reconnecting) and "ended" (PTY exited) statesreconnectNow()fires while a socket handshake is in-flightUI (
frontend/src/components/Terminal.tsx)Documentation
Testing
npm run buildsuccessfullyChecklist
any)console.login production code pathshttps://claude.ai/code/session_01EuCbuu1DNGduvLdMP11ykX