Skip to content

Telemetry: instrument connection lifecycle (SSH process, WebSocket) #905

@EhabY

Description

@EhabY

Part of the Telemetry Phase A rollout. See the RFC in Linear: AIGOV-154.

Events

  • ssh.process.discovered with result, measurement durationMs.
  • ssh.network.info with p2p, derp, measurements latencyMs, downloadMbits, uploadMbits. Sampled at 60 seconds or on a meaningful change (p2p flip, derp flip, or >10% latency swing), whichever comes first. Decoupled from the 3-second networkPollInterval in sshProcess.ts.
  • ssh.process.lost with cause, measurement uptimeMs.
  • ssh.process.recovered with measurement recoveryDurationMs.
  • connection.open with url, measurement connectDurationMs.
  • connection.drop with cause, closeCode, measurement connectionDurationMs. On unexpected close, include an error block.
  • connection.reconnect with result, measurements attempts, totalDurationMs.
  • connection.state_transition with from, to, reason. One event per transition in the 6-state machine: IDLE, CONNECTING, CONNECTED, AWAITING_RETRY, DISCONNECTED, DISPOSED.

Sites

  • src/remote/sshProcess.ts discovery, loss, and recovery paths. Add the network-info sampler that reads the existing polled file but emits on the 60s-or-on-change cadence.
  • src/websocket/reconnectingWebSocket.ts state reducer and handlers.

Tests

  • Per-event assertions via TestSink.
  • Network info sampling emits on first read, after 60 seconds, and on a p2p flip inside the 60s window. Does not emit per 3-second poll.
  • State transition events fire on every reducer change.
  • Reconnect cycle aggregates attempts and totalDurationMs correctly.

Depends on AIGOV-243.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions