Skip to content

Telemetry: coder.exportTelemetry command (Logs, Traces, and Metrics OTLP/JSON) #903

@EhabY

Description

@EhabY

Part of the Telemetry Phase A rollout. See the RFC in Linear: AIGOV-154.

Scope

Register the coder.exportTelemetry command via the dispatcher from ENG-2458. The command exports stored JSONL events as OTLP/JSON covering all three OTel signal types (Logs, Traces, Metrics) so the output is natively understood by Honeycomb, Datadog, Tempo+Loki, New Relic, Splunk, Elastic APM, and any other OTel-aware backend.

Inputs

  • Date range selection via QuickPick with presets ("Last 24 hours", "Last 7 days", "Last 30 days", "All time", "Custom range…"). Custom range falls back to two InputBox prompts with validateInput for YYYY-MM-DD dates.
  • File filtering by filename only (daily files are date-stamped). No file contents are opened outside the selected range, so "export last 3 days" stays fast regardless of total storage size.
  • Format selection via a second QuickPick:
    • JSON array (default): single JSON document for human inspection or compliance review.
    • OTLP/JSON: three OTLP envelopes (resourceLogs, resourceSpans, resourceMetrics), packaged so each can be POSTed unchanged to its corresponding OTel Collector endpoint (/v1/logs, /v1/traces, /v1/metrics).
  • vscode.window.showSaveDialog with suggested filename coder-telemetry-{range}.{ext} and a filter matching the chosen format.

OTLP routing

Events are routed to one of three envelopes based on shape:

Logs (resourceLogs) — events with no traceId: log, logError.

  • Each event becomes one LogRecord. Resource attributes lifted from context. properties and measurements mapped to attributes. Body is the eventName. Severity inferred from presence of error (info vs error).

Traces (resourceSpans) — events with traceId: time, trace parent, every Span.phase descendant.

  • eventIdspan_id. traceIdtrace_id. parentEventIdparent_span_id (omitted on root spans).
  • eventName.split('.').pop() → span name. The full hierarchical eventName is preserved as an attribute (coder.event_name) so prefix queries still work in span backends.
  • timestamp + measurements.durationMsstart_time_unix_nano / end_time_unix_nano.
  • properties.resultstatus.code (STATUS_CODE_OK for success, STATUS_CODE_ERROR for error).
  • error block → status.message and a span event with exception.* attributes (exception.type, exception.message, exception.code).

Metrics (resourceMetrics) — pre-aggregated events.

The current Phase A instrumentation plan has two metric-shaped event families:

  • http.requests (per-window rollup): count_2xx/count_3xx/count_4xx/count_5xx/count_network_error emit as Sum data points (delta temporality, monotonic). avg_duration_ms / p95_duration_ms emit as Gauge. Window timestamps drive the data point time range.
  • ssh.network.info (sampled gauge): latencyMs, downloadMbits, uploadMbits emit as Gauge data points.

Routing is driven by a small list of metric-eligible event names rather than a runtime flag, since these events are pre-aggregated by definition.

Other events' durationMs is NOT derived into Histograms by the export command. Users who want RED metrics (rate / errors / duration) per span name pipe the traces output through the OTel Collector's spanmetrics connector, which produces traces.span.metrics.calls (Sum) and traces.span.metrics.duration (Histogram) keyed by span name + status.code and configurable additional dimensions. The connector is more accurate and more configurable than ad-hoc derivation in our export command.

Streaming

Memory usage stays proportional to a single line via readline.createInterface + createReadStream. The OTLP writer dispatches each event to its envelope based on traceId presence and the metric-eligible event-name list. JSON array wraps with [\n and \n]; OTLP/JSON streams entries directly into the appropriate envelope's array.

Tests

  • Unit tests for date range parsing and the filename-based file filter.
  • JSON-array writer (header, comma-separated lines, footer).
  • OTLP/JSON writer:
    • LogRecord shape for events without traceId. Severity inference.
    • Span shape for events with traceId. parent_span_id derivation. Composed eventName preserved as attribute. Status code and exception event mapping.
    • Metric data points for http.requests (Sums + Gauges) and ssh.network.info (Gauges). Timestamp ranges. Resource attributes from context.
  • Memfs-backed integration test that seeds JSONL files (mix of log/error/timed/phase/metric-shaped), exports a range, and verifies all three OTLP envelopes.

Out of scope

  • Live OTLP push from the extension. The export is batch-only by design. A live exporter (push to a Coder server's collector or directly to an OTLP endpoint) is a Phase C concern.
  • span.log() / span.logError() for log-in-trace correlation — tracked separately in Telemetry: span.log() / span.logError() for log-in-trace correlation #925. The exporter must coordinate with that work on the routing discriminator (whether traceId presence or durationMs presence is used to split LogRecord vs Span at export time).
  • Prometheus endpoint. Prometheus cannot ingest event files directly. Users who want Prometheus feed the OTLP/JSON output into an OTel Collector and let it forward via prometheusremotewrite.

Depends on #900, #901, ENG-2458.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions