Skip to content

Allow summaries to include string data values#49

Open
derrickstolee wants to merge 4 commits intogit-ecosystem:mainfrom
derrickstolee:trace-data-strings
Open

Allow summaries to include string data values#49
derrickstolee wants to merge 4 commits intogit-ecosystem:mainfrom
derrickstolee:trace-data-strings

Conversation

@derrickstolee
Copy link
Copy Markdown
Collaborator

@derrickstolee derrickstolee commented Apr 20, 2026

In #44, we created a custom model for generating performance summaries based on region timings and printf events.

This change introduces a new customization to allow promoting trace2 string data messages into the output telemetry event.

My motivation is that the microsoft/git fork includes a git-gvfs-helper tool that triggers certain string data failures when certain kinds of errors occur. Those messages are not being elevated to the telemetry, so we don't know how often they are happening or if they will go away when we make certain server-side changes.

This is accomplished by adding a new string_patterns list to the summary config. For example, I expect to use this:

string_patterns:
  # Errors during gvfs protocol
  - category: "gvfs-helper"
    key_prefix: "error/"
    field_name: "gvfs_helper_errors"

This will take the errors sent that match the category exactly and the key by prefix and put them into a JSON array as follows:

"gvfs_helper_errors":["(curl:35) SSL connect error [hard_fail]"]

This is a real example that I was able to trigger in my own testing along with other non-error strings using an internal telemetry service that consumes the git-ecosystem/trace2reciever via a fork of the git-ecosystem/sample-trace2-otel-collector.

I used agentic coding to produce these results, including the substantial test cases.

Comment thread Docs/Examples/summary_example.yml Outdated
Comment on lines +34 to +42
# Data pattern rules - capture values from data events matching
# a (category, key prefix) pair. Matched values are promoted into
# the summary regardless of detail level or nesting.
data_patterns:
# Capture curl/http error details from gvfs-helper
- category: "gvfs-helper"
key_prefix: "error/"
field_name: "gvfs_helper_errors"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So data_patterns log the actual objects from the event_data events, whereas the message_patterns just log a count.

Something about the difference in summarisation behaviour between two similarly named things makes me feel a little uneasy from a 'permanent API' thing.

message_patterns aggregates, but data_patterns collects.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about something like string_patterns to signal "match which strings to escalate up the stack" or do you feel we need to move these patterns out of the "summary" config?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm pushing a version that includes that model, just in case.)

derrickstolee and others added 4 commits April 22, 2026 14:59
Git's trace2 "data" events carry structured (category, key, value)
tuples that can contain important diagnostic information, such as
curl/SSL error details from gvfs-helper. Previously, these values
were only included in the OTLP output at higher detail levels:
"dl:process" for process-level data and "dl:verbose" for
region-level data. At the default "dl:summary" level, they were
silently dropped.

This is a problem for deployments that want lightweight telemetry
(summary level) but still need visibility into specific error
conditions reported via data events.

Add a new "string_patterns" rule type to the summary configuration,
alongside the existing "message_patterns" and "region_timers".
Each rule matches data events by exact category and key prefix,
and captures the event's actual value (not just a count) into the
summary. This is emitted at all detail levels via the existing
"trace2.process.summary" span attribute.

Key design decisions:

 - Values are always emitted as arrays for schema stability, even
   when only a single event matches. This avoids downstream parsers
   having to handle both scalar and array types for the same field.

 - Summary promotion runs before region attachment in
   apply__data_generic(), so matching is not skipped when region
   stack lookup fails (e.g. orphaned data events at nesting > 1).

 - Validation enforces that field names are unique across all three
   rule types (message_patterns, region_timers, string_patterns).

Example configuration:

  string_patterns:
    - category: "gvfs-helper"
      key_prefix: "error/"
      field_name: "gvfs_helper_errors"

For a data event like:

  {"event":"data", "category":"gvfs-helper",
   "key":"error/curl",
   "value":"(curl:35) SSL connect error [hard_fail]"}

The summary output will contain:

  {"gvfs_helper_errors":
    ["(curl:35) SSL connect error [hard_fail]"]}

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add the "summary" config key to the collector configuration docs
so that users can discover and configure summary metric tracking.
The summary feature was previously undocumented in the config
reference, making it difficult for users to learn about it without
reading the source code.

Update the example summary YAML to include a string_patterns section
demonstrating how to capture gvfs-helper curl error values, and
update the example output to show the array format used by
string_patterns fields.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The region stack bounds check in apply__data_generic() used a strict
less-than comparison (`len(regionStack) < rWant`) which allowed
an out-of-bounds access when rWant exactly equalled the stack length.
For example, a data event with nesting=2 and an empty region stack
would compute rWant=0, pass the `0 < 0` check, then panic on
`regionStack[0]`.

In a real service, this meant that data events nested inside
regions that weren't tracked (common with gvfs-helper errors) would
crash the receiver rather than being silently skipped. This also
prevented the string_patterns summary promotion from working
end-to-end, since the crash occurred after the promotion call.

Fix the check to use `<=` and also guard against negative rWant.
Add an end-to-end test that feeds the exact trace.json event
through parse and apply to verify the full path works.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add comprehensive end-to-end tests that exercise the complete path:
raw trace2 JSON events -> parse -> apply -> prepareDataset ->
ToTraces -> extract OTLP span attributes -> verify summary JSON.

These tests are designed to be adversarial and expose any gaps in
the data flow. They cover:

 - Process-level data events (nesting=1) promoted at dl:summary
 - Region-level data events (nesting=2) with a proper region stack:
   value appears in BOTH the summary and the region's own data
 - Orphaned nested data events (nesting=2 with no region stack):
   previously would have crashed due to the off-by-one bounds bug;
   now the value is promoted and region attachment fails silently
 - Multiple data events accumulating in the same field as an array
 - Non-matching events excluded (wrong category, wrong key prefix)
 - Integer values from data events (not just strings)
 - Coexistence with message_patterns and region_timers in one summary
 - Promoted values visible at ALL detail levels (summary/process/verbose)
 - No summary config at all (nil) doesn't crash
 - No string_patterns configured doesn't create spurious entries
 - Data events from non-main threads
 - Deeply nested events (nesting=5) with only a partial region stack
 - Multiple rules matching different categories independently

Also replaces the earlier single-event e2e test with a richer
version using the test harness infrastructure (x_make_* helpers,
load_test_dataset_with_config, extractSummaryJSON).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants