Allow summaries to include string data values#49
Open
derrickstolee wants to merge 4 commits intogit-ecosystem:mainfrom
Open
Allow summaries to include string data values#49derrickstolee wants to merge 4 commits intogit-ecosystem:mainfrom
derrickstolee wants to merge 4 commits intogit-ecosystem:mainfrom
Conversation
mjcheetham
reviewed
Apr 22, 2026
Comment on lines
+34
to
+42
| # Data pattern rules - capture values from data events matching | ||
| # a (category, key prefix) pair. Matched values are promoted into | ||
| # the summary regardless of detail level or nesting. | ||
| data_patterns: | ||
| # Capture curl/http error details from gvfs-helper | ||
| - category: "gvfs-helper" | ||
| key_prefix: "error/" | ||
| field_name: "gvfs_helper_errors" | ||
|
|
Contributor
There was a problem hiding this comment.
So data_patterns log the actual objects from the event_data events, whereas the message_patterns just log a count.
Something about the difference in summarisation behaviour between two similarly named things makes me feel a little uneasy from a 'permanent API' thing.
message_patterns aggregates, but data_patterns collects.
Collaborator
Author
There was a problem hiding this comment.
How do you feel about something like string_patterns to signal "match which strings to escalate up the stack" or do you feel we need to move these patterns out of the "summary" config?
Collaborator
Author
There was a problem hiding this comment.
(I'm pushing a version that includes that model, just in case.)
Git's trace2 "data" events carry structured (category, key, value)
tuples that can contain important diagnostic information, such as
curl/SSL error details from gvfs-helper. Previously, these values
were only included in the OTLP output at higher detail levels:
"dl:process" for process-level data and "dl:verbose" for
region-level data. At the default "dl:summary" level, they were
silently dropped.
This is a problem for deployments that want lightweight telemetry
(summary level) but still need visibility into specific error
conditions reported via data events.
Add a new "string_patterns" rule type to the summary configuration,
alongside the existing "message_patterns" and "region_timers".
Each rule matches data events by exact category and key prefix,
and captures the event's actual value (not just a count) into the
summary. This is emitted at all detail levels via the existing
"trace2.process.summary" span attribute.
Key design decisions:
- Values are always emitted as arrays for schema stability, even
when only a single event matches. This avoids downstream parsers
having to handle both scalar and array types for the same field.
- Summary promotion runs before region attachment in
apply__data_generic(), so matching is not skipped when region
stack lookup fails (e.g. orphaned data events at nesting > 1).
- Validation enforces that field names are unique across all three
rule types (message_patterns, region_timers, string_patterns).
Example configuration:
string_patterns:
- category: "gvfs-helper"
key_prefix: "error/"
field_name: "gvfs_helper_errors"
For a data event like:
{"event":"data", "category":"gvfs-helper",
"key":"error/curl",
"value":"(curl:35) SSL connect error [hard_fail]"}
The summary output will contain:
{"gvfs_helper_errors":
["(curl:35) SSL connect error [hard_fail]"]}
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add the "summary" config key to the collector configuration docs so that users can discover and configure summary metric tracking. The summary feature was previously undocumented in the config reference, making it difficult for users to learn about it without reading the source code. Update the example summary YAML to include a string_patterns section demonstrating how to capture gvfs-helper curl error values, and update the example output to show the array format used by string_patterns fields. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The region stack bounds check in apply__data_generic() used a strict less-than comparison (`len(regionStack) < rWant`) which allowed an out-of-bounds access when rWant exactly equalled the stack length. For example, a data event with nesting=2 and an empty region stack would compute rWant=0, pass the `0 < 0` check, then panic on `regionStack[0]`. In a real service, this meant that data events nested inside regions that weren't tracked (common with gvfs-helper errors) would crash the receiver rather than being silently skipped. This also prevented the string_patterns summary promotion from working end-to-end, since the crash occurred after the promotion call. Fix the check to use `<=` and also guard against negative rWant. Add an end-to-end test that feeds the exact trace.json event through parse and apply to verify the full path works. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add comprehensive end-to-end tests that exercise the complete path: raw trace2 JSON events -> parse -> apply -> prepareDataset -> ToTraces -> extract OTLP span attributes -> verify summary JSON. These tests are designed to be adversarial and expose any gaps in the data flow. They cover: - Process-level data events (nesting=1) promoted at dl:summary - Region-level data events (nesting=2) with a proper region stack: value appears in BOTH the summary and the region's own data - Orphaned nested data events (nesting=2 with no region stack): previously would have crashed due to the off-by-one bounds bug; now the value is promoted and region attachment fails silently - Multiple data events accumulating in the same field as an array - Non-matching events excluded (wrong category, wrong key prefix) - Integer values from data events (not just strings) - Coexistence with message_patterns and region_timers in one summary - Promoted values visible at ALL detail levels (summary/process/verbose) - No summary config at all (nil) doesn't crash - No string_patterns configured doesn't create spurious entries - Data events from non-main threads - Deeply nested events (nesting=5) with only a partial region stack - Multiple rules matching different categories independently Also replaces the earlier single-event e2e test with a richer version using the test harness infrastructure (x_make_* helpers, load_test_dataset_with_config, extractSummaryJSON). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6dd6c7e to
12d81b1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In #44, we created a custom model for generating performance summaries based on region timings and printf events.
This change introduces a new customization to allow promoting trace2 string data messages into the output telemetry event.
My motivation is that the
microsoft/gitfork includes agit-gvfs-helpertool that triggers certain string data failures when certain kinds of errors occur. Those messages are not being elevated to the telemetry, so we don't know how often they are happening or if they will go away when we make certain server-side changes.This is accomplished by adding a new
string_patternslist to the summary config. For example, I expect to use this:This will take the errors sent that match the category exactly and the key by prefix and put them into a JSON array as follows:
This is a real example that I was able to trigger in my own testing along with other non-error strings using an internal telemetry service that consumes the
git-ecosystem/trace2recievervia a fork of thegit-ecosystem/sample-trace2-otel-collector.I used agentic coding to produce these results, including the substantial test cases.