feat(feature-flagging): FFE APM feature-flag span enrichment (experimental, gated)#11658
Draft
leoromanovsky wants to merge 5 commits into
Draft
feat(feature-flagging): FFE APM feature-flag span enrichment (experimental, gated)#11658leoromanovsky wants to merge 5 commits into
leoromanovsky wants to merge 5 commits into
Conversation
…nrichment gate - Add nullable Integer serialId to UFC Split model + constructor (Moshi reflectively deserializes the serialId JSON field; no custom adapter needed) - DDEvaluator surfaces __dd_split_serial_id (when present) and __dd_do_log in OpenFeature eval metadata for APM span enrichment (JAVA-01) - Add FeatureFlaggingConfig.SPAN_ENRICHMENT_ENABLED dot-form constant mapping to DD_EXPERIMENTAL_FLAGGING_PROVIDER_SPAN_ENRICHMENT_ENABLED (distinct, off by default) - Register the env var in metadata/supported-configurations.json (version A, boolean) - Update DDEvaluatorTest Split call sites for the new constructor arg
…ceptor (gate-gated)
- ULeb128Encoder: ULEB128 delta-varint + base64 codec ported verbatim from the
frozen Node reference (dd-trace-js#8343); golden vector {100,108,128,130} -> ZAgUAg==,
SHA-256 hex targeting-key hashing; JDK stdlib only (java.util.Base64, MessageDigest)
- SpanEnrichmentAccumulator: per-trace state keyed by trace id in a shared
ConcurrentHashMap; frozen limits (200/10/20/5/64), UTF-8-safe truncation, object
default -> JSON (Pattern F tag shapes: ffe_flags_enc bare base64, ffe_subjects_enc /
ffe_runtime_defaults JSON objects)
- SpanEnrichmentHook: OpenFeature finally hook (capture) resolving the active local
root via AgentTracer and applying the Node branch (serialId -> addSerialId/addSubject;
missing variant -> addDefault); error-isolated
- SpanEnrichmentInterceptor: TraceInterceptor (write) flushing ffe_* onto the local
root onTraceComplete with unique priority 4, then clearing state; disable() for
provider-close cleanup
- Provider: gate-gated construction via ConfigHelper.env (DG-005 — nothing built/
registered when off), hook added to getProviderHooks, interceptor registered via
GlobalTracer, shutdown() disables interceptor + drains state
- feature-flagging-api: add compileOnly/testImplementation :internal-api for the
tracer/interceptor types (runtime-provided by the agent)
… golden round-trip)
- Codec golden-vector round-trip {100,108,128,130} -> ZAgUAg== (encode + decode +
empty + structural dedupe)
- no-span, finished-root (accumulate+dedupe+flush), runtime-default missing-variant
(ffe_runtime_defaults JSON object), per-subject cap (10 subj / 20 exp/subj) +
doLog gating, max-200 serial ids, object-default JSON + 64-char truncation
- gate-off negative control: no hook/interceptor constructed, not in getProviderHooks,
no accumulator state (DG-005)
- gate-on construction + provider-close cleanup (shutdown disables interceptor +
drains state); error isolation; interceptor priority uniqueness
- JUnit 5 only (no Groovy); injectable RootSpanResolver + Provider gate override
avoid static mocking (project mockito uses the subclass mock maker)
…econfig) Gap-closure for the three BLOCKER lifecycle defects in 02-REVIEW-java.md. The frozen contract (codec/limits/gate/tag shapes, golden vector ZAgUAg==) is unchanged — this fixes only how per-trace state is scoped, bounded, and flushed. CR-01 (partial-flush data loss + tag misattribution): CoreTracer.write -> interceptCompleteTrace runs onTraceComplete on EVERY flush, and PendingTrace.write(isPartial) excludes the still-open local root from a partial flush (getRootSpan() is null until the final write). The interceptor previously resolved the root by reference (reachable even when absent from the fragment) and unconditionally removed the accumulator on the first partial flush, dropping all pre-flush flags and tagging a not-yet-finished root. SpanEnrichmentInterceptor now flushes + removes ONLY when the local root is actually present in the flushed collection (the final write); a partial flush leaves the accumulator intact. Also adds the missing getTraceId() null guard (WR-01) and never falls back to tagging a non-root span (WR-02). CR-02 (unbounded state leak): State lived in a static ConcurrentHashMap whose only removal paths were trace-complete and shutdown, so dropped / Noop / never-finishing traces leaked forever (#4844 leak class). State now lives in a per-provider SpanEnrichmentStates store, hard-bounded at MAX_TRACES with FIFO eviction of the oldest in-flight entry, so a never-completing trace cannot grow the map unboundedly. CR-03 (reconfiguration corruption): Provider ignored addTraceInterceptor's return value, so a second gate-on provider (rejected on duplicate priority 4) still wired a hook into shared static state and its shutdown() cleared the first provider's live state. Provider now honors the registration result (wires the hook/interceptor only when registration succeeds) and each provider owns its own state store, so one provider's shutdown can never clear another's. Removes the global static map (review IN-03 root cause). Adds an injectable TraceInterceptorRegistrar test seam (mirrors the existing gate-override seam). Runtime-default rendering (String() parity) and hasData() are intentionally unchanged. L0 suite migrated to the instance-owned store; module 146 tests green.
Adds SpanEnrichmentLifecycleRegressionTest exercising the real tracer lifecycle
paths the original L0 suite bypassed (single root, no children, one provider).
Each test FAILS against the pre-fix code and PASSES after the fix:
CR-01 partialFlushExcludingRootPreservesPreFlushFlags / partialFlushNeverWrites-
TagsOnAChildSpan: a partial flush carrying only child spans (root excluded, as
dd-trace-core does) must not drain state or tag a child; the full pre+post-flush
serial-id set {100,108,128,130} -> 'ZAgUAg==' must land on the root at final
completion. Fail-before: tags written on the partial flush (NeverWantedButInvoked).
CR-02 neverCompletingTracesDoNotLeakUnbounded / boundedStoreEvictsOldestFirst:
3x MAX_TRACES distinct never-flushed traces keep the store bounded; eviction is
FIFO. Fail-before: store grew to 3x the cap (expected <=cap but was larger).
CR-03 secondProviderRejectedRegistrationDoesNotClearFirstProviderState /
eachProviderOwnsADistinctStateStore / hookAndInterceptorOfOneProviderShareThe-
SameStore: a second provider whose registration is rejected wires no
hook/interceptor and its shutdown does not clear the first provider's in-flight
state. Fail-before: second provider kept a non-null interceptor (expected null).
Fail-before verified by temporarily reintroducing each defect: 5/7 failed (the
remaining 2 are structural-invariant checks that hold by construction).
Contributor
|
Contributor
🟡 Java Benchmark SLOs — Performance SLO warning (near threshold)
PR vs. master results
Commit: Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(feature-flagging): FFE APM feature-flag span enrichment
Summary
Adds Feature Flag Events (FFE) span enrichment to the OpenFeature integration. When feature
flags are evaluated, the evaluation metadata is attached to the root APM span so APM customers
can filter traces and errors by active flag variant, and the FFE/Experimentation platform can
correlate spans with experiments. The wire format matches the merged reference implementation
(
dd-trace-js#8343) so backend/Trino decode is identical.How it works
finallyhook captures the evaluation (serial ID, targeting key, runtime default).SpanEnrichmentInterceptor.findLocalRootInFragment).ffe_*tags.Configuration
Opt-in, off by default:
This is distinct from
DD_EXPERIMENTAL_FLAGGING_PROVIDER_ENABLED.Span tags added
ffe_flags_encffe_subjects_encdoLog=true){ sha256(key): encodedIds }ffe_runtime_defaults{ flagKey: value }Limits: 200 serial IDs, 10 subjects, 20 experiments/subject, 5 runtime defaults, 64 chars/runtime-default value (UTF-8-safe truncation).
Changes
DD_EXPERIMENTAL_FLAGGING_PROVIDER_SPAN_ENRICHMENT_ENABLED(FeatureFlaggingConfig), off by default; surface the splitserialIdin eval metadata (ufc/v1/Split.java).ULeb128Encoder— delta-varint serial IDs, SHA256-hashed subject keys.SpanEnrichmentAccumulator,SpanEnrichmentHook,SpanEnrichmentInterceptor,SpanEnrichmentStates— accumulate evaluations and writeffe_*tags onto the local root span.Decisions
ffe_*tags.ffe_*are bare tag names on spanmeta(not_dd.-prefixed); subject keys are SHA256 hashes emitted only when logging is authorized.Validation
FFE dogfooding app
Validated live against the
ffe-dogfoodingapp via atrace-intaketee-proxy that captures the raw trace payload and decodes theffe_*tags. Flagffe-dogfooding-string-flag(serial2312):trace.annotation) carriedffe_flags_enc = iBI=decoding to serial[2312]plus a SHA256-hashedffe_subjects_enc→[2312](doLog=true), through the realdd-openfeatureprovider path.ffe_*tags.Local system-tests run
Ran the frozen
system-testsparametric suite (tests/parametric/test_ffe/test_span_enrichment.py, unchanged) againstdd-java-agent 1.64.0-SNAPSHOT(+ customdd-trace-api,dd-openfeature) built from this branch (HEAD8491123c5d):All 18 cases ran and asserted green, covering
ffe_flags_enc/ffe_subjects_enc/ffe_runtime_defaults, delta-varint encoding, SHA256 subject keys with doLog gating, and the 200/10/20/5/64 limits. (Reported asxpassedonly because the manifest declares the feature at releasev1.64.0while the local artifact is pre-release1.64.0-SNAPSHOT; it resolves to a cleanpassedat release.) No SDK source changes were required to pass. The system-tests enablement (parametric handler +manifests/java.yml) is a separate draft PR againstDataDog/system-tests.