Skip to content

fix: classify permanent ingest failures; cut a per-flush allocation#21

Merged
acoshift merged 1 commit into
mainfrom
feat/ingest-reliability
Jun 13, 2026
Merged

fix: classify permanent ingest failures; cut a per-flush allocation#21
acoshift merged 1 commit into
mainfrom
feat/ingest-reliability

Conversation

@acoshift

Copy link
Copy Markdown
Member

The DO-NOW tier from the performance + send-failure review.

1. HTTP failure taxonomy — stop retrying permanent 4xx forever (the important one)

flush() treated every non-200 as retryable, so retryFlush(false) retried forever. A single misconfiguration — wrong index (404), expired token (401), or a schema-rejected batch (400/422) — parked a worker in the infinite-retry loop. With the default 2 workers, a couple of poison batches froze all ingest and blocked every Ingest caller. There was no permanent-failure path.

Now flush() classifies the status:

  • Permanent (400/401/403/404/409/422): settle tracked items with a new *IngestError{Reason: ReasonServer}, invoke OnDiscard for fire-and-forget items, drop the batch, and keep draining.
  • Retryable (5xx, 408, 425, 429, transport errors): unchanged. 413 keeps its own auto-reduce/re-chunk path.

Tracked (IngestSync) callers now get a definitive Nack/dead-letter verdict instead of an ambiguous context timeout.

2. Drop a per-flush allocation

flush() built a fresh encoded := make([]ingestItem, 0, len(batch)) (~48 KB + a full 1000-element copy at the default batch size) every round-trip, only to replay the settle list. Removed — on HTTP 200 we settle over batch directly. Identical behavior, because encode-failure and fire-and-forget items have ack==nil, so settle(nil) is a no-op for them.

3. Recover a 413-reduced batch size on the ticker

The post-413 batch-size reset only ran on a successful flush, but flush() never runs on an empty buffer — so after a transient 413 spike followed by quiet traffic, the batch could stay shrunk (down to 10%) indefinitely. The reset now also runs in the worker's ticker branch, firing within ~one maxDelay of the 10-minute window expiring. Gated on SetAutoReduceBatchSize, off by default.

Invariants preserved

Fire-and-forget batching, ordering, the 413 split, and the IngestSync at-least-once / exactly-once-settle guarantees are unchanged. The permanent path settles each item exactly once and drops the batch, so nothing is re-flushed or double-settled.

Tests

5 new (ingest_reliability_test.go): permanent 400 → prompt ReasonServer; permanent 404 doesn't wedge a single worker (second call still gets a verdict); 5xx still retried to success (guards against over-classifying); fire-and-forget 422 → OnDiscard + Close returns promptly; worker-side encode failure still discarded with the encoded slice gone (good docs delivered, poison discarded). Full suite green, go vet clean, race-clean across -count=2.

Reviewed but deferred (not in this PR)

429/Retry-After, backoff jitter, ResponseHeaderTimeout, ingest-response-body inspection (silent reject loss), SetCloseTimeout, and commit=wait_for — all from the "with care" tier, to be decided separately.

🤖 Generated with Claude Code

Three hardening changes from the perf/reliability review.

1. HTTP failure taxonomy (the important one). flush() treated EVERY non-200
   as retryable, so retryFlush(false) retried forever. A single misconfig — a
   wrong index (404), an expired token (401), or a schema-rejected batch
   (400/422) — parked a worker in the infinite-retry loop, and with the default
   2 workers a couple of poison batches froze all ingest and blocked every
   Ingest caller. flush() now treats 400/401/403/404/409/422 as permanent:
   it settles tracked items with a new *IngestError{Reason: ReasonServer},
   invokes OnDiscard for fire-and-forget items, and drops the batch so the
   worker keeps draining. 5xx, 408, 425, 429, 413 and transport errors stay
   retryable (413 keeps its auto-reduce path). Tracked callers now get a
   definitive verdict instead of an ambiguous context timeout.

2. Drop a per-flush allocation. flush() built a fresh `encoded` slice
   (~48KB + a full copy at the default batch size) every round-trip just to
   replay the settle list. Removed: on HTTP 200 we settle over `batch`
   directly, which is identical because encode-failure and fire-and-forget
   items have ack==nil so settle(nil) is a no-op for them.

3. Recover a 413-reduced batch size on the ticker, not only on a later
   successful flush. flush() never runs on an empty buffer, so after a
   transient 413 spike followed by quiet traffic the batch could stay shrunk
   (down to 10%) indefinitely. The reset now also runs in the worker's ticker
   branch, so it fires within ~one maxDelay of the window expiring.

Fire-and-forget batching, ordering, the 413 split, and the IngestSync
at-least-once / exactly-once-settle invariants are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@acoshift acoshift merged commit 3a9d169 into main Jun 13, 2026
1 check passed
@acoshift acoshift deleted the feat/ingest-reliability branch June 13, 2026 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant