fix: classify permanent ingest failures; cut a per-flush allocation by acoshift · Pull Request #21 · moonrhythm/quickwit

acoshift · 2026-06-13T06:24:17Z

The DO-NOW tier from the performance + send-failure review.

1. HTTP failure taxonomy — stop retrying permanent 4xx forever (the important one)

flush() treated every non-200 as retryable, so retryFlush(false) retried forever. A single misconfiguration — wrong index (404), expired token (401), or a schema-rejected batch (400/422) — parked a worker in the infinite-retry loop. With the default 2 workers, a couple of poison batches froze all ingest and blocked every Ingest caller. There was no permanent-failure path.

Now flush() classifies the status:

Permanent (400/401/403/404/409/422): settle tracked items with a new *IngestError{Reason: ReasonServer}, invoke OnDiscard for fire-and-forget items, drop the batch, and keep draining.
Retryable (5xx, 408, 425, 429, transport errors): unchanged. 413 keeps its own auto-reduce/re-chunk path.

Tracked (IngestSync) callers now get a definitive Nack/dead-letter verdict instead of an ambiguous context timeout.

2. Drop a per-flush allocation

flush() built a fresh encoded := make([]ingestItem, 0, len(batch)) (~48 KB + a full 1000-element copy at the default batch size) every round-trip, only to replay the settle list. Removed — on HTTP 200 we settle over batch directly. Identical behavior, because encode-failure and fire-and-forget items have ack==nil, so settle(nil) is a no-op for them.

3. Recover a 413-reduced batch size on the ticker

The post-413 batch-size reset only ran on a successful flush, but flush() never runs on an empty buffer — so after a transient 413 spike followed by quiet traffic, the batch could stay shrunk (down to 10%) indefinitely. The reset now also runs in the worker's ticker branch, firing within ~one maxDelay of the 10-minute window expiring. Gated on SetAutoReduceBatchSize, off by default.

Invariants preserved

Fire-and-forget batching, ordering, the 413 split, and the IngestSync at-least-once / exactly-once-settle guarantees are unchanged. The permanent path settles each item exactly once and drops the batch, so nothing is re-flushed or double-settled.

Tests

5 new (ingest_reliability_test.go): permanent 400 → prompt ReasonServer; permanent 404 doesn't wedge a single worker (second call still gets a verdict); 5xx still retried to success (guards against over-classifying); fire-and-forget 422 → OnDiscard + Close returns promptly; worker-side encode failure still discarded with the encoded slice gone (good docs delivered, poison discarded). Full suite green, go vet clean, race-clean across -count=2.

Reviewed but deferred (not in this PR)

429/Retry-After, backoff jitter, ResponseHeaderTimeout, ingest-response-body inspection (silent reject loss), SetCloseTimeout, and commit=wait_for — all from the "with care" tier, to be decided separately.

🤖 Generated with Claude Code

Three hardening changes from the perf/reliability review. 1. HTTP failure taxonomy (the important one). flush() treated EVERY non-200 as retryable, so retryFlush(false) retried forever. A single misconfig — a wrong index (404), an expired token (401), or a schema-rejected batch (400/422) — parked a worker in the infinite-retry loop, and with the default 2 workers a couple of poison batches froze all ingest and blocked every Ingest caller. flush() now treats 400/401/403/404/409/422 as permanent: it settles tracked items with a new *IngestError{Reason: ReasonServer}, invokes OnDiscard for fire-and-forget items, and drops the batch so the worker keeps draining. 5xx, 408, 425, 429, 413 and transport errors stay retryable (413 keeps its auto-reduce path). Tracked callers now get a definitive verdict instead of an ambiguous context timeout. 2. Drop a per-flush allocation. flush() built a fresh `encoded` slice (~48KB + a full copy at the default batch size) every round-trip just to replay the settle list. Removed: on HTTP 200 we settle over `batch` directly, which is identical because encode-failure and fire-and-forget items have ack==nil so settle(nil) is a no-op for them. 3. Recover a 413-reduced batch size on the ticker, not only on a later successful flush. flush() never runs on an empty buffer, so after a transient 413 spike followed by quiet traffic the batch could stay shrunk (down to 10%) indefinitely. The reset now also runs in the worker's ticker branch, so it fires within ~one maxDelay of the window expiring. Fire-and-forget batching, ordering, the 413 split, and the IngestSync at-least-once / exactly-once-settle invariants are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

acoshift merged commit 3a9d169 into main Jun 13, 2026
1 check passed

acoshift deleted the feat/ingest-reliability branch June 13, 2026 06:28

This was referenced Jun 13, 2026

feat: honor Retry-After, jitter backoff, bound response-header wait #22

Merged

feat: configurable SetCloseTimeout for the shutdown flush #24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: classify permanent ingest failures; cut a per-flush allocation#21

fix: classify permanent ingest failures; cut a per-flush allocation#21
acoshift merged 1 commit into
mainfrom
feat/ingest-reliability

acoshift commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

acoshift commented Jun 13, 2026

1. HTTP failure taxonomy — stop retrying permanent 4xx forever (the important one)

2. Drop a per-flush allocation

3. Recover a 413-reduced batch size on the ticker

Invariants preserved

Tests

Reviewed but deferred (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant