feat: honor Retry-After, jitter backoff, bound response-header wait#22
Merged
Conversation
Three transport/retry hardening items from the review. - 429/503 Retry-After: flush() now reads the Retry-After header on a 429 (or 503), parses both delta-seconds and HTTP-date forms, clamps to RetryAfterMax (60s), and paces the next retry to it instead of the fixed 100ms-1s backoff. The value is carried to retryFlush via a worker-local; the retry sleep is now a closeSignal-interruptible timer, so a long Retry-After (or Close) cannot delay shutdown. Retry-After is deliberately ignored on the closing path. - Backoff jitter: both retry loops use equal jitter (sleep in [d/2, d], via math/rand/v2) so retries across workers spread out instead of arriving in lockstep waves that can re-tip a recovering server. - ResponseHeaderTimeout: the default transport now bounds the wait for response headers to getIngestTimeout()/3 (~5s default). A server that completes the request but stalls before responding frees the worker in ~5s instead of the full 15s deadline. Firing returns a transport error, already retryable. User-supplied clients via SetHTTPClient are untouched. 429/503 stay retryable; the failure taxonomy, fire-and-forget batching, ordering, and the at-least-once/settle invariants are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The cheap reliability bundle from the review (#2, #4, #3). All in the
flush/retryFlush/transport path next to the taxonomy from #21.429 / Retry-After (#2)
flush()now readsRetry-Afteron a 429 (or 503), parses both delta-seconds and HTTP-date forms, clamps toRetryAfterMax(60s), and paces the next retry to it instead of the fixed 100ms–1s backoff. The value is carried toretryFlushvia a worker-local, and the retry sleep is now acloseSignal-interruptible timer, so a longRetry-After(or aClose) can't delay shutdown.Retry-Afteris intentionally ignored on the closing path so a server-requested delay can't stretch shutdown.→ Stops the client re-POSTing a saturated Quickwit ingest queue once/sec and lets it cooperate with server-side pacing.
Backoff jitter (#3)
Both retry loops use equal jitter (sleep in
[d/2, d], viamath/rand/v2— no new dep, concurrency-safe). De-synchronizes retries across workers so they don't arrive in lockstep waves that can re-tip a recovering server.ResponseHeaderTimeout (#4)
The default transport bounds the wait for response headers to
getIngestTimeout()/3(~5s default). A server that completes the request but stalls before responding (typical L4 LB fronting an overloaded indexer) frees the worker in ~5s instead of the full 15s deadline. Firing returns a transport error, which is already retryable. User-supplied clients viaSetHTTPClientare untouched.Invariants
429/503 stay retryable; the failure taxonomy, fire-and-forget batching, ordering, and the at-least-once/exactly-once-settle guarantees are unchanged.
retryAfteris a worker-local (one perloop()goroutine), so no new shared state.Tests
parseRetryAfter(seconds / HTTP-date / past-date / empty / garbage / zero / negative),jitterBackoffbounds over 1000 draws,ResponseHeaderTimeout == ingestTimeout/3.Retry-After: 1paces the retry to ~1s (vs the 100ms backoff) and still delivers; held over-count=4 -race.Full suite green,
go vetclean, race-clean across-count=2.🤖 Generated with Claude Code