Problem
TwirpErrorResponse stringifies the originating reqwest::Error into a rust_error: Option<String> field, which throws away the typed error and prevents downstream callers from distinguishing transport-level failures from server-returned errors with the same twirp code.
crates/twirp/src/error.rs:180:
rust_error: Option<String>,
crates/twirp/src/error.rs:215:
pub fn with_rust_error<E: std::error::Error>(self, err: E) -> Self {
self.with_rust_error_string(format!("{err:?}"))
}
After #314, From<reqwest::Error> correctly maps connect/timeout/request errors to TwirpErrorCode::Unavailable. But on the receive path, a server that responds with a twirp body whose code is unavailable (HTTP 503 — load shed, drain, brief overload, panic recovery) round-trips into the same TwirpErrorCode::Unavailable. Callers cannot tell "I couldn't reach the host" from "the host told me it was unavailable."
Why this matters
We use blackbird's twirp client to fan out queries across a sharded index. We want to evict a host from our routing table only when it's actually unreachable (connection refused, DNS failure, etc.) — not when it returned a 503. The current API forces us to either:
- Treat all
Unavailable errors as eviction-worthy, which kicks healthy-but-slow hosts out of rotation and concentrates load on the survivors (which is what we hit and reverted in github/blackbird#15118).
- Treat no
Unavailable errors as eviction-worthy, which works but loses the ability to evict known-dead hosts on the hot path (we now rely on a ~1s background poller).
If we could downcast the underlying error and call reqwest::Error::is_connect() / is_timeout(), the distinction would be trivial.
Proposed fix
Replace rust_error: Option<String> with something that preserves the typed source. Two shapes I considered:
Option A — boxed std::error::Error:
pub struct TwirpErrorResponse {
// ...
source: Option<Box<dyn std::error::Error + Send + Sync + 'static>>,
}
impl TwirpErrorResponse {
pub fn source(&self) -> Option<&(dyn std::error::Error + 'static)> { ... }
}
Callers downcast:
if let Some(src) = err.source() {
if let Some(reqwest_err) = src.downcast_ref::<reqwest::Error>() {
if reqwest_err.is_connect() { /* evict */ }
}
}
Pros: flexible, no enum churn when new source types are added.
Cons: downcasting is a little awkward and only works when the caller knows the concrete type.
Option B — typed enum:
pub enum TwirpErrorSource {
Reqwest(reqwest::Error),
Url(url::ParseError),
// ...
}
Pros: easier to pattern-match.
Cons: forces reqwest into the public API of the twirp crate (it's already a dep, but this makes it part of the surface), and the enum has to grow whenever a new From<...> impl is added.
I'd suggest Option A — it stays small and doesn't pin the public API to specific transport crates.
Either way, the existing rust_error() accessor that returns the stringified form can be preserved as a convenience: self.source.as_ref().map(|s| format!("{s:?}")).
Backward compatibility
The current rust_error: Option<String> field is private (accessed via rust_error()). Replacing the field while keeping the accessor returning Option<&String> is a breaking change for that accessor's return type but not for the rest of the API. If you'd rather avoid a return-type change, the accessor could return Option<String> by formatting on the fly.
Workaround we're using
Drop the per-query eviction entirely and rely on a background liveness poller. See github/blackbird#15118 for context.
Generated via Copilot (Claude Opus 4.7) on behalf of @tclem
Problem
TwirpErrorResponsestringifies the originatingreqwest::Errorinto arust_error: Option<String>field, which throws away the typed error and prevents downstream callers from distinguishing transport-level failures from server-returned errors with the same twirp code.crates/twirp/src/error.rs:180:crates/twirp/src/error.rs:215:After #314,
From<reqwest::Error>correctly maps connect/timeout/request errors toTwirpErrorCode::Unavailable. But on the receive path, a server that responds with a twirp body whosecodeisunavailable(HTTP 503 — load shed, drain, brief overload, panic recovery) round-trips into the sameTwirpErrorCode::Unavailable. Callers cannot tell "I couldn't reach the host" from "the host told me it was unavailable."Why this matters
We use blackbird's twirp client to fan out queries across a sharded index. We want to evict a host from our routing table only when it's actually unreachable (connection refused, DNS failure, etc.) — not when it returned a 503. The current API forces us to either:
Unavailableerrors as eviction-worthy, which kicks healthy-but-slow hosts out of rotation and concentrates load on the survivors (which is what we hit and reverted in github/blackbird#15118).Unavailableerrors as eviction-worthy, which works but loses the ability to evict known-dead hosts on the hot path (we now rely on a ~1s background poller).If we could downcast the underlying error and call
reqwest::Error::is_connect()/is_timeout(), the distinction would be trivial.Proposed fix
Replace
rust_error: Option<String>with something that preserves the typed source. Two shapes I considered:Option A — boxed
std::error::Error:Callers downcast:
Pros: flexible, no enum churn when new source types are added.
Cons: downcasting is a little awkward and only works when the caller knows the concrete type.
Option B — typed enum:
Pros: easier to pattern-match.
Cons: forces
reqwestinto the public API of the twirp crate (it's already a dep, but this makes it part of the surface), and the enum has to grow whenever a newFrom<...>impl is added.I'd suggest Option A — it stays small and doesn't pin the public API to specific transport crates.
Either way, the existing
rust_error()accessor that returns the stringified form can be preserved as a convenience:self.source.as_ref().map(|s| format!("{s:?}")).Backward compatibility
The current
rust_error: Option<String>field is private (accessed viarust_error()). Replacing the field while keeping the accessor returningOption<&String>is a breaking change for that accessor's return type but not for the rest of the API. If you'd rather avoid a return-type change, the accessor could returnOption<String>by formatting on the fly.Workaround we're using
Drop the per-query eviction entirely and rely on a background liveness poller. See github/blackbird#15118 for context.