Untyped ValueError/AssertionError bug#47091
Conversation
|
@sdkReviewAgent-2 |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR addresses a Cosmos routing-map construction failure where transient, paginated /pkranges snapshot inconsistencies could surface as a bare ValueError("Ranges overlap") (or be swallowed into empty results), by converting overlap into a typed sentinel + bounded retry/backoff and ultimately surfacing a transient CosmosHttpResponseError(503) when the retry budget is exhausted.
Changes:
- Introduces
_OverlapDetectedas a dedicated overlap sentinel and improves overlap diagnostics inCollectionRoutingMap.is_complete_set_of_range. - Adds a shared overlap retry/backoff policy helper (with jitter) and integrates it into both sync and async routing-map providers.
- Adds regression + end-to-end tests covering transient overlap recovery, persistent overlap → 503, and incremental-merge overlap fallback behavior.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/azure/cosmos/_routing/collection_routing_map.py | Adds overlap sentinel, dedups full-load ranges by id, converts overlap ValueError into _OverlapDetected, and improves overlap error messages. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py | Centralizes overlap retry budget/backoff policy (with jitter) and converts incremental-merge overlap ValueError into _IncrementalMergeFailed. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py | Sync provider catches _OverlapDetected, applies retry/backoff, and surfaces 503 on budget exhaustion. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py | Async provider mirrors the same _OverlapDetected retry/backoff behavior via await asyncio.sleep. |
| sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py | Adds sync-focused unit/e2e tests for overlap retry/backoff and persistent-overlap 503 surfacing. |
| sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit_async.py | Adds async equivalents plus an incremental-overlap regression test for _IncrementalMergeFailed conversion. |
| sdk/cosmos/azure-cosmos/tests/routing/test_collection_routing_map.py | Adds routing-map builder regression tests for the three observed overlap modes and improved overlap diagnostics. |
|
✅ Review complete (52:47) Posted 4 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
/azp run python - cosmos - tests |
|
@sdkReviewAgent-2 |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
✅ Review complete (51:24) Posted 2 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
/azp run python - cosmos - tests |
|
@sdkReviewAgent-2 |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
✅ Review complete (40:25) Posted 1 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
simorenoh
left a comment
There was a problem hiding this comment.
LGTM overall, just one question. Would ideally like for us to come up with some sort of doc or to expand on the existing docs on all the new behaviors we are adding across these PRs, but can be done as a follow-up in order to not block on testing.
| raise CosmosHttpResponseError( | ||
| status_code=http_constants.StatusCodes.SERVICE_UNAVAILABLE, | ||
| message=( | ||
| "Routing-map fetch for collection '{}' returned overlapping " | ||
| "or gapped ranges on {} attempt(s)." | ||
| ).format(collection_link, retry_attempt_count), |
There was a problem hiding this comment.
So the idea is to mix this with the changes in #47066 to ensure we don't get a retry storm out of this behavior?
| # lane can run several minutes. 900s gives ~15x headroom over the slowest | ||
| # known test while still failing fast on a genuinely hung test instead of | ||
| # letting it consume the ADO step cap. | ||
| timeout = 900 |
There was a problem hiding this comment.
also curious on how this will affect our overall testing mechanisms currently - I imagine this was just to get past the split tests?
There was a problem hiding this comment.
I thn this would be helpful addition- Before this change there was no per-test cap configured at the cosmos package level. A single hung test would consume the entire 60+ minutes for live-account lanes, starving the rest of the suite and producing only a generic step-timeout failure.( i ran into locking/stuck issues for this PR0
So no legitimate test on any current lane should fail because of the new cap. Tests that do hit the cap are the genuine stuck-test cases the change is meant to surface — now they fail in 15 min with a clean per-test traceback.
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…etry tests
Two emulator-pipeline failures:
1) routing_map_provider.py / aio: change _shared_cache_lock from Lock to RLock. `__init__` holds this lock while indexing _shared_routing_map_cache; indexing can synchronously trigger GC of a prior PartitionKeyRangeCache on the same thread (e.g. MagicMock attribute access via _mock_set_magics, or any production scenario where GC sweeps a prior instance during __init__). The swept instance's __del__ -> release() re-acquires the same lock -> deadlock under non-reentrant Lock. Adds regression tests in test_shared_pk_range_cache{,_async}.py that hang under Lock and pass under RLock.
2) test_service_retry_policies{,_async}.py: when host points at the emulator (localhost/127.0.0.1), _setup_read_regions / _setup_write_regions now route every 'region' to the real emulator URL instead of synthesizing localhost-westus:8081 etc. After the perf change keyed the shared pkrange cache by endpoint URL and the testing-gap commit started calling update_location_cache(), the SDK actually routed pkranges/ReadFeed to the synthetic per-region URLs, which fail DNS resolution. The real ServiceRequestError surfaced before the mock could run, leaving mf.counter at 0 (sync) or drifting (async). Mapping all regions to the actual host keeps both the routing-map cache valid and the request reachable.
Also tightens test_per_partition_circuit_breaker_mm_async to skip when the recovery-phase precondition isn't met instead of asserting a flaky timing window.
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@sdkReviewAgent-2 |
| return True | ||
| return False | ||
|
|
||
| def _is_read_retryable_request(request, request_params: Optional[RequestObject] = None): |
There was a problem hiding this comment.
💬 Observation — Scope: This helper widens retry classification beyond PK-range cache paths
def _is_read_retryable_request(request, request_params: Optional[RequestObject] = None):This replaces _has_read_retryable_headers in three call sites (_handle_service_response_retries, sync ConnectionRetryPolicy.send, async _ConnectionRetryPolicy.send) and adds a fallback: if the thin-client proxy header is absent, it checks request_params.operation_type. This means any read-only operation whose request lacked the proxy header is now retryable where it previously was not.
In production, headers and operation_type are both set from the same source (_base.GetHeaders), so behavior is effectively unchanged for normal flows. The widening only matters for edge paths where headers are absent but request_params is threaded through — exactly the /pkranges ReadFeed scenario motivating this PR.
No action needed — the change is correct and well-tested by the 4 new TestServiceRetryPolicyHelpers tests. Noting it here because it's a behavioral expansion that affects all retry paths (not just PK-range cache) and is not mentioned in the PR description. If this is intentionally scoped broader than the PK-range fix, a brief note in the PR description would help future readers.
| # or gap) before the caller surfaces a 503. Shared by sync and async providers. | ||
| _TRANSIENT_SNAPSHOT_RETRY_MAX_ATTEMPTS = 3 | ||
| # Initial backoff (seconds) before the next retry; doubles each attempt and | ||
| # is jittered uniformly in ``[0, upper_bound]``. With MAX_ATTEMPTS=3 the |
There was a problem hiding this comment.
💬 Observation — Comment: "0 + 0.1 + 0.2" leading zero is mildly misleading
_TRANSIENT_SNAPSHOT_RETRY_INITIAL_BACKOFF_SECONDS = 0.1The comment on lines 60–61 says "worst-case cumulative sleep is 0 + 0.1 + 0.2 = 0.3s". The actual flow is: fetch #1 fails → sleep [0, 0.1] → fetch #2 fails → sleep [0, 0.2] → fetch #3 fails → raise 503. That's 2 sleeps, not 3. The leading "0" in the sum represents "no sleep before the first attempt," which is technically correct but reads like there are 3 sleep calls. Consider rephrasing to "worst-case cumulative sleep is 0.1 + 0.2 = 0.3s (2 sleeps between 3 attempts)" for clarity.
Very minor — feel free to resolve without changing.
|
✅ Review complete (32:07) Posted 3 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| # GC runs ``__del__`` -> ``release()``, which re-acquires this same lock on | ||
| # the same thread; with a non-reentrant ``Lock`` that re-acquisition would | ||
| # deadlock the main thread (see ``test_fetch_routing_map_recovers_after_transient_gap``). | ||
| _shared_cache_lock = threading.RLock() |
There was a problem hiding this comment.
RLock justification leans on a MagicMock repro; pin the production scenario
Files: sync and async routing_map_provider.py (threading.Lock → threading.RLock); tests/routing/test_shared_pk_range_cache{,_async}.py.
The new test (test_reentrant_release_during_init_does_not_deadlock) simulates the deadlock with _shared_cache_lock directly, not by exercising the init → GC → del chain. So:
If init is later refactored to release the lock before any GC-triggering code, the test still passes — and the RLock might no longer be needed.
If the production GC path isn't actually a deadlock source (e.g., because CPython's GC runs synchronously only at specific safe points), the change pays RLock overhead for a hypothetical bug.
RLock has measurable overhead vs. Lock; this is on a hot path acquired on every cache read.
Fix: Add a test that constructs two PartitionKeyRangeCache instances back-to-back against the same endpoint with gc.collect() forced inside init. If that deadlocks with Lock() and passes with RLock(), the RLock is justified. If not, reconsider whether the safer fix is to release the lock before any GC-triggering code path.
There was a problem hiding this comment.
yeap real deadlock - updated tests. also the Lock cost is small: It pays zero per get_routing_map call, which is what runs per cache read.
kushagraThapar
left a comment
There was a problem hiding this comment.
LGTM, thanks @dibahlfi !
|
also not going to change the code for offline status as gateway is actually rmeoving offlined ranges and we dont have the same check in other SDKs. |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This PR fixes two related crash paths in the partition key range cache that can happen when the gateway returns a transiently inconsistent paginated /pkranges snapshot:
Both are caused by the same class of transient metadata inconsistency, but with different shapes:
Why this change is needed
We’ve seen this in a real high-scale scenario (very high partition count, long-running async scan).
The SDK previously handled these inconsistencies asymmetrically:
So transient metadata blips could fail user workloads instead of being treated as retryable.
What changed
Reviewer-relevant behavior change
Before:
After:
Retry budget and jittered backoff
When the gateway returns an inconsistent /pkranges snapshot, it's almost always because a partition split is mid-propagation across the gateway's internal caches — different gateway nodes see slightly different partition layouts for a few hundred milliseconds, and a request that lands on a lagging node sees overlap or gap. A short pause lets the gateway catch up to itself. This is different from the usual "too many requests, back off and spread out" scenario, so the retry schedule is tuned to always wait a minimum amount of time before retrying (the floor) rather than risk retrying so fast we just see the same stale snapshot again..
4 attempts total (3 sleeps before surfacing 503). Worst-case cumulative blocking time is ~1.4 s; expected ~0.775 s when all three retries occur.
Exponential schedule with a base of 0.2 s: deterministic upper bounds are 0.2 s, 0.4 s, 0.8 s.
Per-sleep cap of 2.0 s so a future bump to the attempt count cannot accidentally produce multi-second blocking sleeps.
Floored full jitter: each sleep is drawn uniformly from [floor, upper] where floor = min(0.05 s, upper / 4).
The non-zero floor eliminates the near-zero-sleep tail of pure full jitter, which would otherwise burn a retry attempt before the gateway has had any time to converge.
The uniform distribution across [floor, upper] (rather than max(floor, uniform(0, upper))) preserves full jitter's fleet-wide herd dispersion without creating a probability spike at the floor.
The upper / 4 clamp keeps the jitter range at ≥75% of the deterministic upper bound on the smallest attempts, so the floor never collapses an attempt into a near-constant wait.
Scope
This PR is intentionally scoped to PK-range cache resilience for transient /pkranges inconsistency handling; it does not change public API surface.