Untyped ValueError/AssertionError bug by dibahlfi · Pull Request #47091 · Azure/azure-sdk-for-python

dibahlfi · 2026-05-22T22:14:51Z

This PR fixes two related crash paths in the partition key range cache that can happen when the gateway returns a transiently inconsistent paginated /pkranges snapshot:

ValueError("Ranges overlap ...")
AssertionError("code bug: returned overlapping ranges ... is empty")

Both are caused by the same class of transient metadata inconsistency, but with different shapes:

overlap: two ranges claim the same key space
gap: no range claims part of key space

Why this change is needed

We’ve seen this in a real high-scale scenario (very high partition count, long-running async scan).
The SDK previously handled these inconsistencies asymmetrically:

overlap could escape as a hard ValueError
gap could flow through as an incomplete map and later crash in SmartRoutingMapProvider with the empty-overlap assertion

So transient metadata blips could fail user workloads instead of being treated as retryable.

What changed

Unified handling for full-load transient snapshot inconsistencies (overlap + gap):
detect inconsistency
retry with bounded jittered backoff
if still inconsistent after retry budget, raise typed CosmosHttpResponseError(status_code=503) so existing retry policy can absorb it
Kept incremental-path safety:
overlap from incremental merge is converted into incremental-merge failure/fallback behavior (with guarded conversion so unrelated ValueErrors still surface normally)
Added dedup-by-range-id before validation to tolerate duplicate IDs across paginated snapshots from slightly skewed gateway views.
Applied consistently in sync and async routing map providers.

Reviewer-relevant behavior change
Before:

transient overlap/gap could surface as internal exceptions (ValueError / assertion)

After:

transient inconsistency is retried internally
persistent inconsistency surfaces as retryable 503 (instead of non-retryable internal crash)

Retry budget and jittered backoff
When the gateway returns an inconsistent /pkranges snapshot, it's almost always because a partition split is mid-propagation across the gateway's internal caches — different gateway nodes see slightly different partition layouts for a few hundred milliseconds, and a request that lands on a lagging node sees overlap or gap. A short pause lets the gateway catch up to itself. This is different from the usual "too many requests, back off and spread out" scenario, so the retry schedule is tuned to always wait a minimum amount of time before retrying (the floor) rather than risk retrying so fast we just see the same stale snapshot again..

4 attempts total (3 sleeps before surfacing 503). Worst-case cumulative blocking time is ~1.4 s; expected ~0.775 s when all three retries occur.
Exponential schedule with a base of 0.2 s: deterministic upper bounds are 0.2 s, 0.4 s, 0.8 s.
Per-sleep cap of 2.0 s so a future bump to the attempt count cannot accidentally produce multi-second blocking sleeps.
Floored full jitter: each sleep is drawn uniformly from [floor, upper] where floor = min(0.05 s, upper / 4).
The non-zero floor eliminates the near-zero-sleep tail of pure full jitter, which would otherwise burn a retry attempt before the gateway has had any time to converge.
The uniform distribution across [floor, upper] (rather than max(floor, uniform(0, upper))) preserves full jitter's fleet-wide herd dispersion without creating a probability spike at the floor.
The upper / 4 clamp keeps the jitter range at ≥75% of the deterministic upper bound on the smallest attempts, so the floor never collapses an attempt into a near-constant wait.

Scope
This PR is intentionally scoped to PK-range cache resilience for transient /pkranges inconsistency handling; it does not change public API surface.

dibahlfi · 2026-05-22T22:15:31Z

@sdkReviewAgent-2

dibahlfi · 2026-05-22T22:16:06Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-22T22:16:23Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

This PR addresses a Cosmos routing-map construction failure where transient, paginated /pkranges snapshot inconsistencies could surface as a bare ValueError("Ranges overlap") (or be swallowed into empty results), by converting overlap into a typed sentinel + bounded retry/backoff and ultimately surfacing a transient CosmosHttpResponseError(503) when the retry budget is exhausted.

Changes:

Introduces _OverlapDetected as a dedicated overlap sentinel and improves overlap diagnostics in CollectionRoutingMap.is_complete_set_of_range.
Adds a shared overlap retry/backoff policy helper (with jitter) and integrates it into both sync and async routing-map providers.
Adds regression + end-to-end tests covering transient overlap recovery, persistent overlap → 503, and incremental-merge overlap fallback behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/collection_routing_map.py	Adds overlap sentinel, dedups full-load ranges by id, converts overlap `ValueError` into `_OverlapDetected`, and improves overlap error messages.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py	Centralizes overlap retry budget/backoff policy (with jitter) and converts incremental-merge overlap `ValueError` into `_IncrementalMergeFailed`.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py	Sync provider catches `_OverlapDetected`, applies retry/backoff, and surfaces 503 on budget exhaustion.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py	Async provider mirrors the same `_OverlapDetected` retry/backoff behavior via `await asyncio.sleep`.
sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py	Adds sync-focused unit/e2e tests for overlap retry/backoff and persistent-overlap 503 surfacing.
sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit_async.py	Adds async equivalents plus an incremental-overlap regression test for `_IncrementalMergeFailed` conversion.
sdk/cosmos/azure-cosmos/tests/routing/test_collection_routing_map.py	Adds routing-map builder regression tests for the three observed overlap modes and improved overlap diagnostics.

xinlian12 · 2026-05-22T23:08:34Z

✅ Review complete (52:47)

Posted 4 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

dibahlfi · 2026-05-23T03:40:59Z

/azp run python - cosmos - tests

dibahlfi · 2026-05-23T03:41:09Z

@sdkReviewAgent-2

azure-pipelines · 2026-05-23T03:41:14Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2026-05-23T04:33:17Z

✅ Review complete (51:24)

Posted 2 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

dibahlfi · 2026-05-23T04:51:33Z

/azp run python - cosmos - tests

dibahlfi · 2026-05-23T04:51:37Z

@sdkReviewAgent-2

azure-pipelines · 2026-05-23T04:51:47Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2026-05-25T14:33:12Z

✅ Review complete (40:25)

Posted 1 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

dibahlfi · 2026-05-26T20:54:49Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-26T20:55:05Z

Azure Pipelines successfully started running 1 pipeline(s).

dibahlfi · 2026-05-26T21:17:30Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-26T21:17:46Z

Azure Pipelines successfully started running 1 pipeline(s).

simorenoh

LGTM overall, just one question. Would ideally like for us to come up with some sort of doc or to expand on the existing docs on all the new behaviors we are adding across these PRs, but can be done as a follow-up in order to not block on testing.

simorenoh · 2026-05-26T23:50:58Z

+        raise CosmosHttpResponseError(
+            status_code=http_constants.StatusCodes.SERVICE_UNAVAILABLE,
+            message=(
+                "Routing-map fetch for collection '{}' returned overlapping "
+                "or gapped ranges on {} attempt(s)."
+            ).format(collection_link, retry_attempt_count),


So the idea is to mix this with the changes in #47066 to ensure we don't get a retry storm out of this behavior?

simorenoh · 2026-05-27T00:05:54Z

+# lane can run several minutes. 900s gives ~15x headroom over the slowest
+# known test while still failing fast on a genuinely hung test instead of
+# letting it consume the ADO step cap.
+timeout = 900


also curious on how this will affect our overall testing mechanisms currently - I imagine this was just to get past the split tests?

I thn this would be helpful addition- Before this change there was no per-test cap configured at the cosmos package level. A single hung test would consume the entire 60+ minutes for live-account lanes, starving the rest of the suite and producing only a generic step-timeout failure.( i ran into locking/stuck issues for this PR0
So no legitimate test on any current lane should fail because of the new cap. Tests that do hit the cap are the genuine stuck-test cases the change is meant to surface — now they fail in 15 min with a clean per-test traceback.

dibahlfi · 2026-05-27T01:03:59Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-27T01:04:14Z

Azure Pipelines successfully started running 1 pipeline(s).

…etry tests Two emulator-pipeline failures: 1) routing_map_provider.py / aio: change _shared_cache_lock from Lock to RLock. `__init__` holds this lock while indexing _shared_routing_map_cache; indexing can synchronously trigger GC of a prior PartitionKeyRangeCache on the same thread (e.g. MagicMock attribute access via _mock_set_magics, or any production scenario where GC sweeps a prior instance during __init__). The swept instance's __del__ -> release() re-acquires the same lock -> deadlock under non-reentrant Lock. Adds regression tests in test_shared_pk_range_cache{,_async}.py that hang under Lock and pass under RLock. 2) test_service_retry_policies{,_async}.py: when host points at the emulator (localhost/127.0.0.1), _setup_read_regions / _setup_write_regions now route every 'region' to the real emulator URL instead of synthesizing localhost-westus:8081 etc. After the perf change keyed the shared pkrange cache by endpoint URL and the testing-gap commit started calling update_location_cache(), the SDK actually routed pkranges/ReadFeed to the synthetic per-region URLs, which fail DNS resolution. The real ServiceRequestError surfaced before the mock could run, leaving mf.counter at 0 (sync) or drifting (async). Mapping all regions to the actual host keeps both the routing-map cache valid and the request reachable. Also tightens test_per_partition_circuit_breaker_mm_async to skip when the recovery-phase precondition isn't met instead of asserting a flaky timing window.

dibahlfi · 2026-05-27T04:42:42Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-27T04:42:57Z

Azure Pipelines successfully started running 1 pipeline(s).

dibahlfi · 2026-05-27T04:57:34Z

@sdkReviewAgent-2

xinlian12 · 2026-05-27T14:29:08Z

        return True
    return False

+def _is_read_retryable_request(request, request_params: Optional[RequestObject] = None):


💬 Observation — Scope: This helper widens retry classification beyond PK-range cache paths

def _is_read_retryable_request(request, request_params: Optional[RequestObject] = None):

This replaces _has_read_retryable_headers in three call sites (_handle_service_response_retries, sync ConnectionRetryPolicy.send, async _ConnectionRetryPolicy.send) and adds a fallback: if the thin-client proxy header is absent, it checks request_params.operation_type. This means any read-only operation whose request lacked the proxy header is now retryable where it previously was not.

In production, headers and operation_type are both set from the same source (_base.GetHeaders), so behavior is effectively unchanged for normal flows. The widening only matters for edge paths where headers are absent but request_params is threaded through — exactly the /pkranges ReadFeed scenario motivating this PR.

No action needed — the change is correct and well-tested by the 4 new TestServiceRetryPolicyHelpers tests. Noting it here because it's a behavioral expansion that affects all retry paths (not just PK-range cache) and is not mentioned in the PR description. If this is intentionally scoped broader than the PK-range fix, a brief note in the PR description would help future readers.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

xinlian12 · 2026-05-27T14:29:25Z

+# or gap) before the caller surfaces a 503. Shared by sync and async providers.
+_TRANSIENT_SNAPSHOT_RETRY_MAX_ATTEMPTS = 3
+# Initial backoff (seconds) before the next retry; doubles each attempt and
+# is jittered uniformly in ``[0, upper_bound]``. With MAX_ATTEMPTS=3 the


💬 Observation — Comment: "0 + 0.1 + 0.2" leading zero is mildly misleading

_TRANSIENT_SNAPSHOT_RETRY_INITIAL_BACKOFF_SECONDS = 0.1

The comment on lines 60–61 says "worst-case cumulative sleep is 0 + 0.1 + 0.2 = 0.3s". The actual flow is: fetch #1 fails → sleep [0, 0.1] → fetch #2 fails → sleep [0, 0.2] → fetch #3 fails → raise 503. That's 2 sleeps, not 3. The leading "0" in the sum represents "no sleep before the first attempt," which is technically correct but reads like there are 3 sleep calls. Consider rephrasing to "worst-case cumulative sleep is 0.1 + 0.2 = 0.3s (2 sleeps between 3 attempts)" for clarity.

Very minor — feel free to resolve without changing.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

xinlian12 · 2026-05-27T14:29:59Z

✅ Review complete (32:07)

Posted 3 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

dibahlfi · 2026-05-27T16:39:45Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-27T16:40:02Z

Azure Pipelines successfully started running 1 pipeline(s).

tvaron3 · 2026-05-27T18:33:55Z

+# GC runs ``__del__`` -> ``release()``, which re-acquires this same lock on
+# the same thread; with a non-reentrant ``Lock`` that re-acquisition would
+# deadlock the main thread (see ``test_fetch_routing_map_recovers_after_transient_gap``).
+_shared_cache_lock = threading.RLock()


RLock justification leans on a MagicMock repro; pin the production scenario

Files: sync and async routing_map_provider.py (threading.Lock → threading.RLock); tests/routing/test_shared_pk_range_cache{,_async}.py.

The new test (test_reentrant_release_during_init_does_not_deadlock) simulates the deadlock with _shared_cache_lock directly, not by exercising the init → GC → del chain. So:

If init is later refactored to release the lock before any GC-triggering code, the test still passes — and the RLock might no longer be needed.
If the production GC path isn't actually a deadlock source (e.g., because CPython's GC runs synchronously only at specific safe points), the change pays RLock overhead for a hypothetical bug.
RLock has measurable overhead vs. Lock; this is on a hot path acquired on every cache read.

Fix: Add a test that constructs two PartitionKeyRangeCache instances back-to-back against the same endpoint with gc.collect() forced inside init. If that deadlocks with Lock() and passes with RLock(), the RLock is justified. If not, reconsider whether the safer fix is to release the lock before any GC-triggering code path.

yeap real deadlock - updated tests. also the Lock cost is small: It pays zero per get_routing_map call, which is what runs per cache read.

tvaron3

LGTM

kushagraThapar

LGTM, thanks @dibahlfi !

dibahlfi · 2026-05-27T22:48:14Z

also not going to change the code for offline status as gateway is actually rmeoving offlined ranges and we dont have the same check in other SDKs.

dibahlfi · 2026-05-27T23:28:28Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-27T23:28:43Z

Azure Pipelines successfully started running 1 pipeline(s).

fixing ValueError bug

34447a0

dibahlfi requested a review from a team as a code owner May 22, 2026 22:14

Copilot AI review requested due to automatic review settings May 22, 2026 22:14

Copilot started reviewing on behalf of dibahlfi May 22, 2026 22:15 View session

github-actions Bot added the Cosmos label May 22, 2026

github-project-automation Bot added this to CosmosDB Python Eco-System May 22, 2026

Copilot AI reviewed May 22, 2026

View reviewed changes

xinlian12 reviewed May 22, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/collection_routing_map.py

xinlian12 reviewed May 22, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/collection_routing_map.py

xinlian12 reviewed May 22, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py

xinlian12 reviewed May 22, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py Outdated

dibahlfi changed the title ~~fixing ValueError bug~~ Untyped ValueError/AssertionError bug May 23, 2026

fixing copilot comments

15a1c68

xinlian12 reviewed May 23, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit_async.py

xinlian12 reviewed May 23, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py

fixing tests

d439dad

xinlian12 reviewed May 25, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py

fixing test

2a52bdc

fixing logic bug

cd4f71f

simorenoh reviewed May 27, 2026

View reviewed changes

fixing testing gap

fa0335c

dibahlfi added 2 commits May 26, 2026 22:48

fixing testing gap

6f3494c

xinlian12 reviewed May 27, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py Outdated

xinlian12 reviewed May 27, 2026

View reviewed changes

addressing pylint errors

668cfc4

tvaron3 reviewed May 27, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/dev_requirements.txt Outdated

tvaron3 reviewed May 27, 2026

View reviewed changes

tvaron3 approved these changes May 27, 2026

View reviewed changes

kushagraThapar approved these changes May 27, 2026

View reviewed changes

adding new RLock specific tests

dbe2df9

Conversation

dibahlfi commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dibahlfi commented May 22, 2026

Uh oh!

dibahlfi commented May 22, 2026

Uh oh!

azure-pipelines Bot commented May 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xinlian12 commented May 22, 2026

Uh oh!

dibahlfi commented May 23, 2026

Uh oh!

dibahlfi commented May 23, 2026

Uh oh!

azure-pipelines Bot commented May 23, 2026

Uh oh!

Uh oh!

Uh oh!

xinlian12 commented May 23, 2026

Uh oh!

dibahlfi commented May 23, 2026

Uh oh!

dibahlfi commented May 23, 2026

Uh oh!

azure-pipelines Bot commented May 23, 2026

Uh oh!

Uh oh!

xinlian12 commented May 25, 2026

Uh oh!

dibahlfi commented May 26, 2026

Uh oh!

azure-pipelines Bot commented May 26, 2026

Uh oh!

dibahlfi commented May 26, 2026

Uh oh!

azure-pipelines Bot commented May 26, 2026

Uh oh!

simorenoh left a comment

Choose a reason for hiding this comment

Uh oh!

simorenoh May 26, 2026

Choose a reason for hiding this comment

Uh oh!

simorenoh May 27, 2026

Choose a reason for hiding this comment

Uh oh!

dibahlfi May 27, 2026

Choose a reason for hiding this comment

Uh oh!

dibahlfi commented May 27, 2026

Uh oh!

azure-pipelines Bot commented May 27, 2026

Uh oh!

dibahlfi commented May 27, 2026

Uh oh!

azure-pipelines Bot commented May 27, 2026

Uh oh!

dibahlfi commented May 27, 2026

Uh oh!

Uh oh!

xinlian12 May 27, 2026

Choose a reason for hiding this comment

Uh oh!

xinlian12 May 27, 2026

Choose a reason for hiding this comment

dibahlfi commented May 22, 2026 •

edited

Loading

tvaron3 May 27, 2026 •

edited

Loading