Skip to content

Untyped ValueError/AssertionError bug#47091

Open
dibahlfi wants to merge 13 commits into
mainfrom
users/dibahl/pkrangevalue-error-fix
Open

Untyped ValueError/AssertionError bug#47091
dibahlfi wants to merge 13 commits into
mainfrom
users/dibahl/pkrangevalue-error-fix

Conversation

@dibahlfi
Copy link
Copy Markdown
Member

@dibahlfi dibahlfi commented May 22, 2026

This PR fixes two related crash paths in the partition key range cache that can happen when the gateway returns a transiently inconsistent paginated /pkranges snapshot:

  1. ValueError("Ranges overlap ...")
  2. AssertionError("code bug: returned overlapping ranges ... is empty")

Both are caused by the same class of transient metadata inconsistency, but with different shapes:

  • overlap: two ranges claim the same key space
  • gap: no range claims part of key space

Why this change is needed

We’ve seen this in a real high-scale scenario (very high partition count, long-running async scan).
The SDK previously handled these inconsistencies asymmetrically:

  • overlap could escape as a hard ValueError
  • gap could flow through as an incomplete map and later crash in SmartRoutingMapProvider with the empty-overlap assertion

So transient metadata blips could fail user workloads instead of being treated as retryable.

What changed

  • Unified handling for full-load transient snapshot inconsistencies (overlap + gap):
  • detect inconsistency
  • retry with bounded jittered backoff
  • if still inconsistent after retry budget, raise typed CosmosHttpResponseError(status_code=503) so existing retry policy can absorb it
  • Kept incremental-path safety:
  • overlap from incremental merge is converted into incremental-merge failure/fallback behavior (with guarded conversion so unrelated ValueErrors still surface normally)
  • Added dedup-by-range-id before validation to tolerate duplicate IDs across paginated snapshots from slightly skewed gateway views.
  • Applied consistently in sync and async routing map providers.

Reviewer-relevant behavior change
Before:

  • transient overlap/gap could surface as internal exceptions (ValueError / assertion)

After:

  • transient inconsistency is retried internally
  • persistent inconsistency surfaces as retryable 503 (instead of non-retryable internal crash)

Retry budget and jittered backoff
When the gateway returns an inconsistent /pkranges snapshot, it's almost always because a partition split is mid-propagation across the gateway's internal caches — different gateway nodes see slightly different partition layouts for a few hundred milliseconds, and a request that lands on a lagging node sees overlap or gap. A short pause lets the gateway catch up to itself. This is different from the usual "too many requests, back off and spread out" scenario, so the retry schedule is tuned to always wait a minimum amount of time before retrying (the floor) rather than risk retrying so fast we just see the same stale snapshot again..

4 attempts total (3 sleeps before surfacing 503). Worst-case cumulative blocking time is ~1.4 s; expected ~0.775 s when all three retries occur.
Exponential schedule with a base of 0.2 s: deterministic upper bounds are 0.2 s, 0.4 s, 0.8 s.
Per-sleep cap of 2.0 s so a future bump to the attempt count cannot accidentally produce multi-second blocking sleeps.
Floored full jitter: each sleep is drawn uniformly from [floor, upper] where floor = min(0.05 s, upper / 4).
The non-zero floor eliminates the near-zero-sleep tail of pure full jitter, which would otherwise burn a retry attempt before the gateway has had any time to converge.
The uniform distribution across [floor, upper] (rather than max(floor, uniform(0, upper))) preserves full jitter's fleet-wide herd dispersion without creating a probability spike at the floor.
The upper / 4 clamp keeps the jitter range at ≥75% of the deterministic upper bound on the smallest attempts, so the floor never collapses an attempt into a near-constant wait.

Scope
This PR is intentionally scoped to PK-range cache resilience for transient /pkranges inconsistency handling; it does not change public API surface.

@dibahlfi dibahlfi requested a review from a team as a code owner May 22, 2026 22:14
Copilot AI review requested due to automatic review settings May 22, 2026 22:14
@dibahlfi
Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a Cosmos routing-map construction failure where transient, paginated /pkranges snapshot inconsistencies could surface as a bare ValueError("Ranges overlap") (or be swallowed into empty results), by converting overlap into a typed sentinel + bounded retry/backoff and ultimately surfacing a transient CosmosHttpResponseError(503) when the retry budget is exhausted.

Changes:

  • Introduces _OverlapDetected as a dedicated overlap sentinel and improves overlap diagnostics in CollectionRoutingMap.is_complete_set_of_range.
  • Adds a shared overlap retry/backoff policy helper (with jitter) and integrates it into both sync and async routing-map providers.
  • Adds regression + end-to-end tests covering transient overlap recovery, persistent overlap → 503, and incremental-merge overlap fallback behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/collection_routing_map.py Adds overlap sentinel, dedups full-load ranges by id, converts overlap ValueError into _OverlapDetected, and improves overlap error messages.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py Centralizes overlap retry budget/backoff policy (with jitter) and converts incremental-merge overlap ValueError into _IncrementalMergeFailed.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py Sync provider catches _OverlapDetected, applies retry/backoff, and surfaces 503 on budget exhaustion.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py Async provider mirrors the same _OverlapDetected retry/backoff behavior via await asyncio.sleep.
sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py Adds sync-focused unit/e2e tests for overlap retry/backoff and persistent-overlap 503 surfacing.
sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit_async.py Adds async equivalents plus an incremental-overlap regression test for _IncrementalMergeFailed conversion.
sdk/cosmos/azure-cosmos/tests/routing/test_collection_routing_map.py Adds routing-map builder regression tests for the three observed overlap modes and improved overlap diagnostics.

Comment thread sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py Outdated
@xinlian12
Copy link
Copy Markdown
Member

Review complete (52:47)

Posted 4 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@dibahlfi dibahlfi changed the title fixing ValueError bug Untyped ValueError/AssertionError bug May 23, 2026
@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@dibahlfi
Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Copy Markdown
Member

Review complete (51:24)

Posted 2 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@dibahlfi
Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py
@xinlian12
Copy link
Copy Markdown
Member

Review complete (40:25)

Posted 1 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Member

@simorenoh simorenoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, just one question. Would ideally like for us to come up with some sort of doc or to expand on the existing docs on all the new behaviors we are adding across these PRs, but can be done as a follow-up in order to not block on testing.

Comment on lines +102 to +107
raise CosmosHttpResponseError(
status_code=http_constants.StatusCodes.SERVICE_UNAVAILABLE,
message=(
"Routing-map fetch for collection '{}' returned overlapping "
"or gapped ranges on {} attempt(s)."
).format(collection_link, retry_attempt_count),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea is to mix this with the changes in #47066 to ensure we don't get a retry storm out of this behavior?

# lane can run several minutes. 900s gives ~15x headroom over the slowest
# known test while still failing fast on a genuinely hung test instead of
# letting it consume the ADO step cap.
timeout = 900
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also curious on how this will affect our overall testing mechanisms currently - I imagine this was just to get past the split tests?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thn this would be helpful addition- Before this change there was no per-test cap configured at the cosmos package level. A single hung test would consume the entire 60+ minutes for live-account lanes, starving the rest of the suite and producing only a generic step-timeout failure.( i ran into locking/stuck issues for this PR0
So no legitimate test on any current lane should fail because of the new cap. Tests that do hit the cap are the genuine stuck-test cases the change is meant to surface — now they fail in 15 min with a clean per-test traceback.

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

dibahlfi added 2 commits May 26, 2026 22:48
…etry tests

Two emulator-pipeline failures:

1) routing_map_provider.py / aio: change _shared_cache_lock from Lock to RLock. `__init__` holds this lock while indexing _shared_routing_map_cache; indexing can synchronously trigger GC of a prior PartitionKeyRangeCache on the same thread (e.g. MagicMock attribute access via _mock_set_magics, or any production scenario where GC sweeps a prior instance during __init__). The swept instance's __del__ -> release() re-acquires the same lock -> deadlock under non-reentrant Lock. Adds regression tests in test_shared_pk_range_cache{,_async}.py that hang under Lock and pass under RLock.

2) test_service_retry_policies{,_async}.py: when host points at the emulator (localhost/127.0.0.1), _setup_read_regions / _setup_write_regions now route every 'region' to the real emulator URL instead of synthesizing localhost-westus:8081 etc. After the perf change keyed the shared pkrange cache by endpoint URL and the testing-gap commit started calling update_location_cache(), the SDK actually routed pkranges/ReadFeed to the synthetic per-region URLs, which fail DNS resolution. The real ServiceRequestError surfaced before the mock could run, leaving mf.counter at 0 (sync) or drifting (async). Mapping all regions to the actual host keeps both the routing-map cache valid and the request reachable.

Also tightens test_per_partition_circuit_breaker_mm_async to skip when the recovery-phase precondition isn't met instead of asserting a flaky timing window.
@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@dibahlfi
Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

Comment thread sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py Outdated
return True
return False

def _is_read_retryable_request(request, request_params: Optional[RequestObject] = None):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Observation — Scope: This helper widens retry classification beyond PK-range cache paths

def _is_read_retryable_request(request, request_params: Optional[RequestObject] = None):

This replaces _has_read_retryable_headers in three call sites (_handle_service_response_retries, sync ConnectionRetryPolicy.send, async _ConnectionRetryPolicy.send) and adds a fallback: if the thin-client proxy header is absent, it checks request_params.operation_type. This means any read-only operation whose request lacked the proxy header is now retryable where it previously was not.

In production, headers and operation_type are both set from the same source (_base.GetHeaders), so behavior is effectively unchanged for normal flows. The widening only matters for edge paths where headers are absent but request_params is threaded through — exactly the /pkranges ReadFeed scenario motivating this PR.

No action needed — the change is correct and well-tested by the 4 new TestServiceRetryPolicyHelpers tests. Noting it here because it's a behavioral expansion that affects all retry paths (not just PK-range cache) and is not mentioned in the PR description. If this is intentionally scoped broader than the PK-range fix, a brief note in the PR description would help future readers.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

# or gap) before the caller surfaces a 503. Shared by sync and async providers.
_TRANSIENT_SNAPSHOT_RETRY_MAX_ATTEMPTS = 3
# Initial backoff (seconds) before the next retry; doubles each attempt and
# is jittered uniformly in ``[0, upper_bound]``. With MAX_ATTEMPTS=3 the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Observation — Comment: "0 + 0.1 + 0.2" leading zero is mildly misleading

_TRANSIENT_SNAPSHOT_RETRY_INITIAL_BACKOFF_SECONDS = 0.1

The comment on lines 60–61 says "worst-case cumulative sleep is 0 + 0.1 + 0.2 = 0.3s". The actual flow is: fetch #1 fails → sleep [0, 0.1] → fetch #2 fails → sleep [0, 0.2] → fetch #3 fails → raise 503. That's 2 sleeps, not 3. The leading "0" in the sum represents "no sleep before the first attempt," which is technically correct but reads like there are 3 sleep calls. Consider rephrasing to "worst-case cumulative sleep is 0.1 + 0.2 = 0.3s (2 sleeps between 3 attempts)" for clarity.

Very minor — feel free to resolve without changing.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

@xinlian12
Copy link
Copy Markdown
Member

Review complete (32:07)

Posted 3 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread sdk/cosmos/azure-cosmos/dev_requirements.txt Outdated
# GC runs ``__del__`` -> ``release()``, which re-acquires this same lock on
# the same thread; with a non-reentrant ``Lock`` that re-acquisition would
# deadlock the main thread (see ``test_fetch_routing_map_recovers_after_transient_gap``).
_shared_cache_lock = threading.RLock()
Copy link
Copy Markdown
Member

@tvaron3 tvaron3 May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RLock justification leans on a MagicMock repro; pin the production scenario

Files: sync and async routing_map_provider.py (threading.Lock → threading.RLock); tests/routing/test_shared_pk_range_cache{,_async}.py.

The new test (test_reentrant_release_during_init_does_not_deadlock) simulates the deadlock with _shared_cache_lock directly, not by exercising the init → GC → del chain. So:

If init is later refactored to release the lock before any GC-triggering code, the test still passes — and the RLock might no longer be needed.
If the production GC path isn't actually a deadlock source (e.g., because CPython's GC runs synchronously only at specific safe points), the change pays RLock overhead for a hypothetical bug.
RLock has measurable overhead vs. Lock; this is on a hot path acquired on every cache read.

Fix: Add a test that constructs two PartitionKeyRangeCache instances back-to-back against the same endpoint with gc.collect() forced inside init. If that deadlocks with Lock() and passes with RLock(), the RLock is justified. If not, reconsider whether the safer fix is to release the lock before any GC-triggering code path.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap real deadlock - updated tests. also the Lock cost is small: It pays zero per get_routing_map call, which is what runs per cache read.

Copy link
Copy Markdown
Member

@tvaron3 tvaron3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @dibahlfi !

@dibahlfi
Copy link
Copy Markdown
Member Author

also not going to change the code for offline status as gateway is actually rmeoving offlined ranges and we dont have the same check in other SDKs.

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

6 participants