feed range query fix by dibahlfi · Pull Request #47105 · Azure/azure-sdk-for-python

dibahlfi · 2026-05-24T21:55:13Z

Summary
This PR makes Python Cosmos SDK query continuation tokens safe to round-trip across SDK versions during a rolling deployment, while fixing the silent split-bug class for feed_range and prefix partition_key queries.
The behavior is governed by a single rule, applied per request based on how many physical partitions the caller's input scope currently maps to in the PKRANGE cache:
• Currently 1 partition → emit the legacy single-string continuation (read and write). Old SDKs can read it; new SDKs can resume it.
• Currently >1 partition → emit the structured v=1 envelope (read and write). Required to safely represent one resume position per partition.
• Inbound legacy token, but the input scope now covers >1 partition (split happened since the token was minted) → restart that input range once, then emit v=1 from that page onward.
Full-PK queries are structurally always single-partition, so they always emit legacy format regardless of inbound — the strongest cross-version guarantee in the SDK.
Token formats
• Legacy — opaque single-string continuation the SDK has emitted for years. One position slot. Readable by any SDK version.
• v=1 envelope — base64-encoded JSON carrying one resume position per partition plus identity hashes. Only the new SDK can read it; old SDKs forward it verbatim and the service rejects it with 400 BadRequest.
Decoded v=1 shape(conceptual):
{
"v": 1,
"qh": "<hash of query text + parameters>",
"cr": "/dbs/.../colls/...",
"frh": "",
"c": [
{ "min": "...", "max": "...", "bc": "" },
...
]
}
qh / cr / frh let the SDK refuse a token replayed against a different query, container, or feed_range. On mismatch the SDK raises ValueError unwrapped to the caller.

Cross-SDK compatibility matrix

Direction	Full-PK	Feed-range / prefix-PK (single-partition input range)	Feed-range / prefix-PK (multi-partition input range)
Old → Old	Restart on split when service no longer honors parent-era continuation (today's behavior).	Original silent split bug if a split happens before replay.	Original silent split bug if a split happens before replay.
New → New	Always legacy format; resume with legacy token; restart only if service rejects parent-era continuation after split (fallback restart).	Legacy format; resume from saved position; restart only if a split crossed the input range and it is now multi-partition.	v=1; resume per partition; split-safe.
New → Old	Zero impact (always legacy format).	Zero impact (new SDK emitted legacy format).	400 on the old reader.

Mid-page failure handling:
When the pagination loop is mid-page and a backend POST raises (any HTTP error, network failure, etc.), the SDK stamps a resumable checkpoint into last_response_headers[Continuation] before re-raising the original exception. The checkpoint shape follows the same per-request rule above: legacy if the input range currently maps to one partition (always, for full-PK), v=1 if currently more than one. The caller's by_page() loop sees the exception and can resume on the saved checkpoint.
Rollout safety
• Single-partition input ranges (the common case): zero impact across versions in either direction. Bookmarks stay legacy both ways.
• Full-PK: structurally single-partition forever. Strongest guarantee — persist a bookmark, replay from any SDK version, no 400.
• Multi-partition input ranges: bookmarks are v=1 and only readable by the new SDK. During rolling upgrade, an old pod that picks up a new pod's v=1 token loses that page with a 400 and clears-and-restarts.

(cherry picked from commit cc412f0)

dibahlfi · 2026-05-24T21:55:52Z

@sdkReviewAgent-2

dibahlfi · 2026-05-24T21:56:03Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-24T21:56:17Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

This PR updates the Cosmos DB Python SDK query pipeline to make query_items continuation tokens split-safe for feed_range and prefix partition-key queries, and to allow safe continuation round-trips across SDK versions during rolling deployments by emitting legacy vs structured tokens based on the current physical partition overlap.

Changes:

Introduces a v=1 structured continuation token format for multi-partition-overlap feed_range/prefix-PK queries, while preserving legacy single-string continuations for full-PK and single-partition scopes.
Refactors sync/async __QueryFeed feed-range pagination logic to be split-aware, checkpoint on mid-page failures, and validate token identity (collection/query/feed_range).
Adds/updates extensive unit + end-to-end tests covering multi-partition overlap pagination, post-split resume behavior, checkpointing, and aggregate merge behavior; updates index-metrics assertions to be region/build tolerant.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
sdk/cosmos/azure-cosmos/CHANGELOG.md	Adds bug-fix entries for feed_range split pagination and VALUE AVG multi-partition behavior.
sdk/cosmos/azure-cosmos/azure/cosmos/_base.py	Updates query result merging to better classify/handle aggregate partials; adds clearer error for unsupported VALUE AVG merges.
sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_client_connection.py	Refactors sync `__QueryFeed` to use structured feed_range continuation and split-safe pagination helpers, plus checkpoint-on-failure.
sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/base_execution_context.py	Adds per-query response-header capture dict and uses it to resume from checkpoint after 410 split retries.
sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/aio/base_execution_context.py	Async equivalent of per-query header capture and checkpoint-based 410 retry resume.
sdk/cosmos/azure-cosmos/azure/cosmos/_query_aggregate_utils.py	New helpers to detect/identify VALUE aggregates and classify aggregate partial rows.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/feed_range_continuation.py	New shared implementation for structured v=1 token codec, identity hashing, and feed-range pagination helpers.
sdk/cosmos/azure-cosmos/azure/cosmos/aio/_cosmos_client_connection_async.py	Refactors async `__QueryFeed` similarly to sync, using shared helpers for split-safe feed_range pagination and checkpointing.
sdk/cosmos/azure-cosmos/azure/cosmos/container.py	Passes container properties into the request pipeline to support full-PK feed-range routing logic.
sdk/cosmos/azure-cosmos/tests/test_crud_subpartition.py	Adds resume/pagination tests for full and prefix hierarchical partition keys (MultiHash).
sdk/cosmos/azure-cosmos/tests/test_crud_subpartition_async.py	Async equivalents of new subpartition resume/pagination tests.
sdk/cosmos/azure-cosmos/tests/test_partition_split_retry_unit.py	Adds unit tests validating internal header capture + checkpoint continuation use during 410 retry flows (sync).
sdk/cosmos/azure-cosmos/tests/test_partition_split_retry_unit_async.py	Async equivalents of new 410 retry checkpoint/capture unit tests.
sdk/cosmos/azure-cosmos/tests/test_query.py	Adds tests asserting full-PK queries emit legacy continuations and resume correctly; relaxes index-metrics assertions.
sdk/cosmos/azure-cosmos/tests/test_query_async.py	Async equivalents of full-PK legacy continuation tests; relaxes index-metrics assertions.
sdk/cosmos/azure-cosmos/tests/test_query_cross_partition.py	Relaxes index-metrics assertions to tolerate backend variations.
sdk/cosmos/azure-cosmos/tests/test_query_feed_range_multipartition.py	New end-to-end tests for multi-partition-overlap feed_range pagination, merge fallback, checkpointing, identity mismatch, and post-split resume.
sdk/cosmos/azure-cosmos/tests/test_query_feed_range_multipartition_async.py	Async equivalents of new multi-partition feed_range end-to-end tests.

dibahlfi · 2026-05-25T00:19:30Z

@sdkReviewAgent-2

xinlian12 · 2026-05-25T15:45:37Z

✅ Review complete (11:14)

Posted 5 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

dibahlfi · 2026-05-25T20:43:13Z

@sdkReviewAgent-2

dibahlfi · 2026-05-25T20:43:26Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-25T20:43:40Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2026-05-25T21:51:59Z

✅ Review complete (07:47)

Posted 2 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

simorenoh · 2026-05-26T20:57:17Z

Testing Gap: No Live Test for SELECT VALUE AVG(...) Raising ValueError

Background: The PR's new _query_aggregate_utils.py + _base.py logic detects when a SELECT VALUE AVG(...) query spans multiple physical partitions. Previously, the SDK would silently merge AVG partials by addition — producing mathematically wrong results. Now it raises ValueError instead.

What's covered:

Unit tests in test_feed_range_continuation_token.py cover the parser/classifier (detecting AVG as an aggregate type)
Unit tests cover COUNT/SUM/MIN/MAX merge paths
Live tests cover COUNT and SUM across multi-partition feed ranges

What's missing:

No unit test exercises the actual AVG merge-raises-ValueError path
No live test runs SELECT VALUE AVG(...) across a multi-partition feed range

The code path that's untested:
container.query_items("SELECT VALUE AVG(c.age) FROM c", feed_range=<spans 2+ partitions>)
→ __QueryFeed → multi-overlap loop → _merge_query_results()
→ _classify_aggregate_partial() detects AVG
→ raises ValueError("SELECT VALUE AVG ... cannot be merged client-side")

This only triggers when (a) the query is SELECT VALUE AVG(...) AND (b) the scope spans multiple physical partitions requiring client-side merge. Single-partition or server-handled AVG still succeeds.

Why it matters: Without this test, a regression could silently return wrong averages again (the original bug this PR fixes) or fail to surface the documented error. The most dangerous failure mode — returning incorrect data without any error — would be completely invisible.

What a proper test would look like:

Create a feed_range crossing ≥2 physical partitions
Run SELECT VALUE AVG(c.age) FROM c
Assert ValueError is raised with the expected message
Control test: same AVG query on a single partition should still succeed

simorenoh

Just one comment on missing tests that would be good to have, LGTM otherwise

dibahlfi added 2 commits May 23, 2026 19:51

Backport feed-range continuation and split-resume fixes for 4.14.7

f8fbc5d

(cherry picked from commit cc412f0)

feed range query fix

bde4d56

Copilot AI review requested due to automatic review settings May 24, 2026 21:55

dibahlfi requested a review from a team as a code owner May 24, 2026 21:55

Copilot started reviewing on behalf of dibahlfi May 24, 2026 21:55 View session

github-actions Bot added the Cosmos label May 24, 2026

github-project-automation Bot added this to CosmosDB Python Eco-System May 24, 2026

Copilot AI reviewed May 24, 2026

View reviewed changes