Summary
The partition key range cache in azure-cosmos has two improvement opportunities:
1. Resume from last failed page instead of restarting
In _fetch_routing_map (sync: routing_map_provider.py, async: aio/routing_map_provider.py), when _ReadPartitionKeyRanges raises a CosmosHttpResponseError, the exception propagates up and the caller (get_routing_map) does not persist any partial progress. On the next attempt (force-refresh or cache miss), the fetch restarts from scratch with no continuation/etag, re-fetching pages that previously succeeded.
Current behavior:
try:
pk_range_generator = self._document_client._ReadPartitionKeyRanges(...)
ranges.extend(list(pk_range_generator))
except CosmosHttpResponseError as e:
logger.error(...)
raise # all progress lost — next attempt starts from page 1
Proposed behavior:
- On failure, preserve the etag/continuation from the last successful page in the cache entry.
- On the next cache population attempt, use that continuation to resume rather than starting over.
- This is especially relevant for containers with a large number of physical partitions where the response spans multiple pages.
2. Move partition key range fetch to background
Currently, the pkrange fetch is performed synchronously on the critical path of the first data-plane request that needs routing information. For large containers or high-latency accounts, this blocks the user's request while the full set of ranges is fetched.
Proposed behavior:
- Perform the pkrange cache population in a background task (thread or asyncio task).
- If a request arrives while the background fetch is in progress, either:
- Wait on the in-flight fetch (current single-flight behavior is acceptable here), or
- Return a partial/stale routing map if available while the refresh completes in the background.
- This reduces first-request latency for large containers.
Affected files
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py
Sibling issue
Summary
The partition key range cache in
azure-cosmoshas two improvement opportunities:1. Resume from last failed page instead of restarting
In
_fetch_routing_map(sync:routing_map_provider.py, async:aio/routing_map_provider.py), when_ReadPartitionKeyRangesraises aCosmosHttpResponseError, the exception propagates up and the caller (get_routing_map) does not persist any partial progress. On the next attempt (force-refresh or cache miss), the fetch restarts from scratch with no continuation/etag, re-fetching pages that previously succeeded.Current behavior:
Proposed behavior:
2. Move partition key range fetch to background
Currently, the pkrange fetch is performed synchronously on the critical path of the first data-plane request that needs routing information. For large containers or high-latency accounts, this blocks the user's request while the full set of ranges is fetched.
Proposed behavior:
Affected files
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.pysdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.pysdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.pySibling issue