Skip to content

[Cosmos] Partition key range cache: resume from last failed page and move fetch to background #47026

@tvaron3

Description

@tvaron3

Summary

The partition key range cache in azure-cosmos has two improvement opportunities:

1. Resume from last failed page instead of restarting

In _fetch_routing_map (sync: routing_map_provider.py, async: aio/routing_map_provider.py), when _ReadPartitionKeyRanges raises a CosmosHttpResponseError, the exception propagates up and the caller (get_routing_map) does not persist any partial progress. On the next attempt (force-refresh or cache miss), the fetch restarts from scratch with no continuation/etag, re-fetching pages that previously succeeded.

Current behavior:

try:
    pk_range_generator = self._document_client._ReadPartitionKeyRanges(...)
    ranges.extend(list(pk_range_generator))
except CosmosHttpResponseError as e:
    logger.error(...)
    raise  # all progress lost — next attempt starts from page 1

Proposed behavior:

  • On failure, preserve the etag/continuation from the last successful page in the cache entry.
  • On the next cache population attempt, use that continuation to resume rather than starting over.
  • This is especially relevant for containers with a large number of physical partitions where the response spans multiple pages.

2. Move partition key range fetch to background

Currently, the pkrange fetch is performed synchronously on the critical path of the first data-plane request that needs routing information. For large containers or high-latency accounts, this blocks the user's request while the full set of ranges is fetched.

Proposed behavior:

  • Perform the pkrange cache population in a background task (thread or asyncio task).
  • If a request arrives while the background fetch is in progress, either:
    • Wait on the in-flight fetch (current single-flight behavior is acceptable here), or
    • Return a partial/stale routing map if available while the refresh completes in the background.
  • This reduces first-request latency for large containers.

Affected files

  • sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py
  • sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py
  • sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py

Sibling issue

Metadata

Metadata

Assignees

Labels

ClientThis issue points to a problem in the data-plane of the library.Cosmos

Type

No type
No fields configured for issues without a type.

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions