-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Determine this is the right repository
- I determined this is the correct repository in which to report this bug.
Happy to provide a PR if it's likely to be merged.
Summary of the issue
Context:
Using BulkWriter to bulk-write ~1M documents to Firestore from a Cloud Run job. Firestore returns retryable errors (contention) for multiple operations within the same batch window.
Expected Behaviour:
BulkWriter retries the failed operations and completes the flush successfully.
Actual Behaviour:
BulkWriter._schedule_ready_retries() crashes with IndexError: pop from an empty deque when ≥2 retries are ready simultaneously. The method interleaves popleft() with retry.retry(self), and retry.retry() re-enters the send loop (set() → _enqueue_current_batch → _ensure_sending → _send_until_queue_is_empty → _schedule_ready_retries), which drains the remaining retries from the deque. The outer loop then calls popleft() on the now-empty deque.
Related: #15385 (open refactor ticket for BulkWriter retries)
API client name and version
google-cloud-firestore 2.25.0
Reproduction steps: code
file: test_bulkwriter_bug.py
import bisect
import concurrent.futures
import datetime
from unittest.mock import patch
import pytest
from google.cloud.firestore import Client
from google.cloud.firestore_v1.bulk_writer import (
BulkWriterSetOperation,
OperationRetry,
)
def test_bulkwriter_schedule_ready_retries_reentrancy_bug():
"""BulkWriter crashes when ≥2 retries are ready simultaneously.
_schedule_ready_retries computes take_until_index=N, then interleaves
popleft() with retry.retry(self). retry.retry() re-enters the send loop
via set() → _enqueue_current_batch → _ensure_sending →
_send_until_queue_is_empty → _schedule_ready_retries, which drains the
remaining retries. The outer loop then calls popleft() on the empty deque.
batch_size=1 forces _enqueue_current_batch on every set(), making the
re-entrant path deterministic with just 2 retries.
"""
client = Client(project="any-project")
writer = client.bulk_writer()
writer.batch_size = 1
collection = client.collection("test")
for i in range(2):
op = BulkWriterSetOperation(
reference=collection.document(f"doc-{i}"),
document_data={"i": i},
merge=False,
attempts=1,
)
bisect.insort(
writer._retries,
OperationRetry(
operation=op,
run_at=datetime.datetime(2020, 1, 1, tzinfo=datetime.timezone.utc),
),
)
f = concurrent.futures.Future()
f.set_result(None)
with patch.object(writer, "_send_batch", return_value=f):
with pytest.raises(IndexError, match="pop from an empty deque"):
writer.flush()Reproduction steps: supporting files
None required. The test is self-contained — no Firestore emulator, network access, or credentials needed.
Reproduction steps: actual results
IndexError: pop from an empty deque
Full traceback (from _schedule_ready_retries → re-entrant _send_until_queue_is_empty → _schedule_ready_retries):
File "google/cloud/firestore_v1/bulk_writer.py", line 511, in _schedule_ready_retries
retry: OperationRetry = self._retries.popleft()
IndexError: pop from an empty deque
Reproduction steps: expected results
flush() completes without error. All retryable operations are re-queued and processed.
OS & version + platform
First observed on Cloud Run (GCP, australia-southeast1). Reproduced locally on macOS (Apple Silicon).
Python environment
Python 3.13.7
Python dependencies
google-cloud-firestore 2.25.0
(Only google-cloud-firestore is relevant; the bug is in synchronous retry scheduling logic with no external dependencies.)
Additional context
Root cause: _schedule_ready_retries (line 499 of bulk_writer.py) computes the number of ready retries via bisect, then loops that many times calling popleft() and retry.retry(self) interleaved. retry.retry(self) calls bulk_writer.set() which, if the batch is full, triggers _enqueue_current_batch → _ensure_sending → _send_until_queue_is_empty → _schedule_ready_retries re-entrantly. The inner call drains the remaining retries; the outer loop then calls popleft() on the empty deque.
Suggested fix: Drain all ready retries into a local list before processing:
def _schedule_ready_retries(self) -> None:
take_until_index = bisect.bisect(
self._retries, datetime.datetime.now(tz=datetime.timezone.utc)
)
ready = [self._retries.popleft() for _ in range(take_until_index)]
for retry in ready:
retry.retry(self)No memory concern — ready holds the same OperationRetry references already in the deque (no data copying). The maximum number of simultaneous retries is bounded by the batch size (default 20).
Related: #15385 — open refactor ticket for BulkWriter retry internals. This bug is a concrete crash caused by the hand-rolled retry logic flagged in that issue.