Skip to content

Fix explicit wait_for_schema_agreement() failing #837

Open
nyh wants to merge 2 commits intoscylladb:masterfrom
nyh:fix-604
Open

Fix explicit wait_for_schema_agreement() failing #837
nyh wants to merge 2 commits intoscylladb:masterfrom
nyh:fix-604

Conversation

@nyh
Copy link
Copy Markdown

@nyh nyh commented May 3, 2026

The Python driver's wait_for_schema_agreement() function has an optional connection parameter, and whether or not this parameter is given has two different purposes:

  1. connection is given when the user's DDL request generated a "RESULT_KIND_SCHEMA_CHANGE" response, and the driver is required to use the same node.
  2. connection is not given, the driver may connection to any node, and will typically use its "control connection".

The second use case sometimes failed if one of the node recently went down. The problem is that it was possible that the control connection was to that node that went down, and the driver did not yet set up a new control connection. The wait_for_schema_agreement() had an error on that one specific control connection, and instead of retrying on a new control connection, it gave up immediately - which was wrong. The first patch in this series fixes this wait_for_schema_agreement(connection=None) case, and the second patch adds reproducing tests using in-process MockConnection/MockCluster objects.

This fix will improve, for example, ScyllaDB's dtest. dtest's create_ks_query() runs session.cluster.control_connection.wait_for_schema_agreement(wait_time=120). It is arguable why it needs to do this (since the DDL that preceeded it will have already waited for schema agreement!), but it does. And that used to fail - rarely but not rarely enough - when one of the nodes went down just before trying to call this function.

Fixes #604.

nyh added 2 commits May 3, 2026 18:36
…nection

When wait_for_schema_agreement() is called without an explicit connection
(e.g., via the public API or from dtest's create_ks()), it uses the
control connection. If that connection is closed concurrently — for
example because a node was stopped — a ConnectionShutdown exception is
raised and propagated to the caller, causing an unrecoverable failure.

The root cause is that the connection was captured once at the start of
the function and never refreshed. The control connection can reconnect
to another node at any time, but the function was unaware of this.

Fix this by distinguishing between two call modes:

1. Explicit connection (passed after DDL RESULT_KIND_SCHEMA_CHANGE):
   The caller must verify schema agreement through the specific
   coordinator node. A ConnectionShutdown here is a real error and is
   still re-raised.

2. Control-connection mode (connection=None): There is no hard
   requirement on which node is queried. When a ConnectionShutdown
   occurs, wait briefly for the control connection to reconnect, pick
   up the new self._connection, and retry within the existing timeout
   budget. This matches the behavior for OperationTimedOut, which
   already retried in a loop.

This avoids the need for callers to work around the bug by discarding
and recreating their session after a node stop, as seen for example
in several scylla-dtest workarounds (e.g., SCYLLADB-1256).

Fixes: scylladb#604
…ction

Add two unit tests to ControlConnectionTest that cover the bug fixed in
the previous commit (scylladb#604):

test_wait_for_schema_agreement_recovers_from_closed_control_connection:
  Simulates a node being stopped while wait_for_schema_agreement() is
  running in control-connection mode (no explicit connection passed).
  The mock initial connection raises ConnectionShutdown on its first
  query and simultaneously installs a replacement connection, mirroring
  what the driver's reconnect logic does in production.
  Before the fix this exception propagated to the caller.
  After the fix the function transparently picks up the new control
  connection and returns True.

test_wait_for_schema_agreement_explicit_connection_raises_on_shutdown:
  Guards the DDL code path, where the caller passes an explicit
  connection to ensure schema agreement is verified through the DDL
  coordinator. In this case ConnectionShutdown must still be re-raised
  — there is no safe fallback to a different node.
  This test passes both before and after the fix, confirming that the
  fix does not weaken the explicit-connection path.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes ControlConnection.wait_for_schema_agreement(connection=None) to recover when the control connection is closed during the schema agreement polling loop (e.g., a node goes down) by retrying using an updated/reconnected control connection, and adds unit test coverage for the regression.

Changes:

  • Retry schema agreement checks on ConnectionShutdown when no explicit connection was provided, switching to a refreshed control connection when available.
  • Preserve existing behavior when an explicit connection is passed (propagate ConnectionShutdown).
  • Add unit tests covering both the recovery path (implicit control connection) and the explicit-connection failure path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
cassandra/cluster.py Adds ConnectionShutdown recovery logic for the implicit-control-connection schema agreement path.
tests/unit/test_control_connection.py Adds regression tests for control connection shutdown recovery and explicit connection behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/unit/test_control_connection.py
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

tests/unit/test_control_connection.py:26

  • There are now two separate imports from cassandra.connection (ConnectionShutdown and EndPoint/DefaultEndPoint/DefaultEndPointFactory). Consider combining them into a single import statement to avoid duplication and keep imports organized (helps linters like isort/flake8).
from cassandra import OperationTimedOut, SchemaTargetType, SchemaChangeType
from cassandra.connection import ConnectionShutdown
from cassandra.protocol import ResultMessage, RESULT_KIND_ROWS
from cassandra.cluster import ControlConnection, _Scheduler, ProfileManager, EXEC_PROFILE_DEFAULT, ExecutionProfile
from cassandra.pool import Host
from cassandra.connection import EndPoint, DefaultEndPoint, DefaultEndPointFactory
from cassandra.policies import (SimpleConvictionPolicy, RoundRobinPolicy,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nyh
Copy link
Copy Markdown
Author

nyh commented May 4, 2026

@avikivity @mykaul this is the kind of issues that caused numerous different Scylla tests over the years to sporadically fail (one of the recent ones, https://scylladb.atlassian.net/browse/SCYLLADB-1256), and it's important to finally fix.

@sylwiaszunejko sylwiaszunejko requested a review from Lorak-mmk May 4, 2026 10:30
Copy link
Copy Markdown

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@Lorak-mmk
Copy link
Copy Markdown

I'll wait with merging so that @dkropachev can take a look too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

wait_for_schema_agreement fails when connection is closed

3 participants