Fix explicit wait_for_schema_agreement() failing #837
Fix explicit wait_for_schema_agreement() failing #837nyh wants to merge 2 commits intoscylladb:masterfrom
Conversation
…nection When wait_for_schema_agreement() is called without an explicit connection (e.g., via the public API or from dtest's create_ks()), it uses the control connection. If that connection is closed concurrently — for example because a node was stopped — a ConnectionShutdown exception is raised and propagated to the caller, causing an unrecoverable failure. The root cause is that the connection was captured once at the start of the function and never refreshed. The control connection can reconnect to another node at any time, but the function was unaware of this. Fix this by distinguishing between two call modes: 1. Explicit connection (passed after DDL RESULT_KIND_SCHEMA_CHANGE): The caller must verify schema agreement through the specific coordinator node. A ConnectionShutdown here is a real error and is still re-raised. 2. Control-connection mode (connection=None): There is no hard requirement on which node is queried. When a ConnectionShutdown occurs, wait briefly for the control connection to reconnect, pick up the new self._connection, and retry within the existing timeout budget. This matches the behavior for OperationTimedOut, which already retried in a loop. This avoids the need for callers to work around the bug by discarding and recreating their session after a node stop, as seen for example in several scylla-dtest workarounds (e.g., SCYLLADB-1256). Fixes: scylladb#604
…ction Add two unit tests to ControlConnectionTest that cover the bug fixed in the previous commit (scylladb#604): test_wait_for_schema_agreement_recovers_from_closed_control_connection: Simulates a node being stopped while wait_for_schema_agreement() is running in control-connection mode (no explicit connection passed). The mock initial connection raises ConnectionShutdown on its first query and simultaneously installs a replacement connection, mirroring what the driver's reconnect logic does in production. Before the fix this exception propagated to the caller. After the fix the function transparently picks up the new control connection and returns True. test_wait_for_schema_agreement_explicit_connection_raises_on_shutdown: Guards the DDL code path, where the caller passes an explicit connection to ensure schema agreement is verified through the DDL coordinator. In this case ConnectionShutdown must still be re-raised — there is no safe fallback to a different node. This test passes both before and after the fix, confirming that the fix does not weaken the explicit-connection path. Signed-off-by: Nadav Har'El <nyh@scylladb.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes ControlConnection.wait_for_schema_agreement(connection=None) to recover when the control connection is closed during the schema agreement polling loop (e.g., a node goes down) by retrying using an updated/reconnected control connection, and adds unit test coverage for the regression.
Changes:
- Retry schema agreement checks on
ConnectionShutdownwhen no explicit connection was provided, switching to a refreshed control connection when available. - Preserve existing behavior when an explicit
connectionis passed (propagateConnectionShutdown). - Add unit tests covering both the recovery path (implicit control connection) and the explicit-connection failure path.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
cassandra/cluster.py |
Adds ConnectionShutdown recovery logic for the implicit-control-connection schema agreement path. |
tests/unit/test_control_connection.py |
Adds regression tests for control connection shutdown recovery and explicit connection behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
tests/unit/test_control_connection.py:26
- There are now two separate imports from
cassandra.connection(ConnectionShutdownandEndPoint/DefaultEndPoint/DefaultEndPointFactory). Consider combining them into a single import statement to avoid duplication and keep imports organized (helps linters like isort/flake8).
from cassandra import OperationTimedOut, SchemaTargetType, SchemaChangeType
from cassandra.connection import ConnectionShutdown
from cassandra.protocol import ResultMessage, RESULT_KIND_ROWS
from cassandra.cluster import ControlConnection, _Scheduler, ProfileManager, EXEC_PROFILE_DEFAULT, ExecutionProfile
from cassandra.pool import Host
from cassandra.connection import EndPoint, DefaultEndPoint, DefaultEndPointFactory
from cassandra.policies import (SimpleConvictionPolicy, RoundRobinPolicy,
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@avikivity @mykaul this is the kind of issues that caused numerous different Scylla tests over the years to sporadically fail (one of the recent ones, https://scylladb.atlassian.net/browse/SCYLLADB-1256), and it's important to finally fix. |
|
I'll wait with merging so that @dkropachev can take a look too. |
The Python driver's
wait_for_schema_agreement()function has an optionalconnectionparameter, and whether or not this parameter is given has two different purposes:connectionis given when the user's DDL request generated a "RESULT_KIND_SCHEMA_CHANGE" response, and the driver is required to use the same node.connectionis not given, the driver may connection to any node, and will typically use its "control connection".The second use case sometimes failed if one of the node recently went down. The problem is that it was possible that the control connection was to that node that went down, and the driver did not yet set up a new control connection. The
wait_for_schema_agreement()had an error on that one specific control connection, and instead of retrying on a new control connection, it gave up immediately - which was wrong. The first patch in this series fixes thiswait_for_schema_agreement(connection=None)case, and the second patch adds reproducing tests using in-process MockConnection/MockCluster objects.This fix will improve, for example, ScyllaDB's dtest. dtest's
create_ks_query()runssession.cluster.control_connection.wait_for_schema_agreement(wait_time=120). It is arguable why it needs to do this (since the DDL that preceeded it will have already waited for schema agreement!), but it does. And that used to fail - rarely but not rarely enough - when one of the nodes went down just before trying to call this function.Fixes #604.