[DRAFT] Fix pool renewal race by andrzej-jackowski-scylladb · Pull Request #835 · scylladb/python-driver

andrzej-jackowski-scylladb · 2026-04-29T15:22:26Z

@dkropachev, @sylwiaszunejko, #317 continuously causes ScyllaDB CI failures. This is a draft PR that attempts to solve the problem, but I don't have a deep understanding of all the python-driver corner cases. Do you think this approach is worth pursuing?

=====

When pool creation races for the same host, a slower attempt can
overwrite a pool that another thread already published and close
connections with in-flight requests.

Capture the previous pool before connection setup, then compare
that state under the session lock before publishing the new pool.
If another thread changed the pool, discard the stale pool instead
of replacing the current one.

Keep pool removals behind the same lock so the check observes all
writers.

Fixes: #317

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

When pool creation races for the same host, a slower attempt can overwrite a pool that another thread already published and close connections with in-flight requests. Capture the previous pool before connection setup, then compare that state under the session lock before publishing the new pool. If another thread changed the pool, discard the stale pool instead of replacing the current one. Keep pool removals behind the same lock so the check observes all writers. Fixes: scylladb#317

Add a deterministic unit test for the case where another thread publishes a pool while a slower add attempt is still constructing its pool. This guards against closing in-flight connections by replacing the pool that should remain current. Refs: scylladb#317

andrzej-jackowski-scylladb · 2026-04-29T15:23:48Z

@dkropachev, please reassign the review if there are more appropriate developers on the driver team.

Lorak-mmk · 2026-04-29T15:57:58Z

I'll look into it tomorrow. I did think about how to solve the issue, and couldn't figure out anything sensible, so I'm looking forward to reading your solution.

Lorak-mmk · 2026-04-30T17:11:10Z

Eh. Will it solve this specific issue? Maybe, I'm not sure.
The problem that we should imo solve is why are there even multiple calls to add_or_renew_pool. With your approach the useless connections will still be opened.
And I don't even want to think about implications of those concurrent calls calling self.cluster.signal_connection_failure or self.cluster.on_down...

It is called in a few places:

Cluster.on_up
Cluster.on_add
Sesssion.__init__
Session.update_created_pools

The problem we are encountering is with cluster restart - adding nodes and session init are most likely not important here.
Cluster.on_up generally has a deduplication mechanism - self._currently_handling_node_up. So perhaps it is not working for some reason?
This flag is unset an the end of on_up, unless have_future is True. It is true if add_or_renew_pool for any Session returned non-None future.
For any such future, _on_up_future_completed is set as a callback, and this functon will unset the flag.
_on_up_future_completed has a check at the beginning to see if there are more futures left, and will only execute the rest of the function for the last finished future. This check looks solid to me.

What looks weird to me is the ending of that function:

        finally:
            with host.lock:
                host._currently_handling_node_up = False

        # see if there are any pools to add or remove now that the host is marked up
        for session in tuple(self.sessions):
            session.update_created_pools()

It first unsets the flag (allowing other on_up calls to do work) and then calls session.update_created_pools()
So I thought that update_created_pools() updates some pools, and at the same time concurrent on_up calls add_or_renew_pool.

However this seems not very likely to me now. update_created_pools will only update pool if its not existing / broken. That should not be the case after we finished connecting after node restart.

            if not pool or pool.is_shutdown:
                # we don't eagerly set is_up on previously ignored hosts. None is included here
                # to allow us to attempt connections to hosts that have gone from ignored to something
                # else.
                if distance != HostDistance.IGNORED and host.is_up in (True, None):
                    future = self.add_or_renew_pool(host, False)

What is more likely here is actually the on_down problem.
Cluster.on_down handling is split into two parts, sync and async.

    @run_in_executor
    def on_down_potentially_blocking(self, host, is_host_addition):
        self.profile_manager.on_down(host)
        self.control_connection.on_down(host)
        for session in tuple(self.sessions):
            session.on_down(host)

        for listener in self.listeners:
            listener.on_down(host)

        self._start_reconnector(host, is_host_addition)

    def on_down(self, host, is_host_addition, expect_host_to_be_down=False):
        """
        Intended for internal use only.
        """
        if self.is_shutdown:
            return

        with host.lock:
            was_up = host.is_up

            # ignore down signals if we have open pools to the host
            # this is to avoid closing pools when a control connection host became isolated
            if self._discount_down_events and self.profile_manager.distance(host) != HostDistance.IGNORED:
                connected = False
                for session in tuple(self.sessions):
                    pool_states = session.get_pool_state()
                    pool_state = pool_states.get(host)
                    if pool_state:
                        connected |= pool_state['open_count'] > 0
                if connected:
                    return

            host.set_down()
            if (not was_up and not expect_host_to_be_down) or host.is_currently_reconnecting():
                return
        log.warning("Host %s has been marked down", host)

        self.on_down_potentially_blocking(host, is_host_addition)

on_down performs some checks under lock, and then SCHEDULES on_down_potentially_blocking to be executed on a thread pool. It can actually take some time for it to be called, especially when we get a lof of events like restarts.
on_down_potentially_blocking will destroy the pools.
One possible scenario here is a race between on_up and a stale on_down_potentially_blocking.

Node goes down
Driver calls on_down, which schedules on_down_potentially_blocking
Node gets up
Driver calls on_up
Pool is created
on_down_potentially_blocking destroys the pool, which is then scheduled for recreation.

I thought we could just copy the checks from on_down to on_down_potentially_blocking, or make it one function that runs on executor. I did find some problems with that, but don't really remember what they were.

nyh

I like this "optimistic locking" solution, where you allow creating the unneeded pool but end up not using it. Since this race very rarely happen, the small performance hit is not interesting.

I let Claude write a fix for this bug and it came up with something similar, but using a more elaborate mechanism where if we're already in the process of creating a pool, we return the existing future of that creation instead of creating a new one and later throwing it away. But the implementation is longer and in my opinion - not worth it.

nyh · 2026-05-03T19:02:46Z

Why is this fix in draft? What's missing?

I think it's very important for our CI stability, we had a lot of tests fail on this problem over the years, and had to add workarounds for each.

nyh · 2026-05-03T19:13:34Z

Eh. Will it solve this specific issue? Maybe, I'm not sure. The problem that we should imo solve is why are there even multiple calls to add_or_renew_pool. With your approach the useless connections will still be opened.

@Lorak-mmk here is an alternative fix which avoids creating the extra useless connections in the first place, and also explains why we have those multiple call to add_or_renew_pool in the first place: #838

That alternative fix was 100% written by AI, including the fix, the tests and the commit messages nicely explaining everything including how and why this race even happens.

andrzej-jackowski-scylladb · 2026-05-03T19:36:19Z

Why is this fix in draft? What's missing?

@nyh, I created this PR as draft to emphasize that the general discussion is needed to decide if the approach is correct.

Probably your #838 will win over this PR, but let Driver Team members decide.

andrzej-jackowski-scylladb added 2 commits April 29, 2026 17:11

andrzej-jackowski-scylladb requested review from dkropachev and sylwiaszunejko April 29, 2026 15:22

Lorak-mmk self-requested a review April 29, 2026 15:41

nyh approved these changes May 3, 2026

View reviewed changes

nyh mentioned this pull request May 3, 2026

session: fix pool renewal race causing double statement execution #838

Open

andrzej-jackowski-scylladb mentioned this pull request May 3, 2026

test/cluster/auth_cluster: use CREATE ROLE IF NOT EXISTS to fix flaky test scylladb/scylladb#29732

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Fix pool renewal race#835

[DRAFT] Fix pool renewal race#835
andrzej-jackowski-scylladb wants to merge 2 commits intoscylladb:masterfrom
andrzej-jackowski-scylladb:fix-pool-renewal-race

andrzej-jackowski-scylladb commented Apr 29, 2026

Uh oh!

andrzej-jackowski-scylladb commented Apr 29, 2026

Uh oh!

Lorak-mmk commented Apr 29, 2026

Uh oh!

Lorak-mmk commented Apr 30, 2026

Uh oh!

nyh left a comment

Uh oh!

nyh commented May 3, 2026

Uh oh!

nyh commented May 3, 2026 •

edited

Loading

Uh oh!

andrzej-jackowski-scylladb commented May 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andrzej-jackowski-scylladb commented Apr 29, 2026

Pre-review checklist

Uh oh!

andrzej-jackowski-scylladb commented Apr 29, 2026

Uh oh!

Lorak-mmk commented Apr 29, 2026

Uh oh!

Lorak-mmk commented Apr 30, 2026

Uh oh!

nyh left a comment

Choose a reason for hiding this comment

Uh oh!

nyh commented May 3, 2026

Uh oh!

nyh commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrzej-jackowski-scylladb commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nyh commented May 3, 2026 •

edited

Loading

andrzej-jackowski-scylladb commented May 3, 2026 •

edited

Loading