Skip to content

[DRAFT] Fix pool renewal race#835

Draft
andrzej-jackowski-scylladb wants to merge 2 commits intoscylladb:masterfrom
andrzej-jackowski-scylladb:fix-pool-renewal-race
Draft

[DRAFT] Fix pool renewal race#835
andrzej-jackowski-scylladb wants to merge 2 commits intoscylladb:masterfrom
andrzej-jackowski-scylladb:fix-pool-renewal-race

Conversation

@andrzej-jackowski-scylladb
Copy link
Copy Markdown

@dkropachev, @sylwiaszunejko, #317 continuously causes ScyllaDB CI failures. This is a draft PR that attempts to solve the problem, but I don't have a deep understanding of all the python-driver corner cases. Do you think this approach is worth pursuing?

=====

When pool creation races for the same host, a slower attempt can
overwrite a pool that another thread already published and close
connections with in-flight requests.

Capture the previous pool before connection setup, then compare
that state under the session lock before publishing the new pool.
If another thread changed the pool, discard the stale pool instead
of replacing the current one.

Keep pool removals behind the same lock so the check observes all
writers.

Fixes: #317

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

When pool creation races for the same host, a slower attempt can
overwrite a pool that another thread already published and close
connections with in-flight requests.

Capture the previous pool before connection setup, then compare
that state under the session lock before publishing the new pool.
If another thread changed the pool, discard the stale pool instead
of replacing the current one.

Keep pool removals behind the same lock so the check observes all
writers.

Fixes: scylladb#317
Add a deterministic unit test for the case where another thread
publishes a pool while a slower add attempt is still constructing
its pool.

This guards against closing in-flight connections by replacing the
pool that should remain current.

Refs: scylladb#317
@andrzej-jackowski-scylladb
Copy link
Copy Markdown
Author

@dkropachev, please reassign the review if there are more appropriate developers on the driver team.

@Lorak-mmk Lorak-mmk self-requested a review April 29, 2026 15:41
@Lorak-mmk
Copy link
Copy Markdown

I'll look into it tomorrow. I did think about how to solve the issue, and couldn't figure out anything sensible, so I'm looking forward to reading your solution.

@Lorak-mmk
Copy link
Copy Markdown

Eh. Will it solve this specific issue? Maybe, I'm not sure.
The problem that we should imo solve is why are there even multiple calls to add_or_renew_pool. With your approach the useless connections will still be opened.
And I don't even want to think about implications of those concurrent calls calling self.cluster.signal_connection_failure or self.cluster.on_down...

It is called in a few places:

  • Cluster.on_up
  • Cluster.on_add
  • Sesssion.__init__
  • Session.update_created_pools

The problem we are encountering is with cluster restart - adding nodes and session init are most likely not important here.
Cluster.on_up generally has a deduplication mechanism - self._currently_handling_node_up. So perhaps it is not working for some reason?
This flag is unset an the end of on_up, unless have_future is True. It is true if add_or_renew_pool for any Session returned non-None future.
For any such future, _on_up_future_completed is set as a callback, and this functon will unset the flag.
_on_up_future_completed has a check at the beginning to see if there are more futures left, and will only execute the rest of the function for the last finished future. This check looks solid to me.

What looks weird to me is the ending of that function:

        finally:
            with host.lock:
                host._currently_handling_node_up = False

        # see if there are any pools to add or remove now that the host is marked up
        for session in tuple(self.sessions):
            session.update_created_pools()

It first unsets the flag (allowing other on_up calls to do work) and then calls session.update_created_pools()
So I thought that update_created_pools() updates some pools, and at the same time concurrent on_up calls add_or_renew_pool.

However this seems not very likely to me now. update_created_pools will only update pool if its not existing / broken. That should not be the case after we finished connecting after node restart.

            if not pool or pool.is_shutdown:
                # we don't eagerly set is_up on previously ignored hosts. None is included here
                # to allow us to attempt connections to hosts that have gone from ignored to something
                # else.
                if distance != HostDistance.IGNORED and host.is_up in (True, None):
                    future = self.add_or_renew_pool(host, False)

What is more likely here is actually the on_down problem.
Cluster.on_down handling is split into two parts, sync and async.

    @run_in_executor
    def on_down_potentially_blocking(self, host, is_host_addition):
        self.profile_manager.on_down(host)
        self.control_connection.on_down(host)
        for session in tuple(self.sessions):
            session.on_down(host)

        for listener in self.listeners:
            listener.on_down(host)

        self._start_reconnector(host, is_host_addition)

    def on_down(self, host, is_host_addition, expect_host_to_be_down=False):
        """
        Intended for internal use only.
        """
        if self.is_shutdown:
            return

        with host.lock:
            was_up = host.is_up

            # ignore down signals if we have open pools to the host
            # this is to avoid closing pools when a control connection host became isolated
            if self._discount_down_events and self.profile_manager.distance(host) != HostDistance.IGNORED:
                connected = False
                for session in tuple(self.sessions):
                    pool_states = session.get_pool_state()
                    pool_state = pool_states.get(host)
                    if pool_state:
                        connected |= pool_state['open_count'] > 0
                if connected:
                    return

            host.set_down()
            if (not was_up and not expect_host_to_be_down) or host.is_currently_reconnecting():
                return
        log.warning("Host %s has been marked down", host)

        self.on_down_potentially_blocking(host, is_host_addition)

on_down performs some checks under lock, and then SCHEDULES on_down_potentially_blocking to be executed on a thread pool. It can actually take some time for it to be called, especially when we get a lof of events like restarts.
on_down_potentially_blocking will destroy the pools.
One possible scenario here is a race between on_up and a stale on_down_potentially_blocking.

  1. Node goes down
  2. Driver calls on_down, which schedules on_down_potentially_blocking
  3. Node gets up
  4. Driver calls on_up
  5. Pool is created
  6. on_down_potentially_blocking destroys the pool, which is then scheduled for recreation.

I thought we could just copy the checks from on_down to on_down_potentially_blocking, or make it one function that runs on executor. I did find some problems with that, but don't really remember what they were.

Copy link
Copy Markdown

@nyh nyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this "optimistic locking" solution, where you allow creating the unneeded pool but end up not using it. Since this race very rarely happen, the small performance hit is not interesting.

I let Claude write a fix for this bug and it came up with something similar, but using a more elaborate mechanism where if we're already in the process of creating a pool, we return the existing future of that creation instead of creating a new one and later throwing it away. But the implementation is longer and in my opinion - not worth it.

@nyh
Copy link
Copy Markdown

nyh commented May 3, 2026

Why is this fix in draft? What's missing?

I think it's very important for our CI stability, we had a lot of tests fail on this problem over the years, and had to add workarounds for each.

@nyh
Copy link
Copy Markdown

nyh commented May 3, 2026

Eh. Will it solve this specific issue? Maybe, I'm not sure. The problem that we should imo solve is why are there even multiple calls to add_or_renew_pool. With your approach the useless connections will still be opened.

@Lorak-mmk here is an alternative fix which avoids creating the extra useless connections in the first place, and also explains why we have those multiple call to add_or_renew_pool in the first place: #838

That alternative fix was 100% written by AI, including the fix, the tests and the commit messages nicely explaining everything including how and why this race even happens.

@andrzej-jackowski-scylladb
Copy link
Copy Markdown
Author

andrzej-jackowski-scylladb commented May 3, 2026

Why is this fix in draft? What's missing?

@nyh, I created this PR as draft to emphasize that the general discussion is needed to decide if the approach is correct.

Probably your #838 will win over this PR, but let Driver Team members decide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Connection pool renewal after concurrent node bootstraps causes double statement execution

3 participants