CASSANDRA-21138: Add upgrade test for AutoRepair by pauloricardomg · Pull Request #4644 · apache/cassandra

pauloricardomg · 2026-02-27T20:48:01Z

Thanks for sending a pull request! Here are some tips if you're new here:

Ensure you have added or run the appropriate tests for your PR.
Be sure to keep the PR description updated to reflect all changes.
Write your PR title to summarize what this PR proposes.
If possible, provide a concise example to reproduce the issue for a faster review.
Read our contributor guidelines
If you're making a documentation change, see our guide to documentation contribution

Commit messages should follow the following format:

<One sentence description, usually Jira title or CHANGES.txt summary>

<Optional lengthier description (context on patch)>

patch by <Authors>; reviewed by <Reviewers> for CASSANDRA-#####

Co-authored-by: Name1 <email1>
Co-authored-by: Name2 <email2>

The Cassandra Jira

pauloricardomg · 2026-03-24T19:42:37Z

The test failed as expected in CI because it requires the 5.0 patch to be merged.

https://pre-ci.cassandra.apache.org/job/cassandra/496/testReport/junit/org.apache.cassandra.distributed.upgrade/AutoRepairUpgradeTest/Tests___jvm_dtest_upgrade_jdk11_1_6___testAutoRepairAcrossUpgrade__jdk11_x86_64/

tolbertam · 2026-03-25T07:23:14Z

test/distributed/org/apache/cassandra/distributed/upgrade/AutoRepairUpgradeTest.java

+ * In 5.0, the JVM property {@code cassandra.autorepair.enable=true} is also required.
+ */
+public class AutoRepairUpgradeTest extends UpgradeTestBase
+{


I had to add:

@BeforeClass public static void before() { CassandraRelevantProperties.DTEST_IGNORE_SHUTDOWN_THREADCOUNT.setBoolean(true); }

which may imply some kind of thread leak, although i'm having a hard to discern where this might be. I don't see this always but it seems to happen most of the time.

ERROR [node1_MutationStage-1] 2026-03-25T07:27:48,072 AbstractCluster.java:582 - uncaught exception in thread Thread[node1_MutationStage-1,5,SharedPool] java.lang.RuntimeException: java.lang.IllegalStateException: HintsService is shut down and can't accept new hints at org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:3266) at org.apache.cassandra.concurrent.FutureTask$2.call(FutureTask.java:124) at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:151) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.lang.IllegalStateException: HintsService is shut down and can't accept new hints at org.apache.cassandra.hints.HintsService.write(HintsService.java:190) at org.apache.cassandra.service.StorageProxy$7.runMayThrow(StorageProxy.java:3370) at org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:3262) ... 6 common frames omitted **INFO [node1_AutoRepair-Repair-full:1] 2026-03-25T07:27:48,072 SubstituteLogger.java:222 - INFO [node1_AutoRepair-Repair-full:1] node1 2026-03-25T07:27:48,072 AutoRepairUtils.java:1027 - Auto repair finished for 6d194555-f6eb-41d0-c000-000000000001** INFO [node1_isolatedExecutor:2] 2026-03-25T07:27:48,183 SubstituteLogger.java:222 - ERROR 02:27:48,183 Invariant failed java.lang.IllegalStateException: null at accord.utils.Invariants.createIllegalState(Invariants.java:76) at accord.utils.Invariants.expect(Invariants.java:142) at org.apache.cassandra.distributed.impl.Instance.lambda$shutdown$52(Instance.java:1087) at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:829) INFO [node1_isolatedExecutor:2] 2026-03-25T07:27:48,184 SubstituteLogger.java:222 - ERROR [node1_isolatedExecutor:2] node1 2026-03-25T07:27:48,183 Invariants.java:56 - Invariant failed java.lang.IllegalStateException: null at accord.utils.Invariants.createIllegalState(Invariants.java:76) at accord.utils.Invariants.expect(Invariants.java:142) at org.apache.cassandra.distributed.impl.Instance.lambda$shutdown$52(Instance.java:1087) at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:829) INFO [node3_isolatedExecutor:2] 2026-03-25T07:27:48,202 SubstituteLogger.java:222 - ERROR 02:27:48,202 Invariant failed java.lang.IllegalStateException: null at accord.utils.Invariants.createIllegalState(Invariants.java:76) at accord.utils.Invariants.expect(Invariants.java:142) at org.apache.cassandra.distributed.impl.Instance.lambda$shutdown$52(Instance.java:1087) at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:829) INFO [node2_isolatedExecutor:2] 2026-03-25T07:27:48,202 SubstituteLogger.java:222 - ERROR 02:27:48,202 Invariant failed java.lang.IllegalStateException: null at accord.utils.Invariants.createIllegalState(Invariants.java:76) at accord.utils.Invariants.expect(Invariants.java:142) at org.apache.cassandra.distributed.impl.Instance.lambda$shutdown$52(Instance.java:1087) at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:829)

I think maybe the issue is the presence of the node1_AutoRepair-Repair-full:1 thread after the test finishes

Yeah that seems to be the issue, enabled withThreadLeakCheck and found:

INFO [node3_isolatedExecutor:2] 2026-03-25T07:36:53,732 SubstituteLogger.java:222 - ERROR 02:36:53,732 Exception in thread Thread[node3_isolatedExecutor:2,5,isolatedExecutor] java.lang.RuntimeException: Unterminated thread detected node3_shutdown in group isolatedExecutor Unterminated thread detected node3_AutoRepair-Repair-incremental:1 in group AutoRepair-Repair-incremental Unterminated thread detected node3_isolatedExecutor:2 in group isolatedExecutor Unterminated thread detected node3_AutoRepair-Repair-preview_repaired:1 in group AutoRepair-Repair-preview_repaired Unterminated thread detected node3_AutoRepair-Repair-full:1 in group AutoRepair-Repair-full at org.apache.cassandra.distributed.impl.Instance.withThreadLeakCheck(Instance.java:1116) at org.apache.cassandra.distributed.impl.Instance.lambda$shutdown$52(Instance.java:1095) at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:829)

Getting late, but thought of one possible issue: AutoRepair.shutdownBlocking simply calls shutdown() on its executors, which does not interrupt any running tasks.

Another possible issue is we call AutoRepair.shutdownBlocking() fairly late in StorageService.drain, it should arguably be done before ActiveRepairService.instance().stop();

Thanks for catching this @tolbertam. The issue is that AutoRepair.shutdownBlocking was not called on dtest's Instance, I updated this on trunk (6a8e8ad) and 5.0(6401f75) and this no longer fails

awesome, thanks for addressing this, will give this a quick test and get back to you! This also explains what was perplexing me, I added a bunch of break points and the shutdown code was not getting engaged. I had resolved this to some kind of dtest quirk, but didn't realize there was separate shutdown logic here.

tolbertam · 2026-03-25T07:24:45Z

test/distributed/org/apache/cassandra/distributed/upgrade/AutoRepairUpgradeTest.java

+            waitForNewRepairRound(cluster, 3, threshold);
+
+            Map<String, Long> postUpgradeTimestamps = captureFinishTimestamps(cluster);
+            assertEquals("Expected repair history for all 3 nodes after upgrade",


I'm seeing 5 repair_history entries when i run this:

Caused by: java.lang.AssertionError: Expected repair history for all 3 nodes after upgrade expected:<3> but was:<5> at org.junit.Assert.fail(Assert.java:88)

I see you note in comments that host-ids may change as part of upgrade which I think explains this.

It's odd to me that host id is changing on upgrade though, is it just a quirk of the dtest framework exclusively?

It's odd to me that host id is changing on upgrade though, is it just a quirk of the dtest framework exclusively?

I also find it odd but it seems this expected behavior due to TCM changing host ids, so AutoRepair creates new entries with the new host IDs and the old entries are eventually cleaned up.

I updated the test here(b570b9b) to filter out entries with repair timestamp prior to the upgrade which fixes this. LMK what do you think.

oh right; I suppose between 5.0 and trunk, TCM/CMS comes into play. I wonder if the upgrade dtests take care of initializing CMS or if things stay in a gossip frozen mode. Will look into it for my own benefit.

Before Upgrade:
[00000000-0000-4000-8000-000000000002, 00000000-0000-4000-8000-000000000003, 00000000-0000-4000-8000-000000000001]

After Upgrade:
[6d194555-f6eb-41d0-c000-000000000001, 6d194555-f6eb-41d0-c000-000000000003, 6d194555-f6eb-41d0-c000-000000000002]

Ok, that is Interesting, so the host ids change entirely as part of upgrade. I guess I didn't realize TCM would do this.

Ok i see, host id becomes the Node Id (incrementing value) with some magic consistent prefix:

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/tcm/membership/NodeId.java#L61-L73

I took some time to think how this might impact AutoRepair as I know there is some logic around determining whether its a nodes turn in AutoRepairUtils.myTurnToRepair but it incidentally appears to be resilient to this because it filters out anything not in CMS. I guess that this works is a happy coincidence.

However... it looks like AutoRepair.hasMultipleLiveMajorVersions() will not be resilient to this. As a node comes online with 6.0, I think it will not have any visibility into nodes that aren't currently in CMS, and may submit repairs during the upgrade process, which we will want to avoid. I think to get around this we can probably check whether CMS is initialized and if it is not, we can avoid submitting repairs..debugging through this now.

EDIT, looks like i'm wrong about the Major version upgrade thing, it appears NodeIds are populated for nodes at older versions a well:

the other side effect of the host ids changing is that it kind of invalidates the scheduling of repairs.

A node will see that it has never done a repair and run sooner. I'm not sure if we need to bother handling this and maybe we should just add something in the NEWS.txt for this. I think the repair history for the older entries will be cleaned up eventually.

But this in general is an interesting side effect of upgrading to a version with TCM. I wonder if this impacts any other tooling... (started thread here: https://the-asf.slack.com/archives/CK23JSY2K/p1774717256278899)

tolbertam

👍 this seems to work great now, thanks!

I think one thing we may just need to accept is that since the HostIds change between upgrade that existing repair history may be invalidated on upgrade to 6.0, which will provoke more repairs initially after the upgrade completes on all nodes, and the existing repair history will be eventually purged.

Functionally this may be ok, it'd be equivalent to turning AutoRepair on for the first time, but maybe less impactful because data had been previously actively repaired. @jaydeepkumar1984 curious what you think about this? I think we can maybe add something in CHANGES.txt, but not sure what else we can do here as repair history used hostid as the key and i'm not sure how we can resolve the previous host id.

pauloricardomg force-pushed the CASSANDRA-21138-trunk branch from 352a572 to ca63a7a Compare March 13, 2026 22:04

pauloricardomg mentioned this pull request Mar 13, 2026

[CASSANDRA-21138] Add autorepair to Cassandra 5.0 #4558

Open

pauloricardomg force-pushed the CASSANDRA-21138-trunk branch from ca63a7a to 64e95c6 Compare March 23, 2026 20:20

tolbertam self-requested a review March 24, 2026 01:17

tolbertam reviewed Mar 25, 2026

View reviewed changes

pauloricardomg added 3 commits March 26, 2026 12:08

CASSANDRA-21138: Add upgrade test for AutoRepair

d94f8e4

Shutdown AutoRepair on dtests

6a8e8ad

Filter only entries after upgrade

b570b9b

pauloricardomg force-pushed the CASSANDRA-21138-trunk branch from 64e95c6 to b570b9b Compare March 26, 2026 16:12

Fix checkstyle-test errors: import order and unused import

cd4fab3

tolbertam approved these changes Mar 28, 2026

View reviewed changes

Conversation

pauloricardomg commented Feb 27, 2026

Uh oh!

pauloricardomg commented Mar 24, 2026

Uh oh!

tolbertam Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tolbertam Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

tolbertam Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

tolbertam Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

pauloricardomg Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

tolbertam Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tolbertam Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

pauloricardomg Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

tolbertam Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

tolbertam Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tolbertam Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tolbertam Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tolbertam left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tolbertam Mar 25, 2026 •

edited

Loading

tolbertam Mar 28, 2026 •

edited

Loading

tolbertam Mar 28, 2026 •

edited

Loading

tolbertam Mar 28, 2026 •

edited

Loading

tolbertam Mar 28, 2026 •

edited

Loading