CASSANDRA-21138: Add upgrade test for AutoRepair#4644
CASSANDRA-21138: Add upgrade test for AutoRepair#4644pauloricardomg wants to merge 4 commits intoapache:trunkfrom
Conversation
352a572 to
ca63a7a
Compare
ca63a7a to
64e95c6
Compare
|
The test failed as expected in CI because it requires the 5.0 patch to be merged. |
| * In 5.0, the JVM property {@code cassandra.autorepair.enable=true} is also required. | ||
| */ | ||
| public class AutoRepairUpgradeTest extends UpgradeTestBase | ||
| { |
There was a problem hiding this comment.
I had to add:
@BeforeClass
public static void before()
{
CassandraRelevantProperties.DTEST_IGNORE_SHUTDOWN_THREADCOUNT.setBoolean(true);
}
which may imply some kind of thread leak, although i'm having a hard to discern where this might be. I don't see this always but it seems to happen most of the time.
There was a problem hiding this comment.
ERROR [node1_MutationStage-1] 2026-03-25T07:27:48,072 AbstractCluster.java:582 - uncaught exception in thread Thread[node1_MutationStage-1,5,SharedPool]
java.lang.RuntimeException: java.lang.IllegalStateException: HintsService is shut down and can't accept new hints
at org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:3266)
at org.apache.cassandra.concurrent.FutureTask$2.call(FutureTask.java:124)
at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:151)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalStateException: HintsService is shut down and can't accept new hints
at org.apache.cassandra.hints.HintsService.write(HintsService.java:190)
at org.apache.cassandra.service.StorageProxy$7.runMayThrow(StorageProxy.java:3370)
at org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:3262)
... 6 common frames omitted
**INFO [node1_AutoRepair-Repair-full:1] 2026-03-25T07:27:48,072 SubstituteLogger.java:222 - INFO [node1_AutoRepair-Repair-full:1] node1 2026-03-25T07:27:48,072 AutoRepairUtils.java:1027 - Auto repair finished for 6d194555-f6eb-41d0-c000-000000000001**
INFO [node1_isolatedExecutor:2] 2026-03-25T07:27:48,183 SubstituteLogger.java:222 - ERROR 02:27:48,183 Invariant failed
java.lang.IllegalStateException: null
at accord.utils.Invariants.createIllegalState(Invariants.java:76)
at accord.utils.Invariants.expect(Invariants.java:142)
at org.apache.cassandra.distributed.impl.Instance.lambda$shutdown$52(Instance.java:1087)
at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
INFO [node1_isolatedExecutor:2] 2026-03-25T07:27:48,184 SubstituteLogger.java:222 - ERROR [node1_isolatedExecutor:2] node1 2026-03-25T07:27:48,183 Invariants.java:56 - Invariant failed
java.lang.IllegalStateException: null
at accord.utils.Invariants.createIllegalState(Invariants.java:76)
at accord.utils.Invariants.expect(Invariants.java:142)
at org.apache.cassandra.distributed.impl.Instance.lambda$shutdown$52(Instance.java:1087)
at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
INFO [node3_isolatedExecutor:2] 2026-03-25T07:27:48,202 SubstituteLogger.java:222 - ERROR 02:27:48,202 Invariant failed
java.lang.IllegalStateException: null
at accord.utils.Invariants.createIllegalState(Invariants.java:76)
at accord.utils.Invariants.expect(Invariants.java:142)
at org.apache.cassandra.distributed.impl.Instance.lambda$shutdown$52(Instance.java:1087)
at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
INFO [node2_isolatedExecutor:2] 2026-03-25T07:27:48,202 SubstituteLogger.java:222 - ERROR 02:27:48,202 Invariant failed
java.lang.IllegalStateException: null
at accord.utils.Invariants.createIllegalState(Invariants.java:76)
at accord.utils.Invariants.expect(Invariants.java:142)
at org.apache.cassandra.distributed.impl.Instance.lambda$shutdown$52(Instance.java:1087)
at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
I think maybe the issue is the presence of the node1_AutoRepair-Repair-full:1 thread after the test finishes
There was a problem hiding this comment.
Yeah that seems to be the issue, enabled withThreadLeakCheck and found:
INFO [node3_isolatedExecutor:2] 2026-03-25T07:36:53,732 SubstituteLogger.java:222 - ERROR 02:36:53,732 Exception in thread Thread[node3_isolatedExecutor:2,5,isolatedExecutor]
java.lang.RuntimeException:
Unterminated thread detected node3_shutdown in group isolatedExecutor
Unterminated thread detected node3_AutoRepair-Repair-incremental:1 in group AutoRepair-Repair-incremental
Unterminated thread detected node3_isolatedExecutor:2 in group isolatedExecutor
Unterminated thread detected node3_AutoRepair-Repair-preview_repaired:1 in group AutoRepair-Repair-preview_repaired
Unterminated thread detected node3_AutoRepair-Repair-full:1 in group AutoRepair-Repair-full
at org.apache.cassandra.distributed.impl.Instance.withThreadLeakCheck(Instance.java:1116)
at org.apache.cassandra.distributed.impl.Instance.lambda$shutdown$52(Instance.java:1095)
at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
There was a problem hiding this comment.
Getting late, but thought of one possible issue: AutoRepair.shutdownBlocking simply calls shutdown() on its executors, which does not interrupt any running tasks.
Another possible issue is we call AutoRepair.shutdownBlocking() fairly late in StorageService.drain, it should arguably be done before ActiveRepairService.instance().stop();
There was a problem hiding this comment.
Thanks for catching this @tolbertam. The issue is that AutoRepair.shutdownBlocking was not called on dtest's Instance, I updated this on trunk (6a8e8ad) and 5.0(6401f75) and this no longer fails
There was a problem hiding this comment.
awesome, thanks for addressing this, will give this a quick test and get back to you! This also explains what was perplexing me, I added a bunch of break points and the shutdown code was not getting engaged. I had resolved this to some kind of dtest quirk, but didn't realize there was separate shutdown logic here.
| waitForNewRepairRound(cluster, 3, threshold); | ||
|
|
||
| Map<String, Long> postUpgradeTimestamps = captureFinishTimestamps(cluster); | ||
| assertEquals("Expected repair history for all 3 nodes after upgrade", |
There was a problem hiding this comment.
I'm seeing 5 repair_history entries when i run this:
Caused by: java.lang.AssertionError: Expected repair history for all 3 nodes after upgrade expected:<3> but was:<5>
at org.junit.Assert.fail(Assert.java:88)
I see you note in comments that host-ids may change as part of upgrade which I think explains this.
It's odd to me that host id is changing on upgrade though, is it just a quirk of the dtest framework exclusively?
There was a problem hiding this comment.
It's odd to me that host id is changing on upgrade though, is it just a quirk of the dtest framework exclusively?
I also find it odd but it seems this expected behavior due to TCM changing host ids, so AutoRepair creates new entries with the new host IDs and the old entries are eventually cleaned up.
I updated the test here(b570b9b) to filter out entries with repair timestamp prior to the upgrade which fixes this. LMK what do you think.
There was a problem hiding this comment.
oh right; I suppose between 5.0 and trunk, TCM/CMS comes into play. I wonder if the upgrade dtests take care of initializing CMS or if things stay in a gossip frozen mode. Will look into it for my own benefit.
There was a problem hiding this comment.
Before Upgrade:
[00000000-0000-4000-8000-000000000002, 00000000-0000-4000-8000-000000000003, 00000000-0000-4000-8000-000000000001]
After Upgrade:
[6d194555-f6eb-41d0-c000-000000000001, 6d194555-f6eb-41d0-c000-000000000003, 6d194555-f6eb-41d0-c000-000000000002]
Ok, that is Interesting, so the host ids change entirely as part of upgrade. I guess I didn't realize TCM would do this.
There was a problem hiding this comment.
Ok i see, host id becomes the Node Id (incrementing value) with some magic consistent prefix:
I took some time to think how this might impact AutoRepair as I know there is some logic around determining whether its a nodes turn in AutoRepairUtils.myTurnToRepair but it incidentally appears to be resilient to this because it filters out anything not in CMS. I guess that this works is a happy coincidence.
However... it looks like AutoRepair.hasMultipleLiveMajorVersions() will not be resilient to this. As a node comes online with 6.0, I think it will not have any visibility into nodes that aren't currently in CMS, and may submit repairs during the upgrade process, which we will want to avoid. I think to get around this we can probably check whether CMS is initialized and if it is not, we can avoid submitting repairs..debugging through this now.
EDIT, looks like i'm wrong about the Major version upgrade thing, it appears NodeIds are populated for nodes at older versions a well:
There was a problem hiding this comment.
the other side effect of the host ids changing is that it kind of invalidates the scheduling of repairs.
A node will see that it has never done a repair and run sooner. I'm not sure if we need to bother handling this and maybe we should just add something in the NEWS.txt for this. I think the repair history for the older entries will be cleaned up eventually.
But this in general is an interesting side effect of upgrading to a version with TCM. I wonder if this impacts any other tooling... (started thread here: https://the-asf.slack.com/archives/CK23JSY2K/p1774717256278899)
64e95c6 to
b570b9b
Compare
tolbertam
left a comment
There was a problem hiding this comment.
👍 this seems to work great now, thanks!
I think one thing we may just need to accept is that since the HostIds change between upgrade that existing repair history may be invalidated on upgrade to 6.0, which will provoke more repairs initially after the upgrade completes on all nodes, and the existing repair history will be eventually purged.
Functionally this may be ok, it'd be equivalent to turning AutoRepair on for the first time, but maybe less impactful because data had been previously actively repaired. @jaydeepkumar1984 curious what you think about this? I think we can maybe add something in CHANGES.txt, but not sure what else we can do here as repair history used hostid as the key and i'm not sure how we can resolve the previous host id.
Thanks for sending a pull request! Here are some tips if you're new here:
Commit messages should follow the following format:
The Cassandra Jira