Skip to content

CASSANDRA-21138: Add upgrade test for AutoRepair#4644

Open
pauloricardomg wants to merge 4 commits intoapache:trunkfrom
pauloricardomg:CASSANDRA-21138-trunk
Open

CASSANDRA-21138: Add upgrade test for AutoRepair#4644
pauloricardomg wants to merge 4 commits intoapache:trunkfrom
pauloricardomg:CASSANDRA-21138-trunk

Conversation

@pauloricardomg
Copy link
Copy Markdown
Contributor

Thanks for sending a pull request! Here are some tips if you're new here:

  • Ensure you have added or run the appropriate tests for your PR.
  • Be sure to keep the PR description updated to reflect all changes.
  • Write your PR title to summarize what this PR proposes.
  • If possible, provide a concise example to reproduce the issue for a faster review.
  • Read our contributor guidelines
  • If you're making a documentation change, see our guide to documentation contribution

Commit messages should follow the following format:

<One sentence description, usually Jira title or CHANGES.txt summary>

<Optional lengthier description (context on patch)>

patch by <Authors>; reviewed by <Reviewers> for CASSANDRA-#####

Co-authored-by: Name1 <email1>
Co-authored-by: Name2 <email2>

The Cassandra Jira

@pauloricardomg
Copy link
Copy Markdown
Contributor Author

* In 5.0, the JVM property {@code cassandra.autorepair.enable=true} is also required.
*/
public class AutoRepairUpgradeTest extends UpgradeTestBase
{
Copy link
Copy Markdown
Contributor

@tolbertam tolbertam Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add:

    @BeforeClass
    public static void before()
    {
        CassandraRelevantProperties.DTEST_IGNORE_SHUTDOWN_THREADCOUNT.setBoolean(true);
    }

which may imply some kind of thread leak, although i'm having a hard to discern where this might be. I don't see this always but it seems to happen most of the time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ERROR [node1_MutationStage-1] 2026-03-25T07:27:48,072 AbstractCluster.java:582 - uncaught exception in thread Thread[node1_MutationStage-1,5,SharedPool]
java.lang.RuntimeException: java.lang.IllegalStateException: HintsService is shut down and can't accept new hints
	at org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:3266)
	at org.apache.cassandra.concurrent.FutureTask$2.call(FutureTask.java:124)
	at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
	at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
	at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:151)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalStateException: HintsService is shut down and can't accept new hints
	at org.apache.cassandra.hints.HintsService.write(HintsService.java:190)
	at org.apache.cassandra.service.StorageProxy$7.runMayThrow(StorageProxy.java:3370)
	at org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:3262)
	... 6 common frames omitted

**INFO  [node1_AutoRepair-Repair-full:1] 2026-03-25T07:27:48,072 SubstituteLogger.java:222 - INFO  [node1_AutoRepair-Repair-full:1] node1 2026-03-25T07:27:48,072 AutoRepairUtils.java:1027 - Auto repair finished for 6d194555-f6eb-41d0-c000-000000000001**

INFO  [node1_isolatedExecutor:2] 2026-03-25T07:27:48,183 SubstituteLogger.java:222 - ERROR 02:27:48,183 Invariant failed
java.lang.IllegalStateException: null
	at accord.utils.Invariants.createIllegalState(Invariants.java:76)
	at accord.utils.Invariants.expect(Invariants.java:142)
	at org.apache.cassandra.distributed.impl.Instance.lambda$shutdown$52(Instance.java:1087)
	at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
	at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)

INFO  [node1_isolatedExecutor:2] 2026-03-25T07:27:48,184 SubstituteLogger.java:222 - ERROR [node1_isolatedExecutor:2] node1 2026-03-25T07:27:48,183 Invariants.java:56 - Invariant failed
java.lang.IllegalStateException: null
	at accord.utils.Invariants.createIllegalState(Invariants.java:76)
	at accord.utils.Invariants.expect(Invariants.java:142)
	at org.apache.cassandra.distributed.impl.Instance.lambda$shutdown$52(Instance.java:1087)
	at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
	at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)

INFO  [node3_isolatedExecutor:2] 2026-03-25T07:27:48,202 SubstituteLogger.java:222 - ERROR 02:27:48,202 Invariant failed
java.lang.IllegalStateException: null
	at accord.utils.Invariants.createIllegalState(Invariants.java:76)
	at accord.utils.Invariants.expect(Invariants.java:142)
	at org.apache.cassandra.distributed.impl.Instance.lambda$shutdown$52(Instance.java:1087)
	at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
	at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)

INFO  [node2_isolatedExecutor:2] 2026-03-25T07:27:48,202 SubstituteLogger.java:222 - ERROR 02:27:48,202 Invariant failed
java.lang.IllegalStateException: null
	at accord.utils.Invariants.createIllegalState(Invariants.java:76)
	at accord.utils.Invariants.expect(Invariants.java:142)
	at org.apache.cassandra.distributed.impl.Instance.lambda$shutdown$52(Instance.java:1087)
	at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
	at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)

I think maybe the issue is the presence of the node1_AutoRepair-Repair-full:1 thread after the test finishes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that seems to be the issue, enabled withThreadLeakCheck and found:

INFO  [node3_isolatedExecutor:2] 2026-03-25T07:36:53,732 SubstituteLogger.java:222 - ERROR 02:36:53,732 Exception in thread Thread[node3_isolatedExecutor:2,5,isolatedExecutor]
java.lang.RuntimeException: 
Unterminated thread detected node3_shutdown in group isolatedExecutor
Unterminated thread detected node3_AutoRepair-Repair-incremental:1 in group AutoRepair-Repair-incremental
Unterminated thread detected node3_isolatedExecutor:2 in group isolatedExecutor
Unterminated thread detected node3_AutoRepair-Repair-preview_repaired:1 in group AutoRepair-Repair-preview_repaired
Unterminated thread detected node3_AutoRepair-Repair-full:1 in group AutoRepair-Repair-full
	at org.apache.cassandra.distributed.impl.Instance.withThreadLeakCheck(Instance.java:1116)
	at org.apache.cassandra.distributed.impl.Instance.lambda$shutdown$52(Instance.java:1095)
	at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
	at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting late, but thought of one possible issue: AutoRepair.shutdownBlocking simply calls shutdown() on its executors, which does not interrupt any running tasks.

Another possible issue is we call AutoRepair.shutdownBlocking() fairly late in StorageService.drain, it should arguably be done before ActiveRepairService.instance().stop();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this @tolbertam. The issue is that AutoRepair.shutdownBlocking was not called on dtest's Instance, I updated this on trunk (6a8e8ad) and 5.0(6401f75) and this no longer fails

Copy link
Copy Markdown
Contributor

@tolbertam tolbertam Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, thanks for addressing this, will give this a quick test and get back to you! This also explains what was perplexing me, I added a bunch of break points and the shutdown code was not getting engaged. I had resolved this to some kind of dtest quirk, but didn't realize there was separate shutdown logic here.

waitForNewRepairRound(cluster, 3, threshold);

Map<String, Long> postUpgradeTimestamps = captureFinishTimestamps(cluster);
assertEquals("Expected repair history for all 3 nodes after upgrade",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing 5 repair_history entries when i run this:

Caused by: java.lang.AssertionError: Expected repair history for all 3 nodes after upgrade expected:<3> but was:<5>
	at org.junit.Assert.fail(Assert.java:88)

I see you note in comments that host-ids may change as part of upgrade which I think explains this.

It's odd to me that host id is changing on upgrade though, is it just a quirk of the dtest framework exclusively?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd to me that host id is changing on upgrade though, is it just a quirk of the dtest framework exclusively?

I also find it odd but it seems this expected behavior due to TCM changing host ids, so AutoRepair creates new entries with the new host IDs and the old entries are eventually cleaned up.

I updated the test here(b570b9b) to filter out entries with repair timestamp prior to the upgrade which fixes this. LMK what do you think.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right; I suppose between 5.0 and trunk, TCM/CMS comes into play. I wonder if the upgrade dtests take care of initializing CMS or if things stay in a gossip frozen mode. Will look into it for my own benefit.

Copy link
Copy Markdown
Contributor

@tolbertam tolbertam Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before Upgrade:
[00000000-0000-4000-8000-000000000002, 00000000-0000-4000-8000-000000000003, 00000000-0000-4000-8000-000000000001]

After Upgrade:
[6d194555-f6eb-41d0-c000-000000000001, 6d194555-f6eb-41d0-c000-000000000003, 6d194555-f6eb-41d0-c000-000000000002]

Ok, that is Interesting, so the host ids change entirely as part of upgrade. I guess I didn't realize TCM would do this.

Copy link
Copy Markdown
Contributor

@tolbertam tolbertam Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok i see, host id becomes the Node Id (incrementing value) with some magic consistent prefix:

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/tcm/membership/NodeId.java#L61-L73

I took some time to think how this might impact AutoRepair as I know there is some logic around determining whether its a nodes turn in AutoRepairUtils.myTurnToRepair but it incidentally appears to be resilient to this because it filters out anything not in CMS. I guess that this works is a happy coincidence.

However... it looks like AutoRepair.hasMultipleLiveMajorVersions() will not be resilient to this. As a node comes online with 6.0, I think it will not have any visibility into nodes that aren't currently in CMS, and may submit repairs during the upgrade process, which we will want to avoid. I think to get around this we can probably check whether CMS is initialized and if it is not, we can avoid submitting repairs..debugging through this now.

EDIT, looks like i'm wrong about the Major version upgrade thing, it appears NodeIds are populated for nodes at older versions a well:

image

Copy link
Copy Markdown
Contributor

@tolbertam tolbertam Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other side effect of the host ids changing is that it kind of invalidates the scheduling of repairs.

A node will see that it has never done a repair and run sooner. I'm not sure if we need to bother handling this and maybe we should just add something in the NEWS.txt for this. I think the repair history for the older entries will be cleaned up eventually.

But this in general is an interesting side effect of upgrading to a version with TCM. I wonder if this impacts any other tooling... (started thread here: https://the-asf.slack.com/archives/CK23JSY2K/p1774717256278899)

@pauloricardomg pauloricardomg force-pushed the CASSANDRA-21138-trunk branch from 64e95c6 to b570b9b Compare March 26, 2026 16:12
Copy link
Copy Markdown
Contributor

@tolbertam tolbertam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this seems to work great now, thanks!

I think one thing we may just need to accept is that since the HostIds change between upgrade that existing repair history may be invalidated on upgrade to 6.0, which will provoke more repairs initially after the upgrade completes on all nodes, and the existing repair history will be eventually purged.

Functionally this may be ok, it'd be equivalent to turning AutoRepair on for the first time, but maybe less impactful because data had been previously actively repaired. @jaydeepkumar1984 curious what you think about this? I think we can maybe add something in CHANGES.txt, but not sure what else we can do here as repair history used hostid as the key and i'm not sure how we can resolve the previous host id.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants