Skip to content

[CASSANDRA-21138] Add autorepair to Cassandra 5.0#4558

Open
pauloricardomg wants to merge 4 commits intoapache:cassandra-5.0from
pauloricardomg:autorepair_5_0_7
Open

[CASSANDRA-21138] Add autorepair to Cassandra 5.0#4558
pauloricardomg wants to merge 4 commits intoapache:cassandra-5.0from
pauloricardomg:autorepair_5_0_7

Conversation

@pauloricardomg
Copy link
Copy Markdown
Contributor

Thanks for sending a pull request! Here are some tips if you're new here:

  • Ensure you have added or run the appropriate tests for your PR.
  • Be sure to keep the PR description updated to reflect all changes.
  • Write your PR title to summarize what this PR proposes.
  • If possible, provide a concise example to reproduce the issue for a faster review.
  • Read our contributor guidelines
  • If you're making a documentation change, see our guide to documentation contribution

Commit messages should follow the following format:

<One sentence description, usually Jira title or CHANGES.txt summary>

<Optional lengthier description (context on patch)>

patch by <Authors>; reviewed by <Reviewers> for CASSANDRA-#####

Co-authored-by: Name1 <email1>
Co-authored-by: Name2 <email2>

The Cassandra Jira

@pauloricardomg pauloricardomg changed the base branch from trunk to cassandra-5.0 January 12, 2026 16:24
@pauloricardomg pauloricardomg force-pushed the autorepair_5_0_7 branch 5 times, most recently from 00f48df to 29b0090 Compare January 27, 2026 21:57
@pauloricardomg pauloricardomg changed the title [JIRA PENDING] Add autorepair to Cassandra 5.0 [CASSANDRA-21138] Add autorepair to Cassandra 5.0 Jan 28, 2026
for branch in cassandra-4.0 cassandra-4.1 cassandra-5.0 ; do
# Note: cassandra-5.0.6 tag is used instead of cassandra-5.0 branch to enable
# testing upgrades from 5.0.6 to the current local build for autorepair feature
for branch in cassandra-4.0 cassandra-4.1 cassandra-5.0.6 ; do
Copy link
Copy Markdown
Contributor Author

@pauloricardomg pauloricardomg Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cassandra-5.0 dtest jar was unused since we don't do tests upgrading from the current version to the current version so I modified this to build the cassandra-5.0.6 dtest jar where autorepair is not included to test a minor version upgrade on AutoRepairDisabledSchemaUpgradeTest.

echo "Reusing ${TMP_DIR}/cassandra-dtest-jars for past branch dtest jars"
if [ "x" == "x${OFFLINE}" ] ; then
until git -C ${TMP_DIR}/cassandra-dtest-jars fetch --quiet origin ; do echo "git -C ${TMP_DIR}/cassandra-dtest-jars fetch failed… trying again… " ; done
until git -C ${TMP_DIR}/cassandra-dtest-jars fetch --quiet --tags origin ; do echo "git -C ${TMP_DIR}/cassandra-dtest-jars fetch failed… trying again… " ; done
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the --tags is needed because cassandra-5.0.6 is a tag and not a branch.

* Versions at or below this version (5.0.6) do not support auto-repair.
*/
@VisibleForTesting
static final CassandraVersion LAST_UNSUPPORTED_VERSION_FOR_AUTO_REPAIR = new CassandraVersion("5.0.6");
Copy link
Copy Markdown
Contributor Author

@pauloricardomg pauloricardomg Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated to the latest version before the release version of autorepair in 5.0. The reason for this is because we want the tests to pass if it's running 5.0.7-SNAPSHOT which is below 5.0.7 but above 5.0.6.

if (currentRepairStatus == null || parallelRepairNumber > currentRepairStatus.hostIdsWithOnGoingRepair.size())
{
// more repairs can be run, I might be the new one
if (autoRepairHistories != null)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaydeepkumar1984 do you recall why do we re-fetch the repair history here? In what circumstances can autoRepairHistories can be null, is this just on an uninitialized system ? I am wondering if this could cause a race where multiple nodes will attempt to run repair more than the allowed parallelNumber. Should we not perform this re-fetch outside the if parallelRepairNumber > currentRepairStatus.hostIdsWithOnGoingRepair.size() ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not recall any specific reason, so this seems unnecessary. We can remove ti from here and refresh the autoRepairHistories above in line no. 876. which should be sufficient.
CurrentRepairStatus currentRepairStatus = getCurrentRepairStatus(repairType, getAutoRepairHistory(repairType), myId);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated on 13f0b29 let me know if this is what you had in mind

}

@VisibleForTesting
public static void setInitialTokens(String initial_token)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused code

logger.info("Updated CDC block_writes from {} to {}", oldVal, val);
}


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unwanted change

if (row.has("incremental_backups"))
builder.incrementalBackups(row.getBoolean("incremental_backups"));

// auto_repair column was introduced in 5.1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "auto_repair column was introduced in 5.1" -> "auto_repair column was introduced in 5.0.7"

import com.google.common.collect.ImmutableList;
import com.vdurmont.semver4j.Semver;
import com.vdurmont.semver4j.SemverException;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unwanted code change

Object[][] result1 = cluster.coordinator(1).execute("SELECT * FROM " + KEYSPACE + ".tbl WHERE pk = 1", ConsistencyLevel.ALL);
Object[][] result2 = cluster.coordinator(2).execute("SELECT * FROM " + KEYSPACE + ".tbl WHERE pk = 1", ConsistencyLevel.ALL);
assertEquals("Data should be readable from node 1", 1, result1.length);
assertEquals("Data should be readable from node 2", 1, result2.length);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's extend the test case here by downgrading node1 to 5.0.6, so that would also complete the rollback scenario.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jaydeepkumar1984
Copy link
Copy Markdown
Contributor

Overall, the changes look great. A few nits and a comment in the test case. Please take a look at it.

@Kokokokoka
Copy link
Copy Markdown

Hello. can an option be added to skip TWCS tables automatically?
Like reaper does.
P.S. Thanks for an awesome change!

@pauloricardomg
Copy link
Copy Markdown
Contributor Author

@Kokokokoka can you create a new issue requesting this feature? https://issues.apache.org/jira/secure/CreateIssue!default.jspa

@pauloricardomg
Copy link
Copy Markdown
Contributor Author

@jaydeepkumar1984 addressed review comments on d7eb86f

@jaydeepkumar1984
Copy link
Copy Markdown
Contributor

@jaydeepkumar1984 addressed review comments on d7eb86f

Thanks. I will do one more round next week and then likely approve it.

* gen 7: add auto_repair_history and auto_repair_priority tables for AutoRepair feature
*/
public static final long GENERATION = 6;
public static final long GENERATION = 7;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also then maintain the same version if CassandraRelevantProperties.AUTOREPAIR_ENABLE.getBoolean() is false?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed on 81d9547

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was sort of worried about this, but I see that 7 is whats used for trunk for AutoRepair anyways, and has since advanced to 8 for zstd dictionary compression, so this shouldn't be a problem.

@tolbertam
Copy link
Copy Markdown
Contributor

Thanks for making the suggested changes! I will review this weekend

Copy link
Copy Markdown
Contributor

@tolbertam tolbertam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, played with this a little bit this weekend and it worked well.

I ran through all tests except the python upgrade dtests (which i ran into some configuration issues for) and everything worked except some AutoRepair dtests were missing the property to indicate auto repair is enabled (added comment for that). Other than that, one small cassandra.yaml issue which also exists on trunk that would be nice to fix here.

Did hit a couple known repeated flakes (https://issues.apache.org/jira/browse/CASSANDRA-20933, https://issues.apache.org/jira/browse/CASSANDRA-18098), but everything else appears to have run well.

Kicked off upgrade dtests overnight and will share how that went tomorrow. Assuming the property gets added for the auto repair dtests, i should be +1 on this. I'm wondering if we should just do a last formal vote for this before merging in assuming Jaydeep also +1s?

# repair as anti-compaction during repair may contribute to additional space temporarily.
# For example, setting this to 0.2 means at least 20% of disk must be unused.
# Set to 0.0 to disable this check. Defaults to 0.0 (disabled) on 5.0 for backward-compatibility.
# repair_disk_headroom_reject_ratio: 0.0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't noticed before but as it seems the case on trunk as well, but the semicolon seems extraneous here:

Suggested change
# repair_disk_headroom_reject_ratio: 0.0;
# repair_disk_headroom_reject_ratio: 0.0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder on this tiny one ^

@jaydeepkumar1984
Copy link
Copy Markdown
Contributor

Thanks a lot @pauloricardomg for incorporating the review comments. I have also just reviewed the updated PR.
I am ready to +1.
Regarding the vote, I think we already had a consensus on this, and Paulo recently fyi'd the dev list. I think it should be ok as long as two of us officially +1 the PR.

@tolbertam
Copy link
Copy Markdown
Contributor

sounds good 👍. I'm having trouble running the upgrade python-dtests but everything else looking good other than the issue with the dtests missing the system property that I mentioned in the review. I think assuming that gets fixed up we should be all set. A final rebase would be good too, there's one new unit test failing (CompactionTaskTest#testFullyExpiredSSTablesAreNotReleasedPrematurely) that was added on 5.0 that addresses a fixed issue on cassandra-5.0 that this branch doesn't have yet. It'd be great to get a clean test run (other than known flakes) if we can.

@pauloricardomg
Copy link
Copy Markdown
Contributor Author

rebased and added missing system property to dtests on f8b14b0

There's also #4644 that adds an upgrade test to trunk that check that it's possible to run repair after upgrade from 5.0

  • Trunk pre-CI (should fail before the 5.0 version is merged)

@pauloricardomg
Copy link
Copy Markdown
Contributor Author

Once you are good with the changes and CI results can you +1 on https://issues.apache.org/jira/browse/CASSANDRA-21138 ? Thanks

@tolbertam
Copy link
Copy Markdown
Contributor

tolbertam commented Mar 13, 2026

taking a look now! I think only thing I noticed not addressed was just a tiny suggestion in cassandra.yaml (#4558 (comment)). I'm kicking off test suite now and will finish reviewing and if all looks good i'll give my +1!

@tolbertam
Copy link
Copy Markdown
Contributor

tests ran clean and everything looks great, +1 the PR and will also add comment to the JIRA.

@Kokokokoka
Copy link
Copy Markdown

@pauloricardomg
Copy link
Copy Markdown
Contributor Author

@jaydeepkumar1984 @tolbertam talking with other developers (mck, smiklosovic) we decided to ship this in 5.0.8 as a standalone release, so now that 5.0.7 is out I rebased this and updated the version check and references to 5.0.8(2d18e4b). @tolbertam would you mind taking a quick look and resubmitting CI again?

Also created #4644 to add a new trunk test to validate upgrade from 5.0.X if you can take a look and approve.

@tolbertam
Copy link
Copy Markdown
Contributor

sounds good, will kick off a new CI build and will also try to review 4644 as well!

for branch in cassandra-4.0 cassandra-4.1 cassandra-5.0 ; do
# Note: cassandra-5.0.6 tag is used instead of cassandra-5.0 branch to enable
# testing upgrades from 5.0.6 to the current local build for autorepair feature
for branch in cassandra-4.0 cassandra-4.1 cassandra-5.0.6 ; do
Copy link
Copy Markdown
Contributor

@tolbertam tolbertam Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we change this to cassandra-5.0.7 now?

*/
public class AutoRepairDisabledSchemaUpgradeTest extends UpgradeTestBase
{
private static final Semver v506 = new Semver("5.0.6", SemverType.STRICT);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static final Semver v506 = new Semver("5.0.6", SemverType.STRICT);
private static final Semver v506 = new Semver("5.0.7", SemverType.STRICT);

Version from = versions.get(v506);
Version to = AbstractCluster.CURRENT_VERSION;

assertNotNull("5.0.6 version not available - ensure dtest-5.0.6.jar is built", from);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assertNotNull("5.0.6 version not available - ensure dtest-5.0.6.jar is built", from);
assertNotNull("5.0.7 version not available - ensure dtest-5.0.7.jar is built", from);

import static org.junit.Assert.assertNotNull;

/**
* Tests that upgrading from 5.0.6 (where auto-repair is not included) to the current version
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Tests that upgrading from 5.0.6 (where auto-repair is not included) to the current version
* Tests that upgrading from 5.0.7 (where auto-repair is not included) to the current version

I guess technically this test is still valid now that i think about it, but would be good to make it to the most proximate to the the version where autorepair is introduced.

@tolbertam
Copy link
Copy Markdown
Contributor

I attached CI results to the JIRA everything came back good. I see a change got committed overnight, hopefully that doesn't impact plans to do a standalone release!

pauloricardomg and others added 3 commits March 24, 2026 15:24
Includes bug fixes and features:
- Improved observability in AutoRepair (CASSANDRA-20581)
- Stop repair scheduler if two major versions detected (CASSANDRA-20048)
- Safeguard Full repair against disk protection (CASSANDRA-20045)
- Stop AutoRepair monitoring thread upon shutdown (CASSANDRA-20623)
- Fix race condition in auto-repair scheduler (CASSANDRA-20265)
- Minimum repair task duration setting (CASSANDRA-20160)
- Preview_repaired auto-repair type (CASSANDRA-20046)
- Gate auto-repair behind cassandra.autorepair.enable JVM property
- Add cassandra.autorepair.check_min_version to gate minimum version enforcement
- Prevent auto-repair from running if any node is below 5.0.7
- Make system_distributed auto-repair schema conditional on feature being enabled
- Add user-friendly errors for disabled auto-repair and schema incompatibility
- Disable repair_disk_headroom_reject_ratio by default and remove deprecated references

patch by Paulo Motta; reviewed by Jaydeepkumar Chovatia for CASSANDRA-21138

Co-Authored-By: Andy Tolbert <andy_tolbert@apple.com>
Co-Authored-By: Chris Lohfink <clohfink@netflix.com>
Co-Authored-By: Francisco Guerrero <frankgh@apache.org>
Co-Authored-By: Himanshu Jindal <himanshj@amazon.com>
Co-Authored-By: Jaydeepkumar Chovatia <jchovati@uber.com>
Co-Authored-By: Kristijonas Zalys <kzalys@uber.com>
Co-Authored-By: jaydeepkumar1984 <chovatia.jaydeep@gmail.com>
@pauloricardomg
Copy link
Copy Markdown
Contributor Author

good catch @tolbertam addressed remaining comments on b514b54 - rebased and submitted a new CI run

I will commit both after you have a chance to check #4644

@tolbertam
Copy link
Copy Markdown
Contributor

Just approved #4644, thanks @pauloricardomg ! I kicked off another round of CI on this branch for good measure, will upload the results later this afternoon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants