[CASSANDRA-21138] Add autorepair to Cassandra 5.0#4558
[CASSANDRA-21138] Add autorepair to Cassandra 5.0#4558pauloricardomg wants to merge 4 commits intoapache:cassandra-5.0from
Conversation
00f48df to
29b0090
Compare
.build/run-tests.sh
Outdated
| for branch in cassandra-4.0 cassandra-4.1 cassandra-5.0 ; do | ||
| # Note: cassandra-5.0.6 tag is used instead of cassandra-5.0 branch to enable | ||
| # testing upgrades from 5.0.6 to the current local build for autorepair feature | ||
| for branch in cassandra-4.0 cassandra-4.1 cassandra-5.0.6 ; do |
There was a problem hiding this comment.
The cassandra-5.0 dtest jar was unused since we don't do tests upgrading from the current version to the current version so I modified this to build the cassandra-5.0.6 dtest jar where autorepair is not included to test a minor version upgrade on AutoRepairDisabledSchemaUpgradeTest.
| echo "Reusing ${TMP_DIR}/cassandra-dtest-jars for past branch dtest jars" | ||
| if [ "x" == "x${OFFLINE}" ] ; then | ||
| until git -C ${TMP_DIR}/cassandra-dtest-jars fetch --quiet origin ; do echo "git -C ${TMP_DIR}/cassandra-dtest-jars fetch failed… trying again… " ; done | ||
| until git -C ${TMP_DIR}/cassandra-dtest-jars fetch --quiet --tags origin ; do echo "git -C ${TMP_DIR}/cassandra-dtest-jars fetch failed… trying again… " ; done |
There was a problem hiding this comment.
the --tags is needed because cassandra-5.0.6 is a tag and not a branch.
29b0090 to
b866027
Compare
| * Versions at or below this version (5.0.6) do not support auto-repair. | ||
| */ | ||
| @VisibleForTesting | ||
| static final CassandraVersion LAST_UNSUPPORTED_VERSION_FOR_AUTO_REPAIR = new CassandraVersion("5.0.6"); |
There was a problem hiding this comment.
This needs to be updated to the latest version before the release version of autorepair in 5.0. The reason for this is because we want the tests to pass if it's running 5.0.7-SNAPSHOT which is below 5.0.7 but above 5.0.6.
| if (currentRepairStatus == null || parallelRepairNumber > currentRepairStatus.hostIdsWithOnGoingRepair.size()) | ||
| { | ||
| // more repairs can be run, I might be the new one | ||
| if (autoRepairHistories != null) |
There was a problem hiding this comment.
@jaydeepkumar1984 do you recall why do we re-fetch the repair history here? In what circumstances can autoRepairHistories can be null, is this just on an uninitialized system ? I am wondering if this could cause a race where multiple nodes will attempt to run repair more than the allowed parallelNumber. Should we not perform this re-fetch outside the if parallelRepairNumber > currentRepairStatus.hostIdsWithOnGoingRepair.size() ?
There was a problem hiding this comment.
I do not recall any specific reason, so this seems unnecessary. We can remove ti from here and refresh the autoRepairHistories above in line no. 876. which should be sufficient.
CurrentRepairStatus currentRepairStatus = getCurrentRepairStatus(repairType, getAutoRepairHistory(repairType), myId);
There was a problem hiding this comment.
updated on 13f0b29 let me know if this is what you had in mind
b866027 to
13f0b29
Compare
| } | ||
|
|
||
| @VisibleForTesting | ||
| public static void setInitialTokens(String initial_token) |
| logger.info("Updated CDC block_writes from {} to {}", oldVal, val); | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
nit: unwanted change
| if (row.has("incremental_backups")) | ||
| builder.incrementalBackups(row.getBoolean("incremental_backups")); | ||
|
|
||
| // auto_repair column was introduced in 5.1 |
There was a problem hiding this comment.
nit: "auto_repair column was introduced in 5.1" -> "auto_repair column was introduced in 5.0.7"
| import com.google.common.collect.ImmutableList; | ||
| import com.vdurmont.semver4j.Semver; | ||
| import com.vdurmont.semver4j.SemverException; | ||
|
|
There was a problem hiding this comment.
nit: unwanted code change
| Object[][] result1 = cluster.coordinator(1).execute("SELECT * FROM " + KEYSPACE + ".tbl WHERE pk = 1", ConsistencyLevel.ALL); | ||
| Object[][] result2 = cluster.coordinator(2).execute("SELECT * FROM " + KEYSPACE + ".tbl WHERE pk = 1", ConsistencyLevel.ALL); | ||
| assertEquals("Data should be readable from node 1", 1, result1.length); | ||
| assertEquals("Data should be readable from node 2", 1, result2.length); |
There was a problem hiding this comment.
Let's extend the test case here by downgrading node1 to 5.0.6, so that would also complete the rollback scenario.
|
Overall, the changes look great. A few nits and a comment in the test case. Please take a look at it. |
|
Hello. can an option be added to skip TWCS tables automatically? |
|
@Kokokokoka can you create a new issue requesting this feature? https://issues.apache.org/jira/secure/CreateIssue!default.jspa |
13f0b29 to
d7eb86f
Compare
|
@jaydeepkumar1984 addressed review comments on d7eb86f |
Thanks. I will do one more round next week and then likely approve it. |
src/java/org/apache/cassandra/cql3/statements/schema/TableAttributes.java
Show resolved
Hide resolved
src/java/org/apache/cassandra/schema/SystemDistributedKeyspace.java
Outdated
Show resolved
Hide resolved
test/distributed/org/apache/cassandra/distributed/test/repair/AutoRepairTablePropertyDTest.java
Outdated
Show resolved
Hide resolved
d7eb86f to
0bdf37c
Compare
| * gen 7: add auto_repair_history and auto_repair_priority tables for AutoRepair feature | ||
| */ | ||
| public static final long GENERATION = 6; | ||
| public static final long GENERATION = 7; |
There was a problem hiding this comment.
Should we also then maintain the same version if CassandraRelevantProperties.AUTOREPAIR_ENABLE.getBoolean() is false?
There was a problem hiding this comment.
was sort of worried about this, but I see that 7 is whats used for trunk for AutoRepair anyways, and has since advanced to 8 for zstd dictionary compression, so this shouldn't be a problem.
46f1c84 to
0f8e9fb
Compare
|
Thanks for making the suggested changes! I will review this weekend |
tolbertam
left a comment
There was a problem hiding this comment.
This is looking great, played with this a little bit this weekend and it worked well.
I ran through all tests except the python upgrade dtests (which i ran into some configuration issues for) and everything worked except some AutoRepair dtests were missing the property to indicate auto repair is enabled (added comment for that). Other than that, one small cassandra.yaml issue which also exists on trunk that would be nice to fix here.
Did hit a couple known repeated flakes (https://issues.apache.org/jira/browse/CASSANDRA-20933, https://issues.apache.org/jira/browse/CASSANDRA-18098), but everything else appears to have run well.
Kicked off upgrade dtests overnight and will share how that went tomorrow. Assuming the property gets added for the auto repair dtests, i should be +1 on this. I'm wondering if we should just do a last formal vote for this before merging in assuming Jaydeep also +1s?
| # repair as anti-compaction during repair may contribute to additional space temporarily. | ||
| # For example, setting this to 0.2 means at least 20% of disk must be unused. | ||
| # Set to 0.0 to disable this check. Defaults to 0.0 (disabled) on 5.0 for backward-compatibility. | ||
| # repair_disk_headroom_reject_ratio: 0.0; |
There was a problem hiding this comment.
Haven't noticed before but as it seems the case on trunk as well, but the semicolon seems extraneous here:
| # repair_disk_headroom_reject_ratio: 0.0; | |
| # repair_disk_headroom_reject_ratio: 0.0 |
There was a problem hiding this comment.
reminder on this tiny one ^
...ributed/test/repair/AutoRepairSchedulerDisallowParallelReplicaRepairAcrossSchedulesTest.java
Show resolved
Hide resolved
|
Thanks a lot @pauloricardomg for incorporating the review comments. I have also just reviewed the updated PR. |
|
sounds good 👍. I'm having trouble running the upgrade python-dtests but everything else looking good other than the issue with the dtests missing the system property that I mentioned in the review. I think assuming that gets fixed up we should be all set. A final rebase would be good too, there's one new unit test failing ( |
e016d47 to
f8b14b0
Compare
|
Once you are good with the changes and CI results can you +1 on https://issues.apache.org/jira/browse/CASSANDRA-21138 ? Thanks |
|
taking a look now! I think only thing I noticed not addressed was just a tiny suggestion in cassandra.yaml (#4558 (comment)). I'm kicking off test suite now and will finish reviewing and if all looks good i'll give my +1! |
|
tests ran clean and everything looks great, +1 the PR and will also add comment to the JIRA. |
|
f8b14b0 to
2d18e4b
Compare
|
@jaydeepkumar1984 @tolbertam talking with other developers (mck, smiklosovic) we decided to ship this in 5.0.8 as a standalone release, so now that 5.0.7 is out I rebased this and updated the version check and references to 5.0.8(2d18e4b). @tolbertam would you mind taking a quick look and resubmitting CI again? Also created #4644 to add a new trunk test to validate upgrade from 5.0.X if you can take a look and approve. |
|
sounds good, will kick off a new CI build and will also try to review 4644 as well! |
.build/run-tests.sh
Outdated
| for branch in cassandra-4.0 cassandra-4.1 cassandra-5.0 ; do | ||
| # Note: cassandra-5.0.6 tag is used instead of cassandra-5.0 branch to enable | ||
| # testing upgrades from 5.0.6 to the current local build for autorepair feature | ||
| for branch in cassandra-4.0 cassandra-4.1 cassandra-5.0.6 ; do |
There was a problem hiding this comment.
nit: can we change this to cassandra-5.0.7 now?
| */ | ||
| public class AutoRepairDisabledSchemaUpgradeTest extends UpgradeTestBase | ||
| { | ||
| private static final Semver v506 = new Semver("5.0.6", SemverType.STRICT); |
There was a problem hiding this comment.
| private static final Semver v506 = new Semver("5.0.6", SemverType.STRICT); | |
| private static final Semver v506 = new Semver("5.0.7", SemverType.STRICT); |
| Version from = versions.get(v506); | ||
| Version to = AbstractCluster.CURRENT_VERSION; | ||
|
|
||
| assertNotNull("5.0.6 version not available - ensure dtest-5.0.6.jar is built", from); |
There was a problem hiding this comment.
| assertNotNull("5.0.6 version not available - ensure dtest-5.0.6.jar is built", from); | |
| assertNotNull("5.0.7 version not available - ensure dtest-5.0.7.jar is built", from); |
| import static org.junit.Assert.assertNotNull; | ||
|
|
||
| /** | ||
| * Tests that upgrading from 5.0.6 (where auto-repair is not included) to the current version |
There was a problem hiding this comment.
| * Tests that upgrading from 5.0.6 (where auto-repair is not included) to the current version | |
| * Tests that upgrading from 5.0.7 (where auto-repair is not included) to the current version |
I guess technically this test is still valid now that i think about it, but would be good to make it to the most proximate to the the version where autorepair is introduced.
|
I attached CI results to the JIRA everything came back good. I see a change got committed overnight, hopefully that doesn't impact plans to do a standalone release! |
Includes bug fixes and features: - Improved observability in AutoRepair (CASSANDRA-20581) - Stop repair scheduler if two major versions detected (CASSANDRA-20048) - Safeguard Full repair against disk protection (CASSANDRA-20045) - Stop AutoRepair monitoring thread upon shutdown (CASSANDRA-20623) - Fix race condition in auto-repair scheduler (CASSANDRA-20265) - Minimum repair task duration setting (CASSANDRA-20160) - Preview_repaired auto-repair type (CASSANDRA-20046) - Gate auto-repair behind cassandra.autorepair.enable JVM property - Add cassandra.autorepair.check_min_version to gate minimum version enforcement - Prevent auto-repair from running if any node is below 5.0.7 - Make system_distributed auto-repair schema conditional on feature being enabled - Add user-friendly errors for disabled auto-repair and schema incompatibility - Disable repair_disk_headroom_reject_ratio by default and remove deprecated references patch by Paulo Motta; reviewed by Jaydeepkumar Chovatia for CASSANDRA-21138 Co-Authored-By: Andy Tolbert <andy_tolbert@apple.com> Co-Authored-By: Chris Lohfink <clohfink@netflix.com> Co-Authored-By: Francisco Guerrero <frankgh@apache.org> Co-Authored-By: Himanshu Jindal <himanshj@amazon.com> Co-Authored-By: Jaydeepkumar Chovatia <jchovati@uber.com> Co-Authored-By: Kristijonas Zalys <kzalys@uber.com> Co-Authored-By: jaydeepkumar1984 <chovatia.jaydeep@gmail.com>
2d18e4b to
b514b54
Compare
|
good catch @tolbertam addressed remaining comments on b514b54 - rebased and submitted a new CI run I will commit both after you have a chance to check #4644 |
|
Just approved #4644, thanks @pauloricardomg ! I kicked off another round of CI on this branch for good measure, will upload the results later this afternoon. |
Thanks for sending a pull request! Here are some tips if you're new here:
Commit messages should follow the following format:
The Cassandra Jira