Bound Ratis reconfiguration retries to avoid stuck region migration#17894
Closed
CRZbulabula wants to merge 1 commit into
Closed
Bound Ratis reconfiguration retries to avoid stuck region migration#17894CRZbulabula wants to merge 1 commit into
CRZbulabula wants to merge 1 commit into
Conversation
When scaling in a DataNode, if a peer that is being ADDED to a Ratis schema/data region (during region migration) is killed before it catches up, the leader's reconfiguration can never commit. The Ratis client used for reconfiguration retried forever (retryForeverWithSleep), so the coordinator DataNode's AddRegionPeerTask blocked indefinitely inside setConfiguration, leaving the migration permanently stuck -- and CANCEL ineffective, since the task never left PROCESSING. Bound the reconfiguration retries instead of retrying forever: after the limit is exhausted the last failure propagates, so the upper layer can fail and roll back the migration (and CANCEL becomes reachable again). Expose the limit as a per-group config, pushed from ConfigNode to DataNode via TRatisConfig (optional fields, for rolling-upgrade safety): - config_node_ratis_reconfiguration_max_retry_attempts - schema_region_ratis_reconfiguration_max_retry_attempts - data_region_ratis_reconfiguration_max_retry_attempts Default 600 attempts at the fixed 2s interval (~20min cap). Also rename the now-misnamed EndlessRetryFactory / RatisEndlessRetryPolicy to Reconfiguration*, since the policy is no longer endless.
Contributor
Author
|
Superseded by #17895, recreated from apache/iotdb:enhance-ratis-reconfiguration-retry-limit instead of the personal fork branch. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
While benchmarking concurrent read/write and scaling in one healthy DataNode on a Ratis (3-replica) schema region, killing (
kill -9) the DataNode that hosts an ADDING peer mid-migration leaves the schema-regionAddRegionPeertask stuck forever (observed:CHECK_ADD_REGION_PEER/WAITINGfor 50+ minutes).Root cause
Ratis reconfiguration (add/remove peer) requires every new staging peer to catch up before the leader can commit the new configuration. If the ADDING peer is killed and can never catch up:
setConfigurationattempt is rejected withReconfigurationInProgressException.retryForeverWithSleep(2s)in the formerRatisEndlessRetryPolicy), so it retries forever.AddRegionPeerTask.run()is a plain synchronousRunnableblocked insideaddRemotePeer → setConfiguration, with no cancel checkpoint. It therefore never returns and the task status staysPROCESSINGindefinitely.Because the task never leaves
PROCESSING, the existing CANCEL ALL MIGRATIONS mechanism (cooperative, only effective at a safe point) cannot recover this case either — there is no safe point to stop at.Fix
Bound the reconfiguration retries instead of retrying forever. The former
RatisEndlessRetryPolicynow uses:After the bound is exhausted, the last failure propagates (
ConsensusException) so the upper layer (AddRegionPeerProcedure) can fail and roll back the migration — andCANCELbecomes reachable again, since the DataNode task finally leavesPROCESSING.The class/factory were renamed
EndlessRetryFactory/RatisEndlessRetryPolicy→ReconfigurationRetryFactory/RatisReconfigurationRetryPolicy, since the policy is no longer endless.New configuration
Per-consensus-group, pushed from ConfigNode to DataNode via
TRatisConfig:config_node_ratis_reconfiguration_max_retry_attemptsschema_region_ratis_reconfiguration_max_retry_attemptsdata_region_ratis_reconfiguration_max_retry_attemptsReconfiguration retries use a fixed 2s interval, so the bound roughly caps the wait at
attempts × 2s(600 ≈ 20 min). Tune lower for faster failover, higher for very large regions.Compatibility (rolling upgrade)
The two new
TRatisConfigfields areoptional. A ConfigNode running the old version will not set them, in which case the DataNode keeps its local default instead of overwriting it with0(guarded byisSet...()inIoTDBDescriptor).Verification
mvn compilepasses foriotdb-core/consensus,iotdb-core/confignode,iotdb-core/datanode(thrift regenerated).spotless:checkis clean.Note for reviewers
600(~20 min) favors not regressing legitimate slowaddPeer(large snapshot transfer) over fast failover. Happy to adjust — schema-region snapshots are small, so a smallerschema_region_*default (e.g. 30–60) may be preferable.🤖 Generated with Claude Code