Skip to content

Bound Ratis reconfiguration retries to avoid stuck region migration#17894

Closed
CRZbulabula wants to merge 1 commit into
apache:masterfrom
CRZbulabula:enhance-ratis-reconfiguration-retry-limit
Closed

Bound Ratis reconfiguration retries to avoid stuck region migration#17894
CRZbulabula wants to merge 1 commit into
apache:masterfrom
CRZbulabula:enhance-ratis-reconfiguration-retry-limit

Conversation

@CRZbulabula

Copy link
Copy Markdown
Contributor

Problem

While benchmarking concurrent read/write and scaling in one healthy DataNode on a Ratis (3-replica) schema region, killing (kill -9) the DataNode that hosts an ADDING peer mid-migration leaves the schema-region AddRegionPeer task stuck forever (observed: CHECK_ADD_REGION_PEER / WAITING for 50+ minutes).

Root cause

Ratis reconfiguration (add/remove peer) requires every new staging peer to catch up before the leader can commit the new configuration. If the ADDING peer is killed and can never catch up:

  1. On the leader, the reconfiguration stays in the staging state and each setConfiguration attempt is rejected with ReconfigurationInProgressException.
  2. The Ratis client used for reconfiguration uses an unbounded retry policy (retryForeverWithSleep(2s) in the former RatisEndlessRetryPolicy), so it retries forever.
  3. The coordinator DataNode's AddRegionPeerTask.run() is a plain synchronous Runnable blocked inside addRemotePeer → setConfiguration, with no cancel checkpoint. It therefore never returns and the task status stays PROCESSING indefinitely.

Because the task never leaves PROCESSING, the existing CANCEL ALL MIGRATIONS mechanism (cooperative, only effective at a safe point) cannot recover this case either — there is no safe point to stop at.

Fix

Bound the reconfiguration retries instead of retrying forever. The former RatisEndlessRetryPolicy now uses:

RetryPolicies.retryUpToMaximumCountWithFixedSleep(
    config.getReconfigurationMaxRetryAttempts(),   // default 600
    TimeDuration.valueOf(2, TimeUnit.SECONDS));

After the bound is exhausted, the last failure propagates (ConsensusException) so the upper layer (AddRegionPeerProcedure) can fail and roll back the migration — and CANCEL becomes reachable again, since the DataNode task finally leaves PROCESSING.

The class/factory were renamed EndlessRetryFactory / RatisEndlessRetryPolicyReconfigurationRetryFactory / RatisReconfigurationRetryPolicy, since the policy is no longer endless.

New configuration

Per-consensus-group, pushed from ConfigNode to DataNode via TRatisConfig:

property default
config_node_ratis_reconfiguration_max_retry_attempts 600
schema_region_ratis_reconfiguration_max_retry_attempts 600
data_region_ratis_reconfiguration_max_retry_attempts 600

Reconfiguration retries use a fixed 2s interval, so the bound roughly caps the wait at attempts × 2s (600 ≈ 20 min). Tune lower for faster failover, higher for very large regions.

Compatibility (rolling upgrade)

The two new TRatisConfig fields are optional. A ConfigNode running the old version will not set them, in which case the DataNode keeps its local default instead of overwriting it with 0 (guarded by isSet...() in IoTDBDescriptor).

Verification

  • mvn compile passes for iotdb-core/consensus, iotdb-core/confignode, iotdb-core/datanode (thrift regenerated).
  • spotless:check is clean.

Note for reviewers

  • The default 600 (~20 min) favors not regressing legitimate slow addPeer (large snapshot transfer) over fast failover. Happy to adjust — schema-region snapshots are small, so a smaller schema_region_* default (e.g. 30–60) may be preferable.
  • An integration test reproducing the killed-ADDING-peer scenario can be added as a follow-up if desired.

🤖 Generated with Claude Code

When scaling in a DataNode, if a peer that is being ADDED to a Ratis
schema/data region (during region migration) is killed before it catches
up, the leader's reconfiguration can never commit. The Ratis client used
for reconfiguration retried forever (retryForeverWithSleep), so the
coordinator DataNode's AddRegionPeerTask blocked indefinitely inside
setConfiguration, leaving the migration permanently stuck -- and CANCEL
ineffective, since the task never left PROCESSING.

Bound the reconfiguration retries instead of retrying forever: after the
limit is exhausted the last failure propagates, so the upper layer can
fail and roll back the migration (and CANCEL becomes reachable again).

Expose the limit as a per-group config, pushed from ConfigNode to
DataNode via TRatisConfig (optional fields, for rolling-upgrade safety):
- config_node_ratis_reconfiguration_max_retry_attempts
- schema_region_ratis_reconfiguration_max_retry_attempts
- data_region_ratis_reconfiguration_max_retry_attempts
Default 600 attempts at the fixed 2s interval (~20min cap).

Also rename the now-misnamed EndlessRetryFactory /
RatisEndlessRetryPolicy to Reconfiguration*, since the policy is no
longer endless.
@CRZbulabula

Copy link
Copy Markdown
Contributor Author

Superseded by #17895, recreated from apache/iotdb:enhance-ratis-reconfiguration-retry-limit instead of the personal fork branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant