Bound Ratis reconfiguration retries to avoid stuck region migration by CRZbulabula · Pull Request #17894 · apache/iotdb

CRZbulabula · 2026-06-10T01:55:38Z

Problem

While benchmarking concurrent read/write and scaling in one healthy DataNode on a Ratis (3-replica) schema region, killing (kill -9) the DataNode that hosts an ADDING peer mid-migration leaves the schema-region AddRegionPeer task stuck forever (observed: CHECK_ADD_REGION_PEER / WAITING for 50+ minutes).

Root cause

Ratis reconfiguration (add/remove peer) requires every new staging peer to catch up before the leader can commit the new configuration. If the ADDING peer is killed and can never catch up:

On the leader, the reconfiguration stays in the staging state and each setConfiguration attempt is rejected with ReconfigurationInProgressException.
The Ratis client used for reconfiguration uses an unbounded retry policy (retryForeverWithSleep(2s) in the former RatisEndlessRetryPolicy), so it retries forever.
The coordinator DataNode's AddRegionPeerTask.run() is a plain synchronous Runnable blocked inside addRemotePeer → setConfiguration, with no cancel checkpoint. It therefore never returns and the task status stays PROCESSING indefinitely.

Because the task never leaves PROCESSING, the existing CANCEL ALL MIGRATIONS mechanism (cooperative, only effective at a safe point) cannot recover this case either — there is no safe point to stop at.

Fix

Bound the reconfiguration retries instead of retrying forever. The former RatisEndlessRetryPolicy now uses:

RetryPolicies.retryUpToMaximumCountWithFixedSleep(
    config.getReconfigurationMaxRetryAttempts(),   // default 600
    TimeDuration.valueOf(2, TimeUnit.SECONDS));

After the bound is exhausted, the last failure propagates (ConsensusException) so the upper layer (AddRegionPeerProcedure) can fail and roll back the migration — and CANCEL becomes reachable again, since the DataNode task finally leaves PROCESSING.

The class/factory were renamed EndlessRetryFactory / RatisEndlessRetryPolicy → ReconfigurationRetryFactory / RatisReconfigurationRetryPolicy, since the policy is no longer endless.

New configuration

Per-consensus-group, pushed from ConfigNode to DataNode via TRatisConfig:

property	default
`config_node_ratis_reconfiguration_max_retry_attempts`	600
`schema_region_ratis_reconfiguration_max_retry_attempts`	600
`data_region_ratis_reconfiguration_max_retry_attempts`	600

Reconfiguration retries use a fixed 2s interval, so the bound roughly caps the wait at attempts × 2s (600 ≈ 20 min). Tune lower for faster failover, higher for very large regions.

Compatibility (rolling upgrade)

The two new TRatisConfig fields are optional. A ConfigNode running the old version will not set them, in which case the DataNode keeps its local default instead of overwriting it with 0 (guarded by isSet...() in IoTDBDescriptor).

Verification

mvn compile passes for iotdb-core/consensus, iotdb-core/confignode, iotdb-core/datanode (thrift regenerated).
spotless:check is clean.

Note for reviewers

The default 600 (~20 min) favors not regressing legitimate slow addPeer (large snapshot transfer) over fast failover. Happy to adjust — schema-region snapshots are small, so a smaller schema_region_* default (e.g. 30–60) may be preferable.
An integration test reproducing the killed-ADDING-peer scenario can be added as a follow-up if desired.

🤖 Generated with Claude Code

When scaling in a DataNode, if a peer that is being ADDED to a Ratis schema/data region (during region migration) is killed before it catches up, the leader's reconfiguration can never commit. The Ratis client used for reconfiguration retried forever (retryForeverWithSleep), so the coordinator DataNode's AddRegionPeerTask blocked indefinitely inside setConfiguration, leaving the migration permanently stuck -- and CANCEL ineffective, since the task never left PROCESSING. Bound the reconfiguration retries instead of retrying forever: after the limit is exhausted the last failure propagates, so the upper layer can fail and roll back the migration (and CANCEL becomes reachable again). Expose the limit as a per-group config, pushed from ConfigNode to DataNode via TRatisConfig (optional fields, for rolling-upgrade safety): - config_node_ratis_reconfiguration_max_retry_attempts - schema_region_ratis_reconfiguration_max_retry_attempts - data_region_ratis_reconfiguration_max_retry_attempts Default 600 attempts at the fixed 2s interval (~20min cap). Also rename the now-misnamed EndlessRetryFactory / RatisEndlessRetryPolicy to Reconfiguration*, since the policy is no longer endless.

CRZbulabula · 2026-06-10T02:08:27Z

Superseded by #17895, recreated from apache/iotdb:enhance-ratis-reconfiguration-retry-limit instead of the personal fork branch.

CRZbulabula mentioned this pull request Jun 10, 2026

Bound Ratis reconfiguration retries and add region migration ITs #17895

Merged

CRZbulabula closed this Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bound Ratis reconfiguration retries to avoid stuck region migration#17894

Bound Ratis reconfiguration retries to avoid stuck region migration#17894
CRZbulabula wants to merge 1 commit into
apache:masterfrom
CRZbulabula:enhance-ratis-reconfiguration-retry-limit

CRZbulabula commented Jun 10, 2026

Uh oh!

CRZbulabula commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CRZbulabula commented Jun 10, 2026

Problem

Root cause

Fix

New configuration

Compatibility (rolling upgrade)

Verification

Note for reviewers

Uh oh!

CRZbulabula commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant