Skip to content

fix: startup races#3172

Draft
alpe wants to merge 2 commits intomainfrom
alex/fix_startup
Draft

fix: startup races#3172
alpe wants to merge 2 commits intomainfrom
alex/fix_startup

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Mar 17, 2026

Overview

E2E HA tests fail sometimes on a race when the leader is waiting for p2p sync complete on a fresh start.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b219a5f2-e0b1-439f-bcd4-6728d0844ba5

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch alex/fix_startup
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can get early access to new features in CodeRabbit.

Enable the early_access setting to enable early access features such as new models, tools, and more.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 17, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedMar 17, 2026, 2:48 PM

@claude
Copy link
Contributor

claude bot commented Mar 17, 2026

Claude finished @alpe's task in 2m 28s —— View job


Code Review

  • Gather context and understand PR scope
  • Review pkg/sync/syncer_status.go changes
  • Review pkg/sync/sync_service.go changes
  • Review node/failover.go changes
  • Review execution/evm/execution.go changes
  • Review test files
  • Provide final review summary

Overview

This PR correctly identifies and fixes a real race: a raft leader with an empty store would deadlock inside initFromP2PWithRetry since no peer can serve block 1 yet. The fix introduces StartForPublishing mode that skips the blocking P2P sync init and defers it to WriteToStoreAndBroadcast. The changes are generally well-structured with good test coverage. A few issues noted below.


pkg/sync/syncer_status.go — ✅ Correct

The switch from atomic.Bool to a sync.Mutex-protected startOnce() is the right fix. The old pattern had a TOCTOU race: two goroutines could both read started=false then both call syncer.Start(). The new startOnce() atomizes the check + execute + set, preventing double-starts.

Nit: isStarted() acquires a full mutex just for a read that was previously an atomic load. Since isStarted() is no longer called in production code (only in tests), this is fine — but consider whether it needs to exist at all, or whether it should be package-private documentation for the test helper only.


pkg/sync/sync_service.go⚠️ One concern

StartForPublishing — well designed. The new method correctly skips initFromP2PWithRetry and starts the subscriber, allowing the gossip network to be ready before the first block is produced without blocking on P2P peers.

Timer leak fix — the replacement of time.After(backoff) in a loop with a single retryTimer.Reset(backoff) is correct. time.After in a hot loop leaks timer goroutines for the duration of each backoff.

Potential race in WriteToStoreAndBroadcast — the CAS-based storeInitialized guard has a window:

// pkg/sync/sync_service.go:136-149
if syncService.storeInitialized.CompareAndSwap(false, true) {
    storeInitialized, err = syncService.initStore(ctx, headerOrData)  // ← Goroutine A is here
    ...
}
firstStart, err := syncService.startSyncer(ctx)  // ← Goroutine B can reach here

If Goroutine A wins the CAS and is still inside initStore(), Goroutine B fails the CAS (sees true) and falls through to startSyncer(). The syncer can start operating on a partially-initialized or empty store. startOnce() correctly prevents a double-start, but doesn't prevent the syncer from starting before the store is ready. This is pre-existing, not introduced by this PR, but worth documenting or addressing.

Fix this →


node/failover.go — ✅ Correct

shouldStartSyncInPublisherMode is conservative and correct: it only skips blocking P2P sync when all three conditions are met: aggregator + raft leader + empty local store. Non-aggregators, followers, and nodes with existing data all use the standard blocking path.

The concurrent startup of header and data sync services via syncWg is well-structured; both switch modes atomically via the pre-computed startSyncInPublisherMode bool.


execution/evm/execution.go — ✅ Good defensive fix

The new check in reconcileExecutionAtHeight validates that a stored ExecStageStarted execution's inputs (timestamp + tx hash) match the current call before resuming via payload ID. This prevents a stale payload from a different block attempt being incorrectly resumed.

The hashTxs() extraction is clean. Note that hashTxs(nil) returns nil while hashTxs([][]byte{}) also returns nil (both have len==0), which is consistent with the prior behavior.


Tests — ⚠️ One bug

TestHeaderSyncServiceRestart has a dead loop: sync_service_test.go:183

for i := signedHeader.Height() + 1; i < 2; i++ {

signedHeader.Height() is 9 after the first write loop (which runs i < 10), so i starts at 10 and the condition i < 2 is immediately false — the loop body never executes. The second WriteToStoreAndBroadcast call that was supposed to verify post-restart broadcasting is dead code. This should likely be i < signedHeader.Height() + 3 or similar.

Fix this →

TestSyncerStatusStartOnce/concurrent_start_only_runs_once — well designed. The started channel synchronizes the test to confirm the start function is actively running before releasing all goroutines, which correctly validates the mutex prevents concurrent double-starts.

TestReconcileExecutionAtHeight_StartedExecMeta — good coverage of the three new input-matching cases (match, timestamp mismatch, tx mismatch).


Minor observations

  • e2e/failover_e2e_test.go has significant duplication between TestLeaseFailoverE2E and TestHASequencerRollingRestartE2E for setup, node starting, and verification. Not a bug, but a candidate for refactoring in a follow-up.
  • The 30-second p2pInitTimeout in initFromP2PWithRetry is hardcoded. A config option or named constant would make it easier to tune in tests.

Summary

The core fix (publisher mode for raft leaders with empty stores + SyncerStatus.startOnce() for race-free syncer startup) is correct and addresses the described race condition. Two actionable items:

  1. The dead loop in TestHeaderSyncServiceRestart should be fixed — it silently provides no test coverage for post-restart broadcasting.
  2. The WriteToStoreAndBroadcast store-init / syncer-start window is worth documenting even if deferring the fix.

@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 54.65116% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.40%. Comparing base (8849e05) to head (24cb707).

Files with missing lines Patch % Lines
pkg/sync/sync_service.go 47.36% 13 Missing and 7 partials ⚠️
node/failover.go 37.03% 11 Missing and 6 partials ⚠️
pkg/sync/syncer_status.go 90.47% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3172   +/-   ##
=======================================
  Coverage   60.40%   60.40%           
=======================================
  Files         115      115           
  Lines       11902    11961   +59     
=======================================
+ Hits         7189     7225   +36     
- Misses       3905     3920   +15     
- Partials      808      816    +8     
Flag Coverage Δ
combined 60.40% <54.65%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant