fix: startup races by alpe · Pull Request #3172 · evstack/ev-node

alpe · 2026-03-17T13:16:12Z

Overview

E2E HA tests fail sometimes on a race when the leader is waiting for p2p sync complete on a fresh start.

coderabbitai · 2026-03-17T13:16:18Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b219a5f2-e0b1-439f-bcd4-6728d0844ba5

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch alex/fix_startup

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can get early access to new features in CodeRabbit.

Enable the early_access setting to enable early access features such as new models, tools, and more.

github-actions · 2026-03-17T13:16:25Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Mar 17, 2026, 2:48 PM

claude · 2026-03-17T13:16:31Z

Claude finished @alpe's task in 2m 28s —— View job

Code Review

Gather context and understand PR scope
Review pkg/sync/syncer_status.go changes
Review pkg/sync/sync_service.go changes
Review node/failover.go changes
Review execution/evm/execution.go changes
Review test files
Provide final review summary

Overview

This PR correctly identifies and fixes a real race: a raft leader with an empty store would deadlock inside initFromP2PWithRetry since no peer can serve block 1 yet. The fix introduces StartForPublishing mode that skips the blocking P2P sync init and defers it to WriteToStoreAndBroadcast. The changes are generally well-structured with good test coverage. A few issues noted below.

`pkg/sync/syncer_status.go` — ✅ Correct

The switch from atomic.Bool to a sync.Mutex-protected startOnce() is the right fix. The old pattern had a TOCTOU race: two goroutines could both read started=false then both call syncer.Start(). The new startOnce() atomizes the check + execute + set, preventing double-starts.

Nit: isStarted() acquires a full mutex just for a read that was previously an atomic load. Since isStarted() is no longer called in production code (only in tests), this is fine — but consider whether it needs to exist at all, or whether it should be package-private documentation for the test helper only.

`pkg/sync/sync_service.go` — ⚠️ One concern

StartForPublishing — well designed. The new method correctly skips initFromP2PWithRetry and starts the subscriber, allowing the gossip network to be ready before the first block is produced without blocking on P2P peers.

Timer leak fix — the replacement of time.After(backoff) in a loop with a single retryTimer.Reset(backoff) is correct. time.After in a hot loop leaks timer goroutines for the duration of each backoff.

Potential race in WriteToStoreAndBroadcast — the CAS-based storeInitialized guard has a window:

// pkg/sync/sync_service.go:136-149
if syncService.storeInitialized.CompareAndSwap(false, true) {
    storeInitialized, err = syncService.initStore(ctx, headerOrData)  // ← Goroutine A is here
    ...
}
firstStart, err := syncService.startSyncer(ctx)  // ← Goroutine B can reach here

If Goroutine A wins the CAS and is still inside initStore(), Goroutine B fails the CAS (sees true) and falls through to startSyncer(). The syncer can start operating on a partially-initialized or empty store. startOnce() correctly prevents a double-start, but doesn't prevent the syncer from starting before the store is ready. This is pre-existing, not introduced by this PR, but worth documenting or addressing.

Fix this →

`node/failover.go` — ✅ Correct

shouldStartSyncInPublisherMode is conservative and correct: it only skips blocking P2P sync when all three conditions are met: aggregator + raft leader + empty local store. Non-aggregators, followers, and nodes with existing data all use the standard blocking path.

The concurrent startup of header and data sync services via syncWg is well-structured; both switch modes atomically via the pre-computed startSyncInPublisherMode bool.

`execution/evm/execution.go` — ✅ Good defensive fix

The new check in reconcileExecutionAtHeight validates that a stored ExecStageStarted execution's inputs (timestamp + tx hash) match the current call before resuming via payload ID. This prevents a stale payload from a different block attempt being incorrectly resumed.

The hashTxs() extraction is clean. Note that hashTxs(nil) returns nil while hashTxs([][]byte{}) also returns nil (both have len==0), which is consistent with the prior behavior.

Tests — ⚠️ One bug

TestHeaderSyncServiceRestart has a dead loop: sync_service_test.go:183

for i := signedHeader.Height() + 1; i < 2; i++ {

signedHeader.Height() is 9 after the first write loop (which runs i < 10), so i starts at 10 and the condition i < 2 is immediately false — the loop body never executes. The second WriteToStoreAndBroadcast call that was supposed to verify post-restart broadcasting is dead code. This should likely be i < signedHeader.Height() + 3 or similar.

Fix this →

TestSyncerStatusStartOnce/concurrent_start_only_runs_once — well designed. The started channel synchronizes the test to confirm the start function is actively running before releasing all goroutines, which correctly validates the mutex prevents concurrent double-starts.

TestReconcileExecutionAtHeight_StartedExecMeta — good coverage of the three new input-matching cases (match, timestamp mismatch, tx mismatch).

Minor observations

e2e/failover_e2e_test.go has significant duplication between TestLeaseFailoverE2E and TestHASequencerRollingRestartE2E for setup, node starting, and verification. Not a bug, but a candidate for refactoring in a follow-up.
The 30-second p2pInitTimeout in initFromP2PWithRetry is hardcoded. A config option or named constant would make it easier to tune in tests.

Summary

The core fix (publisher mode for raft leaders with empty stores + SyncerStatus.startOnce() for race-free syncer startup) is correct and addresses the described race condition. Two actionable items:

The dead loop in TestHeaderSyncServiceRestart should be fixed — it silently provides no test coverage for post-restart broadcasting.
The WriteToStoreAndBroadcast store-init / syncer-start window is worth documenting even if deferring the fix.

codecov · 2026-03-17T13:19:48Z

Codecov Report

❌ Patch coverage is 54.65116% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.40%. Comparing base (8849e05) to head (24cb707).

Files with missing lines	Patch %	Lines
pkg/sync/sync_service.go	47.36%	13 Missing and 7 partials ⚠️
node/failover.go	37.03%	11 Missing and 6 partials ⚠️
pkg/sync/syncer_status.go	90.47%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3172   +/-   ##
=======================================
  Coverage   60.40%   60.40%           
=======================================
  Files         115      115           
  Lines       11902    11961   +59     
=======================================
+ Hits         7189     7225   +36     
- Misses       3905     3920   +15     
- Partials      808      816    +8

Flag	Coverage Δ
combined	`60.40% <54.65%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Fix startup races

522863f

github-actions bot assigned alpe Mar 17, 2026

Review feedback

24cb707

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: startup races#3172

fix: startup races#3172
alpe wants to merge 2 commits intomainfrom
alex/fix_startup

alpe commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

claude bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alpe commented Mar 17, 2026

Overview

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Overview

pkg/sync/syncer_status.go — ✅ Correct

pkg/sync/sync_service.go — ⚠️ One concern

node/failover.go — ✅ Correct

execution/evm/execution.go — ✅ Good defensive fix

Tests — ⚠️ One bug

Minor observations

Summary

Uh oh!

codecov bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

github-actions bot commented Mar 17, 2026 •

edited

Loading

claude bot commented Mar 17, 2026 •

edited

Loading

`pkg/sync/syncer_status.go` — ✅ Correct

`pkg/sync/sync_service.go` — ⚠️ One concern

`node/failover.go` — ✅ Correct

`execution/evm/execution.go` — ✅ Good defensive fix

codecov bot commented Mar 17, 2026 •

edited

Loading