Skip to content

blocksync: fix sendError in block sync (CON-276)#3440

Open
wen-coding wants to merge 2 commits into
mainfrom
claude/lucid-sanderson-aa951a
Open

blocksync: fix sendError in block sync (CON-276)#3440
wen-coding wants to merge 2 commits into
mainfrom
claude/lucid-sanderson-aa951a

Conversation

@wen-coding
Copy link
Copy Markdown
Contributor

@wen-coding wen-coding commented May 14, 2026

Describe your changes and provide context

BlockPool.AddBlock, BlockPool.removeTimedoutPeers, and bpPeer.onTimeout previously called sendError (a blocking send on errorsCh) while still holding pool.mtx. This change releases the mutex before the send.

In AddBlock and removeTimedoutPeers the send is registered as a defer before the defer pool.mtx.Unlock(), so LIFO ordering runs Unlock first and then performs the send. This keeps panic safety from the deferred unlock and avoids tracking an explicit Unlock on every return path. onTimeout's critical section is a single boolean write, so it uses an explicit Unlock instead.

AddBlock's return values and error wrapping are unchanged.

Testing performed to validate your change

  • go test ./sei-tendermint/internal/blocksync/... -race -count=1 — full package green.
  • New regression test TestBlockPoolAddBlockReleasesLockBeforeSend asserts the invariant for AddBlock. Verified discriminating by temporarily reverting the AddBlock edit — the test then blocks until the package timeout.
  • gofmt -s -l clean.

Restructures BlockPool.AddBlock, removeTimedoutPeers, and
bpPeer.onTimeout so pool.mtx is released before any send on errorsCh.
In AddBlock and removeTimedoutPeers the send is registered as a defer
before the deferred Unlock; LIFO ordering runs Unlock first, preserving
panic safety from the deferred unlock. onTimeout's critical section is
a single boolean write, so it uses an explicit Unlock instead.

Original error semantics and return values are preserved.

Adds TestBlockPoolAddBlockReleasesLockBeforeSend asserting the
invariant for AddBlock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wen-coding wen-coding changed the title blocksync: release pool.mtx before sending on errorsCh (CON-276) blocksync: BlockPool refactor (CON-276) May 14, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMay 15, 2026, 3:02 PM

@github-actions
Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMay 14, 2026, 11:34 PM

@codecov
Copy link
Copy Markdown

codecov Bot commented May 14, 2026

Codecov Report

❌ Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.36%. Comparing base (5060dcd) to head (34f12ad).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
sei-tendermint/internal/blocksync/pool.go 80.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3440      +/-   ##
==========================================
+ Coverage   59.31%   59.36%   +0.05%     
==========================================
  Files        2120     2120              
  Lines      175523   175867     +344     
==========================================
+ Hits       104106   104412     +306     
- Misses      62338    62368      +30     
- Partials     9079     9087       +8     
Flag Coverage Δ
sei-chain-pr 72.80% <80.00%> (?)
sei-db 70.41% <ø> (-0.22%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-tendermint/internal/blocksync/pool.go 84.91% <80.00%> (+2.34%) ⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@wen-coding wen-coding changed the title blocksync: BlockPool refactor (CON-276) blocksync: fix sendError in block sync (CON-276) May 14, 2026

errorsCh <- peerError{errors.New("filler"), peerID}

farHeight := int64(1 + maxDiffBetweenCurrentAndReceivedBlockHeight + 1000)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document what 1 + ... + 1000 is about

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

farBlock := &types.Block{Header: types.Header{Height: farHeight}}

addBlockDone := make(chan struct{})
go func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spawn -> wait looks time sensitive, can you make it more robust?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not, then perhaps we shouldn't really have this test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to check runtime.Stack to detect sendError happened, how does this look?

t.Fatal("AddBlock did not complete after errorsCh was drained")
}

select {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the point of this clause?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

close(probeDone)
}()

select {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we skip the individual timeouts? I'm pretty sure it will be flaky otherwise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Rewrites TestBlockPoolAddBlockReleasesLockBeforeSend to remove the
time-sensitive sleep and per-select timeouts. Switches errorsCh to
unbuffered so AddBlock's sendError reliably parks on the send, then
uses runtime.Stack to detect when the goroutine is parked in
sendError and probes pool.mtx with TryLock. The test now passes/fails
deterministically against fixed/buggy code respectively, and relies
on require.Eventually with a single budget rather than scattered
timeouts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented May 15, 2026

PR Summary

Medium Risk
Touches blocksync concurrency by changing lock/unlock ordering around blocking errorsCh sends; mistakes could introduce races or missed error delivery, but scope is limited to error paths/timeouts.

Overview
Prevents potential deadlocks/contention in blocksync by ensuring BlockPool releases pool.mtx before doing the blocking errorsCh send in AddBlock, removeTimedoutPeers, and bpPeer.onTimeout.

Adds a regression test (TestBlockPoolAddBlockReleasesLockBeforeSend) that parks AddBlock on an unbuffered errorsCh send and asserts the mutex remains acquirable while the goroutine is blocked.

Reviewed by Cursor Bugbot for commit 34f12ad. Bugbot is set up for automated code reviews on this repo. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants