blocksync: fix sendError in block sync (CON-276)#3440
Conversation
Restructures BlockPool.AddBlock, removeTimedoutPeers, and bpPeer.onTimeout so pool.mtx is released before any send on errorsCh. In AddBlock and removeTimedoutPeers the send is registered as a defer before the deferred Unlock; LIFO ordering runs Unlock first, preserving panic safety from the deferred unlock. onTimeout's critical section is a single boolean write, so it uses an explicit Unlock instead. Original error semantics and return values are preserved. Adds TestBlockPoolAddBlockReleasesLockBeforeSend asserting the invariant for AddBlock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3440 +/- ##
==========================================
+ Coverage 59.31% 59.36% +0.05%
==========================================
Files 2120 2120
Lines 175523 175867 +344
==========================================
+ Hits 104106 104412 +306
- Misses 62338 62368 +30
- Partials 9079 9087 +8
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
|
||
| errorsCh <- peerError{errors.New("filler"), peerID} | ||
|
|
||
| farHeight := int64(1 + maxDiffBetweenCurrentAndReceivedBlockHeight + 1000) |
There was a problem hiding this comment.
please document what 1 + ... + 1000 is about
| farBlock := &types.Block{Header: types.Header{Height: farHeight}} | ||
|
|
||
| addBlockDone := make(chan struct{}) | ||
| go func() { |
There was a problem hiding this comment.
spawn -> wait looks time sensitive, can you make it more robust?
There was a problem hiding this comment.
if not, then perhaps we shouldn't really have this test.
There was a problem hiding this comment.
Changed to check runtime.Stack to detect sendError happened, how does this look?
| t.Fatal("AddBlock did not complete after errorsCh was drained") | ||
| } | ||
|
|
||
| select { |
There was a problem hiding this comment.
what's the point of this clause?
| close(probeDone) | ||
| }() | ||
|
|
||
| select { |
There was a problem hiding this comment.
can we skip the individual timeouts? I'm pretty sure it will be flaky otherwise.
Rewrites TestBlockPoolAddBlockReleasesLockBeforeSend to remove the time-sensitive sleep and per-select timeouts. Switches errorsCh to unbuffered so AddBlock's sendError reliably parks on the send, then uses runtime.Stack to detect when the goroutine is parked in sendError and probes pool.mtx with TryLock. The test now passes/fails deterministically against fixed/buggy code respectively, and relies on require.Eventually with a single budget rather than scattered timeouts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR SummaryMedium Risk Overview Adds a regression test ( Reviewed by Cursor Bugbot for commit 34f12ad. Bugbot is set up for automated code reviews on this repo. Configure here. |
Describe your changes and provide context
BlockPool.AddBlock,BlockPool.removeTimedoutPeers, andbpPeer.onTimeoutpreviously calledsendError(a blocking send onerrorsCh) while still holdingpool.mtx. This change releases the mutex before the send.In
AddBlockandremoveTimedoutPeersthe send is registered as adeferbefore thedefer pool.mtx.Unlock(), so LIFO ordering runsUnlockfirst and then performs the send. This keeps panic safety from the deferred unlock and avoids tracking an explicitUnlockon every return path.onTimeout's critical section is a single boolean write, so it uses an explicitUnlockinstead.AddBlock's return values and error wrapping are unchanged.Testing performed to validate your change
go test ./sei-tendermint/internal/blocksync/... -race -count=1— full package green.TestBlockPoolAddBlockReleasesLockBeforeSendasserts the invariant forAddBlock. Verified discriminating by temporarily reverting theAddBlockedit — the test then blocks until the package timeout.gofmt -s -lclean.