fix(pullsync): prevent sync interval over-advancement by misaakidis · Pull Request #5424 · ethersphere/bee

misaakidis · 2026-04-05T19:18:17Z

Checklist

I have read the coding guide.
My change requires a documentation update, and I have done it.
I have added tests to cover my changes.
I have filled out the description and linked the related issues.

Description

Three bugs caused the puller's sync interval for a peer to advance past BinIDs whose chunks were never delivered or verified. Once an interval is marked done, those BinIDs are never re-requested from that peer.

Background

The puller tracks, per peer per bin, which BinID ranges have been synced. After each Sync() call it records [start, topmost] as done and retries from topmost+1. topmost comes from offer.Topmost set by the downstream peer.

Bug 1 — live chunk inflated `topmost` past historical cursor

collectAddrs subscribed to SubscribeBin with no upper bound. A newly-pushed chunk arriving at a high BinID while the historical cursor was still low produced an inflated offer.Topmost, causing the upstream peer to
skip the historical range permanently.

downstream historical cursor: 50
live chunk arrives at BinID:     75

offer.Topmost = 75  ← upstream peer advances [start, 75], skipping 51–74

Fix: snapshot ReserveLastBinIDs() before subscribing (historicalCursor). Live chunks are still included in the offer for eager storage, but Topmost is capped at historicalCursor.

Bug 2 — leading gap produced wrong `topmost`

If the downstream peer has no chunks at [start, X-1] but has a chunk at BinID X, it sent a non-empty offer with Topmost = X. The upstream peer advanced its interval to [start, X], permanently skipping [start, X-1].

Downstream store:  ·  ·  ·  5  6  7    (BinIDs 2–4 absent)
Upstream start: 2
offer.Topmost = 5  ← upstream marks [2, 5] done, never re-requests 2–4

Fix: detect the leading gap in collectAddrs: when the first chunk has BinID > start and is within the historical range, return an empty offer with Topmost = BinID - 1. The upstream peer advances only to the gap boundary, then retries from BinID on the next round.

Round 1: empty offer, Topmost = 4    → upstream marks [2, 4] done
Round 2: offer {5,6,7}, Topmost = 7  → upstream marks [5, 7] done

Bug 3 — mid-offer gap produced wrong `topmost`

If the downstream peer's store has internal gaps (e.g. BinIDs {3, 7, 11} with start=3), the offer included all three chunks with Topmost = 11. The upstream peer advanced its interval to [3, 11], silently marking BinIDs 4–6 and 8–10 as synced.

Downstream store:  3  ·  ·  ·  7  ·  ·  ·  11   (gaps at 4–6 and 8–10)
Upstream start: 3
offer.Topmost = 11  ← upstream marks [3, 11] done, never re-requests 4–6, 8–10

This cannot be fixed by returning only the contiguous prefix, as that would require re-delivering chunks the upstream peer already stored. Instead, all chunks are included in a single offer and Topmost is capped at the contiguous Topmost — the highest BinID reachable from start without a gap.

contiguous Topmost = 3  (gap at 4)

Round 1: offer {3, 7, 11}, Topmost = 3   → upstream stores all 3, marks [3, 3] done
Round 2: empty offer,      Topmost = 6   → upstream marks [4, 6] done  (Bug 2 fix)
Round 3: offer {7, 11},    Topmost = 7   → upstream has both (Want = ∅), marks [7, 7] done
Round 4: empty offer,      Topmost = 10  → upstream marks [8, 10] done
Round 5: offer {11},       Topmost = 11  → upstream has it, marks [11, 11] done

No chunk data is retransmitted after round 1. Rounds 2–5 carry only metadata (empty offers / empty Want bitvectors).

Cleanup — per-chunk errors no longer returned as sync failures

Per-chunk validation failures (ErrOverwriteNewerChunk, invalid stamp, expired batch, invalid CAC/SOC) were accumulated and returned to the caller, causing the puller to log spurious "syncWorker interval failed" entries and
increment its error counter for entirely normal situations. These are permanent properties of individual chunks — not indications that the sync session itself failed. They are now logged at debug level and skipped; the error return from Sync() is reserved for stream-level failures.

Related Issue

Surfaced during pull-sync optimisation work. See also: fix/puller-disconnect-backoff (PR #5423)

… on successful delivery Sync() previously returned offer.Topmost as topmost regardless of whether the offered chunks passed stamp validation and were actually stored. The puller used this value to advance the sync interval, permanently marking BinIDs as synced even when no chunk was stored there. Sync() now returns a separate stored value alongside topmost: - stored = offer.Topmost when all chunks were validated and stored - stored = 0 when any chunk failed validation The puller advances the interval only when stored >= start. topmost is retained for the MaxUint64 overflow guard. The mock mirrors this behaviour: stored defaults to Topmost on success and is forced to 0 when a sync error is configured. Known limitation: a serving peer whose historical reserve has a gap at [start, X-1] can return a live chunk at BinID X >> start, inflating offer.Topmost and therefore stored past BinIDs that were never held. Fixing this requires the server to cap offer.Topmost at its own historical cursor; that change is deferred.

Two correctness bugs are fixed. When a delivered chunk fails stamp validation, the peer's sync interval must not advance past that BinID. Previously, Sync() returned offer.Topmost unconditionally; the puller marked those BinIDs as done even though no chunk was stored. Now Sync() returns topmost=0 on any validation failure (invalid stamp, unsolicited, or structural error), so the puller's existing guard suppresses the interval write. ErrOverwriteNewerChunk is excluded: the chunk is already in the reserve with a newer stamp, so advancing past it is correct. collectAddrs blocked on SubscribeBin until the first chunk arrived. A live chunk at BinID X far beyond the server's historical frontier caused offer.Topmost=X, inflating the client's interval past historical BinIDs it had never received. collectAddrs now snapshots ReserveLastBinIDs() before subscribing and caps topmost at that cursor. Live chunks are still delivered; only Topmost is bounded to the historical frontier. Sync() retains its 3-value signature. topmost is the server's watermark (capped at the historical cursor) on success, and 0 on validation failure. The puller code is unchanged.

…rror accumulators The hasValidationError bool was set at six separate sites and acted on in a deferred block at the end of Sync(). This is an outlier in the bee codebase, where the standard pattern is to use error accumulators whose nil-ness is the condition. Replace with two accumulators that express the existing semantic distinction structurally: chunkErr — stamp, solicitation, and structural failures. A non-nil chunkErr zeros topmost, preventing interval advancement past BinIDs whose chunks were never stored. overwriteErr — ErrOverwriteNewerChunk only. The chunk is already present in the reserve; advancing the interval past it is correct, so overwriteErr does not affect topmost. Both accumulators are joined on return, preserving the caller contract: all errors remain reachable via errors.Is. No interface change. No behaviour change.

… over-advancement When SubscribeBin returns a first chunk whose BinID is beyond the requested start, the server has no chunks at [start, firstBinID-1]. Previously the server included the chunk in the offer with Topmost=firstBinID, causing the client to advance its interval across BinIDs it never received. collectAddrs now detects this case: when the first chunk is within the historical range (BinID <= historicalCursor) and its BinID is beyond the requested start, the server returns an empty offer with Topmost=firstBinID-1. The client advances its interval to the gap boundary only, then issues a new Get from firstBinID and receives the chunk cleanly on the next round. The check is skipped for start=0 (the BinID namespace starts at 1; the puller always uses start>=1 in practice, but test helpers may pass 0). A new test, TestSync_HistoricalGapReturnsEmptyOfferAtBoundary, covers the gap-boundary case directly.

…r gap advancement collectAddrs tracks the highest BinID reachable from start without a gap (contiguousEnd). When the server's store has internal gaps (e.g. BinIDs {3,7,11} with start=3), all matching chunks are still included in the offer for eager client-side storage, but Topmost is capped at contiguousEnd (3) so the client's sync interval does not advance past missing BinIDs. The cap only fires when contiguousEnd >= start (at least one historical chunk extended the frontier). This avoids capping when the offer contains only live chunks (BinID > historicalCursor), where topmost is already bounded by the existing historicalCursor cap. Adds TestSync_MidOfferGapCapsTopmostAtContiguousFrontier to cover this case.

Commits 7429496 and 4f9650f introduced four style regressions: - ErrUnsolicitedChunk wrapped in a single-element var () block; unwrap to a plain var declaration matching the rest of the package. - Blank lines after the opening brace of New(), handler(), Sync(), and makeOffer(); remove to match bee's conventional style. Also note: commit 4fd11e0 silently added s.logger.Debug("batch timeout timer triggered") to the collectAddrs timer case; that addition is intentional and remains. No behaviour change.

Per-chunk validation failures (invalid stamp, unsolicited chunk, unknown or expired batch, invalid CAC/SOC structure) are now logged at debug level and skipped. They no longer affect topmost or the error return. Previously chunkErr accumulated these failures and zeroed topmost, stalling interval advancement indefinitely. Since each failure is a permanent property of the chunk data, retrying the same BinID yields the same result. With multi-peer historical sync each peer's interval is tracked independently, so a chunk one peer cannot deliver will be covered by another. Stream-level errors (read failure, non-overwrite put error) still abort the sync and are returned to the caller unchanged.

acud · 2026-04-06T04:23:47Z

thanks @misaakidis for this. i still didn't fully decipher the 2 first bugs, but 3 does not look like a bug and was intended to work like this.

the thing is that because BinIDs are always increasing, you can thoroughly expect that if a chunk was deleted in the past, it turns into a gap that will never be filled (unless binid overflows - different problem). so in fact this was done this way so that old data that was deleted and synced in the future does not get re-requested and is pretty much the expected behavior.

also bear in mind there are other implications here. have this sort of logic will significantly fragment the intervals, causing historical syncing to always have to go through those unused gaps (that will never be filled, and therefore stall) every time the peer is being synced from.

i'm not sure this is desirable as it will slow down the receiving of actual data from the peer, and the amount of gaps you'd need to iterate on will grow linearly on every node restart. i suggest to rethink this change.

edit: bug 2 seems to be directly related to bug 3. again not sure this is what we want to have. about bug 1 - i can't really fully understand it honestly.

misaakidis · 2026-04-08T08:59:57Z

Thanks @acud for the review! You're right on Bug 3, and the same reasoning applies to all three.

Bug 3 and Bug 2: confirmed not bugs. Gaps are permanent; the contiguous-cap and leading-gap fixes would fragment intervals and add overhead proportional to eviction history. Agreed on all points.

Bug 1: also not a bug. I had argued this was a race where BinID=75 could appear in the offer while 51–74 "hadn't arrived yet." That's wrong: IncBinID is a strict monotonic counter, so BinID=75 can only be assigned after 74 prior chunks were stored. SubscribeBin uses a sorted IterateBin scan, so 51–74 are always delivered before 75 in the stream. Either they exist (delivered in order) or they were deleted (permanent gaps, correct to advance past). No race is possible.

I originally modeled pullsync as a single-peer protocol in the context of SWIP-25, where interval advancement past an undelivered BinID would mean permanent data loss. That framing led me to treat gaps as delivery failures rather than permanent store state. Under the current multi-peer protocol the completeness guarantee lives at the protocol level, not the per-peer level, and v2.7.1's advancement behavior is correct.

One unrelated improvement worth a separate PR: per-chunk errors (ErrOverwriteNewerChunk, stamp validation failures, expired batch) are currently accumulated and returned to the caller, causing spurious syncWorker interval failed metrics for entirely normal situations. They could be logged at debug level and dropped from the error return, reserving it for stream-level failures only.

Closing this PR.

gacevicljubisa and others added 8 commits March 18, 2026 08:35

chore: update postage snapshot to v0.0.6 (#5401)

61fab37

fix(pullsync): document cursor boundary invariants and add tests

1a88740

misaakidis force-pushed the fix/pullsync-interval-advancement branch 2 times, most recently from c290234 to 3569e87 Compare April 5, 2026 19:46

chore(pullsync): use upstream/downstream peer terminology in comments

1051de6

misaakidis force-pushed the fix/pullsync-interval-advancement branch from 3569e87 to 1051de6 Compare April 5, 2026 19:55

misaakidis closed this Apr 8, 2026

misaakidis mentioned this pull request Apr 8, 2026

fix(puller): non-blocking peer disconnect and sync error backoff #5423

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pullsync): prevent sync interval over-advancement#5424

fix(pullsync): prevent sync interval over-advancement#5424
misaakidis wants to merge 10 commits intomasterfrom
fix/pullsync-interval-advancement

misaakidis commented Apr 5, 2026 •

edited

Loading

Uh oh!

acud commented Apr 6, 2026 •

edited

Loading

Uh oh!

misaakidis commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

misaakidis commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Description

Background

Bug 1 — live chunk inflated topmost past historical cursor

Bug 2 — leading gap produced wrong topmost

Bug 3 — mid-offer gap produced wrong topmost

Cleanup — per-chunk errors no longer returned as sync failures

Related Issue

Uh oh!

acud commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

misaakidis commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

misaakidis commented Apr 5, 2026 •

edited

Loading

Bug 1 — live chunk inflated `topmost` past historical cursor

Bug 2 — leading gap produced wrong `topmost`

Bug 3 — mid-offer gap produced wrong `topmost`

acud commented Apr 6, 2026 •

edited

Loading