Skip to content

fix(pullsync): prevent sync interval over-advancement#5424

Closed
misaakidis wants to merge 10 commits intomasterfrom
fix/pullsync-interval-advancement
Closed

fix(pullsync): prevent sync interval over-advancement#5424
misaakidis wants to merge 10 commits intomasterfrom
fix/pullsync-interval-advancement

Conversation

@misaakidis
Copy link
Copy Markdown
Member

@misaakidis misaakidis commented Apr 5, 2026

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

Three bugs caused the puller's sync interval for a peer to advance past BinIDs whose chunks were never delivered or verified. Once an interval is marked done, those BinIDs are never re-requested from that peer.

Background

The puller tracks, per peer per bin, which BinID ranges have been synced. After each Sync() call it records [start, topmost] as done and retries from topmost+1. topmost comes from offer.Topmost set by the downstream peer.


Bug 1 — live chunk inflated topmost past historical cursor

collectAddrs subscribed to SubscribeBin with no upper bound. A newly-pushed chunk arriving at a high BinID while the historical cursor was still low produced an inflated offer.Topmost, causing the upstream peer to
skip the historical range permanently.

downstream historical cursor: 50
live chunk arrives at BinID:     75

offer.Topmost = 75  ← upstream peer advances [start, 75], skipping 51–74

Fix: snapshot ReserveLastBinIDs() before subscribing (historicalCursor). Live chunks are still included in the offer for eager storage, but Topmost is capped at historicalCursor.


Bug 2 — leading gap produced wrong topmost

If the downstream peer has no chunks at [start, X-1] but has a chunk at BinID X, it sent a non-empty offer with Topmost = X. The upstream peer advanced its interval to [start, X], permanently skipping [start, X-1].

Downstream store:  ·  ·  ·  5  6  7    (BinIDs 2–4 absent)
Upstream start: 2
offer.Topmost = 5  ← upstream marks [2, 5] done, never re-requests 2–4

Fix: detect the leading gap in collectAddrs: when the first chunk has BinID > start and is within the historical range, return an empty offer with Topmost = BinID - 1. The upstream peer advances only to the gap boundary, then retries from BinID on the next round.

Round 1: empty offer, Topmost = 4    → upstream marks [2, 4] done
Round 2: offer {5,6,7}, Topmost = 7  → upstream marks [5, 7] done

Bug 3 — mid-offer gap produced wrong topmost

If the downstream peer's store has internal gaps (e.g. BinIDs {3, 7, 11} with start=3), the offer included all three chunks with Topmost = 11. The upstream peer advanced its interval to [3, 11], silently marking BinIDs 4–6 and 8–10 as synced.

Downstream store:  3  ·  ·  ·  7  ·  ·  ·  11   (gaps at 4–6 and 8–10)
Upstream start: 3
offer.Topmost = 11  ← upstream marks [3, 11] done, never re-requests 4–6, 8–10

This cannot be fixed by returning only the contiguous prefix, as that would require re-delivering chunks the upstream peer already stored. Instead, all chunks are included in a single offer and Topmost is capped at the contiguous Topmost — the highest BinID reachable from start without a gap.

contiguous Topmost = 3  (gap at 4)

Round 1: offer {3, 7, 11}, Topmost = 3   → upstream stores all 3, marks [3, 3] done
Round 2: empty offer,      Topmost = 6   → upstream marks [4, 6] done  (Bug 2 fix)
Round 3: offer {7, 11},    Topmost = 7   → upstream has both (Want = ∅), marks [7, 7] done
Round 4: empty offer,      Topmost = 10  → upstream marks [8, 10] done
Round 5: offer {11},       Topmost = 11  → upstream has it, marks [11, 11] done

No chunk data is retransmitted after round 1. Rounds 2–5 carry only metadata (empty offers / empty Want bitvectors).


Cleanup — per-chunk errors no longer returned as sync failures

Per-chunk validation failures (ErrOverwriteNewerChunk, invalid stamp, expired batch, invalid CAC/SOC) were accumulated and returned to the caller, causing the puller to log spurious "syncWorker interval failed" entries and
increment its error counter for entirely normal situations. These are permanent properties of individual chunks — not indications that the sync session itself failed. They are now logged at debug level and skipped; the error return from Sync() is reserved for stream-level failures.


Related Issue

Surfaced during pull-sync optimisation work. See also: fix/puller-disconnect-backoff (PR #5423)

gacevicljubisa and others added 8 commits March 18, 2026 08:35
… on successful delivery

Sync() previously returned offer.Topmost as topmost regardless of whether
the offered chunks passed stamp validation and were actually stored. The
puller used this value to advance the sync interval, permanently marking
BinIDs as synced even when no chunk was stored there.

Sync() now returns a separate stored value alongside topmost:
- stored = offer.Topmost when all chunks were validated and stored
- stored = 0 when any chunk failed validation

The puller advances the interval only when stored >= start. topmost is
retained for the MaxUint64 overflow guard. The mock mirrors this
behaviour: stored defaults to Topmost on success and is forced to 0
when a sync error is configured.

Known limitation: a serving peer whose historical reserve has a gap at
[start, X-1] can return a live chunk at BinID X >> start, inflating
offer.Topmost and therefore stored past BinIDs that were never held.
Fixing this requires the server to cap offer.Topmost at its own
historical cursor; that change is deferred.
Two correctness bugs are fixed.

When a delivered chunk fails stamp validation, the peer's sync interval
must not advance past that BinID. Previously, Sync() returned
offer.Topmost unconditionally; the puller marked those BinIDs as done
even though no chunk was stored. Now Sync() returns topmost=0 on any
validation failure (invalid stamp, unsolicited, or structural error),
so the puller's existing guard suppresses the interval write.
ErrOverwriteNewerChunk is excluded: the chunk is already in the reserve
with a newer stamp, so advancing past it is correct.

collectAddrs blocked on SubscribeBin until the first chunk arrived. A
live chunk at BinID X far beyond the server's historical frontier caused
offer.Topmost=X, inflating the client's interval past historical BinIDs
it had never received. collectAddrs now snapshots ReserveLastBinIDs()
before subscribing and caps topmost at that cursor. Live chunks are
still delivered; only Topmost is bounded to the historical frontier.

Sync() retains its 3-value signature. topmost is the server's watermark
(capped at the historical cursor) on success, and 0 on validation failure.
The puller code is unchanged.
…rror accumulators

The hasValidationError bool was set at six separate sites and acted on in
a deferred block at the end of Sync(). This is an outlier in the bee
codebase, where the standard pattern is to use error accumulators whose
nil-ness is the condition.

Replace with two accumulators that express the existing semantic distinction
structurally:

  chunkErr    — stamp, solicitation, and structural failures. A non-nil
                chunkErr zeros topmost, preventing interval advancement past
                BinIDs whose chunks were never stored.

  overwriteErr — ErrOverwriteNewerChunk only. The chunk is already present
                 in the reserve; advancing the interval past it is correct,
                 so overwriteErr does not affect topmost.

Both accumulators are joined on return, preserving the caller contract:
all errors remain reachable via errors.Is. No interface change. No
behaviour change.
… over-advancement

When SubscribeBin returns a first chunk whose BinID is beyond the requested
start, the server has no chunks at [start, firstBinID-1]. Previously the
server included the chunk in the offer with Topmost=firstBinID, causing the
client to advance its interval across BinIDs it never received.

collectAddrs now detects this case: when the first chunk is within the
historical range (BinID <= historicalCursor) and its BinID is beyond the
requested start, the server returns an empty offer with Topmost=firstBinID-1.
The client advances its interval to the gap boundary only, then issues a new
Get from firstBinID and receives the chunk cleanly on the next round.

The check is skipped for start=0 (the BinID namespace starts at 1; the
puller always uses start>=1 in practice, but test helpers may pass 0).

A new test, TestSync_HistoricalGapReturnsEmptyOfferAtBoundary, covers the
gap-boundary case directly.
…r gap advancement

collectAddrs tracks the highest BinID reachable from start without a gap
(contiguousEnd). When the server's store has internal gaps (e.g. BinIDs
{3,7,11} with start=3), all matching chunks are still included in the
offer for eager client-side storage, but Topmost is capped at contiguousEnd
(3) so the client's sync interval does not advance past missing BinIDs.

The cap only fires when contiguousEnd >= start (at least one historical
chunk extended the frontier). This avoids capping when the offer contains
only live chunks (BinID > historicalCursor), where topmost is already
bounded by the existing historicalCursor cap.

Adds TestSync_MidOfferGapCapsTopmostAtContiguousFrontier to cover this case.
Commits 7429496 and 4f9650f introduced four style regressions:

- ErrUnsolicitedChunk wrapped in a single-element var () block; unwrap
  to a plain var declaration matching the rest of the package.
- Blank lines after the opening brace of New(), handler(), Sync(), and
  makeOffer(); remove to match bee's conventional style.

Also note: commit 4fd11e0 silently added
s.logger.Debug("batch timeout timer triggered") to the collectAddrs
timer case; that addition is intentional and remains.

No behaviour change.
@misaakidis misaakidis force-pushed the fix/pullsync-interval-advancement branch 2 times, most recently from c290234 to 3569e87 Compare April 5, 2026 19:46
@misaakidis misaakidis force-pushed the fix/pullsync-interval-advancement branch from 3569e87 to 1051de6 Compare April 5, 2026 19:55
Per-chunk validation failures (invalid stamp, unsolicited chunk, unknown
or expired batch, invalid CAC/SOC structure) are now logged at debug
level and skipped. They no longer affect topmost or the error return.

Previously chunkErr accumulated these failures and zeroed topmost,
stalling interval advancement indefinitely. Since each failure is a
permanent property of the chunk data, retrying the same BinID yields
the same result. With multi-peer historical sync each peer's interval
is tracked independently, so a chunk one peer cannot deliver will be
covered by another.

Stream-level errors (read failure, non-overwrite put error) still abort
the sync and are returned to the caller unchanged.
@acud
Copy link
Copy Markdown
Contributor

acud commented Apr 6, 2026

thanks @misaakidis for this. i still didn't fully decipher the 2 first bugs, but 3 does not look like a bug and was intended to work like this.

the thing is that because BinIDs are always increasing, you can thoroughly expect that if a chunk was deleted in the past, it turns into a gap that will never be filled (unless binid overflows - different problem). so in fact this was done this way so that old data that was deleted and synced in the future does not get re-requested and is pretty much the expected behavior.

also bear in mind there are other implications here. have this sort of logic will significantly fragment the intervals, causing historical syncing to always have to go through those unused gaps (that will never be filled, and therefore stall) every time the peer is being synced from.

i'm not sure this is desirable as it will slow down the receiving of actual data from the peer, and the amount of gaps you'd need to iterate on will grow linearly on every node restart. i suggest to rethink this change.

edit: bug 2 seems to be directly related to bug 3. again not sure this is what we want to have. about bug 1 - i can't really fully understand it honestly.

@misaakidis
Copy link
Copy Markdown
Member Author

Thanks @acud for the review! You're right on Bug 3, and the same reasoning applies to all three.

Bug 3 and Bug 2: confirmed not bugs. Gaps are permanent; the contiguous-cap and leading-gap fixes would fragment intervals and add overhead proportional to eviction history. Agreed on all points.

Bug 1: also not a bug. I had argued this was a race where BinID=75 could appear in the offer while 51–74 "hadn't arrived yet." That's wrong: IncBinID is a strict monotonic counter, so BinID=75 can only be assigned after 74 prior chunks were stored. SubscribeBin uses a sorted IterateBin scan, so 51–74 are always delivered before 75 in the stream. Either they exist (delivered in order) or they were deleted (permanent gaps, correct to advance past). No race is possible.

I originally modeled pullsync as a single-peer protocol in the context of SWIP-25, where interval advancement past an undelivered BinID would mean permanent data loss. That framing led me to treat gaps as delivery failures rather than permanent store state. Under the current multi-peer protocol the completeness guarantee lives at the protocol level, not the per-peer level, and v2.7.1's advancement behavior is correct.

One unrelated improvement worth a separate PR: per-chunk errors (ErrOverwriteNewerChunk, stamp validation failures, expired batch) are currently accumulated and returned to the caller, causing spurious syncWorker interval failed metrics for entirely normal situations. They could be logged at debug level and dropped from the error return, reserving it for stream-level failures only.

Closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants