fix(pullsync): prevent sync interval over-advancement#5424
fix(pullsync): prevent sync interval over-advancement#5424misaakidis wants to merge 10 commits intomasterfrom
Conversation
… on successful delivery Sync() previously returned offer.Topmost as topmost regardless of whether the offered chunks passed stamp validation and were actually stored. The puller used this value to advance the sync interval, permanently marking BinIDs as synced even when no chunk was stored there. Sync() now returns a separate stored value alongside topmost: - stored = offer.Topmost when all chunks were validated and stored - stored = 0 when any chunk failed validation The puller advances the interval only when stored >= start. topmost is retained for the MaxUint64 overflow guard. The mock mirrors this behaviour: stored defaults to Topmost on success and is forced to 0 when a sync error is configured. Known limitation: a serving peer whose historical reserve has a gap at [start, X-1] can return a live chunk at BinID X >> start, inflating offer.Topmost and therefore stored past BinIDs that were never held. Fixing this requires the server to cap offer.Topmost at its own historical cursor; that change is deferred.
Two correctness bugs are fixed. When a delivered chunk fails stamp validation, the peer's sync interval must not advance past that BinID. Previously, Sync() returned offer.Topmost unconditionally; the puller marked those BinIDs as done even though no chunk was stored. Now Sync() returns topmost=0 on any validation failure (invalid stamp, unsolicited, or structural error), so the puller's existing guard suppresses the interval write. ErrOverwriteNewerChunk is excluded: the chunk is already in the reserve with a newer stamp, so advancing past it is correct. collectAddrs blocked on SubscribeBin until the first chunk arrived. A live chunk at BinID X far beyond the server's historical frontier caused offer.Topmost=X, inflating the client's interval past historical BinIDs it had never received. collectAddrs now snapshots ReserveLastBinIDs() before subscribing and caps topmost at that cursor. Live chunks are still delivered; only Topmost is bounded to the historical frontier. Sync() retains its 3-value signature. topmost is the server's watermark (capped at the historical cursor) on success, and 0 on validation failure. The puller code is unchanged.
…rror accumulators
The hasValidationError bool was set at six separate sites and acted on in
a deferred block at the end of Sync(). This is an outlier in the bee
codebase, where the standard pattern is to use error accumulators whose
nil-ness is the condition.
Replace with two accumulators that express the existing semantic distinction
structurally:
chunkErr — stamp, solicitation, and structural failures. A non-nil
chunkErr zeros topmost, preventing interval advancement past
BinIDs whose chunks were never stored.
overwriteErr — ErrOverwriteNewerChunk only. The chunk is already present
in the reserve; advancing the interval past it is correct,
so overwriteErr does not affect topmost.
Both accumulators are joined on return, preserving the caller contract:
all errors remain reachable via errors.Is. No interface change. No
behaviour change.
… over-advancement When SubscribeBin returns a first chunk whose BinID is beyond the requested start, the server has no chunks at [start, firstBinID-1]. Previously the server included the chunk in the offer with Topmost=firstBinID, causing the client to advance its interval across BinIDs it never received. collectAddrs now detects this case: when the first chunk is within the historical range (BinID <= historicalCursor) and its BinID is beyond the requested start, the server returns an empty offer with Topmost=firstBinID-1. The client advances its interval to the gap boundary only, then issues a new Get from firstBinID and receives the chunk cleanly on the next round. The check is skipped for start=0 (the BinID namespace starts at 1; the puller always uses start>=1 in practice, but test helpers may pass 0). A new test, TestSync_HistoricalGapReturnsEmptyOfferAtBoundary, covers the gap-boundary case directly.
…r gap advancement
collectAddrs tracks the highest BinID reachable from start without a gap
(contiguousEnd). When the server's store has internal gaps (e.g. BinIDs
{3,7,11} with start=3), all matching chunks are still included in the
offer for eager client-side storage, but Topmost is capped at contiguousEnd
(3) so the client's sync interval does not advance past missing BinIDs.
The cap only fires when contiguousEnd >= start (at least one historical
chunk extended the frontier). This avoids capping when the offer contains
only live chunks (BinID > historicalCursor), where topmost is already
bounded by the existing historicalCursor cap.
Adds TestSync_MidOfferGapCapsTopmostAtContiguousFrontier to cover this case.
Commits 7429496 and 4f9650f introduced four style regressions: - ErrUnsolicitedChunk wrapped in a single-element var () block; unwrap to a plain var declaration matching the rest of the package. - Blank lines after the opening brace of New(), handler(), Sync(), and makeOffer(); remove to match bee's conventional style. Also note: commit 4fd11e0 silently added s.logger.Debug("batch timeout timer triggered") to the collectAddrs timer case; that addition is intentional and remains. No behaviour change.
c290234 to
3569e87
Compare
3569e87 to
1051de6
Compare
Per-chunk validation failures (invalid stamp, unsolicited chunk, unknown or expired batch, invalid CAC/SOC structure) are now logged at debug level and skipped. They no longer affect topmost or the error return. Previously chunkErr accumulated these failures and zeroed topmost, stalling interval advancement indefinitely. Since each failure is a permanent property of the chunk data, retrying the same BinID yields the same result. With multi-peer historical sync each peer's interval is tracked independently, so a chunk one peer cannot deliver will be covered by another. Stream-level errors (read failure, non-overwrite put error) still abort the sync and are returned to the caller unchanged.
|
thanks @misaakidis for this. i still didn't fully decipher the 2 first bugs, but 3 does not look like a bug and was intended to work like this. the thing is that because BinIDs are always increasing, you can thoroughly expect that if a chunk was deleted in the past, it turns into a gap that will never be filled (unless binid overflows - different problem). so in fact this was done this way so that old data that was deleted and synced in the future does not get re-requested and is pretty much the expected behavior. also bear in mind there are other implications here. have this sort of logic will significantly fragment the intervals, causing historical syncing to always have to go through those unused gaps (that will never be filled, and therefore stall) every time the peer is being synced from. i'm not sure this is desirable as it will slow down the receiving of actual data from the peer, and the amount of gaps you'd need to iterate on will grow linearly on every node restart. i suggest to rethink this change. edit: bug 2 seems to be directly related to bug 3. again not sure this is what we want to have. about bug 1 - i can't really fully understand it honestly. |
|
Thanks @acud for the review! You're right on Bug 3, and the same reasoning applies to all three. Bug 3 and Bug 2: confirmed not bugs. Gaps are permanent; the contiguous-cap and leading-gap fixes would fragment intervals and add overhead proportional to eviction history. Agreed on all points. Bug 1: also not a bug. I had argued this was a race where BinID=75 could appear in the offer while 51–74 "hadn't arrived yet." That's wrong: IncBinID is a strict monotonic counter, so BinID=75 can only be assigned after 74 prior chunks were stored. SubscribeBin uses a sorted IterateBin scan, so 51–74 are always delivered before 75 in the stream. Either they exist (delivered in order) or they were deleted (permanent gaps, correct to advance past). No race is possible. I originally modeled pullsync as a single-peer protocol in the context of SWIP-25, where interval advancement past an undelivered BinID would mean permanent data loss. That framing led me to treat gaps as delivery failures rather than permanent store state. Under the current multi-peer protocol the completeness guarantee lives at the protocol level, not the per-peer level, and v2.7.1's advancement behavior is correct. One unrelated improvement worth a separate PR: per-chunk errors (ErrOverwriteNewerChunk, stamp validation failures, expired batch) are currently accumulated and returned to the caller, causing spurious syncWorker interval failed metrics for entirely normal situations. They could be logged at debug level and dropped from the error return, reserving it for stream-level failures only. Closing this PR. |
Checklist
Description
Three bugs caused the puller's sync interval for a peer to advance past BinIDs whose chunks were never delivered or verified. Once an interval is marked done, those BinIDs are never re-requested from that peer.
Background
The puller tracks, per peer per bin, which BinID ranges have been synced. After each
Sync()call it records[start, topmost]as done and retries fromtopmost+1.topmostcomes fromoffer.Topmostset by the downstream peer.Bug 1 — live chunk inflated
topmostpast historical cursorcollectAddrssubscribed toSubscribeBinwith no upper bound. A newly-pushed chunk arriving at a high BinID while the historical cursor was still low produced an inflatedoffer.Topmost, causing the upstream peer toskip the historical range permanently.
Fix: snapshot
ReserveLastBinIDs()before subscribing (historicalCursor). Live chunks are still included in the offer for eager storage, butTopmostis capped athistoricalCursor.Bug 2 — leading gap produced wrong
topmostIf the downstream peer has no chunks at
[start, X-1]but has a chunk at BinIDX, it sent a non-empty offer withTopmost = X. The upstream peer advanced its interval to[start, X], permanently skipping[start, X-1].Fix: detect the leading gap in
collectAddrs: when the first chunk hasBinID > startand is within the historical range, return an empty offer withTopmost = BinID - 1. The upstream peer advances only to the gap boundary, then retries fromBinIDon the next round.Bug 3 — mid-offer gap produced wrong
topmostIf the downstream peer's store has internal gaps (e.g. BinIDs
{3, 7, 11}withstart=3), the offer included all three chunks withTopmost = 11. The upstream peer advanced its interval to[3, 11], silently marking BinIDs 4–6 and 8–10 as synced.This cannot be fixed by returning only the contiguous prefix, as that would require re-delivering chunks the upstream peer already stored. Instead, all chunks are included in a single offer and
Topmostis capped at the contiguous Topmost — the highest BinID reachable fromstartwithout a gap.No chunk data is retransmitted after round 1. Rounds 2–5 carry only metadata (empty offers / empty Want bitvectors).
Cleanup — per-chunk errors no longer returned as sync failures
Per-chunk validation failures (
ErrOverwriteNewerChunk, invalid stamp, expired batch, invalid CAC/SOC) were accumulated and returned to the caller, causing the puller to log spurious "syncWorker interval failed" entries andincrement its error counter for entirely normal situations. These are permanent properties of individual chunks — not indications that the sync session itself failed. They are now logged at debug level and skipped; the error return from
Sync()is reserved for stream-level failures.Related Issue
Surfaced during pull-sync optimisation work. See also:
fix/puller-disconnect-backoff(PR #5423)