Skip to content

Revert "Add LRU capacity to ValidatingCache, remove sentinel pattern, add storage Update"#4730

Merged
yrobla merged 1 commit intomainfrom
revert-4669-issue-4494
Apr 10, 2026
Merged

Revert "Add LRU capacity to ValidatingCache, remove sentinel pattern, add storage Update"#4730
yrobla merged 1 commit intomainfrom
revert-4669-issue-4494

Conversation

@yrobla
Copy link
Copy Markdown
Contributor

@yrobla yrobla commented Apr 10, 2026

Reverts #4669

Large PR Justification

This is a revert or a pr that passed with breaking tests

@github-actions github-actions bot added the size/XL Extra large PR: 1000+ lines changed label Apr 10, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 10, 2026

Codecov Report

❌ Patch coverage is 81.45161% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.74%. Comparing base (9c974a3) to head (c9673d7).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
pkg/vmcp/server/sessionmanager/session_manager.go 55.55% 12 Missing and 8 partials ⚠️
...kg/transport/session/session_data_storage_local.go 93.54% 1 Missing and 1 partial ⚠️
pkg/vmcp/server/sessionmanager/cache.go 97.82% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4730      +/-   ##
==========================================
+ Coverage   68.68%   68.74%   +0.06%     
==========================================
  Files         518      518              
  Lines       54826    54826              
==========================================
+ Hits        37657    37690      +33     
+ Misses      14274    14242      -32     
+ Partials     2895     2894       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 10, 2026
@github-actions
Copy link
Copy Markdown
Contributor

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

@github-actions github-actions bot dismissed their stale review April 10, 2026 06:12

Large PR justification has been provided. Thank you!

@yrobla yrobla merged commit acc1f66 into main Apr 10, 2026
42 checks passed
@yrobla yrobla deleted the revert-4669-issue-4494 branch April 10, 2026 06:13
Comment on lines +613 to +630

ctx, cancel := context.WithTimeout(context.Background(), notifyBackendExpiredTimeout)
defer cancel()

// Update only succeeds if the key still exists. A concurrent Delete (same
// pod or cross-pod) returns (false, nil), and we bail without resurrecting.
updated, err := sm.storage.Update(ctx, sessionID, metadata)
if err != nil {
// Cross-pod guard: re-check that the storage record still exists before
// upserting. If another pod terminated the session (deleting the key) after
// NotifyBackendExpired's initial Load, we must not recreate the record.
if _, err := sm.storage.Load(ctx, sessionID); err != nil {
if errors.Is(err, transportsession.ErrSessionNotFound) {
return nil // session was terminated elsewhere; nothing to update
}
return err
}
if !updated {
return nil // session was terminated; nothing to update

if err := sm.storage.Upsert(ctx, sessionID, metadata); err != nil {
return err
}
// The cache self-heals lazily: on the next GetMultiSession, checkSession detects
// either the absent storage key or stale MetadataKeyBackendIDs and evicts the
// entry, triggering a fresh RestoreSession.
sm.sessions.Delete(sessionID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 In updateMetadata, a single context.WithTimeout(context.Background(), notifyBackendExpiredTimeout) context is created at line 614 and shared between both the cross-pod guard storage.Load call and the subsequent storage.Upsert call. If the Load consumes most of the 5-second budget (e.g., slow Redis or network jitter), the Upsert fails with a deadline-exceeded error even though the guard check succeeded and the write was safe to perform. The fix is to create two separate contexts—one for Load and one for Upsert—matching the documented intent of the notifyBackendExpiredTimeout constant, which explicitly states each storage operation is 'capped independently' and the worst-case wall-clock is '2 × 5 s = 10 s'.

Extended reasoning...

What the bug is: updateMetadata (session_manager.go lines 614–630) creates a single context with a 5-second timeout and passes it to two sequential storage operations: a cross-pod guard storage.Load (line 620) and a subsequent storage.Upsert (line 627). These two operations share one deadline, so the 5-second budget is split between them rather than each getting its own independent 5 seconds.

The specific code path: Inside updateMetadata, the code is:

ctx, cancel := context.WithTimeout(context.Background(), notifyBackendExpiredTimeout)
defer cancel()
if _, err := sm.storage.Load(ctx, sessionID); err \!= nil { ... }   // consumes from shared ctx
if err := sm.storage.Upsert(ctx, sessionID, metadata); err \!= nil { ... } // residual budget only

If the Load call takes, say, 4.9 seconds (slow Redis or network jitter), the Upsert starts with only 0.1 seconds remaining and fails immediately with a context deadline exceeded error.

Why existing code doesn't prevent it: The constant notifyBackendExpiredTimeout has an explicit comment that reads: 'bounds each individual storage operation inside NotifyBackendExpired() — one Load and one Upsert, each capped independently. Each is a single-key Redis operation, so 5 s per call is consistent with terminateTimeout and decorateTimeout. Worst-case wall-clock for the function is 2 × 5 s = 10 s.' The implementation directly contradicts this documented intent. The old code (before the revert) used a single storage.Update atomic operation, so there was only one operation per context and this sharing problem did not exist. The revert introduced two sequential storage operations under one shared context.

What the impact would be: When the Upsert fails due to the exhausted context, updateMetadata returns an error, NotifyBackendExpired logs a warning, the session is NOT evicted from the node-local cache (only evicted on success), and the backend-expiry metadata is silently dropped. The next GetMultiSession call will still use the stale in-memory session holding references to the now-expired backend. This is a subtle correctness issue that only manifests under degraded network conditions (slow Redis, network jitter) where the re-check Load takes close to 5 seconds.

How to fix it: Create two separate contexts, one for the Load and one for the Upsert:

loadCtx, loadCancel := context.WithTimeout(context.Background(), notifyBackendExpiredTimeout)
defer loadCancel()
if _, err := sm.storage.Load(loadCtx, sessionID); err \!= nil { ... }

upsertCtx, upsertCancel := context.WithTimeout(context.Background(), notifyBackendExpiredTimeout)
defer upsertCancel()
if err := sm.storage.Upsert(upsertCtx, sessionID, metadata); err \!= nil { ... }

This matches the documented intent and ensures both operations receive their full independent 5-second budget.

Step-by-step proof: (1) Backend workload-A expires; NotifyBackendExpired is called. (2) Initial storage.Load in NotifyBackendExpired at line 540 succeeds in ~0.1s (its own independent context). (3) updateMetadata is called with the updated metadata. (4) The shared context is created with a 5-second deadline at line 614. (5) The cross-pod guard storage.Load at line 620 takes 4.9s due to Redis latency—context still has 0.1s remaining. (6) The storage.Upsert at line 627 starts with 0.1s remaining, Redis round-trip exceeds this, and Upsert returns context deadline exceeded. (7) updateMetadata returns an error; sm.sessions.Delete(sessionID) is never called. (8) NotifyBackendExpired logs a warning and returns. (9) The node-local cache still holds the stale session; subsequent GetMultiSession returns stale backend metadata until the session is next evicted by checkSession detecting drift.

Comment on lines +775 to 788
if !sm.sessions.CompareAndSwap(sessionID, sess, decorated) {
return fmt.Errorf("DecorateSession: session %q was terminated or concurrently modified during decoration", sessionID)
}
// Persist updated metadata to storage. On failure, attempt to rollback
// the local-map entry so the caller can retry. If Terminate() has since
// replaced the decorated entry with a sentinel, the rollback CAS returns
// false and we leave the sentinel in place.
decorateCtx, decorateCancel := context.WithTimeout(context.Background(), decorateTimeout)
defer decorateCancel()
updated, err := sm.storage.Update(decorateCtx, sessionID, decorated.GetMetadata())
if err != nil {
if err := sm.storage.Upsert(decorateCtx, sessionID, decorated.GetMetadata()); err != nil {
_ = sm.sessions.CompareAndSwap(sessionID, decorated, sess)
return fmt.Errorf("DecorateSession: failed to store decorated session metadata: %w", err)
}
if !updated {
// Session was deleted (by Terminate or TTL) between Get and Update.
// Evict the stale cache entry so onEvict closes backend connections.
sm.sessions.Remove(sessionID)
return fmt.Errorf("DecorateSession: session %q was deleted during decoration", sessionID)
}
sm.sessions.Set(sessionID, decorated)
return nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The revert replaces storage.Update (SET XX / conditional write) with unconditional storage.Upsert in DecorateSession, creating a window where a concurrent Terminate() can delete the storage key and then DecorateSession's Upsert silently recreates it. On the same pod the consequence is bounded (the decorator wraps an already-closed session), but cross-pod RestoreSession calls will find the resurrected record in Redis and open fresh backend connections for a session that has been terminated.

Extended reasoning...

The race window

After DecorateSession's CompareAndSwap(sessionID, sess, decorated) succeeds at line 775, the local cache holds decorated as a live vmcpsession.MultiSession. At this point there is a concurrent execution window before line 784's storage.Upsert runs. During that window Terminate() can execute fully: (1) Peek sees decorated (a valid MultiSession, not a sentinel), (2) Store(terminatedSentinel{}) overwrites the cache entry unconditionally, (3) storage.Delete removes the Redis/local key, (4) sessions.Delete removes the sentinel, (5) multiSess.Close() releases backend connections. At this point both the cache and storage are clean — the session is fully terminated.

Why Upsert causes resurrection

DecorateSession then reaches line 784 and calls storage.Upsert(decorateCtx, sessionID, decorated.GetMetadata()). Because Upsert is unconditional (Redis SET with no XX flag), it recreates the storage key even though Terminate just deleted it. The rollback CompareAndSwap(sessionID, decorated, sess) at line 785 only fires on a storage error, not on this success path — so the local cache remains empty but storage now holds a non-terminated metadata record.

Same-pod impact is bounded; cross-pod impact is not

The code comment at lines 757–760 claims "its consequence is bounded: Terminate() already called Close() on the underlying MultiSession before deleting it, so any resurrected decorator wraps an already-closed session and will fail on first use". This is only true for same-pod access via the in-memory RestorableCache. Any other pod performing GetMultiSessionloadSessionfactory.RestoreSession will find the resurrected storage record, see it is not marked terminated, and call factory.RestoreSession to open fresh backend connections for a session ID that was explicitly terminated. The closed local session object is never involved; the other pod constructs a brand-new MultiSession from the resurrected metadata.

Concrete proof of the race

  1. Pod A calls DecorateSession("sess-1", fn). GetMultiSession returns sess; fn runs and returns decorated. CAS succeeds — cache: {"sess-1": decorated}.
  2. Pod A's Terminate("sess-1") runs on a concurrent goroutine: Peek sees decorated, stores terminatedSentinel, calls storage.Delete, calls sessions.Delete, calls decorated.Close(). Cache: empty. Storage: key absent.
  3. DecorateSession resumes at line 784: storage.Upsert(ctx, "sess-1", decorated.GetMetadata()) — storage key recreated with non-terminated metadata. Cache: empty. Storage: {"sess-1": {metadata without terminated flag}}.
  4. Pod B calls GetMultiSession("sess-1") → cache miss → loadSessionstorage.Load returns the resurrected record → metadata has no terminated=true flag and has MetadataKeyTokenHash set → factory.RestoreSession opens new backend connections.

The fix that was reverted

The previous code (PR #4669) used storage.Update (Redis SET XX) which is an atomic conditional write: it returns (false, nil) when the key does not exist. After Terminate's storage.Delete, Update would find the key absent and return (false, nil), and DecorateSession would bail without recreating the record. The revert replaced Update with Upsert, removing this protection entirely.

Comment on lines +122 to +131
// Guard against a sentinel being stored between load() completing and
// this Store call (Terminate() running concurrently). LoadOrStore is
// atomic: if a sentinel got in, we discard the freshly loaded value
// via onEvict rather than silently overwriting the sentinel.
if _, loaded := c.m.LoadOrStore(key, v); loaded {
if c.onEvict != nil {
c.onEvict(key, v)
}
return nil, errSentinelFound
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 In RestorableCache.Get (cache.go lines 126–130), after load() completes, LoadOrStore returns loaded=true for any pre-existing value, but the code unconditionally calls onEvict(key, v) and returns errSentinelFound without checking whether the pre-existing value is a sentinel or a legitimate V. If a concurrent CreateSession stores a valid MultiSession between storage.Upsert (line 348) and sessions.Store (line 355), a simultaneous GetMultiSession → loadSession → RestoreSession call will find that V via LoadOrStore, close the freshly-restored session's backend connections via onEvict, and return (nil, false) to all singleflight waiters even though the cache holds a valid session. The old ValidatingCache handled this correctly by reading back the winner via ContainsOrAdd+Get and calling onEvict only on the loser; the new code always treats any concurrent store as a sentinel, and the test TestValidatingCache_Singleflight_EvictsLoserWhenLoadRacesWriter that verified the correct behavior was removed in this PR.

Extended reasoning...

The logical flaw: At cache.go lines 126–130, LoadOrStore(key, v) returns (_, loaded=true) whenever anything is already stored under key. The code comment says "if a sentinel got in" but the code cannot distinguish a terminatedSentinel{} from a legitimate vmcpsession.MultiSession. Both cause onEvict(key, v) to fire on the freshly-loaded v (closing its backend connections) and errSentinelFound to be returned, causing all singleflight waiters to receive (zero, false).

The code path that triggers it: In CreateSession (session_manager.go lines 348–355), storage.Upsert writes the session metadata including MetadataKeyTokenHash to Redis, making the session restorable, and then sessions.Store(sessionID, sess) stores the live MultiSession in the node-local cache. These two operations are not atomic. During the window where metadata is in Redis but the V is not yet in the cache: a concurrent GetMultiSession(sessionID) sees a cache miss, enters the singleflight group, calls loadSession, finds MetadataKeyTokenHash in storage, and invokes factory.RestoreSession (which opens HTTP backend connections — typically 100–500 ms). Meanwhile CreateSession completes its sessions.Store. When RestoreSession finally returns and tries LoadOrStore, it finds the V from CreateSession and incorrectly treats it as a sentinel.

Why existing code does not prevent it: The singleflight re-check (lines 110–116) inspects the cache before load() is called and correctly returns the live value or sentinel found there. The bug is in the post-load LoadOrStore guard, which has no type assertion — it cannot tell a just-stored MultiSession from a terminatedSentinel.

Addressing the refutations: Five verifiers argue the trigger is implausible because the window between storage.Upsert and sessions.Store is nanoseconds while RestoreSession takes hundreds of milliseconds. This timing argument is sound: for the race to fire, the concurrent loadSession goroutine must start its storage.Load call after storage.Upsert runs, then RestoreSession must take long enough that sessions.Store completes before LoadOrStore is reached. In a single-pod scenario this requires either a very unlucky scheduler interleaving (cache miss seen between Upsert and Store in the same process) or a cross-pod client known to the pod before the session is registered with the SDK — both extremely unlikely. The refutations are persuasive that the race is near-impossible with real HTTP backends.

What makes it worth noting regardless: The code carries an incorrect invariant comment ("if a sentinel got in") and has silently dropped the correctness guarantee that the previous ValidatingCache implementation provided via ContainsOrAdd+Get. The removed test TestValidatingCache_Singleflight_EvictsLoserWhenLoadRacesWriter explicitly verified that a concurrent valid-V store was handled correctly (winner's value returned, loser evicted). The new TestRestorableCache_Sentinel_StoredDuringLoad only tests the sentinel case. This is a latent correctness regression: if RestoreSession ever becomes faster (e.g., in tests or with connection-pool prewarming) or a new caller path that calls GetMultiSession for an in-progress session emerges, the bug will manifest silently.

Step-by-step proof (theoretical, requires unlucky scheduling):

  1. sm.Generate() stores empty placeholder for sessionID; no MetadataKeyTokenHash yet.
  2. sm.CreateSession() calls factory.MakeSessionWithID() (slow — opens backends).
  3. CreateSession calls sm.storage.Upsert(sessionID, sess.GetMetadata())MetadataKeyTokenHash now in Redis.
  4. Race window begins: sessions.Store has not yet executed.
  5. Concurrently, another goroutine calls sm.GetMultiSession(sessionID) — cache miss.
  6. loadSession calls storage.Load — sees MetadataKeyTokenHash, session is restorable.
  7. loadSession calls factory.RestoreSession — slow (opens HTTP connections).
  8. CreateSession executes sessions.Store(sessionID, sess) — valid V now in cache.
  9. RestoreSession completes; LoadOrStore(key, restoredSess) finds the V from step 8.
  10. Code calls onEvict(key, restoredSess) → closes the freshly-restored session's backend connections.
  11. Returns errSentinelFound → all singleflight waiters receive (nil, false).

Fix: Check whether the pre-existing value is actually a sentinel before calling onEvict and returning errSentinelFound. If loaded=true and the existing value is a V, call onEvict on the freshly-loaded v (as before) but return the winner's V to callers, mirroring the old ContainsOrAdd+Get logic.

yrobla pushed a commit that referenced this pull request Apr 10, 2026
…pattern, add storage Update" (#4730)"

This reverts commit acc1f66.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants