Skip to content

Add LRU capacity to ValidatingCache, remove sentinel pattern, add storage Update#4669

Merged
yrobla merged 7 commits intomainfrom
issue-4494
Apr 10, 2026
Merged

Add LRU capacity to ValidatingCache, remove sentinel pattern, add storage Update#4669
yrobla merged 7 commits intomainfrom
issue-4494

Conversation

@yrobla
Copy link
Copy Markdown
Contributor

@yrobla yrobla commented Apr 8, 2026

Summary

  • Introduce pkg/cache.ValidatingCache[K,V]: a fully-typed, capacity-bounded LRU cache using mutex+map+container/list. Replaces the sessionmanager-internal RestorableCache and its sync.Map backing store.
  • Add cache.ErrExpired, cache.New(capacity, load, check, onEvict). A capacity of 0 keeps the existing unbounded behaviour; non-zero evicts the LRU entry (calling onEvict) when the limit is reached.
  • Eliminate terminatedSentinel / Peek(any) / Store(any) / CompareAndSwap(any): all cache operations are now fully typed. The no-resurrection invariant is pushed into the storage layer instead.
  • Add DataStorage.Update (SET XX semantics): overwrites metadata only if the key already exists, returning (false, nil) when absent. Implemented via Redis SetArgs{Mode:"XX"} and a mutex-protected Load+Store in the local in-memory backend.
  • Replace Terminate's sentinel+rollback dance with a simpler storage.Delete followed by cache.Delete. Concurrent RestoreSession calls fail naturally via storage.Load returning ErrSessionNotFound.
  • Replace updateMetadata's Load+Upsert+sentinel-check with a single storage.Update call, closing the TOCTOU resurrection window entirely.
  • Replace DecorateSession's storage.Upsert with storage.Update; a (false, nil) result rolls back the in-cache CAS instead of silently resurrecting a deleted session.
  • Add FactoryConfig.CacheCapacity to wire the limit through sessionmanager.New.
  • Remove sessionmanager/cache.go and cache_test.go; cache tests live in pkg/cache/validating_cache_test.go, covering LRU eviction order, capacity=1, unlimited capacity, singleflight deduplication, and liveness expiry.

Fixes #4494

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)
  • E2E tests (task test-e2e)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Changes

File Change

Does this introduce a user-facing change?

Special notes for reviewers

Large PR Justification

This is a complete pr covering a single functionality. Cannot be split.

@yrobla yrobla requested a review from Copilot April 8, 2026 10:26
@github-actions github-actions bot added the size/XL Extra large PR: 1000+ lines changed label Apr 8, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a fully-typed, capacity-bounded LRU ValidatingCache and updates the session manager/storage integration to remove the prior sentinel-based anti-resurrection approach by using a conditional storage Update (SET XX semantics).

Changes:

  • Add pkg/cache.ValidatingCache[K,V] with optional LRU capacity and singleflight-deduplicated loads.
  • Extend transport/session.DataStorage with Update (conditional overwrite) and wire it into session metadata updates/decoration to prevent resurrection after deletes.
  • Replace sessionmanager’s internal cache/sentinel pattern with the shared cache package and simplify termination flow.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/vmcp/server/sessionmanager/session_manager.go Switch to pkg/cache.ValidatingCache, remove sentinel-based termination/metadata logic, use storage Update.
pkg/vmcp/server/sessionmanager/session_manager_test.go Update tests to the typed cache and storage Update behavior; adjust race/resurrection coverage.
pkg/vmcp/server/sessionmanager/factory.go Add FactoryConfig.CacheCapacity to configure node-local cache size.
pkg/vmcp/server/sessionmanager/cache.go Remove the old RestorableCache implementation and sentinel-related APIs.
pkg/transport/session/session_data_storage.go Add DataStorage.Update to the interface contract (SET XX semantics).
pkg/transport/session/session_data_storage_redis.go Implement Redis-backed Update via SET XX + TTL.
pkg/transport/session/session_data_storage_local.go Implement local Update with mutex to be atomic w.r.t. Delete.
pkg/cache/validating_cache.go New shared typed cache with LRU capacity, per-hit validation, and singleflight miss deduplication.
pkg/cache/validating_cache_test.go Move/reshape cache tests to cover LRU eviction and typed APIs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 72.68293% with 56 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.62%. Comparing base (b542727) to head (653fe28).

Files with missing lines Patch % Lines
pkg/vmcp/server/sessionmanager/session_manager.go 39.58% 17 Missing and 12 partials ⚠️
pkg/cache/validating_cache.go 80.95% 16 Missing and 8 partials ⚠️
...kg/transport/session/session_data_storage_redis.go 82.35% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4669      +/-   ##
==========================================
+ Coverage   68.60%   68.62%   +0.01%     
==========================================
  Files         515      515              
  Lines       53518    53641     +123     
==========================================
+ Hits        36718    36811      +93     
- Misses      13958    13975      +17     
- Partials     2842     2855      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 8, 2026
@yrobla yrobla requested a review from Copilot April 8, 2026 10:42
@github-actions github-actions bot dismissed their stale review April 8, 2026 10:43

Large PR justification has been provided. Thank you!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 8, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

pkg/vmcp/server/sessionmanager/session_manager.go:142

  • The onEvict callback now runs for both expiry-driven evictions and capacity/LRU evictions, but the log message says "evicted expired session". This will be misleading once CacheCapacity is set. Consider adjusting the message (or passing eviction reason through the cache) so LRU evictions aren’t reported as expirations.
		func(id string, sess vmcpsession.MultiSession) {
			if closeErr := sess.Close(); closeErr != nil {
				slog.Warn("session cache: error closing evicted session",
					"session_id", id, "error", closeErr)
			}
			slog.Warn("session cache: evicted expired session from node-local cache",
				"session_id", id)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 8, 2026
@yrobla yrobla force-pushed the issue-4494 branch 2 times, most recently from 36eeed9 to 49e8631 Compare April 8, 2026 12:42
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 8, 2026
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 9, 2026
…-healing

Remove the Delete/Peek/CompareAndSwap methods from ValidatingCache — the
cache now self-heals lazily: on the next Get, checkSession loads from
storage and evicts if the session is gone or terminated.

Terminate detects phase (full MultiSession vs placeholder) by loading
from storage and checking MetadataKeyTokenHash rather than Peeking the
cache. Full sessions trigger a storage.Delete; placeholders are marked
terminated in storage. The node-local cache catches up on next access.

DecorateSession drops the CAS+rollback pattern in favour of
sessions.Set followed by storage.Update (SET XX), which acts as the
concurrency guard. A concurrent termination between the two calls is
detected by Update returning false.

LocalSessionDataStorage replaces its mixed sync.Map+sync.Mutex with a
plain map guarded exclusively by a single sync.Mutex, eliminating the
dual-primitive complexity. The Update contract tests are extended to
cover missing-key, existing-key, and post-Delete cases.

Add defaultCacheCapacity = 1000 so capacity=0 no longer means
"unlimited"; document the long-term goal of unifying cache and storage
behind a single interface.
@yrobla yrobla requested review from Copilot and jerm-dro April 9, 2026 12:37
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 9, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

pkg/cache/validating_cache_test.go:405

  • These tests use sync.WaitGroup.Go(...), but elsewhere in this same file sync.WaitGroup is used with Add + go + Done. Unless you’re relying on a Go version that adds WaitGroup.Go, this won’t compile; even if it does, the mixed styles are confusing. Consider rewriting these to the standard Add/Done pattern (or use errgroup.Group if you want a Go helper).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Collaborator

@JAORMX JAORMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this one! The move to hashicorp/golang-lru/v2 and the storage.Update (SET XX) pattern are solid choices. The sentinel removal cleans things up a lot. I do have a few things I'd like to discuss before we merge though.

The most important one I couldn't leave as an inline comment because server.go isn't in the diff:

pkg/vmcp/server/server.go:453-460 builds the FactoryConfig without setting CacheCapacity. Go defaults int to 0, and sessionmanager.New (line 105) now rejects anything < 1. So the vMCP server won't start on this branch. This needs to be wired from config or given a sensible default.

Other things worth discussing:

  • Phase 1 Terminate still uses Upsert instead of Update (line 485), which can recreate a deleted key... the exact race we're fixing.
  • Phase 2 Terminate deletes from storage but doesn't evict from cache. ValidatingCache has no Delete method. Terminated sessions hold backend connections until the next Get or LRU pressure.
  • DecorateSession writes cache before storage (line 722). Flipping the order would avoid cache/storage divergence on transient storage errors.
  • Duplicate Update test cases in session_data_storage_test.go.
  • cleanupFailedPlaceholder mutates the caller's map in place (line 392).

See the inline comments for details on each. None of these are blocking... but the server.go wiring and the Upsert on line 485 are pretty important to address.

@yrobla yrobla requested review from JAORMX and Copilot April 9, 2026 14:48
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 9, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 9, 2026
Copy link
Copy Markdown
Contributor

@jerm-dro jerm-dro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions inline — all can be addressed in follow-up PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vmcp: address RestorableCache capacity limit and sentinel pattern complexity (follow-up to #4464)

5 participants