Skip to content

[feat] gRPC: implement SubscribeKvEvents for KV-cache event streaming#15493

Draft
key4ng wants to merge 7 commits into
NVIDIA:mainfrom
key4ng:feat/grpc-subscribe-kv-events
Draft

[feat] gRPC: implement SubscribeKvEvents for KV-cache event streaming#15493
key4ng wants to merge 7 commits into
NVIDIA:mainfrom
key4ng:feat/grpc-subscribe-kv-events

Conversation

@key4ng

@key4ng key4ng commented Jun 19, 2026

Copy link
Copy Markdown

Description

The tensorrt_llm/grpc/ server (used for high-performance integration with external routers such as sgl-router) implements Generate / Embed / HealthCheck / Abort / GetModelInfo / GetServerInfo, but not SubscribeKvEvents, so it inherits the base UNIMPLEMENTED. Routers that do KV-event-driven, cache-aware load balancing therefore fall back to an approximate prefix tree for TensorRT-LLM workers instead of routing on each worker's actual KV-cache state.

This PR implements SubscribeKvEvents on TrtllmServiceServicer, bridging TRT-LLM's LLM.get_kv_cache_events_async() to the common.KvEventBatch stream the routers already consume (the same proto this module already imports from smg-grpc-proto).

What's added

  • tensorrt_llm/grpc/kv_events.py (new) — a pure, engine-free converter (no GPU, unit-testable):
    • to_int64() — reduces TRT-LLM's unsigned 64-bit block/parent hashes to signed int64 so block identity and parent chaining stay consistent with the router's index.
    • convert_event() / convert_batch() — map storedKvBlocksStored and removedKvBlocksRemoved; created / updated are skipped (no common.proto equivalent).
  • tensorrt_llm/grpc/grpc_servicer.py — add the SubscribeKvEvents async-generator RPC:
    • Gates on KvCacheConfig (enable_block_reuse and event_buffer_max_size > 0); aborts UNIMPLEMENTED when KV events aren't enabled, so routers fall back cleanly.
    • Drains get_kv_cache_events_async() and streams converted batches. Sequence numbers are monotonic and persist across reconnects (they advance only on yield, so a disconnect window introduces no sequence gap and the consumer's stale-batch filter never drops fresh post-reconnect events).
  • tests/unittest/llmapi/test_grpc_kv_events.py (new) — 11 CPU-only unit tests: converter mapping incl. unsigned→signed wrap, created skipped, contiguous sequence numbers, the UNIMPLEMENTED gate, and sequence-number persistence across reconnects.

Scope / follow-ups

v1 targets the single-tier default. Out of scope (follow-ups): KV-offloading (host-tier) removed semantics, sliding-window multi-window events, and replay on reconnect (start_sequence_number is currently ignored).

Test Plan

Unit (CPU, no model):

pytest tests/unittest/llmapi/test_grpc_kv_events.py
# 11 passed

End-to-end (1× H100, TinyLlama-1.1B-Chat, --grpc --backend pytorch):

Worker config Result
KvCacheConfig(enable_block_reuse=True, event_buffer_max_size=16384) Cache-aware router subscribes → stream connects (start_seq=0) → router learns block_size=32 (== tokens_per_block) from live events → repeated prefixes route stickily (~7× TTFT improvement on cache hits).
event_buffer_max_size=0 (events disabled) RPC aborts UNIMPLEMENTED → router disables the subscription and falls back to its approximate tree (graceful).

Launch used for the events-enabled worker:

trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --grpc --backend pytorch \
  --host 127.0.0.1 --port 31001 \
  --extra_llm_api_options <(echo 'kv_cache_config: {enable_block_reuse: true, event_buffer_max_size: 16384}')

Draft — opening for discussion/review.

🤖 Generated with Claude Code

key4ng added 5 commits June 18, 2026 18:48
Signed-off-by: key4ng <rukeyang@gmail.com>
…ng in test

Signed-off-by: key4ng <rukeyang@gmail.com>
Signed-off-by: key4ng <rukeyang@gmail.com>
Signed-off-by: key4ng <rukeyang@gmail.com>
…ests

Signed-off-by: key4ng <rukeyang@gmail.com>
@key4ng key4ng changed the title [None][feat] gRPC: implement SubscribeKvEvents for KV-cache event streaming [feat] gRPC: implement SubscribeKvEvents for KV-cache event streaming Jun 19, 2026
key4ng added 2 commits June 22, 2026 20:48
SubscribeKvEvents emitted one KvEventBatch per event and ran the blocking
get_kv_cache_events drain on the asyncio loop; under load this saturated the
serving process. Pack each drain cycle's events into one batch (capped at
1024), matching the ZMQ-based bridges, and run the blocking drain in a worker
thread so the serving event loop is not stalled.

Signed-off-by: key4ng <rukeyang@gmail.com>
TRT-LLM's get_kv_cache_events is a single-consumer drain: with one drain per
SubscribeKvEvents stream, N gateways subscribing to a worker split its event
stream so each saw only ~1/N of the blocks, degrading multi-gateway cache-aware
routing (a 1/2/4-gateway sweep showed event throughput collapsing from 23 to
18.5 req/s as gateways increased). A shared background task now drains once and
broadcasts each batch to every subscriber, so every gateway receives the full
stream -- matching the ZMQ pub/sub behaviour of the other engines. With fan-out,
4-gateway event throughput recovers to 22.6 req/s, on par with the approximate
tree.

Signed-off-by: key4ng <rukeyang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant