[feat] gRPC: implement SubscribeKvEvents for KV-cache event streaming by key4ng · Pull Request #15493 · NVIDIA/TensorRT-LLM

key4ng · 2026-06-19T04:23:45Z

Description

The tensorrt_llm/grpc/ server (used for high-performance integration with external routers such as sgl-router) implements Generate / Embed / HealthCheck / Abort / GetModelInfo / GetServerInfo, but not SubscribeKvEvents, so it inherits the base UNIMPLEMENTED. Routers that do KV-event-driven, cache-aware load balancing therefore fall back to an approximate prefix tree for TensorRT-LLM workers instead of routing on each worker's actual KV-cache state.

This PR implements SubscribeKvEvents on TrtllmServiceServicer, bridging TRT-LLM's LLM.get_kv_cache_events_async() to the common.KvEventBatch stream the routers already consume (the same proto this module already imports from smg-grpc-proto).

What's added

tensorrt_llm/grpc/kv_events.py (new) — a pure, engine-free converter (no GPU, unit-testable):
- to_int64() — reduces TRT-LLM's unsigned 64-bit block/parent hashes to signed int64 so block identity and parent chaining stay consistent with the router's index.
- convert_event() / convert_batch() — map stored → KvBlocksStored and removed → KvBlocksRemoved; created / updated are skipped (no common.proto equivalent).
tensorrt_llm/grpc/grpc_servicer.py — add the SubscribeKvEvents async-generator RPC:
- Gates on KvCacheConfig (enable_block_reuse and event_buffer_max_size > 0); aborts UNIMPLEMENTED when KV events aren't enabled, so routers fall back cleanly.
- Drains get_kv_cache_events_async() and streams converted batches. Sequence numbers are monotonic and persist across reconnects (they advance only on yield, so a disconnect window introduces no sequence gap and the consumer's stale-batch filter never drops fresh post-reconnect events).
tests/unittest/llmapi/test_grpc_kv_events.py (new) — 11 CPU-only unit tests: converter mapping incl. unsigned→signed wrap, created skipped, contiguous sequence numbers, the UNIMPLEMENTED gate, and sequence-number persistence across reconnects.

Scope / follow-ups

v1 targets the single-tier default. Out of scope (follow-ups): KV-offloading (host-tier) removed semantics, sliding-window multi-window events, and replay on reconnect (start_sequence_number is currently ignored).

Test Plan

Unit (CPU, no model):

pytest tests/unittest/llmapi/test_grpc_kv_events.py
# 11 passed

End-to-end (1× H100, TinyLlama-1.1B-Chat, --grpc --backend pytorch):

Worker config	Result
`KvCacheConfig(enable_block_reuse=True, event_buffer_max_size=16384)`	Cache-aware router subscribes → stream connects (`start_seq=0`) → router learns `block_size=32` (== `tokens_per_block`) from live events → repeated prefixes route stickily (~7× TTFT improvement on cache hits).
`event_buffer_max_size=0` (events disabled)	RPC aborts `UNIMPLEMENTED` → router disables the subscription and falls back to its approximate tree (graceful).

Launch used for the events-enabled worker:

trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --grpc --backend pytorch \
  --host 127.0.0.1 --port 31001 \
  --extra_llm_api_options <(echo 'kv_cache_config: {enable_block_reuse: true, event_buffer_max_size: 16384}')

Draft — opening for discussion/review.

🤖 Generated with Claude Code

Signed-off-by: key4ng <rukeyang@gmail.com>

…ng in test Signed-off-by: key4ng <rukeyang@gmail.com>

Signed-off-by: key4ng <rukeyang@gmail.com>

…ests Signed-off-by: key4ng <rukeyang@gmail.com>

SubscribeKvEvents emitted one KvEventBatch per event and ran the blocking get_kv_cache_events drain on the asyncio loop; under load this saturated the serving process. Pack each drain cycle's events into one batch (capped at 1024), matching the ZMQ-based bridges, and run the blocking drain in a worker thread so the serving event loop is not stalled. Signed-off-by: key4ng <rukeyang@gmail.com>

TRT-LLM's get_kv_cache_events is a single-consumer drain: with one drain per SubscribeKvEvents stream, N gateways subscribing to a worker split its event stream so each saw only ~1/N of the blocks, degrading multi-gateway cache-aware routing (a 1/2/4-gateway sweep showed event throughput collapsing from 23 to 18.5 req/s as gateways increased). A shared background task now drains once and broadcasts each batch to every subscriber, so every gateway receives the full stream -- matching the ZMQ pub/sub behaviour of the other engines. With fan-out, 4-gateway event throughput recovers to 22.6 req/s, on par with the approximate tree. Signed-off-by: key4ng <rukeyang@gmail.com>

key4ng added 5 commits June 18, 2026 18:48

feat(grpc): add KV-cache event -> proto converter

aba968c

Signed-off-by: key4ng <rukeyang@gmail.com>

fix(grpc): use TRT-LLM event_id directly; exercise parent-hash wrappi…

8283695

…ng in test Signed-off-by: key4ng <rukeyang@gmail.com>

feat(grpc): implement SubscribeKvEvents on TrtllmServiceServicer

e2d0630

Signed-off-by: key4ng <rukeyang@gmail.com>

test(grpc): move SubscribeKvEvents test imports to file top

c6c2e65

Signed-off-by: key4ng <rukeyang@gmail.com>

fix(grpc): persist KV-event seq across reconnects; tidy imports and t…

f2e2d90

…ests Signed-off-by: key4ng <rukeyang@gmail.com>

github-actions Bot assigned key4ng Jun 19, 2026

key4ng changed the title ~~[None][feat] gRPC: implement SubscribeKvEvents for KV-cache event streaming~~ [feat] gRPC: implement SubscribeKvEvents for KV-cache event streaming Jun 19, 2026

key4ng added 2 commits June 22, 2026 20:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] gRPC: implement SubscribeKvEvents for KV-cache event streaming#15493

[feat] gRPC: implement SubscribeKvEvents for KV-cache event streaming#15493
key4ng wants to merge 7 commits into
NVIDIA:mainfrom
key4ng:feat/grpc-subscribe-kv-events

key4ng commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

key4ng commented Jun 19, 2026

Description

What's added

Scope / follow-ups

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant