[feat] gRPC: implement SubscribeKvEvents for KV-cache event streaming#15493
Draft
key4ng wants to merge 7 commits into
Draft
[feat] gRPC: implement SubscribeKvEvents for KV-cache event streaming#15493key4ng wants to merge 7 commits into
key4ng wants to merge 7 commits into
Conversation
Signed-off-by: key4ng <rukeyang@gmail.com>
…ng in test Signed-off-by: key4ng <rukeyang@gmail.com>
Signed-off-by: key4ng <rukeyang@gmail.com>
Signed-off-by: key4ng <rukeyang@gmail.com>
…ests Signed-off-by: key4ng <rukeyang@gmail.com>
SubscribeKvEvents emitted one KvEventBatch per event and ran the blocking get_kv_cache_events drain on the asyncio loop; under load this saturated the serving process. Pack each drain cycle's events into one batch (capped at 1024), matching the ZMQ-based bridges, and run the blocking drain in a worker thread so the serving event loop is not stalled. Signed-off-by: key4ng <rukeyang@gmail.com>
TRT-LLM's get_kv_cache_events is a single-consumer drain: with one drain per SubscribeKvEvents stream, N gateways subscribing to a worker split its event stream so each saw only ~1/N of the blocks, degrading multi-gateway cache-aware routing (a 1/2/4-gateway sweep showed event throughput collapsing from 23 to 18.5 req/s as gateways increased). A shared background task now drains once and broadcasts each batch to every subscriber, so every gateway receives the full stream -- matching the ZMQ pub/sub behaviour of the other engines. With fan-out, 4-gateway event throughput recovers to 22.6 req/s, on par with the approximate tree. Signed-off-by: key4ng <rukeyang@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The
tensorrt_llm/grpc/server (used for high-performance integration with external routers such as sgl-router) implements Generate / Embed / HealthCheck / Abort / GetModelInfo / GetServerInfo, but notSubscribeKvEvents, so it inherits the baseUNIMPLEMENTED. Routers that do KV-event-driven, cache-aware load balancing therefore fall back to an approximate prefix tree for TensorRT-LLM workers instead of routing on each worker's actual KV-cache state.This PR implements
SubscribeKvEventsonTrtllmServiceServicer, bridging TRT-LLM'sLLM.get_kv_cache_events_async()to thecommon.KvEventBatchstream the routers already consume (the same proto this module already imports fromsmg-grpc-proto).What's added
tensorrt_llm/grpc/kv_events.py(new) — a pure, engine-free converter (no GPU, unit-testable):to_int64()— reduces TRT-LLM's unsigned 64-bit block/parent hashes to signed int64 so block identity and parent chaining stay consistent with the router's index.convert_event()/convert_batch()— mapstored→KvBlocksStoredandremoved→KvBlocksRemoved;created/updatedare skipped (nocommon.protoequivalent).tensorrt_llm/grpc/grpc_servicer.py— add theSubscribeKvEventsasync-generator RPC:KvCacheConfig(enable_block_reuseandevent_buffer_max_size > 0); abortsUNIMPLEMENTEDwhen KV events aren't enabled, so routers fall back cleanly.get_kv_cache_events_async()and streams converted batches. Sequence numbers are monotonic and persist across reconnects (they advance only onyield, so a disconnect window introduces no sequence gap and the consumer's stale-batch filter never drops fresh post-reconnect events).tests/unittest/llmapi/test_grpc_kv_events.py(new) — 11 CPU-only unit tests: converter mapping incl. unsigned→signed wrap,createdskipped, contiguous sequence numbers, theUNIMPLEMENTEDgate, and sequence-number persistence across reconnects.Scope / follow-ups
v1 targets the single-tier default. Out of scope (follow-ups): KV-offloading (host-tier)
removedsemantics, sliding-window multi-window events, and replay on reconnect (start_sequence_numberis currently ignored).Test Plan
Unit (CPU, no model):
End-to-end (1× H100, TinyLlama-1.1B-Chat,
--grpc --backend pytorch):KvCacheConfig(enable_block_reuse=True, event_buffer_max_size=16384)start_seq=0) → router learnsblock_size=32(==tokens_per_block) from live events → repeated prefixes route stickily (~7× TTFT improvement on cache hits).event_buffer_max_size=0(events disabled)UNIMPLEMENTED→ router disables the subscription and falls back to its approximate tree (graceful).Launch used for the events-enabled worker:
trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --grpc --backend pytorch \ --host 127.0.0.1 --port 31001 \ --extra_llm_api_options <(echo 'kv_cache_config: {enable_block_reuse: true, event_buffer_max_size: 16384}')🤖 Generated with Claude Code