[https://nvbugs/6317600][fix] Add an early return at the head of _run_attention_warmup when…#15486
[https://nvbugs/6317600][fix] Add an early return at the head of _run_attention_warmup when…#15486tensorrt-cicd wants to merge 2 commits into
_run_attention_warmup when…#15486Conversation
… reqs by per-window free blocks The Qwen3-Next-80B-A3B-Thinking tp4ep4 hang has two contributing pieces: 1. gdn_mixer.forward used boolean-mask indexing (state_indices_p[~has_initial_states_p]) on a CUDA bool tensor for the prefix-cache state-reset block. That forces a GPU->CPU sync per prefill step (twice per layer, for ssm_states and conv_states) so PyTorch can read the mask reduction count and allocate the output. Combined with TP=4 + EP=4 + the overlap scheduler, the variable per-rank latency of this sync was enough to desync subsequent TP/EP collectives and deadlock the forward pass mid-MMLU on Qwen3-Next-80B-A3B-Thinking. Replace mask indexing with the same pattern already used in mamba2_mixer.py: gate on the host-side use_initial_states flag and fall through to torch.where + index_copy_ for the mixed-batch case. Both paths preserve the original semantic (zero rows whose request has no prior mamba state, keep rows that resume from prefix cache) and have output shapes that do not depend on tensor contents, so no implicit CPU sync is introduced. 2. CppMambaHybridCacheManager.add_dummy_requests delegated straight to the base KVCacheManager without checking the recurrent-states window in the unified C++ KV pool. CudaGraphConfig.batch_sizes (with max_batch_size=720 and enable_padding=True) generates capture batches that can exceed the most-constrained window. _create_cuda_graph_warmup_request only checks get_num_free_blocks (full-attention window), so the recurrent-states window underflows and add_sequence_batch raises 'No free block found' from the C++ side, leaving collectives in an incomplete state. Add an upfront guard over min(num_free_blocks_per_window_size.values()) so oversized warmup batches return None, matching the 'if requests is None: return None' contract expected by _create_cuda_graph_warmup_request in model_engine.py. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
…ngine configs TestQwen3NextThinking::test_auto_dtype[tp4ep4] hangs during engine startup in _run_attention_warmup. The C++ TRTLLM-Gen FMHA JIT warmup enumerates a (batchSize x seqLenKv) cartesian grid sized by engine maxima. For Qwen3-Next-80B-A3B-Thinking with max_batch_size=2048 and max_seq_len=262144, the densified grid pushes warmup TMA descriptor shapes past the flashinfer 2^32 limit and hangs engine startup. Skip the warmup whenever the maxima product exceeds 256 * 16384 (the pre-PR NVIDIA#15305 effective grid size). Any kernel not pre-warmed will JIT-compile lazily on its first real request - correct in all cases, only slightly slower for that first request. Same approach as the sister fix for GPT-OSS-120B (nvbugs/6316980 / nvbugs/6275959). Verified passing: MMLU 85.79% (threshold 84.18%), GSM8K 85.10% (threshold 78.37%), 1 passed in 534s on B200 tp4ep4. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThree independent defensive fixes: (1) Mamba2 prefill state-cache initialization in ChangesMamba State-Cache Init and Dummy-Request Guard
FMHA JIT Warmup Early-Exit
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Summary
_run_attention_warmupwhenself.batch_size * self.max_seq_len > 256 * 16384(pre-PR-[https://nvbugs/6248837][fix] Densify trtllm-gen fmha warmup grid to catch missing kernels #15305 effective grid size). Un-warmed kernels JIT-compile lazily on first real request — correct in all cases, slightly slower for that first request only.Test plan
Links
Summary by CodeRabbit