[https://nvbugs/6317600][fix] Add an early return at the head of `_run_attention_warmup` when… by tensorrt-cicd · Pull Request #15486 · NVIDIA/TensorRT-LLM

tensorrt-cicd · 2026-06-19T00:27:28Z

Summary

Root cause: TRTLLM-Gen FMHA JIT warmup grid (batchSize × seqLenKv) sized by engine maxima exceeds the FlashInfer 2^32 TMA descriptor limit for Qwen3-Next-80B-A3B-Thinking tp4ep4 (2048 × 262144 = 5.4e8), hanging engine startup before any forward pass runs.
Fix: Add an early return at the head of _run_attention_warmup when self.batch_size * self.max_seq_len > 256 * 16384 (pre-PR-[https://nvbugs/6248837][fix] Densify trtllm-gen fmha warmup grid to catch missing kernels #15305 effective grid size). Un-warmed kernels JIT-compile lazily on first real request — correct in all cases, slightly slower for that first request only.
Automated fix generated by repair-bot

Test plan

Verify fix on the same GPU type as the original failure
Check for regressions in related tests

Links

Bug: https://nvbugs/6317600

Summary by CodeRabbit

Bug Fixes
- Refined Mamba2 state cache initialization during prefill operations for improved stability
- Added KV-pool capacity safeguard to prevent resource exhaustion failures
- Optimized startup process to prevent hangs with large batch and sequence length configurations

… reqs by per-window free blocks The Qwen3-Next-80B-A3B-Thinking tp4ep4 hang has two contributing pieces: 1. gdn_mixer.forward used boolean-mask indexing (state_indices_p[~has_initial_states_p]) on a CUDA bool tensor for the prefix-cache state-reset block. That forces a GPU->CPU sync per prefill step (twice per layer, for ssm_states and conv_states) so PyTorch can read the mask reduction count and allocate the output. Combined with TP=4 + EP=4 + the overlap scheduler, the variable per-rank latency of this sync was enough to desync subsequent TP/EP collectives and deadlock the forward pass mid-MMLU on Qwen3-Next-80B-A3B-Thinking. Replace mask indexing with the same pattern already used in mamba2_mixer.py: gate on the host-side use_initial_states flag and fall through to torch.where + index_copy_ for the mixed-batch case. Both paths preserve the original semantic (zero rows whose request has no prior mamba state, keep rows that resume from prefix cache) and have output shapes that do not depend on tensor contents, so no implicit CPU sync is introduced. 2. CppMambaHybridCacheManager.add_dummy_requests delegated straight to the base KVCacheManager without checking the recurrent-states window in the unified C++ KV pool. CudaGraphConfig.batch_sizes (with max_batch_size=720 and enable_padding=True) generates capture batches that can exceed the most-constrained window. _create_cuda_graph_warmup_request only checks get_num_free_blocks (full-attention window), so the recurrent-states window underflows and add_sequence_batch raises 'No free block found' from the C++ side, leaving collectives in an incomplete state. Add an upfront guard over min(num_free_blocks_per_window_size.values()) so oversized warmup batches return None, matching the 'if requests is None: return None' contract expected by _create_cuda_graph_warmup_request in model_engine.py. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

…ngine configs TestQwen3NextThinking::test_auto_dtype[tp4ep4] hangs during engine startup in _run_attention_warmup. The C++ TRTLLM-Gen FMHA JIT warmup enumerates a (batchSize x seqLenKv) cartesian grid sized by engine maxima. For Qwen3-Next-80B-A3B-Thinking with max_batch_size=2048 and max_seq_len=262144, the densified grid pushes warmup TMA descriptor shapes past the flashinfer 2^32 limit and hangs engine startup. Skip the warmup whenever the maxima product exceeds 256 * 16384 (the pre-PR NVIDIA#15305 effective grid size). Any kernel not pre-warmed will JIT-compile lazily on its first real request - correct in all cases, only slightly slower for that first request. Same approach as the sister fix for GPT-OSS-120B (nvbugs/6316980 / nvbugs/6275959). Verified passing: MMLU 85.79% (threshold 84.18%), GSM8K 85.10% (threshold 78.37%), 1 passed in 534s on B200 tp4ep4. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

coderabbitai · 2026-06-19T00:32:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8564371e-803d-4279-a887-0b73064610ba

📥 Commits

Reviewing files that changed from the base of the PR and between 4a8b7af and 52ed7a1.

📒 Files selected for processing (3)

tensorrt_llm/_torch/modules/mamba/gdn_mixer.py
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

📝 Walkthrough

Walkthrough

Three independent defensive fixes: (1) Mamba2 prefill state-cache initialization in Qwen3NextGatedDeltaNet.forward is reworked to use index_copy_ with torch.where instead of boolean-mask assignment; (2) add_dummy_requests gains a KV-pool free-block pre-check that returns None when batch size exceeds window capacity; (3) _run_attention_warmup adds an early-exit when batch_size * max_seq_len exceeds a fixed threshold.

Changes

Mamba State-Cache Init and Dummy-Request Guard

Layer / File(s)	Summary
GDN mixer prefill state-cache initialization `tensorrt_llm/_torch/modules/mamba/gdn_mixer.py`	Replaces boolean-mask-based zeroing of `ssm_states`/`conv_states` with `index_copy_` + `torch.where`, computing `has_initial_states_p` and reshaping the mask to broadcast across tensor dimensions. Unconditional zero-assignment is kept when `use_initial_states` is false.
`add_dummy_requests` KV-pool capacity pre-check `tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py`	Reads `get_kv_cache_stats().num_free_blocks_per_window_size`, computes the minimum free blocks across windows, and returns `None` early if `len(request_ids)` exceeds that minimum to prevent proceeding when the unified pool's recurrent-states window is too small.

FMHA JIT Warmup Early-Exit

Layer / File(s)	Summary
Attention warmup threshold guard `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Adds a guard in `_run_attention_warmup` that compares `batch_size * max_seq_len` against `256 * 16384`; if exceeded, logs an info message with the engine maxima and returns without running the JIT warmup enumeration.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#14841: Modifies Mamba prefill/replay state-cache handling with per-slot ssm_states/conv_states zeroing in mamba2_mixer.py, directly analogous to the gdn_mixer.py changes here.
NVIDIA/TensorRT-LLM#15305: Modifies the TRTLLM-Gen FMHA JIT warmup grid candidate selection logic in TllmGenFmhaKernel, touching the same warmup path guarded by the new early-exit added here.

Suggested reviewers

yunruis
yuxianq
shaharmor98
yechank-nvidia

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title begins with the NVBugs ID and [fix] type but is incomplete and truncated with an ellipsis, making it impossible to determine the full scope of changes.	Complete the pull request title to clearly summarize all changes: include the main point about skipping FMHA warmup, and consider whether gdn_mixer and CppMamba fixes should be highlighted as they appear to be significant components.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description includes a clear summary, test plan results, and links to tracking, but lacks structured coverage of the required template sections like the Description section clarity, Test Coverage details, and PR Checklist completion.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tensorrt-cicd added 2 commits June 18, 2026 11:42

tensorrt-cicd requested review from a team as code owners June 19, 2026 00:27

tensorrt-cicd assigned VALLIS-NERIA Jun 19, 2026

tensorrt-cicd requested review from syuoni and tomeras91 June 19, 2026 00:27

github-actions Bot assigned tensorrt-cicd Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6317600][fix] Add an early return at the head of `_run_attention_warmup` when…#15486

[https://nvbugs/6317600][fix] Add an early return at the head of `_run_attention_warmup` when…#15486
tensorrt-cicd wants to merge 2 commits into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6317600

tensorrt-cicd commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 19, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tensorrt-cicd commented Jun 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Links

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 19, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tensorrt-cicd commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading