Skip to content

[https://nvbugs/6317600][fix] Add an early return at the head of _run_attention_warmup when…#15486

Open
tensorrt-cicd wants to merge 2 commits into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6317600
Open

[https://nvbugs/6317600][fix] Add an early return at the head of _run_attention_warmup when…#15486
tensorrt-cicd wants to merge 2 commits into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6317600

Conversation

@tensorrt-cicd

@tensorrt-cicd tensorrt-cicd commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Root cause: TRTLLM-Gen FMHA JIT warmup grid (batchSize × seqLenKv) sized by engine maxima exceeds the FlashInfer 2^32 TMA descriptor limit for Qwen3-Next-80B-A3B-Thinking tp4ep4 (2048 × 262144 = 5.4e8), hanging engine startup before any forward pass runs.
  • Fix: Add an early return at the head of _run_attention_warmup when self.batch_size * self.max_seq_len > 256 * 16384 (pre-PR-[https://nvbugs/6248837][fix] Densify trtllm-gen fmha warmup grid to catch missing kernels #15305 effective grid size). Un-warmed kernels JIT-compile lazily on first real request — correct in all cases, slightly slower for that first request only.
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Bug Fixes
    • Refined Mamba2 state cache initialization during prefill operations for improved stability
    • Added KV-pool capacity safeguard to prevent resource exhaustion failures
    • Optimized startup process to prevent hangs with large batch and sequence length configurations

… reqs by per-window free blocks

The Qwen3-Next-80B-A3B-Thinking tp4ep4 hang has two contributing pieces:

1. gdn_mixer.forward used boolean-mask indexing
   (state_indices_p[~has_initial_states_p]) on a CUDA bool tensor for the
   prefix-cache state-reset block. That forces a GPU->CPU sync per prefill
   step (twice per layer, for ssm_states and conv_states) so PyTorch can
   read the mask reduction count and allocate the output. Combined with
   TP=4 + EP=4 + the overlap scheduler, the variable per-rank latency of
   this sync was enough to desync subsequent TP/EP collectives and
   deadlock the forward pass mid-MMLU on Qwen3-Next-80B-A3B-Thinking.

   Replace mask indexing with the same pattern already used in
   mamba2_mixer.py: gate on the host-side use_initial_states flag and fall
   through to torch.where + index_copy_ for the mixed-batch case. Both
   paths preserve the original semantic (zero rows whose request has no
   prior mamba state, keep rows that resume from prefix cache) and have
   output shapes that do not depend on tensor contents, so no implicit
   CPU sync is introduced.

2. CppMambaHybridCacheManager.add_dummy_requests delegated straight to
   the base KVCacheManager without checking the recurrent-states window
   in the unified C++ KV pool. CudaGraphConfig.batch_sizes (with
   max_batch_size=720 and enable_padding=True) generates capture batches
   that can exceed the most-constrained window. _create_cuda_graph_warmup_request
   only checks get_num_free_blocks (full-attention window), so the
   recurrent-states window underflows and add_sequence_batch raises
   'No free block found' from the C++ side, leaving collectives in an
   incomplete state.

   Add an upfront guard over min(num_free_blocks_per_window_size.values())
   so oversized warmup batches return None, matching the
   'if requests is None: return None' contract expected by
   _create_cuda_graph_warmup_request in model_engine.py.

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
…ngine configs

TestQwen3NextThinking::test_auto_dtype[tp4ep4] hangs during engine startup
in _run_attention_warmup. The C++ TRTLLM-Gen FMHA JIT warmup enumerates a
(batchSize x seqLenKv) cartesian grid sized by engine maxima. For
Qwen3-Next-80B-A3B-Thinking with max_batch_size=2048 and max_seq_len=262144,
the densified grid pushes warmup TMA descriptor shapes past the flashinfer
2^32 limit and hangs engine startup.

Skip the warmup whenever the maxima product exceeds 256 * 16384
(the pre-PR NVIDIA#15305 effective grid size). Any kernel not pre-warmed will
JIT-compile lazily on its first real request - correct in all cases, only
slightly slower for that first request. Same approach as the sister fix
for GPT-OSS-120B (nvbugs/6316980 / nvbugs/6275959).

Verified passing: MMLU 85.79% (threshold 84.18%), GSM8K 85.10%
(threshold 78.37%), 1 passed in 534s on B200 tp4ep4.

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8564371e-803d-4279-a887-0b73064610ba

📥 Commits

Reviewing files that changed from the base of the PR and between 4a8b7af and 52ed7a1.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/modules/mamba/gdn_mixer.py
  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py

📝 Walkthrough

Walkthrough

Three independent defensive fixes: (1) Mamba2 prefill state-cache initialization in Qwen3NextGatedDeltaNet.forward is reworked to use index_copy_ with torch.where instead of boolean-mask assignment; (2) add_dummy_requests gains a KV-pool free-block pre-check that returns None when batch size exceeds window capacity; (3) _run_attention_warmup adds an early-exit when batch_size * max_seq_len exceeds a fixed threshold.

Changes

Mamba State-Cache Init and Dummy-Request Guard

Layer / File(s) Summary
GDN mixer prefill state-cache initialization
tensorrt_llm/_torch/modules/mamba/gdn_mixer.py
Replaces boolean-mask-based zeroing of ssm_states/conv_states with index_copy_ + torch.where, computing has_initial_states_p and reshaping the mask to broadcast across tensor dimensions. Unconditional zero-assignment is kept when use_initial_states is false.
add_dummy_requests KV-pool capacity pre-check
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
Reads get_kv_cache_stats().num_free_blocks_per_window_size, computes the minimum free blocks across windows, and returns None early if len(request_ids) exceeds that minimum to prevent proceeding when the unified pool's recurrent-states window is too small.

FMHA JIT Warmup Early-Exit

Layer / File(s) Summary
Attention warmup threshold guard
tensorrt_llm/_torch/pyexecutor/model_engine.py
Adds a guard in _run_attention_warmup that compares batch_size * max_seq_len against 256 * 16384; if exceeded, logs an info message with the engine maxima and returns without running the JIT warmup enumeration.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#14841: Modifies Mamba prefill/replay state-cache handling with per-slot ssm_states/conv_states zeroing in mamba2_mixer.py, directly analogous to the gdn_mixer.py changes here.
  • NVIDIA/TensorRT-LLM#15305: Modifies the TRTLLM-Gen FMHA JIT warmup grid candidate selection logic in TllmGenFmhaKernel, touching the same warmup path guarded by the new early-exit added here.

Suggested reviewers

  • yunruis
  • yuxianq
  • shaharmor98
  • yechank-nvidia
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title begins with the NVBugs ID and [fix] type but is incomplete and truncated with an ellipsis, making it impossible to determine the full scope of changes. Complete the pull request title to clearly summarize all changes: include the main point about skipping FMHA warmup, and consider whether gdn_mixer and CppMamba fixes should be highlighted as they appear to be significant components.
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The PR description includes a clear summary, test plan results, and links to tracking, but lacks structured coverage of the required template sections like the Description section clarity, Test Coverage details, and PR Checklist completion.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants