[https://nvbugs/6316980][fix] Added a runtime guard in FlashInferTrtllmGenAttention.is_supported using the… by tensorrt-cicd · Pull Request #15496 · NVIDIA/TensorRT-LLM

tensorrt-cicd · 2026-06-19T08:47:08Z

Summary

Root cause: FlashInferTrtllmGenAttention.is_supported has no engine-maxima guard for the flashinfer 2^32 per-dim TMA limit, so the runtime trtllm-gen forward path that runs right after the (now-skipped) JIT warmup still calls buildNdTmaDescriptor with overflowing shapes for max_batch_size=720, max_seq_len=131072 GPT-OSS-120B configs.
Fix: Added a runtime guard in FlashInferTrtllmGenAttention.is_supported using the same 256*16384 threshold as the JIT-warmup skip; oversized configs now route through the legacy thop.attention path whose C++ runtime does not use flashinfer-side TMA descriptors.
Automated fix generated by repair-bot

Test plan

Verify fix on the same GPU type as the original failure
Check for regressions in related tests

Links

Bug: https://nvbugs/6316980

Summary by CodeRabbit

Bug Fixes

Added configuration validation to prevent overflow errors when combining large batch sizes with long sequence lengths in attention operations.
Optimized engine initialization warmup to skip unnecessary processing in high-capacity scenarios, improving stability and reducing initialization time for large deployments.

coderabbitai · 2026-06-19T08:51:11Z

📝 Walkthrough

Walkthrough

Adds a MAX_TMA_GUARD_THRESHOLD constant (256 * 16384) to FlashInferTrtllmGenAttention to cap the engine-size product that triggers TMA shape overflow. The threshold is enforced in is_supported() to reject oversized configurations and in _run_attention_warmup to skip TRTLLM-Gen FMHA JIT warmup when the product would exceed it.

Changes

TMA overflow guard for FlashInfer trtllm-gen

Layer / File(s)	Summary
`MAX_TMA_GUARD_THRESHOLD` constant and `is_supported()` guard `tensorrt_llm/_torch/attention_backend/trtllm_gen.py`	Adds `MAX_TMA_GUARD_THRESHOLD = 256 * 16384` as a class-level constant with explanatory comments, and adds a check in `is_supported()` that returns `(False, <reason>)` when `meta.max_num_requests * meta.max_seq_len` exceeds the threshold.
JIT warmup skip in `_run_attention_warmup` `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Imports `FlashInferTrtllmGenAttention` and adds an early-return pre-check in `_run_attention_warmup` that logs and skips TRTLLM-Gen FMHA JIT warmup when `self.batch_size * self.max_seq_len` exceeds `MAX_TMA_GUARD_THRESHOLD`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title is incomplete and cut off mid-sentence, making it unclear what the actual fix is. It references a guard being added but doesn't finish the thought.	Complete the title with a clear summary, e.g., '[https://nvbugs/6316980][fix] Add runtime guard for flashinfer TMA 2^32 shape overflow in FlashInferTrtllmGenAttention'

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description includes a summary section explaining the root cause and fix clearly. However, it lacks detailed explanations in the Test Coverage section (only checkmarks without specifics) and doesn't address all PR Checklist items required by the template.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

tensorrt_llm/_torch/attention_backend/trtllm_gen.py (2)

638-646: ⚡ Quick win

Clarify the error message to reference the actual threshold value.

The error message mentions "flashinfer 2^32 TMA shape limit" but the actual guard threshold is MAX_TMA_GUARD_THRESHOLD (256 × 16384 = 4,194,304), which is significantly smaller than 2³². While the underlying root cause is flashinfer's 2³² constraint, the immediate reason for rejection is exceeding the conservative threshold. Including the actual threshold in the message would help users and maintainers understand the specific limit being enforced.

📝 Proposed improvement to error message

         if meta.max_num_requests * meta.max_seq_len > self.MAX_TMA_GUARD_THRESHOLD:
             return (
                 False,
-                f"engine maxima (max_num_requests={meta.max_num_requests}, "
-                f"max_seq_len={meta.max_seq_len}) would overflow flashinfer 2^32 TMA shape limit.",
+                f"engine maxima product (max_num_requests={meta.max_num_requests} × "
+                f"max_seq_len={meta.max_seq_len} = {meta.max_num_requests * meta.max_seq_len}) "
+                f"exceeds threshold ({self.MAX_TMA_GUARD_THRESHOLD}) to prevent flashinfer TMA overflow.",
             )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/attention_backend/trtllm_gen.py` around lines 638 - 646,
The error message in the return statement of the condition checking against
MAX_TMA_GUARD_THRESHOLD currently only references flashinfer's 2^32 TMA shape
limit, but it should clarify the actual threshold value being enforced. Update
the error message string (in the f-string starting with "engine maxima") to
include the actual MAX_TMA_GUARD_THRESHOLD constant value so users understand
the specific conservative limit that triggered the rejection, not just the
underlying flashinfer constraint.

431-436: 💤 Low value

Consider clarifying why the threshold is 256×16384 rather than closer to 2³².

The comment mentions that flashinfer's buildNdTmaDescriptor enforces a 2³² (4,294,967,296) limit, but the actual threshold is 256 × 16384 = 4,194,304. While the PR context indicates this value matches the existing JIT-warmup threshold, future maintainers would benefit from a brief explanation of why this specific conservative value was chosen (e.g., empirical testing, safety margin, or other flashinfer internals).

📝 Suggested comment enhancement

-    # flashinfer's buildNdTmaDescriptor (kernelParams.h:598) enforces shapes[ii] <= 2^32.
-    # When max_batch_size * max_seq_len exceeds this, the K/V cache pool's TMA shape
-    # overflows on one rank and the rest hang on the next NCCL collective
+    # flashinfer's buildNdTmaDescriptor (kernelParams.h:598) enforces shapes[ii] <= 2^32.
+    # This conservative threshold (256 × 16384 = 4,194,304) prevents TMA shape overflow
+    # in the K/V cache pool when max_batch_size * max_seq_len exceeds it, avoiding
+    # one-rank abort + NCCL hang on subsequent collectives

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/attention_backend/trtllm_gen.py` around lines 431 - 436,
The constant MAX_TMA_GUARD_THRESHOLD is set to 256 * 16384 which is
significantly lower than the 2^32 limit mentioned in the comment. Enhance the
comment above MAX_TMA_GUARD_THRESHOLD to clarify why this specific conservative
threshold value was chosen instead of being closer to the actual 2^32 limit
enforced by flashinfer's buildNdTmaDescriptor, such as whether it comes from
empirical testing, a safety margin, alignment with the JIT-warmup threshold, or
other flashinfer internals. This will help future maintainers understand the
design decision.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tensorrt_llm/_torch/attention_backend/trtllm_gen.py`:
- Around line 638-646: The error message in the return statement of the
condition checking against MAX_TMA_GUARD_THRESHOLD currently only references
flashinfer's 2^32 TMA shape limit, but it should clarify the actual threshold
value being enforced. Update the error message string (in the f-string starting
with "engine maxima") to include the actual MAX_TMA_GUARD_THRESHOLD constant
value so users understand the specific conservative limit that triggered the
rejection, not just the underlying flashinfer constraint.
- Around line 431-436: The constant MAX_TMA_GUARD_THRESHOLD is set to 256 *
16384 which is significantly lower than the 2^32 limit mentioned in the comment.
Enhance the comment above MAX_TMA_GUARD_THRESHOLD to clarify why this specific
conservative threshold value was chosen instead of being closer to the actual
2^32 limit enforced by flashinfer's buildNdTmaDescriptor, such as whether it
comes from empirical testing, a safety margin, alignment with the JIT-warmup
threshold, or other flashinfer internals. This will help future maintainers
understand the design decision.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7c2bce37-fcf2-42d4-8cbc-3c6d0fade52c

📥 Commits

Reviewing files that changed from the base of the PR and between 7060827 and 5bb0569.

⛔ Files ignored due to path filters (1)

tests/integration/defs/examples/visual_gen/golden/visual_gen_lpips/visual_gen_lpips_golden_media.zip is excluded by !**/*.zip

📒 Files selected for processing (2)

tensorrt_llm/_torch/attention_backend/trtllm_gen.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

…ngine configs Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

…ne configs flashinfer's buildNdTmaDescriptor at kernelParams.h:598 enforces shapes[ii] <= 2^32 per TMA dim. The previous fix (commit 9a05ee8 / nvbugs/6275959) skipped only the C++ TRTLLM-Gen FMHA JIT warmup grid; the runtime trtllm-gen forward path that runs immediately after (inside the general warmup) is still unguarded. With max_batch_size=720 and max_seq_len=131072 on GPT-OSS-120B, the runtime decode call into flashinfer aborts on one rank and the rest hang on the next NCCL collective, producing the silent multi-hour hang we saw post-fix. Add a runtime guard in FlashInferTrtllmGenAttention.is_supported that falls back to the legacy thop.attention path (C++ TRT-LLM runtime, no flashinfer TMA descriptors) when meta.max_num_requests * meta.max_seq_len exceeds the same 256*16384 threshold used by the JIT-warmup skip. The two paths now agree on what 'oversized' means and the legacy runtime takes over end-to-end for those configs. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

tensorrt-cicd requested review from a team as code owners June 19, 2026 08:47

tensorrt-cicd requested a review from QiJune June 19, 2026 08:47

github-actions Bot assigned tensorrt-cicd Jun 19, 2026

coderabbitai Bot reviewed Jun 19, 2026

View reviewed changes

tensorrt-cicd added 2 commits June 19, 2026 02:22

[nvbugs/6316980][fix] Skip TRTLLM-Gen FMHA JIT warmup for oversized e…

c5025b3

…ngine configs Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

tensorrt-cicd force-pushed the repair-bot-bug6316980 branch from 5bb0569 to 2f3e040 Compare June 19, 2026 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6316980][fix] Added a runtime guard in FlashInferTrtllmGenAttention.is_supported using the…#15496

[https://nvbugs/6316980][fix] Added a runtime guard in FlashInferTrtllmGenAttention.is_supported using the…#15496
tensorrt-cicd wants to merge 2 commits into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6316980

tensorrt-cicd commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 19, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tensorrt-cicd commented Jun 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Links

Summary by CodeRabbit

Bug Fixes

Uh oh!

coderabbitai Bot commented Jun 19, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tensorrt-cicd commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading