[https://nvbugs/6317074][fix] One-line bump of free_gpu_memory_fraction from 0.6 to 0.8 in… by tensorrt-cicd · Pull Request #15477 · NVIDIA/TensorRT-LLM

tensorrt-cicd · 2026-06-18T13:13:42Z

Summary

Root cause: Ultra-550B TEP4 block_reuse test ran with free_gpu_memory_fraction=0.6, under-allocating KV cache and causing aggressive eviction that broke prefix-cache correctness, dropping GSM8K accuracy ~9 pt below threshold.
Fix: One-line bump of free_gpu_memory_fraction from 0.6 to 0.8 in TestNemotronV3Ultra.test_nvfp4_4gpus_block_reuse; only the test file staged.
Automated fix generated by repair-bot

Test plan

Verify fix on the same GPU type as the original failure
Check for regressions in related tests

Links

Bug: https://nvbugs/6317074

Summary by CodeRabbit

Tests
- Updated test configuration for GPU memory allocation optimization in inference testing scenarios.

coderabbitai · 2026-06-18T13:16:36Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 485175e0-4b3b-4d94-81e6-83d8d25d7b2d

📥 Commits

Reviewing files that changed from the base of the PR and between 1aa232a and 11cdf22.

📒 Files selected for processing (1)

tests/integration/defs/accuracy/test_llm_api_pytorch.py

📝 Walkthrough

Walkthrough

In TestNemotronV3Ultra.test_nvfp4_4gpus_block_reuse, the KvCacheConfig parameter free_gpu_memory_fraction is increased from 0.6 to 0.8.

Changes

NVFP4 Block-Reuse Test KV Cache Configuration

Layer / File(s)	Summary
KvCacheConfig memory fraction update `tests/integration/defs/accuracy/test_llm_api_pytorch.py`	`free_gpu_memory_fraction` in `test_nvfp4_4gpus_block_reuse` is changed from `0.6` to `0.8`.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title references a specific NVBugs ticket and describes the main change: bumping free_gpu_memory_fraction from 0.6 to 0.8, which directly matches the code modification shown in the summary.
Description check	✅ Passed	The PR description clearly explains the root cause, the fix, test coverage, and includes relevant links, though it omits some template sections like explicit Test Coverage list.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…_reuse test The TEP4 variant (tp=4, ep=4, max_batch_size=32, enable_block_reuse=True) on the 550B Ultra model was failing GSM8K accuracy at 85.292 vs threshold 91.452 (~6 pt drop). Root cause: 'max_seq_len' was inferred from the model HF config to 262144. With tokens_per_block=32 that gives blocks_per_seq=8192, which is essentially equal to the per-window primary KV pool size (8224 blocks at free_gpu_memory_fraction=0.5). Block reuse on Nemotron-Ultra at this scale cannot retain prefix snapshots: a single in-flight sequence consumes the entire pool, every previously-cached prefix is evicted, and the Mamba snapshot pool (sized off the same primary budget per resource_manager.py:1820-1869 and the 'intercept = 0 when mamba_slope > 0' heuristic in mamba_cache_manager.py:1750-1770) is also starved. Block reuse is logically enabled but functionally inert, so multi-step generation drifts numerically and GSM8K drops ~6 points. GSM8K (MAX_INPUT_LEN=4096, MAX_OUTPUT_LEN=256) and MMLU (MAX_INPUT_LEN=4094) never need anywhere near 262144 tokens. Capping max_seq_len=8192 reduces blocks_per_seq from 8192 to 256, leaving room for ~32 sequences of cache plus the snapshot pool, so block_reuse becomes effective and GSM8K accuracy returns to spec. The same value is already used by analogous large-model tests in this file (TestLlama3_3_70BInstruct.test_fp8_tp4 and TestLlama4MaverickInstruct .test_auto_dtype with the comment 'Keep this low to avoid warmup OOM in CI'). Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

tensorrt-cicd requested a review from a team as a code owner June 18, 2026 13:13

tensorrt-cicd assigned VALLIS-NERIA Jun 18, 2026

github-actions Bot assigned tensorrt-cicd Jun 18, 2026

tensorrt-cicd force-pushed the repair-bot-bug6317074 branch from 11cdf22 to 0b1e9a3 Compare June 20, 2026 01:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6317074][fix] One-line bump of free_gpu_memory_fraction from 0.6 to 0.8 in…#15477

[https://nvbugs/6317074][fix] One-line bump of free_gpu_memory_fraction from 0.6 to 0.8 in…#15477
tensorrt-cicd wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6317074

tensorrt-cicd commented Jun 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 18, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tensorrt-cicd commented Jun 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Links

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 18, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tensorrt-cicd commented Jun 18, 2026 •

edited by coderabbitai Bot

Loading