add gptq fused kernel by sychen52 · Pull Request #1291 · NVIDIA/Model-Optimizer

sychen52 · 2026-04-17T23:06:01Z

What does this PR do?

Add gptq fused kernel to improve speed. This fused kernel is not yet used in the production code path.

Usage

check unittest

Testing

added a unittest

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Release Notes

New Features
- Added GPTQ fused kernel for optimized blockwise weight quantization
- Introduced compute_fp4_scales public function for FP4 scale computation
Refactor
- Consolidated FP4 quantization logic into shared utilities for improved reusability across kernels
Tests
- Extended test coverage for GPTQ fused kernel operations
- Added Triton availability checks to test infrastructure

Signed-off-by: Shiyang Chen <shiychen@nvidia.com>

coderabbitai · 2026-04-17T23:06:21Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5f813bd9-fdaf-41a1-86fa-988ec7cd363b

📥 Commits

Reviewing files that changed from the base of the PR and between fc14872 and dfc7d4c.

📒 Files selected for processing (1)

tests/gpu/torch/quantization/test_gptq.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/gpu/torch/quantization/test_gptq.py

📝 Walkthrough

Walkthrough

The changes consolidate NVFP4 FP4 quantization logic into a new shared module, refactor existing FP4 kernels to utilize the centralized quantization functions, introduce a new GPTQ fused kernel implementation for blockwise weight updates, and add comprehensive test coverage with parameterized testing and benchmarking.

Changes

Cohort / File(s)	Summary
NVFP4 Quantization Core `modelopt/torch/quantization/triton/nvfp4_quant.py`	New module introducing Triton JIT functions `fp4_round_magnitude` and `nvfp4_scalar_quant` that implement NVFP4 (E2M1) fake quantization logic with threshold-based magnitude rounding and degenerate scale handling.
Existing FP4 Kernel Refactoring `modelopt/torch/quantization/triton/fp4_kernel.py`, `modelopt/torch/quantization/triton/fp4_kernel_hopper.py`	Refactored to delegate FP4 quantization logic to shared utilities; `fp4_kernel.py` adds `compute_fp4_scales` function and updates `__all__` exports, while both kernels replace inline quantization code with calls to `nvfp4_scalar_quant` and `fp4_round_magnitude`.
GPTQ Fused Kernel `modelopt/torch/quantization/triton/gptq_fused_kernel.py`	New module implementing Triton-jitted fused kernel for GPTQ blockwise weight updates with per-row program instances, scalar quantization via `nvfp4_scalar_quant`, and in-register error propagation; exports `gptq_fused_block_scalar` wrapper.
Test Infrastructure & Coverage `tests/gpu/torch/quantization/conftest.py`, `tests/gpu/torch/quantization/test_gptq.py`	Added `requires_triton` pytest marker in conftest; introduced parameterized test `test_fused_vs_unfused_nvfp4` comparing fused and unfused GPTQ paths, plus non-pytest benchmark `bench_fused_nvfp4` for runtime measurement.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 63.16% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the primary change: a new GPTQ fused kernel is introduced with supporting infrastructure for NVFP4 quantization.
Security Anti-Patterns	✅ Passed	No security anti-patterns (torch.load without weights_only, numpy.load with allow_pickle, trust_remote_code, eval/exec, or # nosec comments) found in modified/new files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cjluo-nv

Good PR overall: clean deduplication of FP4 rounding logic into nvfp4_quant.py, well-structured fused GPTQ kernel, and solid correctness tests. A few issues to address:

Missing __all__ in nvfp4_quant.py (minor but important): Since __init__.py does from .nvfp4_quant import *, internal Triton JIT helpers (fp4_round_magnitude, nvfp4_scalar_quant) leak into the public API. These are internal building blocks, not user-facing.
Copyright year 2026 in both new files — should be 2025.
Benchmark tests in CI: test_fused_nvfp4_benchmark is purely a performance test with no assertions. It prints results but can't fail for correctness. Consider marking it with a custom marker (e.g., @pytest.mark.benchmark) or adding a soft speedup assertion to make it more useful, or just being explicit that it's informational-only.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

modelopt/torch/quantization/triton/__init__.py (1)
36-36: Avoid wildcard-importing nvfp4_quant without an explicit export list.

nvfp4_quant.py does not define __all__, so this also re-exports its implementation imports (triton, tl, libdevice) from modelopt.torch.quantization.triton. Either add __all__ = ["fp4_round_magnitude", "nvfp4_scalar_quant"] in the new module or import those two names explicitly here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/triton/__init__.py` at line 36, The current
wildcard import "from .nvfp4_quant import *" re-exports unrelated implementation
names; fix by either adding an explicit __all__ = ["fp4_round_magnitude",
"nvfp4_scalar_quant"] to nvfp4_quant.py or replace the wildcard here with an
explicit import of the two public symbols (fp4_round_magnitude,
nvfp4_scalar_quant) so only those functions are exported from
modelopt.torch.quantization.triton.
tests/gpu/torch/quantization/test_gptq.py (1)
265-269: Initialize hessian on CUDA to avoid unnecessary device transfers.

The current code initializes hessian on CPU and then calls hessian.to("cuda") after update_hessian. While this works because update_hessian explicitly handles device transfers via .to(hessian.device), initializing hessian directly on CUDA with device=inp.device avoids the redundant transfer and simplifies the code.
Suggested fix
-    hessian = torch.zeros(dim, dim, dtype=torch.float32)
+    hessian = torch.zeros(dim, dim, device=inp.device, dtype=torch.float32)
     hessian, _ = update_hessian(inp, hessian, 0)
-    hessian = hessian.to("cuda")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/quantization/test_gptq.py` around lines 265 - 269, Initialize
the hessian tensor directly on CUDA to avoid an unnecessary CPU→GPU copy: when
creating hessian (used with update_hessian and later passed to _compute_h_inv),
construct it with the same device as inp (e.g., device=inp.device) instead of
creating a CPU tensor and calling hessian.to("cuda") afterward; keep calls to
update_hessian(hessian, ...) and _compute_h_inv unchanged so only the hessian
allocation is updated.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/gpu/torch/quantization/test_gptq.py`:
- Around line 387-454: The test test_fused_nvfp4_benchmark currently runs 6
heavy configs defined by _NVFP4_BENCH_CONFIGS/_NVFP4_BENCH_IDS and should not
run in normal GPU CI; either (A) gate it behind an opt-in marker/env var by
adding a decorator like `@pytest.mark.skipif`(os.getenv("RUN_GPU_BENCH") != "1",
reason="benchmark tests disabled by default") or `@pytest.mark.benchmark` and
document how to opt-in, or (B) reduce it to a tiny smoke case by replacing
_NVFP4_BENCH_CONFIGS with a single small config (eg (16,128,256,512)) and keep
the rest of the code, making sure to update _NVFP4_BENCH_IDS and the
pytest.mark.parametrize call for test_fused_nvfp4_benchmark accordingly.
- Around line 28-29: The module-level import of compute_fp4_scales causes pytest
collection to fail when Triton is unavailable; move the import into the function
that uses it (/_make_nvfp4_test_data/) or wrap it with an availability check
similar to how gptq_fused_block_scalar is handled so the import only runs when
requires_triton applies; update _make_nvfp4_test_data to import
compute_fp4_scales locally (or perform a conditional import and raise/skip
accordingly) so tests can be collected without Triton present.

---

Nitpick comments:
In `@modelopt/torch/quantization/triton/__init__.py`:
- Line 36: The current wildcard import "from .nvfp4_quant import *" re-exports
unrelated implementation names; fix by either adding an explicit __all__ =
["fp4_round_magnitude", "nvfp4_scalar_quant"] to nvfp4_quant.py or replace the
wildcard here with an explicit import of the two public symbols
(fp4_round_magnitude, nvfp4_scalar_quant) so only those functions are exported
from modelopt.torch.quantization.triton.

In `@tests/gpu/torch/quantization/test_gptq.py`:
- Around line 265-269: Initialize the hessian tensor directly on CUDA to avoid
an unnecessary CPU→GPU copy: when creating hessian (used with update_hessian and
later passed to _compute_h_inv), construct it with the same device as inp (e.g.,
device=inp.device) instead of creating a CPU tensor and calling
hessian.to("cuda") afterward; keep calls to update_hessian(hessian, ...) and
_compute_h_inv unchanged so only the hessian allocation is updated.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ebaef65c-34a9-46a3-a419-43885b2cefb0

📥 Commits

Reviewing files that changed from the base of the PR and between dc7ad66 and fc14872.

📒 Files selected for processing (7)

modelopt/torch/quantization/triton/__init__.py
modelopt/torch/quantization/triton/fp4_kernel.py
modelopt/torch/quantization/triton/fp4_kernel_hopper.py
modelopt/torch/quantization/triton/gptq_fused_kernel.py
modelopt/torch/quantization/triton/nvfp4_quant.py
tests/gpu/torch/quantization/conftest.py
tests/gpu/torch/quantization/test_gptq.py

codecov · 2026-04-17T23:19:19Z

Codecov Report

❌ Patch coverage is 54.54545% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.14%. Comparing base (e4b054b) to head (dfc7d4c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
...opt/torch/quantization/triton/gptq_fused_kernel.py	39.39%	20 Missing ⚠️
modelopt/torch/quantization/triton/nvfp4_quant.py	46.66%	8 Missing ⚠️
modelopt/torch/quantization/triton/fp4_kernel.py	93.75%	1 Missing ⚠️
...opt/torch/quantization/triton/fp4_kernel_hopper.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1291      +/-   ##
==========================================
+ Coverage   72.54%   76.14%   +3.60%     
==========================================
  Files         459      461       +2     
  Lines       48649    48746      +97     
==========================================
+ Hits        35290    37117    +1827     
+ Misses      13359    11629    -1730

Flag	Coverage Δ
examples	`40.53% <16.66%> (+1.15%)`	⬆️
gpu	`59.96% <54.54%> (+7.64%)`	⬆️
unit	`52.12% <0.00%> (-0.07%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Shiyang Chen <shiychen@nvidia.com>

cjluo-nv

Re-review of PR #1291: add gptq fused kernel

Addressed (critical/important):

✅ Module-level compute_fp4_scales import moved inside function (was breaking pytest collection without Triton)
✅ Benchmark test moved out of pytest into if __name__ == "__main__" guard
✅ Wildcard import of nvfp4_quant removed from __init__.py
✅ @requires_triton skip marker properly applied to fused kernel tests

Still outstanding (minor):

❌ Copyright year 2026 in nvfp4_quant.py and gptq_fused_kernel.py — author argued "this year is 2026" but it's 2025
❌ Hessian initialized on CPU then moved to CUDA (minor nitpick, not blocking)

Correctness assessment: The fused kernel logic is correct — the per-column GPTQ error propagation matches the unfused reference, the nvfp4_scalar_quant deduplication is clean, and the test (test_fused_vs_unfused_nvfp4) provides meaningful correctness assertions with torch.testing.assert_close. The PR is well-structured at ~490 lines with good code reuse of FP4 rounding logic.

cjluo-nv · 2026-04-18T00:24:38Z

+        scale = tl.load(scales_base + (block_start + col) // quant_block_size)
+
+        w_scalar = tl.sum(tl.where(j_range == col, w_full, 0.0))
+        q_scalar = tl.sum(


Nit: When col >= n_cols, the kernel still iterates and reads from scales_base + (block_start + col) // quant_block_size. In the current usage BLOCK_SIZE == n_cols always (caller passes w[:, bs:be]), so this is fine. But if someone calls gptq_fused_block_scalar with n_cols < block_size in the future, this would read past the valid scale range. Consider adding a if col >= n_cols: break-equivalent guard (or a comment documenting that n_cols == BLOCK_SIZE is assumed).

add gptq fused kernel

fc14872

Signed-off-by: Shiyang Chen <shiychen@nvidia.com>

sychen52 requested a review from a team as a code owner April 17, 2026 23:06

sychen52 requested a review from Fridah-nv April 17, 2026 23:06

sychen52 requested review from cjluo-nv and sugunav14 April 17, 2026 23:06

cjluo-nv reviewed Apr 17, 2026

View reviewed changes

coderabbitai bot reviewed Apr 17, 2026

View reviewed changes

Comment thread tests/gpu/torch/quantization/test_gptq.py Outdated

Comment thread tests/gpu/torch/quantization/test_gptq.py Outdated

address reviewer comments

dfc7d4c

Signed-off-by: Shiyang Chen <shiychen@nvidia.com>

sychen52 requested a review from cjluo-nv April 18, 2026 00:22

cjluo-nv reviewed Apr 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add gptq fused kernel#1291

add gptq fused kernel#1291
sychen52 wants to merge 2 commits intoNVIDIA:mainfrom
sychen52:gptq_triton

sychen52 commented Apr 17, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 17, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

cjluo-nv left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

cjluo-nv left a comment

Uh oh!

Uh oh!

cjluo-nv Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sychen52 commented Apr 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cjluo-nv Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sychen52 commented Apr 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 17, 2026 •

edited

Loading

codecov bot commented Apr 17, 2026 •

edited

Loading