Skip to content

add gptq fused kernel#1291

Open
sychen52 wants to merge 2 commits intoNVIDIA:mainfrom
sychen52:gptq_triton
Open

add gptq fused kernel#1291
sychen52 wants to merge 2 commits intoNVIDIA:mainfrom
sychen52:gptq_triton

Conversation

@sychen52
Copy link
Copy Markdown
Contributor

@sychen52 sychen52 commented Apr 17, 2026

What does this PR do?

Add gptq fused kernel to improve speed. This fused kernel is not yet used in the production code path.

Usage

check unittest

Testing

added a unittest

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Release Notes

  • New Features

    • Added GPTQ fused kernel for optimized blockwise weight quantization
    • Introduced compute_fp4_scales public function for FP4 scale computation
  • Refactor

    • Consolidated FP4 quantization logic into shared utilities for improved reusability across kernels
  • Tests

    • Extended test coverage for GPTQ fused kernel operations
    • Added Triton availability checks to test infrastructure

Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
@sychen52 sychen52 requested a review from a team as a code owner April 17, 2026 23:06
@sychen52 sychen52 requested a review from Fridah-nv April 17, 2026 23:06
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 17, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5f813bd9-fdaf-41a1-86fa-988ec7cd363b

📥 Commits

Reviewing files that changed from the base of the PR and between fc14872 and dfc7d4c.

📒 Files selected for processing (1)
  • tests/gpu/torch/quantization/test_gptq.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/gpu/torch/quantization/test_gptq.py

📝 Walkthrough

Walkthrough

The changes consolidate NVFP4 FP4 quantization logic into a new shared module, refactor existing FP4 kernels to utilize the centralized quantization functions, introduce a new GPTQ fused kernel implementation for blockwise weight updates, and add comprehensive test coverage with parameterized testing and benchmarking.

Changes

Cohort / File(s) Summary
NVFP4 Quantization Core
modelopt/torch/quantization/triton/nvfp4_quant.py
New module introducing Triton JIT functions fp4_round_magnitude and nvfp4_scalar_quant that implement NVFP4 (E2M1) fake quantization logic with threshold-based magnitude rounding and degenerate scale handling.
Existing FP4 Kernel Refactoring
modelopt/torch/quantization/triton/fp4_kernel.py, modelopt/torch/quantization/triton/fp4_kernel_hopper.py
Refactored to delegate FP4 quantization logic to shared utilities; fp4_kernel.py adds compute_fp4_scales function and updates __all__ exports, while both kernels replace inline quantization code with calls to nvfp4_scalar_quant and fp4_round_magnitude.
GPTQ Fused Kernel
modelopt/torch/quantization/triton/gptq_fused_kernel.py
New module implementing Triton-jitted fused kernel for GPTQ blockwise weight updates with per-row program instances, scalar quantization via nvfp4_scalar_quant, and in-register error propagation; exports gptq_fused_block_scalar wrapper.
Test Infrastructure & Coverage
tests/gpu/torch/quantization/conftest.py, tests/gpu/torch/quantization/test_gptq.py
Added requires_triton pytest marker in conftest; introduced parameterized test test_fused_vs_unfused_nvfp4 comparing fused and unfused GPTQ paths, plus non-pytest benchmark bench_fused_nvfp4 for runtime measurement.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 63.16% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the primary change: a new GPTQ fused kernel is introduced with supporting infrastructure for NVFP4 quantization.
Security Anti-Patterns ✅ Passed No security anti-patterns (torch.load without weights_only, numpy.load with allow_pickle, trust_remote_code, eval/exec, or # nosec comments) found in modified/new files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@sychen52 sychen52 requested review from cjluo-nv and sugunav14 April 17, 2026 23:06
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good PR overall: clean deduplication of FP4 rounding logic into nvfp4_quant.py, well-structured fused GPTQ kernel, and solid correctness tests. A few issues to address:

  1. Missing __all__ in nvfp4_quant.py (minor but important): Since __init__.py does from .nvfp4_quant import *, internal Triton JIT helpers (fp4_round_magnitude, nvfp4_scalar_quant) leak into the public API. These are internal building blocks, not user-facing.

  2. Copyright year 2026 in both new files — should be 2025.

  3. Benchmark tests in CI: test_fused_nvfp4_benchmark is purely a performance test with no assertions. It prints results but can't fail for correctness. Consider marking it with a custom marker (e.g., @pytest.mark.benchmark) or adding a soft speedup assertion to make it more useful, or just being explicit that it's informational-only.

Comment thread modelopt/torch/quantization/triton/nvfp4_quant.py
Comment thread modelopt/torch/quantization/triton/gptq_fused_kernel.py
Comment thread modelopt/torch/quantization/triton/nvfp4_quant.py
Comment thread modelopt/torch/quantization/triton/gptq_fused_kernel.py
Comment thread tests/gpu/torch/quantization/test_gptq.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
modelopt/torch/quantization/triton/__init__.py (1)

36-36: Avoid wildcard-importing nvfp4_quant without an explicit export list.

nvfp4_quant.py does not define __all__, so this also re-exports its implementation imports (triton, tl, libdevice) from modelopt.torch.quantization.triton. Either add __all__ = ["fp4_round_magnitude", "nvfp4_scalar_quant"] in the new module or import those two names explicitly here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/triton/__init__.py` at line 36, The current
wildcard import "from .nvfp4_quant import *" re-exports unrelated implementation
names; fix by either adding an explicit __all__ = ["fp4_round_magnitude",
"nvfp4_scalar_quant"] to nvfp4_quant.py or replace the wildcard here with an
explicit import of the two public symbols (fp4_round_magnitude,
nvfp4_scalar_quant) so only those functions are exported from
modelopt.torch.quantization.triton.
tests/gpu/torch/quantization/test_gptq.py (1)

265-269: Initialize hessian on CUDA to avoid unnecessary device transfers.

The current code initializes hessian on CPU and then calls hessian.to("cuda") after update_hessian. While this works because update_hessian explicitly handles device transfers via .to(hessian.device), initializing hessian directly on CUDA with device=inp.device avoids the redundant transfer and simplifies the code.

Suggested fix
-    hessian = torch.zeros(dim, dim, dtype=torch.float32)
+    hessian = torch.zeros(dim, dim, device=inp.device, dtype=torch.float32)
     hessian, _ = update_hessian(inp, hessian, 0)
-    hessian = hessian.to("cuda")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/quantization/test_gptq.py` around lines 265 - 269, Initialize
the hessian tensor directly on CUDA to avoid an unnecessary CPU→GPU copy: when
creating hessian (used with update_hessian and later passed to _compute_h_inv),
construct it with the same device as inp (e.g., device=inp.device) instead of
creating a CPU tensor and calling hessian.to("cuda") afterward; keep calls to
update_hessian(hessian, ...) and _compute_h_inv unchanged so only the hessian
allocation is updated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/gpu/torch/quantization/test_gptq.py`:
- Around line 387-454: The test test_fused_nvfp4_benchmark currently runs 6
heavy configs defined by _NVFP4_BENCH_CONFIGS/_NVFP4_BENCH_IDS and should not
run in normal GPU CI; either (A) gate it behind an opt-in marker/env var by
adding a decorator like `@pytest.mark.skipif`(os.getenv("RUN_GPU_BENCH") != "1",
reason="benchmark tests disabled by default") or `@pytest.mark.benchmark` and
document how to opt-in, or (B) reduce it to a tiny smoke case by replacing
_NVFP4_BENCH_CONFIGS with a single small config (eg (16,128,256,512)) and keep
the rest of the code, making sure to update _NVFP4_BENCH_IDS and the
pytest.mark.parametrize call for test_fused_nvfp4_benchmark accordingly.
- Around line 28-29: The module-level import of compute_fp4_scales causes pytest
collection to fail when Triton is unavailable; move the import into the function
that uses it (/_make_nvfp4_test_data/) or wrap it with an availability check
similar to how gptq_fused_block_scalar is handled so the import only runs when
requires_triton applies; update _make_nvfp4_test_data to import
compute_fp4_scales locally (or perform a conditional import and raise/skip
accordingly) so tests can be collected without Triton present.

---

Nitpick comments:
In `@modelopt/torch/quantization/triton/__init__.py`:
- Line 36: The current wildcard import "from .nvfp4_quant import *" re-exports
unrelated implementation names; fix by either adding an explicit __all__ =
["fp4_round_magnitude", "nvfp4_scalar_quant"] to nvfp4_quant.py or replace the
wildcard here with an explicit import of the two public symbols
(fp4_round_magnitude, nvfp4_scalar_quant) so only those functions are exported
from modelopt.torch.quantization.triton.

In `@tests/gpu/torch/quantization/test_gptq.py`:
- Around line 265-269: Initialize the hessian tensor directly on CUDA to avoid
an unnecessary CPU→GPU copy: when creating hessian (used with update_hessian and
later passed to _compute_h_inv), construct it with the same device as inp (e.g.,
device=inp.device) instead of creating a CPU tensor and calling
hessian.to("cuda") afterward; keep calls to update_hessian(hessian, ...) and
_compute_h_inv unchanged so only the hessian allocation is updated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ebaef65c-34a9-46a3-a419-43885b2cefb0

📥 Commits

Reviewing files that changed from the base of the PR and between dc7ad66 and fc14872.

📒 Files selected for processing (7)
  • modelopt/torch/quantization/triton/__init__.py
  • modelopt/torch/quantization/triton/fp4_kernel.py
  • modelopt/torch/quantization/triton/fp4_kernel_hopper.py
  • modelopt/torch/quantization/triton/gptq_fused_kernel.py
  • modelopt/torch/quantization/triton/nvfp4_quant.py
  • tests/gpu/torch/quantization/conftest.py
  • tests/gpu/torch/quantization/test_gptq.py

Comment thread tests/gpu/torch/quantization/test_gptq.py Outdated
Comment thread tests/gpu/torch/quantization/test_gptq.py Outdated
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 17, 2026

Codecov Report

❌ Patch coverage is 54.54545% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.14%. Comparing base (e4b054b) to head (dfc7d4c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...opt/torch/quantization/triton/gptq_fused_kernel.py 39.39% 20 Missing ⚠️
modelopt/torch/quantization/triton/nvfp4_quant.py 46.66% 8 Missing ⚠️
modelopt/torch/quantization/triton/fp4_kernel.py 93.75% 1 Missing ⚠️
...opt/torch/quantization/triton/fp4_kernel_hopper.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1291      +/-   ##
==========================================
+ Coverage   72.54%   76.14%   +3.60%     
==========================================
  Files         459      461       +2     
  Lines       48649    48746      +97     
==========================================
+ Hits        35290    37117    +1827     
+ Misses      13359    11629    -1730     
Flag Coverage Δ
examples 40.53% <16.66%> (+1.15%) ⬆️
gpu 59.96% <54.54%> (+7.64%) ⬆️
unit 52.12% <0.00%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
@sychen52 sychen52 requested a review from cjluo-nv April 18, 2026 00:22
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review of PR #1291: add gptq fused kernel

Addressed (critical/important):

  • ✅ Module-level compute_fp4_scales import moved inside function (was breaking pytest collection without Triton)
  • ✅ Benchmark test moved out of pytest into if __name__ == "__main__" guard
  • ✅ Wildcard import of nvfp4_quant removed from __init__.py
  • @requires_triton skip marker properly applied to fused kernel tests

Still outstanding (minor):

  • ❌ Copyright year 2026 in nvfp4_quant.py and gptq_fused_kernel.py — author argued "this year is 2026" but it's 2025
  • ❌ Hessian initialized on CPU then moved to CUDA (minor nitpick, not blocking)

Correctness assessment: The fused kernel logic is correct — the per-column GPTQ error propagation matches the unfused reference, the nvfp4_scalar_quant deduplication is clean, and the test (test_fused_vs_unfused_nvfp4) provides meaningful correctness assertions with torch.testing.assert_close. The PR is well-structured at ~490 lines with good code reuse of FP4 rounding logic.

Comment thread modelopt/torch/quantization/triton/gptq_fused_kernel.py
scale = tl.load(scales_base + (block_start + col) // quant_block_size)

w_scalar = tl.sum(tl.where(j_range == col, w_full, 0.0))
q_scalar = tl.sum(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: When col >= n_cols, the kernel still iterates and reads from scales_base + (block_start + col) // quant_block_size. In the current usage BLOCK_SIZE == n_cols always (caller passes w[:, bs:be]), so this is fine. But if someone calls gptq_fused_block_scalar with n_cols < block_size in the future, this would read past the valid scale range. Consider adding a if col >= n_cols: break-equivalent guard (or a comment documenting that n_cols == BLOCK_SIZE is assumed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants