Conversation
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe changes consolidate NVFP4 FP4 quantization logic into a new shared module, refactor existing FP4 kernels to utilize the centralized quantization functions, introduce a new GPTQ fused kernel implementation for blockwise weight updates, and add comprehensive test coverage with parameterized testing and benchmarking. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
cjluo-nv
left a comment
There was a problem hiding this comment.
Good PR overall: clean deduplication of FP4 rounding logic into nvfp4_quant.py, well-structured fused GPTQ kernel, and solid correctness tests. A few issues to address:
-
Missing
__all__innvfp4_quant.py(minor but important): Since__init__.pydoesfrom .nvfp4_quant import *, internal Triton JIT helpers (fp4_round_magnitude,nvfp4_scalar_quant) leak into the public API. These are internal building blocks, not user-facing. -
Copyright year 2026 in both new files — should be 2025.
-
Benchmark tests in CI:
test_fused_nvfp4_benchmarkis purely a performance test with no assertions. It prints results but can't fail for correctness. Consider marking it with a custom marker (e.g.,@pytest.mark.benchmark) or adding a soft speedup assertion to make it more useful, or just being explicit that it's informational-only.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
modelopt/torch/quantization/triton/__init__.py (1)
36-36: Avoid wildcard-importingnvfp4_quantwithout an explicit export list.
nvfp4_quant.pydoes not define__all__, so this also re-exports its implementation imports (triton,tl,libdevice) frommodelopt.torch.quantization.triton. Either add__all__ = ["fp4_round_magnitude", "nvfp4_scalar_quant"]in the new module or import those two names explicitly here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/triton/__init__.py` at line 36, The current wildcard import "from .nvfp4_quant import *" re-exports unrelated implementation names; fix by either adding an explicit __all__ = ["fp4_round_magnitude", "nvfp4_scalar_quant"] to nvfp4_quant.py or replace the wildcard here with an explicit import of the two public symbols (fp4_round_magnitude, nvfp4_scalar_quant) so only those functions are exported from modelopt.torch.quantization.triton.tests/gpu/torch/quantization/test_gptq.py (1)
265-269: Initializehessianon CUDA to avoid unnecessary device transfers.The current code initializes
hessianon CPU and then callshessian.to("cuda")afterupdate_hessian. While this works becauseupdate_hessianexplicitly handles device transfers via.to(hessian.device), initializinghessiandirectly on CUDA withdevice=inp.deviceavoids the redundant transfer and simplifies the code.Suggested fix
- hessian = torch.zeros(dim, dim, dtype=torch.float32) + hessian = torch.zeros(dim, dim, device=inp.device, dtype=torch.float32) hessian, _ = update_hessian(inp, hessian, 0) - hessian = hessian.to("cuda")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/quantization/test_gptq.py` around lines 265 - 269, Initialize the hessian tensor directly on CUDA to avoid an unnecessary CPU→GPU copy: when creating hessian (used with update_hessian and later passed to _compute_h_inv), construct it with the same device as inp (e.g., device=inp.device) instead of creating a CPU tensor and calling hessian.to("cuda") afterward; keep calls to update_hessian(hessian, ...) and _compute_h_inv unchanged so only the hessian allocation is updated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/gpu/torch/quantization/test_gptq.py`:
- Around line 387-454: The test test_fused_nvfp4_benchmark currently runs 6
heavy configs defined by _NVFP4_BENCH_CONFIGS/_NVFP4_BENCH_IDS and should not
run in normal GPU CI; either (A) gate it behind an opt-in marker/env var by
adding a decorator like `@pytest.mark.skipif`(os.getenv("RUN_GPU_BENCH") != "1",
reason="benchmark tests disabled by default") or `@pytest.mark.benchmark` and
document how to opt-in, or (B) reduce it to a tiny smoke case by replacing
_NVFP4_BENCH_CONFIGS with a single small config (eg (16,128,256,512)) and keep
the rest of the code, making sure to update _NVFP4_BENCH_IDS and the
pytest.mark.parametrize call for test_fused_nvfp4_benchmark accordingly.
- Around line 28-29: The module-level import of compute_fp4_scales causes pytest
collection to fail when Triton is unavailable; move the import into the function
that uses it (/_make_nvfp4_test_data/) or wrap it with an availability check
similar to how gptq_fused_block_scalar is handled so the import only runs when
requires_triton applies; update _make_nvfp4_test_data to import
compute_fp4_scales locally (or perform a conditional import and raise/skip
accordingly) so tests can be collected without Triton present.
---
Nitpick comments:
In `@modelopt/torch/quantization/triton/__init__.py`:
- Line 36: The current wildcard import "from .nvfp4_quant import *" re-exports
unrelated implementation names; fix by either adding an explicit __all__ =
["fp4_round_magnitude", "nvfp4_scalar_quant"] to nvfp4_quant.py or replace the
wildcard here with an explicit import of the two public symbols
(fp4_round_magnitude, nvfp4_scalar_quant) so only those functions are exported
from modelopt.torch.quantization.triton.
In `@tests/gpu/torch/quantization/test_gptq.py`:
- Around line 265-269: Initialize the hessian tensor directly on CUDA to avoid
an unnecessary CPU→GPU copy: when creating hessian (used with update_hessian and
later passed to _compute_h_inv), construct it with the same device as inp (e.g.,
device=inp.device) instead of creating a CPU tensor and calling
hessian.to("cuda") afterward; keep calls to update_hessian(hessian, ...) and
_compute_h_inv unchanged so only the hessian allocation is updated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: ebaef65c-34a9-46a3-a419-43885b2cefb0
📒 Files selected for processing (7)
modelopt/torch/quantization/triton/__init__.pymodelopt/torch/quantization/triton/fp4_kernel.pymodelopt/torch/quantization/triton/fp4_kernel_hopper.pymodelopt/torch/quantization/triton/gptq_fused_kernel.pymodelopt/torch/quantization/triton/nvfp4_quant.pytests/gpu/torch/quantization/conftest.pytests/gpu/torch/quantization/test_gptq.py
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1291 +/- ##
==========================================
+ Coverage 72.54% 76.14% +3.60%
==========================================
Files 459 461 +2
Lines 48649 48746 +97
==========================================
+ Hits 35290 37117 +1827
+ Misses 13359 11629 -1730
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
cjluo-nv
left a comment
There was a problem hiding this comment.
Re-review of PR #1291: add gptq fused kernel
Addressed (critical/important):
- ✅ Module-level
compute_fp4_scalesimport moved inside function (was breaking pytest collection without Triton) - ✅ Benchmark test moved out of pytest into
if __name__ == "__main__"guard - ✅ Wildcard import of
nvfp4_quantremoved from__init__.py - ✅
@requires_tritonskip marker properly applied to fused kernel tests
Still outstanding (minor):
- ❌ Copyright year 2026 in
nvfp4_quant.pyandgptq_fused_kernel.py— author argued "this year is 2026" but it's 2025 - ❌ Hessian initialized on CPU then moved to CUDA (minor nitpick, not blocking)
Correctness assessment: The fused kernel logic is correct — the per-column GPTQ error propagation matches the unfused reference, the nvfp4_scalar_quant deduplication is clean, and the test (test_fused_vs_unfused_nvfp4) provides meaningful correctness assertions with torch.testing.assert_close. The PR is well-structured at ~490 lines with good code reuse of FP4 rounding logic.
| scale = tl.load(scales_base + (block_start + col) // quant_block_size) | ||
|
|
||
| w_scalar = tl.sum(tl.where(j_range == col, w_full, 0.0)) | ||
| q_scalar = tl.sum( |
There was a problem hiding this comment.
Nit: When col >= n_cols, the kernel still iterates and reads from scales_base + (block_start + col) // quant_block_size. In the current usage BLOCK_SIZE == n_cols always (caller passes w[:, bs:be]), so this is fine. But if someone calls gptq_fused_block_scalar with n_cols < block_size in the future, this would read past the valid scale range. Consider adding a if col >= n_cols: break-equivalent guard (or a comment documenting that n_cols == BLOCK_SIZE is assumed).
What does this PR do?
Add gptq fused kernel to improve speed. This fused kernel is not yet used in the production code path.
Usage
check unittest
Testing
added a unittest
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
Release Notes
New Features
compute_fp4_scalespublic function for FP4 scale computationRefactor
Tests