Add WaterSIC for KV-cache quantization by kaix-nv · Pull Request #1217 · NVIDIA/Model-Optimizer

kaix-nv · 2026-04-09T06:11:37Z

What does this PR do?

Type of change: ?
New feature. Integrates WaterSIC for KV-cache quantization.

WaterSIC is an information-theoretically near-optimal quantization algorithm (Lifar et al., 2026) that uses the waterfilling principle for per-column rate allocation combined with Successive Interference Cancellation (ZSIC) and Huffman entropy coding. This PR adds KV-cache quantization for HF models.

Usage

import modelopt.torch.quantization as mtq

# FP8 weights + WaterSIC KV cache
python hf_ptq.py --pyt_ckpt_path Qwen/Qwen3-8B --qformat fp8 --kv_cache_qformat watersic_kv

# With custom rate and KL-aware mode
 python hf_ptq.py --pyt_ckpt_path Qwen/Qwen3-8B --qformat fp8 --kv_cache_qformat watersic_kv \
      --watersic_target_rate 4.0 --watersic_kl_aware

Testing

python examples/watersic_kv_cache/kv_cache_real_model_plots.py
Capture real post-RoPE Q, K tensors from Qwen3-8B (layers 1, 12, 24, 35) and plots rate vs KL divergence for all 5 method.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

copy-pr-bot · 2026-04-09T06:11:41Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-09T06:11:44Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 42c9eb13-da64-424d-9eb7-512f4fe6e291

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kaix/watersic-kv-cache

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

copy-pr-bot · 2026-04-09T06:15:14Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

github-actions · 2026-04-09T06:20:37Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1217/
Built to branch `gh-pages` at 2026-04-13 21:56 UTC. Preview will be ready when the GitHub Pages deployment is complete.

codecov · 2026-04-09T06:28:22Z

Codecov Report

❌ Patch coverage is 64.07407% with 97 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.68%. Comparing base (0b42c14) to head (4b44815).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...orch/quantization/algorithms/watersic_kv/helper.py	31.18%	64 Missing ⚠️
modelopt/torch/quantization/model_calib.py	14.81%	23 Missing ⚠️
.../torch/quantization/algorithms/watersic_kv/zsic.py	91.45%	10 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1217      +/-   ##
==========================================
- Coverage   76.91%   76.68%   -0.24%     
==========================================
  Files         350      355       +5     
  Lines       40481    41414     +933     
==========================================
+ Hits        31137    31758     +621     
- Misses       9344     9656     +312

Flag	Coverage Δ
unit	`55.57% <64.07%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Kai Xu <kaix@nvidia.com>

rohansjoshi · 2026-04-17T23:30:27Z

            raise RuntimeError(
                "MoE calibration incomplete: some experts received no tokens during calibration. "
-                "Increase --calib-size to ensure all experts see calibration data."
+                "Increase --calib-size to ensure all experts see calibration data.",


Maybe do these linting fixes in separate PR?

rohansjoshi · 2026-04-17T23:31:45Z

+            **kwargs,
+        ):
+            # Capture detached CPU copies before quantizers touch them.
+            helper.collected_Q.append(query_states.detach().cpu())


Why move to CPU? Couldn't these tensors be large?

rohansjoshi · 2026-04-17T23:32:29Z

+            )
+
+            # Full quantisation.
+            W_hat, rate, nmse, Z_h, gamma_h = watersic_quantize(


So to clarify, you only quantize K, not V? Can it be applied to V also?

rohansjoshi

Really cool feature! LGTM, left a few comments. How many bits can you compress to?

juhi10071998 · 2026-04-21T01:49:23Z

+
+    ``base * 4 ** (-max(0, target_rate - knee))``
+    """
+    return base * 4.0 ** (-max(0.0, target_rate - knee))


hI @kaix-nv, why do we have 4 ** (-...) specifically? also, seems like knee has a piecewise behavior, is there a recommendation of what value should we use for this?

juhi10071998 · 2026-04-21T01:51:58Z

+    """Achieved coding rate (bits per element)."""
+
+
+def _compute_importance_weights(P: Tensor, importance_clip: float = 50.0) -> Tensor:


is this importance_clip of 50 same for all layers or should we have this specific for each layer? different layers may have different attention distributions, I expect initial layers to be more uniform and then the differences widening for later layers. Just curious if we considered making this layer specific?

kaix-nv force-pushed the kaix/watersic-kv-cache branch from fc7619e to ff39c79 Compare April 9, 2026 06:15

kaix-nv changed the title ~~Kaix/watersic kv cache~~ Add WaterSIC for KV-cache quantization Apr 9, 2026

kaix-nv force-pushed the kaix/watersic-kv-cache branch 5 times, most recently from 7b0bb08 to 9b41b8e Compare April 10, 2026 23:31

kaix-nv added 5 commits April 13, 2026 14:52

Add WaterSIC ZSIC core algorithm with unit tests

81dbee9

Signed-off-by: Kai Xu <kaix@nvidia.com>

Add WaterSIC KV quantizer helper with unit tests

7d9af87

Signed-off-by: Kai Xu <kaix@nvidia.com>

Add WaterSIC KV-cache calibration config and update package exports

0995735

Signed-off-by: Kai Xu <kaix@nvidia.com>

Register WaterSIC KV-cache calibration mode and function

1877ee7

Signed-off-by: Kai Xu <kaix@nvidia.com>

Integrate WaterSIC KV-cache in hf_ptq.py

4b44815

Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv force-pushed the kaix/watersic-kv-cache branch from 9b41b8e to 4b44815 Compare April 13, 2026 21:52

rohansjoshi reviewed Apr 17, 2026

View reviewed changes

rohansjoshi reviewed Apr 18, 2026

View reviewed changes

juhi10071998 reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WaterSIC for KV-cache quantization#1217

Add WaterSIC for KV-cache quantization#1217
kaix-nv wants to merge 5 commits intomainfrom
kaix/watersic-kv-cache

kaix-nv commented Apr 9, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 9, 2026

Uh oh!

coderabbitai Bot commented Apr 9, 2026 •

edited

Loading

Review skipped

Uh oh!

copy-pr-bot Bot commented Apr 9, 2026

Uh oh!

github-actions Bot commented Apr 9, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-13 21:56 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

rohansjoshi Apr 17, 2026

Uh oh!

rohansjoshi Apr 17, 2026

Uh oh!

rohansjoshi Apr 17, 2026 •

edited

Loading

Uh oh!

rohansjoshi left a comment

Uh oh!

juhi10071998 Apr 21, 2026

Uh oh!

juhi10071998 Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		"""Achieved coding rate (bits per element)."""


		def _compute_importance_weights(P: Tensor, importance_clip: float = 50.0) -> Tensor:

Conversation

kaix-nv commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Apr 9, 2026

Uh oh!

coderabbitai Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

copy-pr-bot Bot commented Apr 9, 2026

Uh oh!

github-actions Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-04-13 21:56 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rohansjoshi Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

rohansjoshi Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

rohansjoshi Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohansjoshi left a comment

Choose a reason for hiding this comment

Uh oh!

juhi10071998 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

juhi10071998 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kaix-nv commented Apr 9, 2026 •

edited

Loading

coderabbitai Bot commented Apr 9, 2026 •

edited

Loading

github-actions Bot commented Apr 9, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-13 21:56 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented Apr 9, 2026 •

edited

Loading

rohansjoshi Apr 17, 2026 •

edited

Loading