Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
fc7619e to
ff39c79
Compare
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1217 +/- ##
==========================================
- Coverage 76.91% 76.68% -0.24%
==========================================
Files 350 355 +5
Lines 40481 41414 +933
==========================================
+ Hits 31137 31758 +621
- Misses 9344 9656 +312
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
7b0bb08 to
9b41b8e
Compare
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
9b41b8e to
4b44815
Compare
| raise RuntimeError( | ||
| "MoE calibration incomplete: some experts received no tokens during calibration. " | ||
| "Increase --calib-size to ensure all experts see calibration data." | ||
| "Increase --calib-size to ensure all experts see calibration data.", |
There was a problem hiding this comment.
Maybe do these linting fixes in separate PR?
| **kwargs, | ||
| ): | ||
| # Capture detached CPU copies before quantizers touch them. | ||
| helper.collected_Q.append(query_states.detach().cpu()) |
There was a problem hiding this comment.
Why move to CPU? Couldn't these tensors be large?
| ) | ||
|
|
||
| # Full quantisation. | ||
| W_hat, rate, nmse, Z_h, gamma_h = watersic_quantize( |
There was a problem hiding this comment.
So to clarify, you only quantize K, not V? Can it be applied to V also?
rohansjoshi
left a comment
There was a problem hiding this comment.
Really cool feature! LGTM, left a few comments. How many bits can you compress to?
|
|
||
| ``base * 4 ** (-max(0, target_rate - knee))`` | ||
| """ | ||
| return base * 4.0 ** (-max(0.0, target_rate - knee)) |
There was a problem hiding this comment.
hI @kaix-nv, why do we have 4 ** (-...) specifically? also, seems like knee has a piecewise behavior, is there a recommendation of what value should we use for this?
| """Achieved coding rate (bits per element).""" | ||
|
|
||
|
|
||
| def _compute_importance_weights(P: Tensor, importance_clip: float = 50.0) -> Tensor: |
There was a problem hiding this comment.
is this importance_clip of 50 same for all layers or should we have this specific for each layer? different layers may have different attention distributions, I expect initial layers to be more uniform and then the differences widening for later layers. Just curious if we considered making this layer specific?
What does this PR do?
Type of change: ?
New feature. Integrates WaterSIC for KV-cache quantization.
WaterSIC is an information-theoretically near-optimal quantization algorithm (Lifar et al., 2026) that uses the waterfilling principle for per-column rate allocation combined with Successive Interference Cancellation (ZSIC) and Huffman entropy coding. This PR adds KV-cache quantization for HF models.
Usage
Testing
python examples/watersic_kv_cache/kv_cache_real_model_plots.pyCapture real post-RoPE Q, K tensors from Qwen3-8B (layers 1, 12, 24, 35) and plots rate vs KL divergence for all 5 method.
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information