Skip to content

[None][feat] support per-layer mixed-precision MoE serving (GLM, Qwen3-MoE)#15485

Open
joshua-hill wants to merge 1 commit into
NVIDIA:mainfrom
joshua-hill:feat/per-layer-mixed-precision-moe
Open

[None][feat] support per-layer mixed-precision MoE serving (GLM, Qwen3-MoE)#15485
joshua-hill wants to merge 1 commit into
NVIDIA:mainfrom
joshua-hill:feat/per-layer-mixed-precision-moe

Conversation

@joshua-hill

@joshua-hill joshua-hill commented Jun 19, 2026

Copy link
Copy Markdown

What

Enable serving per-layer mixed-precision MoE checkpoints (e.g. NVFP4 + FP8 across layers) for GLM-4.x and Qwen3-MoE.

Why

Today a MIXED_PRECISION MoE checkpoint fails to serve:

  • GLM asserts MIXED_PRECISION is ambiguous in Glm4DecoderLayer.
  • The fused MoE otherwise receives the global quant config, so under MIXED_PRECISION it allocates the experts as bf16 and crashes in load_expert_w3_w1_weight with a shape mismatch.

How

  • Pass the per-layer routed-experts quant config to the MoE. ConfigurableMoE already propagates override_quant_config to the backend before create_weights, so the buffers are then allocated in the correct per-layer format.
  • _get_experts_quant_config derives the block format from the per-expert keys ModelOpt exports (...mlp.experts.0.gate_proj) when a fused ...mlp.experts key is absent.
  • Drop the MIXED_PRECISION assert in GLM; disable eager fusion under MIXED_PRECISION.
  • Same one-line wiring for Qwen3MoE.

Constraint unchanged: a layer's experts block uses a single format (fused-kernel requirement); formats may differ across layers.

Test

  • Added tests/unittest/_torch/modeling/test_mixed_precision_moe_quant_config.py (pure-Python: per-expert-key fallback / bare-key / global-fallback, for both GLM and Qwen3-MoE).
  • Validated end-to-end on the equivalent code in a 1.3.0rc8 build: tiny synthetic GLM-4-MoE loads/runs (log-prob cosine 0.9999 vs the fake-quant reference; uniform-NVFP4 output bit-identical to baseline), and a real Qwen3-30B-A3B mixed NVFP4+FP8 checkpoint serves on a single GPU with coherent output at ~232 tok/s decode.

Summary by CodeRabbit

Release Notes

  • Improvements

    • Enhanced quantization configuration handling for mixture-of-experts models in mixed-precision checkpoints across GLM-4 and Qwen3 architectures.
  • Tests

    • Added unit tests for expert quantization configuration resolution.

…3-MoE)

GLM-4.x / Qwen3-MoE asserted against (or fell back to the global) MIXED_PRECISION
quant config for the fused MoE, so a per-layer mixed-precision (e.g. NVFP4 + FP8)
checkpoint either errored ("MIXED_PRECISION is ambiguous") or loaded the experts as
bf16 and crashed in weight loading.

Pass the per-layer routed-experts quant config to the MoE (ConfigurableMoE already
propagates override_quant_config to the backend), and derive that config from the
per-expert keys ModelOpt exports when a fused "...mlp.experts" key is absent. Drop
the MIXED_PRECISION assert in GLM; disable eager fusion under MIXED_PRECISION.

A layer's experts block still uses one format (fused-kernel requirement); formats
may vary across layers. Adds a unit test for the per-layer experts quant-config
resolution.

Signed-off-by: Joshua Hill <jhdhill@uwaterloo.ca>
@joshua-hill joshua-hill requested review from a team as code owners June 19, 2026 00:17
@joshua-hill

Copy link
Copy Markdown
Author

/bot run

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 990bfb5d-4a83-461e-ab18-02c40da901ce

📥 Commits

Reviewing files that changed from the base of the PR and between 4a8b7af and 0eda8cf.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/models/modeling_glm.py
  • tensorrt_llm/_torch/models/modeling_qwen3_moe.py
  • tests/unittest/_torch/modeling/test_mixed_precision_moe_quant_config.py

📝 Walkthrough

Walkthrough

Two MoE model implementations (GLM-4 and Qwen3-MoE) gain a _get_experts_quant_config helper that resolves per-layer expert quantization configuration from mixed-precision checkpoint dicts via a two-stage lookup (block key → per-expert projection key scan → global fallback). The resolved config is passed as override_quant_config into MoE expert construction. Unit tests cover both implementations.

Changes

Mixed-precision MoE expert quant config resolution

Layer / File(s) Summary
Expert quant config resolution helpers and MoE wiring
tensorrt_llm/_torch/models/modeling_glm.py, tensorrt_llm/_torch/models/modeling_qwen3_moe.py
Glm4MoE._get_experts_quant_config is updated to a two-stage lookup: first checks the direct block key in quant_config_dict, then scans per-expert keys ending in gate/up/down projections, and falls back to the global quant config. A parallel module-level _get_experts_quant_config is added to modeling_qwen3_moe.py with identical logic. In Glm4DecoderLayer.__init__, quant_config and experts_quant_config are now computed separately; is_nvfp4 is derived from the general config with an added MIXED_PRECISION assertion, and experts_quant_config is passed to Glm4MoE. In Qwen3MoE.__init__, create_moe receives the derived experts config via override_quant_config.
Parametrized unit tests
tests/unittest/_torch/modeling/test_mixed_precision_moe_quant_config.py
Adds a pytest module parametrized over both expert quant config getters, covering: global fallback when no dict is present, bare experts block key match, per-expert key derivation while ignoring attention keys, and global fallback when only other-layer keys exist.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: enabling per-layer mixed-precision MoE serving for GLM and Qwen3-MoE models, which directly aligns with the changeset's core purpose.
Description check ✅ Passed The PR description provides a comprehensive explanation with 'What', 'Why', 'How', and 'Test' sections, covering the issue, solution, and validation. It addresses the template's core requirements despite not following the exact template structure.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant