[None][feat] support per-layer mixed-precision MoE serving (GLM, Qwen3-MoE) by joshua-hill · Pull Request #15485 · NVIDIA/TensorRT-LLM

joshua-hill · 2026-06-19T00:17:13Z

What

Enable serving per-layer mixed-precision MoE checkpoints (e.g. NVFP4 + FP8 across layers) for GLM-4.x and Qwen3-MoE.

Why

Today a MIXED_PRECISION MoE checkpoint fails to serve:

GLM asserts MIXED_PRECISION is ambiguous in Glm4DecoderLayer.
The fused MoE otherwise receives the global quant config, so under MIXED_PRECISION it allocates the experts as bf16 and crashes in load_expert_w3_w1_weight with a shape mismatch.

How

Pass the per-layer routed-experts quant config to the MoE. ConfigurableMoE already propagates override_quant_config to the backend before create_weights, so the buffers are then allocated in the correct per-layer format.
_get_experts_quant_config derives the block format from the per-expert keys ModelOpt exports (...mlp.experts.0.gate_proj) when a fused ...mlp.experts key is absent.
Drop the MIXED_PRECISION assert in GLM; disable eager fusion under MIXED_PRECISION.
Same one-line wiring for Qwen3MoE.

Constraint unchanged: a layer's experts block uses a single format (fused-kernel requirement); formats may differ across layers.

Test

Added tests/unittest/_torch/modeling/test_mixed_precision_moe_quant_config.py (pure-Python: per-expert-key fallback / bare-key / global-fallback, for both GLM and Qwen3-MoE).
Validated end-to-end on the equivalent code in a 1.3.0rc8 build: tiny synthetic GLM-4-MoE loads/runs (log-prob cosine 0.9999 vs the fake-quant reference; uniform-NVFP4 output bit-identical to baseline), and a real Qwen3-30B-A3B mixed NVFP4+FP8 checkpoint serves on a single GPU with coherent output at ~232 tok/s decode.

Summary by CodeRabbit

Release Notes

Improvements
- Enhanced quantization configuration handling for mixture-of-experts models in mixed-precision checkpoints across GLM-4 and Qwen3 architectures.
Tests
- Added unit tests for expert quantization configuration resolution.

…3-MoE) GLM-4.x / Qwen3-MoE asserted against (or fell back to the global) MIXED_PRECISION quant config for the fused MoE, so a per-layer mixed-precision (e.g. NVFP4 + FP8) checkpoint either errored ("MIXED_PRECISION is ambiguous") or loaded the experts as bf16 and crashed in weight loading. Pass the per-layer routed-experts quant config to the MoE (ConfigurableMoE already propagates override_quant_config to the backend), and derive that config from the per-expert keys ModelOpt exports when a fused "...mlp.experts" key is absent. Drop the MIXED_PRECISION assert in GLM; disable eager fusion under MIXED_PRECISION. A layer's experts block still uses one format (fused-kernel requirement); formats may vary across layers. Adds a unit test for the per-layer experts quant-config resolution. Signed-off-by: Joshua Hill <jhdhill@uwaterloo.ca>

joshua-hill · 2026-06-19T00:17:15Z

/bot run

coderabbitai · 2026-06-19T00:20:38Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 990bfb5d-4a83-461e-ab18-02c40da901ce

📥 Commits

Reviewing files that changed from the base of the PR and between 4a8b7af and 0eda8cf.

📒 Files selected for processing (3)

tensorrt_llm/_torch/models/modeling_glm.py
tensorrt_llm/_torch/models/modeling_qwen3_moe.py
tests/unittest/_torch/modeling/test_mixed_precision_moe_quant_config.py

📝 Walkthrough

Walkthrough

Two MoE model implementations (GLM-4 and Qwen3-MoE) gain a _get_experts_quant_config helper that resolves per-layer expert quantization configuration from mixed-precision checkpoint dicts via a two-stage lookup (block key → per-expert projection key scan → global fallback). The resolved config is passed as override_quant_config into MoE expert construction. Unit tests cover both implementations.

Changes

Mixed-precision MoE expert quant config resolution

Layer / File(s)	Summary
Expert quant config resolution helpers and MoE wiring `tensorrt_llm/_torch/models/modeling_glm.py`, `tensorrt_llm/_torch/models/modeling_qwen3_moe.py`	`Glm4MoE._get_experts_quant_config` is updated to a two-stage lookup: first checks the direct block key in `quant_config_dict`, then scans per-expert keys ending in `gate`/`up`/`down` projections, and falls back to the global quant config. A parallel module-level `_get_experts_quant_config` is added to `modeling_qwen3_moe.py` with identical logic. In `Glm4DecoderLayer.__init__`, `quant_config` and `experts_quant_config` are now computed separately; `is_nvfp4` is derived from the general config with an added `MIXED_PRECISION` assertion, and `experts_quant_config` is passed to `Glm4MoE`. In `Qwen3MoE.__init__`, `create_moe` receives the derived experts config via `override_quant_config`.
Parametrized unit tests `tests/unittest/_torch/modeling/test_mixed_precision_moe_quant_config.py`	Adds a pytest module parametrized over both expert quant config getters, covering: global fallback when no dict is present, bare experts block key match, per-expert key derivation while ignoring attention keys, and global fallback when only other-layer keys exist.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: enabling per-layer mixed-precision MoE serving for GLM and Qwen3-MoE models, which directly aligns with the changeset's core purpose.
Description check	✅ Passed	The PR description provides a comprehensive explanation with 'What', 'Why', 'How', and 'Test' sections, covering the issue, solution, and validation. It addresses the template's core requirements despite not following the exact template structure.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

joshua-hill requested review from a team as code owners June 19, 2026 00:17

joshua-hill requested review from dongjiyingdjy and symphonylyh June 19, 2026 00:17

github-actions Bot assigned joshua-hill Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][feat] support per-layer mixed-precision MoE serving (GLM, Qwen3-MoE)#15485

[None][feat] support per-layer mixed-precision MoE serving (GLM, Qwen3-MoE)#15485
joshua-hill wants to merge 1 commit into
NVIDIA:mainfrom
joshua-hill:feat/per-layer-mixed-precision-moe

joshua-hill commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

joshua-hill commented Jun 19, 2026

Uh oh!

coderabbitai Bot commented Jun 19, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joshua-hill commented Jun 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Test

Summary by CodeRabbit

Release Notes

Uh oh!

joshua-hill commented Jun 19, 2026

Uh oh!

coderabbitai Bot commented Jun 19, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

joshua-hill commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading