[None][feat] support per-layer mixed-precision MoE serving (GLM, Qwen3-MoE)#15485
[None][feat] support per-layer mixed-precision MoE serving (GLM, Qwen3-MoE)#15485joshua-hill wants to merge 1 commit into
Conversation
…3-MoE)
GLM-4.x / Qwen3-MoE asserted against (or fell back to the global) MIXED_PRECISION
quant config for the fused MoE, so a per-layer mixed-precision (e.g. NVFP4 + FP8)
checkpoint either errored ("MIXED_PRECISION is ambiguous") or loaded the experts as
bf16 and crashed in weight loading.
Pass the per-layer routed-experts quant config to the MoE (ConfigurableMoE already
propagates override_quant_config to the backend), and derive that config from the
per-expert keys ModelOpt exports when a fused "...mlp.experts" key is absent. Drop
the MIXED_PRECISION assert in GLM; disable eager fusion under MIXED_PRECISION.
A layer's experts block still uses one format (fused-kernel requirement); formats
may vary across layers. Adds a unit test for the per-layer experts quant-config
resolution.
Signed-off-by: Joshua Hill <jhdhill@uwaterloo.ca>
|
/bot run |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughTwo MoE model implementations (GLM-4 and Qwen3-MoE) gain a ChangesMixed-precision MoE expert quant config resolution
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
What
Enable serving per-layer mixed-precision MoE checkpoints (e.g. NVFP4 + FP8 across layers) for GLM-4.x and Qwen3-MoE.
Why
Today a
MIXED_PRECISIONMoE checkpoint fails to serve:MIXED_PRECISION is ambiguousinGlm4DecoderLayer.MIXED_PRECISIONit allocates the experts as bf16 and crashes inload_expert_w3_w1_weightwith a shape mismatch.How
ConfigurableMoEalready propagatesoverride_quant_configto the backend beforecreate_weights, so the buffers are then allocated in the correct per-layer format._get_experts_quant_configderives the block format from the per-expert keys ModelOpt exports (...mlp.experts.0.gate_proj) when a fused...mlp.expertskey is absent.MIXED_PRECISIONassert in GLM; disable eager fusion underMIXED_PRECISION.Qwen3MoE.Constraint unchanged: a layer's experts block uses a single format (fused-kernel requirement); formats may differ across layers.
Test
tests/unittest/_torch/modeling/test_mixed_precision_moe_quant_config.py(pure-Python: per-expert-key fallback / bare-key / global-fallback, for both GLM and Qwen3-MoE).Summary by CodeRabbit
Release Notes
Improvements
Tests