[QDP] Add Mahout-AMD and PennyLane-AMDGPU frameworks to throughput benchmark#1291
Merged
400Ping merged 4 commits intoApr 26, 2026
Merged
Conversation
…nchmark
Extends the canonical multi-framework throughput benchmark
(benchmark_throughput.py: Mahout vs PennyLane vs Qiskit) with two new
frameworks so users on AMD ROCm hosts can benchmark on equal footing:
* mahout-amd — QDP AMD path via QdpEngine(backend="amd")
(TritonAmdEngine on ROCm). Drives the same
prefetched_batches loop as the existing 'mahout'
path. Auto-skips on hosts without is_triton_amd_available().
* pennylane-amdgpu — PennyLane lightning.amdgpu, the official ROCm
simulator (Kokkos+HIP backend). Per-sample loop
because the native sim doesn't broadcast over
batch dimension for AmplitudeEmbedding.
Both new framework names are added to FRAMEWORK_CHOICES and dispatched
in main(). Speedup reporting picks whichever Mahout backend ran.
Loader caveat: lightning.amdgpu's bundled liblightning_kokkos_catalyst.so
NEEDs libhsa-runtime64.so.1 / libamdhip64.so.7. Ubuntu 24.04's
/lib/x86_64-linux-gnu/libhsa-runtime64.so.1 is from ROCm 5.7 and shadows
newer symbols (e.g. hsa_amd_memory_get_preferred_copy_engine), making
the plugin fail at device init. Resolved by RTLD_GLOBAL-pre-loading the
matching ROCm 7.x libs at module top, gated on ROCM_LIB_DIR env var or
MAHOUT_PRELOAD_ROCM=1. The preload MUST happen before torch / pennylane
import — doing it after deadlocks because torch's HIP runtime has
already mapped the older libhsa.
Verified on AMD Instinct MI300X (ROCm 7.2 / torch 2.9.0+rocm6.4 /
pennylane 0.44.1, pennylane-lightning-amdgpu 0.44.0). End-to-end at
q=12, 3200 samples:
Mahout-AMD 3,965 vec/s (winner)
PennyLane-AMDGPU 705 vec/s (5.6x slower)
PennyLane 627 vec/s (6.3x slower)
CUDA hosts and CPU-only hosts unaffected; new frameworks gracefully
skip when their dependencies are missing.
Member
Author
|
lg, cc @400Ping |
* Remove unused `# noqa: E402` directives (RUF100). The ruff config in this repo doesn't enable E402, so the markers were warnings rather than suppressions. * Apply `ruff format` to satisfy the pre-commit format hook. No behavior change.
Three pre-existing methodology issues were inflating Mahout's reported speedup vs PennyLane in the throughput numbers. Fixed all three so the honest result stands on its own: * run_pennylane was casting complex64 -> float32 on the GPU transfer, silently dropping the imaginary part of every encoded state (and triggering pytorch's "Casting complex values to real discards the imaginary part" warning on every batch). The broken pennylane path was ~4x slower than a correct one — the cast is removed and the state stays complex64 on the target device. * No warmup anywhere meant first-batch costs (Triton AMD JIT autotune, PennyLane QNode tracing/cache, Kokkos device init, Aer transpile cache) were inside every timer. Added WARMUP_BATCHES = 3 used by all runners. * run_mahout_amd ran float32 while both PennyLane runners ran float64. All runners now use float32 input for the same dtype across the comparison. Also tightened the ROCm preload gate: previously '_preload_rocm_libs_at_import' fired when EITHER MAHOUT_PRELOAD_ROCM=1 OR ROCM_LIB_DIR was set. A stale exported ROCM_LIB_DIR could trigger global symbol injection in unrelated processes that import this module. Now it requires the explicit MAHOUT_PRELOAD_ROCM=1 opt-in. Also added run_qiskit warmup and made torch.cuda.synchronize() calls in run_pennylane and run_qiskit conditional on torch.cuda.is_available() so the script no longer raises on AMD-only or CPU-only hosts when those runners try to bookend their timers. Honest results on AMD Instinct MI300X (batch=64, post-fix): q=8, 6400 Mahout-AMD 111,703 / PennyLane 44,085 => 2.53x q=12, 3200 Mahout-AMD 4,108 / PennyLane 2,545 => 1.61x q=16, 1920 Mahout-AMD 715 / PennyLane 771 => 0.93x Mahout-AMD wins at small states (lower per-call dispatch); default.qubit GPU's batched torch broadcast catches up at larger states. Worth a follow-up to trim per-call overhead in triton_amd.py. PennyLane-AMDGPU stays 2-76x behind both because it doesn't broadcast over the batch dimension for AmplitudeEmbedding — that's the public API reality, not a kernel-level comparison.
Adds a 'pytorch-ref' framework that runs the same amplitude-encoding workload as 'mahout-amd' but goes through the project's own PyTorch reference (qumat_qdp.torch_ref.amplitude_encode) — a pure PyTorch op chain (L2 normalize, zero-pad to 2**num_qubits, complex view) with no engine wrapper. Useful as a "what naive PyTorch on the same hardware can do" ceiling for the AMD comparison: gaps between this and 'mahout-amd' quantify the per-call overhead in the AMD engine adapter rather than any fundamental kernel-level difference. Honest result on AMD Instinct MI300X (batch=64, 12800 samples at q=8/q=12, 6400 at q=16): q=8 PyTorch-ref 119,815 / Mahout-AMD 112,650 => M-AMD 0.94x q=12 PyTorch-ref 4,163 / Mahout-AMD 4,918 => M-AMD 1.18x q=16 PyTorch-ref 926 / Mahout-AMD 890 => M-AMD 0.96x Mahout-AMD essentially ties PyTorch-ref within run-to-run noise. The AMD path's TritonAmdEngine is not currently delivering a kernel-level speedup over the same operation in plain PyTorch ops — closing or beating that gap would require real @triton.jit kernels (follow-up). The Mahout-AMD value-add on AMD is therefore the unified API surface and DLPack-zero-copy integration, not raw kernel throughput. The honest win remains 1.3-2.8x over PennyLane default.qubit GPU (skips MottonenStatePreparation) and 2.6-75x over PennyLane lightning.amdgpu (skips per-sample QNode dispatch).
Member
|
I only did a quick scan but since it is kind of urgent, so I am merging this. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related Issues
Follow-up to #1289 (AMD backend selection in encoding benchmarks).
Changes
Summary
Extends the canonical throughput benchmark
(
benchmark_throughput.py— Mahout vs PennyLane vs Qiskit) with threenew frameworks so users on AMD ROCm hosts can benchmark on equal
footing with the existing CUDA paths:
mahout-amd— QDP AMD path viaQdpEngine(backend="amd")(the TritonAmdEngine landed in [Feature][QDP] Add AMD GPU support via Triton backend #1158, exposed via the public router
in [QDP] Add AMD backend selection to QDP encoding benchmarks #1289).
pennylane-amdgpu— PennyLanelightning.amdgpu, the officialROCm simulator (Kokkos+HIP backend).
pytorch-ref— Pure-PyTorch reference implementation(
qumat_qdp.torch_ref.amplitude_encode). Same workload, no enginewrapper. Useful as a "what naive PyTorch on the same hardware can
do" ceiling — gaps between this and
mahout-amdquantify anyper-call overhead in the AMD engine adapter.
Both names are added to
FRAMEWORK_CHOICES, dispatched inmain(),and included in the speedup-ratio summary. Existing CUDA-only and
CPU-only hosts are unaffected; the new frameworks auto-skip with a
clear message when their runtimes aren't available.
While here, also fixed three pre-existing methodology issues that were
inflating Mahout's reported speedup vs PennyLane:
run_pennylanewas castingcomplex64 → float32for the GPUtransfer (line
state_cpu.to("cuda", dtype=torch.float32)),discarding the imaginary part of the encoded state. PyTorch's
Casting complex values to real discards the imaginary partwarning fired on every batch. This silently produced wrong results
AND inflated Mahout's win because the broken pennylane path was
~4× slower than a correct one.
autotune, PennyLane QNode tracing, Kokkos device init, AerSim
transpile cache) were inside every timer. Added
WARMUP_BATCHES = 3used by all runners.
ran float64. All runners now use float32 input.
Verified on AMD Instinct MI300X (ROCm 7.2 / torch 2.9.0+rocm6.4 / pennylane 0.44.1)
End-to-end DataLoader → encode → consumer pipeline, batch_size=64,
3 warmup batches not timed, all runners on float32 input:
Mahout-AMD speedup ratios:
Honest takeaways:
sizes, within run-to-run noise). The AMD path's TritonAmdEngine is
not adding kernel-level speedup over the same operation expressed
in plain PyTorch ops. The Mahout-AMD value-add on AMD is not
speed; it's the unified Mahout API surface and DLPack-zero-copy
integration story. Closing or beating the PyTorch-ref gap
meaningfully would require real
@triton.jitkernels — flaggedas follow-up.
default.qubitGPU.Both paths run real GPU compute on the same MI300X (verified via
torch.cuda.memory_allocatedgrowth and 40× CPU-vs-CUDA speedup);the win is from skipping PennyLane's
MottonenStatePreparationdecomposition (~2^(n+1) gate ops) and writing the state vector
directly. Wins narrow at large states (compute-bound regime).
lightning.amdgpuis 2.6–75× behind everything else becauseit doesn't broadcast over the batch dimension for
AmplitudeEmbedding; the per-sample QNode dispatch (~1 ms each inPennyLane v0.44) dominates. This is the public-API reality today,
not a kernel-level comparison.
Earlier draft of this PR claimed "5.6×" against PennyLane — that was
inflated by a
complex64 → float32cast inrun_pennylanethatsilently dropped the imaginary part of the encoded state, making the
PennyLane baseline ~4× slower than a correct one. Fixed in this PR;
the corrected number is 1.6–2.8×.
Runtime caveat: lightning.amdgpu loader
The
pennylane-lightning-amdgpuwheel needslibhsa-runtime64.so.1and
libamdhip64.so.7matching the system ROCm install. Ubuntu 24.04ships an older
libhsa-runtime64.so.1(from ROCm 5.7) at/lib/x86_64-linux-gnu/, which shadows newer symbols(e.g.
hsa_amd_memory_get_preferred_copy_engine) the plugin requires.The script handles this by RTLD_GLOBAL-pre-loading the matching
ROCm 7.x libs at module top, gated on
MAHOUT_PRELOAD_ROCM=1(only —
ROCM_LIB_DIRalone no longer auto-triggers, to avoidsurprising other code that imports this module). The preload must
happen before
torch/pennylaneimport — doing it later deadlocksbecause torch's HIP runtime has already mapped the older libhsa.
Usage on a ROCm host:
If
pennylane-lightning-amdgpuisn't installed or the libs aren'tfound, the framework gracefully skips with a hint pointing at
MAHOUT_PRELOAD_ROCM+ROCM_LIB_DIR.Test plan
pennylane,qiskit,mahout) unchanged — all still callprefetched_batches/normalize_batchthe same way, plus they now warm up.mahout-amdruns end-to-end on MI300X (3 sizesverified above).
pennylane-amdgpuruns end-to-end withMAHOUT_PRELOAD_ROCM=1 ROCM_LIB_DIR=....pennylane-lightning-amdgpuinstalled:graceful skip.
pennylane-amdgpufails with clearhint pointing to the env vars.
mahout-amd,pennylane) workswithout ROCm libs preloaded.
ruff checkandruff formatclean.Checklist
benchmark script; runs are validated by the verification table
above)