[QDP] Add Mahout-AMD and PennyLane-AMDGPU frameworks to throughput benchmark by ryankert01 · Pull Request #1291 · apache/mahout

ryankert01 · 2026-04-25T18:47:42Z

Related Issues

Follow-up to #1289 (AMD backend selection in encoding benchmarks).

Changes

Summary

Extends the canonical throughput benchmark
(benchmark_throughput.py — Mahout vs PennyLane vs Qiskit) with three
new frameworks so users on AMD ROCm hosts can benchmark on equal
footing with the existing CUDA paths:

mahout-amd — QDP AMD path via QdpEngine(backend="amd")
(the TritonAmdEngine landed in [Feature][QDP] Add AMD GPU support via Triton backend #1158, exposed via the public router
in [QDP] Add AMD backend selection to QDP encoding benchmarks #1289).
pennylane-amdgpu — PennyLane lightning.amdgpu, the official
ROCm simulator (Kokkos+HIP backend).
pytorch-ref — Pure-PyTorch reference implementation
(qumat_qdp.torch_ref.amplitude_encode). Same workload, no engine
wrapper. Useful as a "what naive PyTorch on the same hardware can
do" ceiling — gaps between this and mahout-amd quantify any
per-call overhead in the AMD engine adapter.

Both names are added to FRAMEWORK_CHOICES, dispatched in main(),
and included in the speedup-ratio summary. Existing CUDA-only and
CPU-only hosts are unaffected; the new frameworks auto-skip with a
clear message when their runtimes aren't available.

While here, also fixed three pre-existing methodology issues that were
inflating Mahout's reported speedup vs PennyLane:

run_pennylane was casting complex64 → float32 for the GPU
transfer (line state_cpu.to("cuda", dtype=torch.float32)),
discarding the imaginary part of the encoded state. PyTorch's
Casting complex values to real discards the imaginary part
warning fired on every batch. This silently produced wrong results
AND inflated Mahout's win because the broken pennylane path was
~4× slower than a correct one.
No warmup anywhere — first-batch costs (Triton AMD JIT
autotune, PennyLane QNode tracing, Kokkos device init, AerSim
transpile cache) were inside every timer. Added WARMUP_BATCHES = 3
used by all runners.
Dtype mismatch — Mahout-AMD ran float32, both PennyLane runners
ran float64. All runners now use float32 input.

Verified on AMD Instinct MI300X (ROCm 7.2 / torch 2.9.0+rocm6.4 / pennylane 0.44.1)

End-to-end DataLoader → encode → consumer pipeline, batch_size=64,
3 warmup batches not timed, all runners on float32 input:

Config	Mahout-AMD	PyTorch-ref	PennyLane	PL-AMDGPU
q=8, 12800 samples	112,650 vec/s	119,815	48,367	1,492
q=12, 12800 samples	4,918	4,163	1,755	658
q=16, 6400 samples	890	926	704	347

Mahout-AMD speedup ratios:

	vs PyTorch-ref	vs PennyLane	vs PennyLane-AMDGPU
q=8	0.94×	2.33×	75.5×
q=12	1.18×	2.80×	7.48×
q=16	0.96×	1.27×	2.57×

Honest takeaways:

Mahout-AMD essentially ties PyTorch-ref (±6–18% across qubit
sizes, within run-to-run noise). The AMD path's TritonAmdEngine is
not adding kernel-level speedup over the same operation expressed
in plain PyTorch ops. The Mahout-AMD value-add on AMD is not
speed; it's the unified Mahout API surface and DLPack-zero-copy
integration story. Closing or beating the PyTorch-ref gap
meaningfully would require real @triton.jit kernels — flagged
as follow-up.
Mahout-AMD wins 1.27–2.80× over PennyLane default.qubit GPU.
Both paths run real GPU compute on the same MI300X (verified via
torch.cuda.memory_allocated growth and 40× CPU-vs-CUDA speedup);
the win is from skipping PennyLane's MottonenStatePreparation
decomposition (~2^(n+1) gate ops) and writing the state vector
directly. Wins narrow at large states (compute-bound regime).
lightning.amdgpu is 2.6–75× behind everything else because
it doesn't broadcast over the batch dimension for
AmplitudeEmbedding; the per-sample QNode dispatch (~1 ms each in
PennyLane v0.44) dominates. This is the public-API reality today,
not a kernel-level comparison.

Earlier draft of this PR claimed "5.6×" against PennyLane — that was
inflated by a complex64 → float32 cast in run_pennylane that
silently dropped the imaginary part of the encoded state, making the
PennyLane baseline ~4× slower than a correct one. Fixed in this PR;
the corrected number is 1.6–2.8×.

Runtime caveat: lightning.amdgpu loader

The pennylane-lightning-amdgpu wheel needs libhsa-runtime64.so.1
and libamdhip64.so.7 matching the system ROCm install. Ubuntu 24.04
ships an older libhsa-runtime64.so.1 (from ROCm 5.7) at
/lib/x86_64-linux-gnu/, which shadows newer symbols
(e.g. hsa_amd_memory_get_preferred_copy_engine) the plugin requires.

The script handles this by RTLD_GLOBAL-pre-loading the matching
ROCm 7.x libs at module top, gated on MAHOUT_PRELOAD_ROCM=1
(only — ROCM_LIB_DIR alone no longer auto-triggers, to avoid
surprising other code that imports this module). The preload must
happen before torch / pennylane import — doing it later deadlocks
because torch's HIP runtime has already mapped the older libhsa.

Usage on a ROCm host:

MAHOUT_PRELOAD_ROCM=1 ROCM_LIB_DIR=/opt/rocm-7.2.0/lib \
  uv run python benchmark/benchmark_throughput.py \
  --frameworks mahout-amd,pennylane,pennylane-amdgpu --qubits 12

If pennylane-lightning-amdgpu isn't installed or the libs aren't
found, the framework gracefully skips with a hint pointing at
MAHOUT_PRELOAD_ROCM + ROCM_LIB_DIR.

Test plan

CUDA host: existing frameworks (pennylane, qiskit,
mahout) unchanged — all still call prefetched_batches /
normalize_batch the same way, plus they now warm up.
AMD host: mahout-amd runs end-to-end on MI300X (3 sizes
verified above).
AMD host: pennylane-amdgpu runs end-to-end with
MAHOUT_PRELOAD_ROCM=1 ROCM_LIB_DIR=....
AMD host without pennylane-lightning-amdgpu installed:
graceful skip.
Host without preload env: pennylane-amdgpu fails with clear
hint pointing to the env vars.
Mixed framework selection (mahout-amd,pennylane) works
without ROCm libs preloaded.
ruff check and ruff format clean.

Checklist

Added or updated documentation for all changes
Added or updated unit tests for all changes — N/A (this is a
benchmark script; runs are validated by the verification table
above)

…nchmark Extends the canonical multi-framework throughput benchmark (benchmark_throughput.py: Mahout vs PennyLane vs Qiskit) with two new frameworks so users on AMD ROCm hosts can benchmark on equal footing: * mahout-amd — QDP AMD path via QdpEngine(backend="amd") (TritonAmdEngine on ROCm). Drives the same prefetched_batches loop as the existing 'mahout' path. Auto-skips on hosts without is_triton_amd_available(). * pennylane-amdgpu — PennyLane lightning.amdgpu, the official ROCm simulator (Kokkos+HIP backend). Per-sample loop because the native sim doesn't broadcast over batch dimension for AmplitudeEmbedding. Both new framework names are added to FRAMEWORK_CHOICES and dispatched in main(). Speedup reporting picks whichever Mahout backend ran. Loader caveat: lightning.amdgpu's bundled liblightning_kokkos_catalyst.so NEEDs libhsa-runtime64.so.1 / libamdhip64.so.7. Ubuntu 24.04's /lib/x86_64-linux-gnu/libhsa-runtime64.so.1 is from ROCm 5.7 and shadows newer symbols (e.g. hsa_amd_memory_get_preferred_copy_engine), making the plugin fail at device init. Resolved by RTLD_GLOBAL-pre-loading the matching ROCm 7.x libs at module top, gated on ROCM_LIB_DIR env var or MAHOUT_PRELOAD_ROCM=1. The preload MUST happen before torch / pennylane import — doing it after deadlocks because torch's HIP runtime has already mapped the older libhsa. Verified on AMD Instinct MI300X (ROCm 7.2 / torch 2.9.0+rocm6.4 / pennylane 0.44.1, pennylane-lightning-amdgpu 0.44.0). End-to-end at q=12, 3200 samples: Mahout-AMD 3,965 vec/s (winner) PennyLane-AMDGPU 705 vec/s (5.6x slower) PennyLane 627 vec/s (6.3x slower) CUDA hosts and CPU-only hosts unaffected; new frameworks gracefully skip when their dependencies are missing.

ryankert01 · 2026-04-25T18:48:54Z

lg, cc @400Ping

* Remove unused `# noqa: E402` directives (RUF100). The ruff config in this repo doesn't enable E402, so the markers were warnings rather than suppressions. * Apply `ruff format` to satisfy the pre-commit format hook. No behavior change.

Three pre-existing methodology issues were inflating Mahout's reported speedup vs PennyLane in the throughput numbers. Fixed all three so the honest result stands on its own: * run_pennylane was casting complex64 -> float32 on the GPU transfer, silently dropping the imaginary part of every encoded state (and triggering pytorch's "Casting complex values to real discards the imaginary part" warning on every batch). The broken pennylane path was ~4x slower than a correct one — the cast is removed and the state stays complex64 on the target device. * No warmup anywhere meant first-batch costs (Triton AMD JIT autotune, PennyLane QNode tracing/cache, Kokkos device init, Aer transpile cache) were inside every timer. Added WARMUP_BATCHES = 3 used by all runners. * run_mahout_amd ran float32 while both PennyLane runners ran float64. All runners now use float32 input for the same dtype across the comparison. Also tightened the ROCm preload gate: previously '_preload_rocm_libs_at_import' fired when EITHER MAHOUT_PRELOAD_ROCM=1 OR ROCM_LIB_DIR was set. A stale exported ROCM_LIB_DIR could trigger global symbol injection in unrelated processes that import this module. Now it requires the explicit MAHOUT_PRELOAD_ROCM=1 opt-in. Also added run_qiskit warmup and made torch.cuda.synchronize() calls in run_pennylane and run_qiskit conditional on torch.cuda.is_available() so the script no longer raises on AMD-only or CPU-only hosts when those runners try to bookend their timers. Honest results on AMD Instinct MI300X (batch=64, post-fix): q=8, 6400 Mahout-AMD 111,703 / PennyLane 44,085 => 2.53x q=12, 3200 Mahout-AMD 4,108 / PennyLane 2,545 => 1.61x q=16, 1920 Mahout-AMD 715 / PennyLane 771 => 0.93x Mahout-AMD wins at small states (lower per-call dispatch); default.qubit GPU's batched torch broadcast catches up at larger states. Worth a follow-up to trim per-call overhead in triton_amd.py. PennyLane-AMDGPU stays 2-76x behind both because it doesn't broadcast over the batch dimension for AmplitudeEmbedding — that's the public API reality, not a kernel-level comparison.

Adds a 'pytorch-ref' framework that runs the same amplitude-encoding workload as 'mahout-amd' but goes through the project's own PyTorch reference (qumat_qdp.torch_ref.amplitude_encode) — a pure PyTorch op chain (L2 normalize, zero-pad to 2**num_qubits, complex view) with no engine wrapper. Useful as a "what naive PyTorch on the same hardware can do" ceiling for the AMD comparison: gaps between this and 'mahout-amd' quantify the per-call overhead in the AMD engine adapter rather than any fundamental kernel-level difference. Honest result on AMD Instinct MI300X (batch=64, 12800 samples at q=8/q=12, 6400 at q=16): q=8 PyTorch-ref 119,815 / Mahout-AMD 112,650 => M-AMD 0.94x q=12 PyTorch-ref 4,163 / Mahout-AMD 4,918 => M-AMD 1.18x q=16 PyTorch-ref 926 / Mahout-AMD 890 => M-AMD 0.96x Mahout-AMD essentially ties PyTorch-ref within run-to-run noise. The AMD path's TritonAmdEngine is not currently delivering a kernel-level speedup over the same operation in plain PyTorch ops — closing or beating that gap would require real @triton.jit kernels (follow-up). The Mahout-AMD value-add on AMD is therefore the unified API surface and DLPack-zero-copy integration, not raw kernel throughput. The honest win remains 1.3-2.8x over PennyLane default.qubit GPU (skips MottonenStatePreparation) and 2.6-75x over PennyLane lightning.amdgpu (skips per-sample QNode dispatch).

400Ping

Overall LGTM

400Ping · 2026-04-26T04:40:19Z

I only did a quick scan but since it is kind of urgent, so I am merging this.

ryankert01 requested review from 400Ping and guan404ming as code owners April 25, 2026 18:47

ryankert01 requested a review from rich7420 April 25, 2026 18:55

ryankert01 added 2 commits April 25, 2026 19:00

400Ping approved these changes Apr 26, 2026

View reviewed changes

400Ping merged commit 696f902 into apache:main Apr 26, 2026
8 checks passed

ryankert01 deleted the benchmark-amd-pennylane-comparison branch April 26, 2026 04:46

ryankert01 added this to the Qumat 0.6.0 milestone May 13, 2026

ryankert01 mentioned this pull request May 13, 2026

Status of testing Apache Mahout Qumat 0.6.0rc1 #1315

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QDP] Add Mahout-AMD and PennyLane-AMDGPU frameworks to throughput benchmark#1291

[QDP] Add Mahout-AMD and PennyLane-AMDGPU frameworks to throughput benchmark#1291
400Ping merged 4 commits into
apache:mainfrom
ryankert01:benchmark-amd-pennylane-comparison

ryankert01 commented Apr 25, 2026 •

edited

Loading

Uh oh!

ryankert01 commented Apr 25, 2026

Uh oh!

400Ping left a comment

Uh oh!

400Ping commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ryankert01 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Changes

Summary

Verified on AMD Instinct MI300X (ROCm 7.2 / torch 2.9.0+rocm6.4 / pennylane 0.44.1)

Runtime caveat: lightning.amdgpu loader

Test plan

Checklist

Uh oh!

ryankert01 commented Apr 25, 2026

Uh oh!

400Ping left a comment

Choose a reason for hiding this comment

Uh oh!

400Ping commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ryankert01 commented Apr 25, 2026 •

edited

Loading