Skip to content

[QDP] Add Mahout-AMD and PennyLane-AMDGPU frameworks to throughput benchmark#1291

Merged
400Ping merged 4 commits into
apache:mainfrom
ryankert01:benchmark-amd-pennylane-comparison
Apr 26, 2026
Merged

[QDP] Add Mahout-AMD and PennyLane-AMDGPU frameworks to throughput benchmark#1291
400Ping merged 4 commits into
apache:mainfrom
ryankert01:benchmark-amd-pennylane-comparison

Conversation

@ryankert01
Copy link
Copy Markdown
Member

@ryankert01 ryankert01 commented Apr 25, 2026

Related Issues

Follow-up to #1289 (AMD backend selection in encoding benchmarks).

Changes

  • Bug fix
  • New feature
  • Refactoring
  • Documentation
  • Test
  • CI/CD pipeline
  • Other

Summary

Extends the canonical throughput benchmark
(benchmark_throughput.py — Mahout vs PennyLane vs Qiskit) with three
new frameworks so users on AMD ROCm hosts can benchmark on equal
footing with the existing CUDA paths:

  • mahout-amd — QDP AMD path via QdpEngine(backend="amd")
    (the TritonAmdEngine landed in [Feature][QDP] Add AMD GPU support via Triton backend #1158, exposed via the public router
    in [QDP] Add AMD backend selection to QDP encoding benchmarks #1289).
  • pennylane-amdgpu — PennyLane lightning.amdgpu, the official
    ROCm simulator (Kokkos+HIP backend).
  • pytorch-ref — Pure-PyTorch reference implementation
    (qumat_qdp.torch_ref.amplitude_encode). Same workload, no engine
    wrapper. Useful as a "what naive PyTorch on the same hardware can
    do" ceiling — gaps between this and mahout-amd quantify any
    per-call overhead in the AMD engine adapter.

Both names are added to FRAMEWORK_CHOICES, dispatched in main(),
and included in the speedup-ratio summary. Existing CUDA-only and
CPU-only hosts are unaffected; the new frameworks auto-skip with a
clear message when their runtimes aren't available.

While here, also fixed three pre-existing methodology issues that were
inflating Mahout's reported speedup vs PennyLane:

  1. run_pennylane was casting complex64 → float32 for the GPU
    transfer (line state_cpu.to("cuda", dtype=torch.float32)),
    discarding the imaginary part of the encoded state. PyTorch's
    Casting complex values to real discards the imaginary part
    warning fired on every batch. This silently produced wrong results
    AND inflated Mahout's win because the broken pennylane path was
    ~4× slower than a correct one.
  2. No warmup anywhere — first-batch costs (Triton AMD JIT
    autotune, PennyLane QNode tracing, Kokkos device init, AerSim
    transpile cache) were inside every timer. Added WARMUP_BATCHES = 3
    used by all runners.
  3. Dtype mismatch — Mahout-AMD ran float32, both PennyLane runners
    ran float64. All runners now use float32 input.

Verified on AMD Instinct MI300X (ROCm 7.2 / torch 2.9.0+rocm6.4 / pennylane 0.44.1)

End-to-end DataLoader → encode → consumer pipeline, batch_size=64,
3 warmup batches not timed, all runners on float32 input:

Config Mahout-AMD PyTorch-ref PennyLane PL-AMDGPU
q=8, 12800 samples 112,650 vec/s 119,815 48,367 1,492
q=12, 12800 samples 4,918 4,163 1,755 658
q=16, 6400 samples 890 926 704 347

Mahout-AMD speedup ratios:

vs PyTorch-ref vs PennyLane vs PennyLane-AMDGPU
q=8 0.94× 2.33× 75.5×
q=12 1.18× 2.80× 7.48×
q=16 0.96× 1.27× 2.57×

Honest takeaways:

  • Mahout-AMD essentially ties PyTorch-ref (±6–18% across qubit
    sizes, within run-to-run noise). The AMD path's TritonAmdEngine is
    not adding kernel-level speedup over the same operation expressed
    in plain PyTorch ops. The Mahout-AMD value-add on AMD is not
    speed; it's the unified Mahout API surface and DLPack-zero-copy
    integration story.
    Closing or beating the PyTorch-ref gap
    meaningfully would require real @triton.jit kernels — flagged
    as follow-up.
  • Mahout-AMD wins 1.27–2.80× over PennyLane default.qubit GPU.
    Both paths run real GPU compute on the same MI300X (verified via
    torch.cuda.memory_allocated growth and 40× CPU-vs-CUDA speedup);
    the win is from skipping PennyLane's MottonenStatePreparation
    decomposition (~2^(n+1) gate ops) and writing the state vector
    directly. Wins narrow at large states (compute-bound regime).
  • lightning.amdgpu is 2.6–75× behind everything else because
    it doesn't broadcast over the batch dimension for
    AmplitudeEmbedding; the per-sample QNode dispatch (~1 ms each in
    PennyLane v0.44) dominates. This is the public-API reality today,
    not a kernel-level comparison.

Earlier draft of this PR claimed "5.6×" against PennyLane — that was
inflated by a complex64 → float32 cast in run_pennylane that
silently dropped the imaginary part of the encoded state, making the
PennyLane baseline ~4× slower than a correct one. Fixed in this PR;
the corrected number is 1.6–2.8×.

Runtime caveat: lightning.amdgpu loader

The pennylane-lightning-amdgpu wheel needs libhsa-runtime64.so.1
and libamdhip64.so.7 matching the system ROCm install. Ubuntu 24.04
ships an older libhsa-runtime64.so.1 (from ROCm 5.7) at
/lib/x86_64-linux-gnu/, which shadows newer symbols
(e.g. hsa_amd_memory_get_preferred_copy_engine) the plugin requires.

The script handles this by RTLD_GLOBAL-pre-loading the matching
ROCm 7.x libs at module top, gated on MAHOUT_PRELOAD_ROCM=1
(only — ROCM_LIB_DIR alone no longer auto-triggers, to avoid
surprising other code that imports this module). The preload must
happen before torch / pennylane import — doing it later deadlocks
because torch's HIP runtime has already mapped the older libhsa.

Usage on a ROCm host:

MAHOUT_PRELOAD_ROCM=1 ROCM_LIB_DIR=/opt/rocm-7.2.0/lib \
  uv run python benchmark/benchmark_throughput.py \
  --frameworks mahout-amd,pennylane,pennylane-amdgpu --qubits 12

If pennylane-lightning-amdgpu isn't installed or the libs aren't
found, the framework gracefully skips with a hint pointing at
MAHOUT_PRELOAD_ROCM + ROCM_LIB_DIR.

Test plan

  • CUDA host: existing frameworks (pennylane, qiskit,
    mahout) unchanged — all still call prefetched_batches /
    normalize_batch the same way, plus they now warm up.
  • AMD host: mahout-amd runs end-to-end on MI300X (3 sizes
    verified above).
  • AMD host: pennylane-amdgpu runs end-to-end with
    MAHOUT_PRELOAD_ROCM=1 ROCM_LIB_DIR=....
  • AMD host without pennylane-lightning-amdgpu installed:
    graceful skip.
  • Host without preload env: pennylane-amdgpu fails with clear
    hint pointing to the env vars.
  • Mixed framework selection (mahout-amd,pennylane) works
    without ROCm libs preloaded.
  • ruff check and ruff format clean.

Checklist

  • Added or updated documentation for all changes
  • Added or updated unit tests for all changes — N/A (this is a
    benchmark script; runs are validated by the verification table
    above)

…nchmark

Extends the canonical multi-framework throughput benchmark
(benchmark_throughput.py: Mahout vs PennyLane vs Qiskit) with two new
frameworks so users on AMD ROCm hosts can benchmark on equal footing:

* mahout-amd       — QDP AMD path via QdpEngine(backend="amd")
                     (TritonAmdEngine on ROCm). Drives the same
                     prefetched_batches loop as the existing 'mahout'
                     path. Auto-skips on hosts without is_triton_amd_available().

* pennylane-amdgpu — PennyLane lightning.amdgpu, the official ROCm
                     simulator (Kokkos+HIP backend). Per-sample loop
                     because the native sim doesn't broadcast over
                     batch dimension for AmplitudeEmbedding.

Both new framework names are added to FRAMEWORK_CHOICES and dispatched
in main(). Speedup reporting picks whichever Mahout backend ran.

Loader caveat: lightning.amdgpu's bundled liblightning_kokkos_catalyst.so
NEEDs libhsa-runtime64.so.1 / libamdhip64.so.7. Ubuntu 24.04's
/lib/x86_64-linux-gnu/libhsa-runtime64.so.1 is from ROCm 5.7 and shadows
newer symbols (e.g. hsa_amd_memory_get_preferred_copy_engine), making
the plugin fail at device init. Resolved by RTLD_GLOBAL-pre-loading the
matching ROCm 7.x libs at module top, gated on ROCM_LIB_DIR env var or
MAHOUT_PRELOAD_ROCM=1. The preload MUST happen before torch / pennylane
import — doing it after deadlocks because torch's HIP runtime has
already mapped the older libhsa.

Verified on AMD Instinct MI300X (ROCm 7.2 / torch 2.9.0+rocm6.4 /
pennylane 0.44.1, pennylane-lightning-amdgpu 0.44.0). End-to-end at
q=12, 3200 samples:

  Mahout-AMD          3,965 vec/s    (winner)
  PennyLane-AMDGPU      705 vec/s    (5.6x slower)
  PennyLane             627 vec/s    (6.3x slower)

CUDA hosts and CPU-only hosts unaffected; new frameworks gracefully
skip when their dependencies are missing.
@ryankert01
Copy link
Copy Markdown
Member Author

lg, cc @400Ping

* Remove unused `# noqa: E402` directives (RUF100). The ruff config in
  this repo doesn't enable E402, so the markers were warnings rather
  than suppressions.
* Apply `ruff format` to satisfy the pre-commit format hook.

No behavior change.
@ryankert01 ryankert01 requested a review from rich7420 April 25, 2026 18:55
Three pre-existing methodology issues were inflating Mahout's reported
speedup vs PennyLane in the throughput numbers. Fixed all three so the
honest result stands on its own:

* run_pennylane was casting complex64 -> float32 on the GPU transfer,
  silently dropping the imaginary part of every encoded state (and
  triggering pytorch's "Casting complex values to real discards the
  imaginary part" warning on every batch). The broken pennylane path
  was ~4x slower than a correct one — the cast is removed and the
  state stays complex64 on the target device.

* No warmup anywhere meant first-batch costs (Triton AMD JIT autotune,
  PennyLane QNode tracing/cache, Kokkos device init, Aer transpile
  cache) were inside every timer. Added WARMUP_BATCHES = 3 used by all
  runners.

* run_mahout_amd ran float32 while both PennyLane runners ran float64.
  All runners now use float32 input for the same dtype across the
  comparison.

Also tightened the ROCm preload gate: previously '_preload_rocm_libs_at_import'
fired when EITHER MAHOUT_PRELOAD_ROCM=1 OR ROCM_LIB_DIR was set. A stale
exported ROCM_LIB_DIR could trigger global symbol injection in unrelated
processes that import this module. Now it requires the explicit
MAHOUT_PRELOAD_ROCM=1 opt-in.

Also added run_qiskit warmup and made torch.cuda.synchronize() calls
in run_pennylane and run_qiskit conditional on torch.cuda.is_available()
so the script no longer raises on AMD-only or CPU-only hosts when those
runners try to bookend their timers.

Honest results on AMD Instinct MI300X (batch=64, post-fix):
  q=8, 6400  Mahout-AMD 111,703 / PennyLane 44,085  =>  2.53x
  q=12, 3200 Mahout-AMD   4,108 / PennyLane  2,545  =>  1.61x
  q=16, 1920 Mahout-AMD     715 / PennyLane    771  =>  0.93x

Mahout-AMD wins at small states (lower per-call dispatch); default.qubit
GPU's batched torch broadcast catches up at larger states. Worth a
follow-up to trim per-call overhead in triton_amd.py.

PennyLane-AMDGPU stays 2-76x behind both because it doesn't broadcast
over the batch dimension for AmplitudeEmbedding — that's the public API
reality, not a kernel-level comparison.
Adds a 'pytorch-ref' framework that runs the same amplitude-encoding
workload as 'mahout-amd' but goes through the project's own PyTorch
reference (qumat_qdp.torch_ref.amplitude_encode) — a pure PyTorch op
chain (L2 normalize, zero-pad to 2**num_qubits, complex view) with no
engine wrapper.

Useful as a "what naive PyTorch on the same hardware can do" ceiling
for the AMD comparison: gaps between this and 'mahout-amd' quantify
the per-call overhead in the AMD engine adapter rather than any
fundamental kernel-level difference.

Honest result on AMD Instinct MI300X (batch=64, 12800 samples at
q=8/q=12, 6400 at q=16):

  q=8   PyTorch-ref 119,815 / Mahout-AMD 112,650  =>  M-AMD 0.94x
  q=12  PyTorch-ref   4,163 / Mahout-AMD   4,918  =>  M-AMD 1.18x
  q=16  PyTorch-ref     926 / Mahout-AMD     890  =>  M-AMD 0.96x

Mahout-AMD essentially ties PyTorch-ref within run-to-run noise. The
AMD path's TritonAmdEngine is not currently delivering a kernel-level
speedup over the same operation in plain PyTorch ops — closing or
beating that gap would require real @triton.jit kernels (follow-up).

The Mahout-AMD value-add on AMD is therefore the unified API surface
and DLPack-zero-copy integration, not raw kernel throughput. The
honest win remains 1.3-2.8x over PennyLane default.qubit GPU (skips
MottonenStatePreparation) and 2.6-75x over PennyLane lightning.amdgpu
(skips per-sample QNode dispatch).
Copy link
Copy Markdown
Member

@400Ping 400Ping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

@400Ping
Copy link
Copy Markdown
Member

400Ping commented Apr 26, 2026

I only did a quick scan but since it is kind of urgent, so I am merging this.

@400Ping 400Ping merged commit 696f902 into apache:main Apr 26, 2026
8 checks passed
@ryankert01 ryankert01 deleted the benchmark-amd-pennylane-comparison branch April 26, 2026 04:46
@ryankert01 ryankert01 added this to the Qumat 0.6.0 milestone May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants