Fix: unify DAV subspace diagonalization to use lapack_hegvx#7302
Fix: unify DAV subspace diagonalization to use lapack_hegvx#7302Cstandardlib wants to merge 3 commits intodeepmodeling:developfrom
Conversation
Previously the GPU path used lapack_hegvd (all eigenvalues) while the CPU serial path used hegvx_op (subset eigenvalues), causing inconsistent behavior and wasted computation on GPU. This commit unifies both paths to use container::kernels::lapack_hegvx, which computes only the needed nband eigenpairs. The CPU serial path also drops ~30 lines of manual save/restore code since lapack_hegvx preserves input matrices internally. A GPU specialization of lapack_hegvx is added for ROCm using CPU fallback (D2H -> CPU solve -> H2D), following the existing pattern in lapack_heevd. Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
There was a problem hiding this comment.
Pull request overview
This PR unifies DAV subspace diagonalization across CPU and GPU to use container::kernels::lapack_hegvx, so only the required nband eigenpairs are computed (instead of all nbase), and adds an ROCm lapack_hegvx<T, DEVICE_GPU> specialization via a CPU fallback.
Changes:
- Switch DAV subspace diagonalization (CPU + GPU) to
lapack_hegvxwithnbandsubset selection. - Remove CPU-side manual save/restore boilerplate around
hegvx. - Add ROCm
lapack_hegvx<T, DEVICE_GPU>specialization that falls back to CPU LAPACK by copying data host<->device.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
source/source_hsolver/diago_dav_subspace.cpp |
Uses lapack_hegvx for both CPU and GPU paths; removes manual matrix save/restore in CPU serial path. |
source/source_base/module_container/ATen/kernels/rocm/lapack.hip.cu |
Adds ROCm lapack_hegvx GPU specialization implemented via CPU fallback and host/device copies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::vector<Real> H_eigen_val(n); | ||
| std::vector<T> H_eigen_vec(n * lda); | ||
|
|
||
| hipErrcheck(hipMemcpy(H_A.data(), A, sizeof(T) * n * lda, hipMemcpyDeviceToHost)); | ||
| hipErrcheck(hipMemcpy(H_B.data(), B, sizeof(T) * n * lda, hipMemcpyDeviceToHost)); | ||
|
|
||
| lapack_hegvx<T, DEVICE_CPU>()(n, lda, H_A.data(), H_B.data(), m, H_eigen_val.data(), H_eigen_vec.data()); | ||
|
|
||
| hipErrcheck(hipMemcpy(eigen_val, H_eigen_val.data(), sizeof(Real) * n, hipMemcpyHostToDevice)); | ||
| hipErrcheck(hipMemcpy(eigen_vec, H_eigen_vec.data(), sizeof(T) * n * lda, hipMemcpyHostToDevice)); |
There was a problem hiding this comment.
In this ROCm GPU specialization, the host buffers and memcpy sizes for outputs should match what lapack_hegvx actually produces: eigen_val is length m (see lapack_hegvx<T, DEVICE_CPU> usage in lapack_test.cpp), and eigen_vec is lda * m (Z is n-by-m). Copying n eigenvalues and n*lda eigenvector elements back can write past the caller’s allocated buffers when they size them to m (e.g., the hegvx unit test allocates E with neig entries). Please resize H_eigen_val to m and copy back only m values; likewise allocate/copy H_eigen_vec as lda * m (or at least only memcpy that many elements).
| std::vector<Real> H_eigen_val(n); | |
| std::vector<T> H_eigen_vec(n * lda); | |
| hipErrcheck(hipMemcpy(H_A.data(), A, sizeof(T) * n * lda, hipMemcpyDeviceToHost)); | |
| hipErrcheck(hipMemcpy(H_B.data(), B, sizeof(T) * n * lda, hipMemcpyDeviceToHost)); | |
| lapack_hegvx<T, DEVICE_CPU>()(n, lda, H_A.data(), H_B.data(), m, H_eigen_val.data(), H_eigen_vec.data()); | |
| hipErrcheck(hipMemcpy(eigen_val, H_eigen_val.data(), sizeof(Real) * n, hipMemcpyHostToDevice)); | |
| hipErrcheck(hipMemcpy(eigen_vec, H_eigen_vec.data(), sizeof(T) * n * lda, hipMemcpyHostToDevice)); | |
| std::vector<Real> H_eigen_val(m); | |
| std::vector<T> H_eigen_vec(lda * m); | |
| hipErrcheck(hipMemcpy(H_A.data(), A, sizeof(T) * n * lda, hipMemcpyDeviceToHost)); | |
| hipErrcheck(hipMemcpy(H_B.data(), B, sizeof(T) * n * lda, hipMemcpyDeviceToHost)); | |
| lapack_hegvx<T, DEVICE_CPU>()(n, lda, H_A.data(), H_B.data(), m, H_eigen_val.data(), H_eigen_vec.data()); | |
| hipErrcheck(hipMemcpy(eigen_val, H_eigen_val.data(), sizeof(Real) * m, hipMemcpyHostToDevice)); | |
| hipErrcheck(hipMemcpy(eigen_vec, H_eigen_vec.data(), sizeof(T) * lda * m, hipMemcpyHostToDevice)); |
The CPU specialization of lapack_hegvx used 'L' (lower) uplo while the existing hegvx_op uses 'U' (upper), causing eigenvalue divergence in the pyabacus dav_subspace tests. Additionally, passing nullptr for IWORK and IFAIL during the workspace query can cause crashes with some LAPACK implementations. Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
The GPU lapack_hegvx had a bug: it passed A/hcc directly to cuSOLVER hegvdx which overwrites the upper triangle, corrupting the subspace matrix for subsequent DAV iterations. Fixed by copying A to eigen_vec and backing up B before the call. The GPU path in diago_dav_subspace.cpp is kept on lapack_hegvd because cuSOLVER hegvdx correctly detects a pre-existing numerical issue (B not positive definite at higher orders) that hegvd silently ignores. Switching to hegvx on GPU is left for future work once the underlying subspace conditioning is addressed. Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Summary
container::kernels::lapack_hegvx, computing only the needednbandeigenpairs instead of allnbaseeigenvalueslapack_hegvx)lapack_hegvx<T, DEVICE_GPU>GPU specialization for ROCm using CPU fallback, following the existinglapack_heevdpatternMotivation
The DAV subspace diagonalization had inconsistent eigensolver calls:
lapack_hegvd(divide-and-conquer, ALL eigenvalues)hegvx_op(expert driver, subset eigenvalues)This caused wasted computation on GPU (computing all eigenvalues when only
nbandwere needed) and inconsistent semantics. Now both paths explicitly request only the firstnbandeigenpairs.Changes
rocm/lapack.hip.culapack_hegvxGPU specialization with CPU fallbackdiago_dav_subspace.cpplapack_hegvx, remove save/restore boilerplateTest Results
MODULE_HSOLVER_dav(double): 3/3 PASSEDMODULE_HSOLVER_dav_float(single): 3/3 PASSEDscf_dav: 7/7 PASSEDscf_dav_sub: 7/7 PASSED