Feature: [WIP] Enable NCCL library for GPU-Direct collective communications in plane wave (BPCG, stochastic KG, etc.) by Flying-dragon-boxing · Pull Request #7301 · deepmodeling/abacus-develop

Flying-dragon-boxing · 2026-04-30T04:14:53Z

CUDA-Aware MPI seems to be unstable on different platforms (BW DCUs on Sugon SCNet, A800s on 北极星 HPC platform at PKU CLS and more). By enabling NCCL on these platforms, we can finally pass all CI tests on GPU with GPU-Direct collective communications enabled.
Still experimental.

Results on two A800s.
11_PW_GPU:

16_SDFT_GPU:

Flying-dragon-boxing · 2026-04-30T04:17:10Z

To-do:

EXX PW broadcasts
CMake scripts
SCNet Sugon DCU BPCG

Flying-dragon-boxing · 2026-04-30T10:01:04Z

Should fix #7076, but issue #7077 remains.

Copilot

Pull request overview

Adds an optional NCCL-backed implementation for GPU collective communications in Parallel_Common to improve stability vs CUDA-aware MPI on some platforms, and wires the build system to enable/link NCCL when requested.

Changes:

Introduces NCCL implementations of bcast, allreduce(sum), and an allgatherv-style operation for GPU buffers, with runtime dispatch in parallel_device.
Updates PGemmCN to route reduction/gather steps through *_dev collectives (and adjusts temporary-buffer logic).
Adds CMake option + discovery/linking module to enable NCCL (ENABLE_NCCL_PARALLEL_DEVICE).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
source/source_base/parallel_device.h	Adds NCCL collective declarations and dispatch in `*_dev` collectives; introduces `get_buffer` to avoid unnecessary D2H copies on non-root receive paths.
source/source_base/parallel_device.cpp	Implements NCCL communicator registry and NCCL-backed bcast/reduce/gatherv-style collectives.
source/source_base/para_gemm.cpp	Switches gather/reduce paths to `reduce_dev`/`gatherv_dev` and adjusts temp buffer allocation conditions.
cmake/SetupNccl.cmake	Adds NCCL find/link helper used by the top-level build.
CMakeLists.txt	Adds `ENABLE_NCCL_PARALLEL_DEVICE` option and integrates NCCL setup into CUDA builds.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

            {
                resmem_dev_op()(A_tmp_device_, max_colA * LDA);
-#ifndef __CUDA_MPI
+#if !defined(__CUDA_MPI) || defined(__NCCL_PARALLEL_DEVICE)


            {
                resmem_dev_op()(C_local_tmp_, size_C_local);
-#ifndef __CUDA_MPI
+#if !defined(__CUDA_MPI) || defined(__NCCL_PARALLEL_DEVICE)


Flying-dragon-boxing added 5 commits April 30, 2026 11:22

Harden GPU MPI staging helpers

6d81876

Add NCCL collectives for parallel_device

2a07d9b

Fix NCCL headers in parallel_device

5f1cb81

Route PGemm collectives through device wrappers

356fb28

Tighten NCCL collective correctness

d7e8334

Flying-dragon-boxing added 2 commits April 30, 2026 12:18

Relax NCCL discovery for existing environments

d1c5979

Decouple NCCL parallel_device from CUDA-aware MPI

bb01cef

Propagate NCCL headers to subdirectory targets

aaaaad9

Flying-dragon-boxing marked this pull request as ready for review April 30, 2026 11:46

Copilot AI review requested due to automatic review settings April 30, 2026 11:46

Merge branch 'develop' into fix-gpu-mpi-staging-comm

cf0b949

Copilot started reviewing on behalf of Flying-dragon-boxing April 30, 2026 11:47 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

mohanchen added GPU & DCU & HPC GPU and DCU and HPC related any issues Compile & CICD & Docs & Dependencies Issues related to compiling ABACUS labels May 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: [WIP] Enable NCCL library for GPU-Direct collective communications in plane wave (BPCG, stochastic KG, etc.)#7301

Feature: [WIP] Enable NCCL library for GPU-Direct collective communications in plane wave (BPCG, stochastic KG, etc.)#7301
Flying-dragon-boxing wants to merge 9 commits intodeepmodeling:developfrom
Flying-dragon-boxing:fix-gpu-mpi-staging-comm

Flying-dragon-boxing commented Apr 30, 2026 •

edited

Loading

Uh oh!

Flying-dragon-boxing commented Apr 30, 2026 •

edited

Loading

Uh oh!

Flying-dragon-boxing commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Flying-dragon-boxing commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flying-dragon-boxing commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flying-dragon-boxing commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Flying-dragon-boxing commented Apr 30, 2026 •

edited

Loading

Flying-dragon-boxing commented Apr 30, 2026 •

edited

Loading