Skip to content

Feature: [WIP] Enable NCCL library for GPU-Direct collective communications in plane wave (BPCG, stochastic KG, etc.)#7301

Open
Flying-dragon-boxing wants to merge 9 commits intodeepmodeling:developfrom
Flying-dragon-boxing:fix-gpu-mpi-staging-comm
Open

Feature: [WIP] Enable NCCL library for GPU-Direct collective communications in plane wave (BPCG, stochastic KG, etc.)#7301
Flying-dragon-boxing wants to merge 9 commits intodeepmodeling:developfrom
Flying-dragon-boxing:fix-gpu-mpi-staging-comm

Conversation

@Flying-dragon-boxing
Copy link
Copy Markdown
Collaborator

@Flying-dragon-boxing Flying-dragon-boxing commented Apr 30, 2026

CUDA-Aware MPI seems to be unstable on different platforms (BW DCUs on Sugon SCNet, A800s on 北极星 HPC platform at PKU CLS and more). By enabling NCCL on these platforms, we can finally pass all CI tests on GPU with GPU-Direct collective communications enabled.
Still experimental.

Results on two A800s.
11_PW_GPU:
image
16_SDFT_GPU:
image

@Flying-dragon-boxing
Copy link
Copy Markdown
Collaborator Author

Flying-dragon-boxing commented Apr 30, 2026

To-do:

  • EXX PW broadcasts
  • CMake scripts
  • SCNet Sugon DCU BPCG

@Flying-dragon-boxing
Copy link
Copy Markdown
Collaborator Author

Should fix #7076, but issue #7077 remains.

@Flying-dragon-boxing Flying-dragon-boxing marked this pull request as ready for review April 30, 2026 11:46
Copilot AI review requested due to automatic review settings April 30, 2026 11:46
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional NCCL-backed implementation for GPU collective communications in Parallel_Common to improve stability vs CUDA-aware MPI on some platforms, and wires the build system to enable/link NCCL when requested.

Changes:

  • Introduces NCCL implementations of bcast, allreduce(sum), and an allgatherv-style operation for GPU buffers, with runtime dispatch in parallel_device.
  • Updates PGemmCN to route reduction/gather steps through *_dev collectives (and adjusts temporary-buffer logic).
  • Adds CMake option + discovery/linking module to enable NCCL (ENABLE_NCCL_PARALLEL_DEVICE).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
source/source_base/parallel_device.h Adds NCCL collective declarations and dispatch in *_dev collectives; introduces get_buffer to avoid unnecessary D2H copies on non-root receive paths.
source/source_base/parallel_device.cpp Implements NCCL communicator registry and NCCL-backed bcast/reduce/gatherv-style collectives.
source/source_base/para_gemm.cpp Switches gather/reduce paths to reduce_dev/gatherv_dev and adjusts temp buffer allocation conditions.
cmake/SetupNccl.cmake Adds NCCL find/link helper used by the top-level build.
CMakeLists.txt Adds ENABLE_NCCL_PARALLEL_DEVICE option and integrates NCCL setup into CUDA builds.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/source_base/parallel_device.cpp
Comment thread source/source_base/parallel_device.h
Comment thread source/source_base/parallel_device.h
{
resmem_dev_op()(A_tmp_device_, max_colA * LDA);
#ifndef __CUDA_MPI
#if !defined(__CUDA_MPI) || defined(__NCCL_PARALLEL_DEVICE)
{
resmem_dev_op()(C_local_tmp_, size_C_local);
#ifndef __CUDA_MPI
#if !defined(__CUDA_MPI) || defined(__NCCL_PARALLEL_DEVICE)
Comment thread source/source_base/parallel_device.cpp
@mohanchen mohanchen added GPU & DCU & HPC GPU and DCU and HPC related any issues Compile & CICD & Docs & Dependencies Issues related to compiling ABACUS labels May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Compile & CICD & Docs & Dependencies Issues related to compiling ABACUS GPU & DCU & HPC GPU and DCU and HPC related any issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants