Skip to content

cuD-PDLP#1391

Open
Bubullzz wants to merge 130 commits into
NVIDIA:mainfrom
Bubullzz:cuD-PDLP
Open

cuD-PDLP#1391
Bubullzz wants to merge 130 commits into
NVIDIA:mainfrom
Bubullzz:cuD-PDLP

Conversation

@Bubullzz

@Bubullzz Bubullzz commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Implemented metis-partitionned multi-GPU PDLP.

On 8 NVLINKed B200 : 2x to 6x speedup against CuOpt PDLP. 0.5x to 4x speedup against D-PDLP. My implementation scales better on bigger instances.
I am coming back soon with the memory footprint gains

closes #891

Bubullzz added 30 commits May 7, 2026 15:07
@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test 56d5580

@Bubullzz

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/src/pdlp/distributed_pdlp/multi_gpu_engine.hpp (1)

151-181: ⚠️ Potential issue | 🟠 Major

Add error checking to all NCCL calls.

The ncclGroupStart, ncclGroupEnd, ncclSend, ncclRecv, and ncclAllReduce calls do not check return codes. NCCL errors (e.g., network failures, rank mismatches) will silently corrupt distributed execution or cause hangs. Wrap these calls with error checking similar to how RAFT_CUDA_TRY is used for CUDA errors throughout the codebase. This applies to all occurrences in multi_gpu_engine.hpp (lines 151–181, 218–248, 260–267) and distributed_algorithms.cu.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/pdlp/distributed_pdlp/multi_gpu_engine.hpp` around lines 151 - 181,
The NCCL calls ncclGroupStart, ncclSend, ncclRecv, and ncclGroupEnd in the
distributed variable synchronization loops do not check return codes, which can
cause silent errors or hangs. Wrap each of these NCCL function calls with error
checking (similar to the RAFT_CUDA_TRY pattern used elsewhere in the codebase)
to capture and handle potential NCCL errors such as network failures or rank
mismatches. Apply this error checking consistently to all NCCL calls in the
affected regions and ensure that errors are properly reported before continuing
with execution.
🧹 Nitpick comments (2)
cpp/src/pdlp/distributed_pdlp/partitioner.hpp (1)

9-9: 💤 Low value

Unused <string> include.

This header uses char const* for the context parameter in validate_partition, not std::string. The <string> include can be removed per IWYU.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/pdlp/distributed_pdlp/partitioner.hpp` at line 9, The `#include
<string>` header on line 9 is unused because the `validate_partition` function
uses `char const*` for its context parameter instead of `std::string`. Remove
this include statement to follow IWYU (Include What You Use) principles and
reduce unnecessary dependencies.
cpp/src/pdlp/pdlp.cu (1)

2996-2999: ⚡ Quick win

Consider moving distributed average-restart validation to construction time.

This runtime check correctly guards against using average restart in distributed mode (since unscaled_primal_avg_solution_ and unscaled_dual_avg_solution_ remain zero-sized from the shape-0 placeholder delegation). However, failing deep in the solve loop provides a poor user experience.

Add a precondition check in the distributed constructor (around line 400) to reject the incompatible configuration early:

cuopt_expects(settings.hyper_params.never_restart_to_average,
              error_type_t::ValidationError,
              "Distributed PDLP requires never_restart_to_average = true");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/pdlp/pdlp.cu` around lines 2996 - 2999, The validation check for
distributed PDLP's average restart constraint is currently performed during the
solve loop (around the cuopt_expects call with never_restart_to_average
message), which is too late for good user experience. Move this validation to
the distributed constructor (around line 400) to catch the incompatible
configuration early at construction time. Add a cuopt_expects precondition that
checks settings.hyper_params.never_restart_to_average is true and uses
error_type_t::ValidationError instead of RuntimeError. This allows users to
discover the configuration requirement immediately rather than failing deep in
the solve process.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@cpp/src/pdlp/distributed_pdlp/multi_gpu_engine.hpp`:
- Around line 151-181: The NCCL calls ncclGroupStart, ncclSend, ncclRecv, and
ncclGroupEnd in the distributed variable synchronization loops do not check
return codes, which can cause silent errors or hangs. Wrap each of these NCCL
function calls with error checking (similar to the RAFT_CUDA_TRY pattern used
elsewhere in the codebase) to capture and handle potential NCCL errors such as
network failures or rank mismatches. Apply this error checking consistently to
all NCCL calls in the affected regions and ensure that errors are properly
reported before continuing with execution.

---

Nitpick comments:
In `@cpp/src/pdlp/distributed_pdlp/partitioner.hpp`:
- Line 9: The `#include <string>` header on line 9 is unused because the
`validate_partition` function uses `char const*` for its context parameter
instead of `std::string`. Remove this include statement to follow IWYU (Include
What You Use) principles and reduce unnecessary dependencies.

In `@cpp/src/pdlp/pdlp.cu`:
- Around line 2996-2999: The validation check for distributed PDLP's average
restart constraint is currently performed during the solve loop (around the
cuopt_expects call with never_restart_to_average message), which is too late for
good user experience. Move this validation to the distributed constructor
(around line 400) to catch the incompatible configuration early at construction
time. Add a cuopt_expects precondition that checks
settings.hyper_params.never_restart_to_average is true and uses
error_type_t::ValidationError instead of RuntimeError. This allows users to
discover the configuration requirement immediately rather than failing deep in
the solve process.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 67a6c70f-e5d7-4fb3-bb09-3e515a5f65f2

📥 Commits

Reviewing files that changed from the base of the PR and between c42f770 and 56d5580.

📒 Files selected for processing (22)
  • cpp/CMakeLists.txt
  • cpp/cuopt_cli.cpp
  • cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/distributed_pdlp/distributed_algorithms.cu
  • cpp/src/pdlp/distributed_pdlp/multi_gpu_engine.cu
  • cpp/src/pdlp/distributed_pdlp/multi_gpu_engine.hpp
  • cpp/src/pdlp/distributed_pdlp/partition_loader.cu
  • cpp/src/pdlp/distributed_pdlp/partitioner.cpp
  • cpp/src/pdlp/distributed_pdlp/partitioner.hpp
  • cpp/src/pdlp/distributed_pdlp/rank_data.hpp
  • cpp/src/pdlp/distributed_pdlp/shard.cu
  • cpp/src/pdlp/distributed_pdlp/shard.hpp
  • cpp/src/pdlp/initial_scaling_strategy/initial_scaling.cu
  • cpp/src/pdlp/pdhg.cu
  • cpp/src/pdlp/pdhg.hpp
  • cpp/src/pdlp/pdlp.cu
  • cpp/src/pdlp/pdlp.cuh
  • cpp/src/pdlp/saddle_point.cu
  • cpp/src/pdlp/solve.cuh
  • cpp/src/pdlp/termination_strategy/convergence_information.cu
  • cpp/src/pdlp/termination_strategy/convergence_information.hpp
💤 Files with no reviewable changes (1)
  • cpp/src/pdlp/saddle_point.cu
🚧 Files skipped from review as they are similar to previous changes (15)
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/distributed_pdlp/rank_data.hpp
  • cpp/CMakeLists.txt
  • cpp/src/pdlp/pdlp.cuh
  • cpp/src/pdlp/termination_strategy/convergence_information.hpp
  • cpp/src/pdlp/solve.cuh
  • cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp
  • cpp/src/pdlp/distributed_pdlp/shard.hpp
  • cpp/src/pdlp/distributed_pdlp/multi_gpu_engine.cu
  • cpp/cuopt_cli.cpp
  • cpp/src/pdlp/distributed_pdlp/partition_loader.cu
  • cpp/src/pdlp/distributed_pdlp/partitioner.cpp
  • cpp/src/pdlp/distributed_pdlp/shard.cu
  • cpp/src/pdlp/distributed_pdlp/distributed_algorithms.cu
  • cpp/src/pdlp/termination_strategy/convergence_information.cu

@Bubullzz Bubullzz marked this pull request as ready for review June 17, 2026 15:27
@Kh4ster Kh4ster added feature request New feature or request non-breaking Introduces a non-breaking change pdlp labels Jun 17, 2026
@Kh4ster Kh4ster changed the title CuD-PDLP cuD-PDLP Jun 17, 2026
@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test e2a36ab

@Bubullzz Bubullzz removed the do not merge Do not merge if this flag is set label Jun 18, 2026
@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test 9c1e345

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test 34196b1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change pdlp

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Multi GPU PDLP

2 participants