cuD-PDLP#1391
Conversation
…he cycle seems to be fixed, cuopt compiles
+ style too
compiles and runs
|
/ok to test 56d5580 |
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/src/pdlp/distributed_pdlp/multi_gpu_engine.hpp (1)
151-181:⚠️ Potential issue | 🟠 MajorAdd error checking to all NCCL calls.
The
ncclGroupStart,ncclGroupEnd,ncclSend,ncclRecv, andncclAllReducecalls do not check return codes. NCCL errors (e.g., network failures, rank mismatches) will silently corrupt distributed execution or cause hangs. Wrap these calls with error checking similar to howRAFT_CUDA_TRYis used for CUDA errors throughout the codebase. This applies to all occurrences inmulti_gpu_engine.hpp(lines 151–181, 218–248, 260–267) anddistributed_algorithms.cu.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/pdlp/distributed_pdlp/multi_gpu_engine.hpp` around lines 151 - 181, The NCCL calls ncclGroupStart, ncclSend, ncclRecv, and ncclGroupEnd in the distributed variable synchronization loops do not check return codes, which can cause silent errors or hangs. Wrap each of these NCCL function calls with error checking (similar to the RAFT_CUDA_TRY pattern used elsewhere in the codebase) to capture and handle potential NCCL errors such as network failures or rank mismatches. Apply this error checking consistently to all NCCL calls in the affected regions and ensure that errors are properly reported before continuing with execution.
🧹 Nitpick comments (2)
cpp/src/pdlp/distributed_pdlp/partitioner.hpp (1)
9-9: 💤 Low valueUnused
<string>include.This header uses
char const*for the context parameter invalidate_partition, notstd::string. The<string>include can be removed per IWYU.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/pdlp/distributed_pdlp/partitioner.hpp` at line 9, The `#include <string>` header on line 9 is unused because the `validate_partition` function uses `char const*` for its context parameter instead of `std::string`. Remove this include statement to follow IWYU (Include What You Use) principles and reduce unnecessary dependencies.cpp/src/pdlp/pdlp.cu (1)
2996-2999: ⚡ Quick winConsider moving distributed average-restart validation to construction time.
This runtime check correctly guards against using average restart in distributed mode (since
unscaled_primal_avg_solution_andunscaled_dual_avg_solution_remain zero-sized from the shape-0 placeholder delegation). However, failing deep in the solve loop provides a poor user experience.Add a precondition check in the distributed constructor (around line 400) to reject the incompatible configuration early:
cuopt_expects(settings.hyper_params.never_restart_to_average, error_type_t::ValidationError, "Distributed PDLP requires never_restart_to_average = true");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/pdlp/pdlp.cu` around lines 2996 - 2999, The validation check for distributed PDLP's average restart constraint is currently performed during the solve loop (around the cuopt_expects call with never_restart_to_average message), which is too late for good user experience. Move this validation to the distributed constructor (around line 400) to catch the incompatible configuration early at construction time. Add a cuopt_expects precondition that checks settings.hyper_params.never_restart_to_average is true and uses error_type_t::ValidationError instead of RuntimeError. This allows users to discover the configuration requirement immediately rather than failing deep in the solve process.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@cpp/src/pdlp/distributed_pdlp/multi_gpu_engine.hpp`:
- Around line 151-181: The NCCL calls ncclGroupStart, ncclSend, ncclRecv, and
ncclGroupEnd in the distributed variable synchronization loops do not check
return codes, which can cause silent errors or hangs. Wrap each of these NCCL
function calls with error checking (similar to the RAFT_CUDA_TRY pattern used
elsewhere in the codebase) to capture and handle potential NCCL errors such as
network failures or rank mismatches. Apply this error checking consistently to
all NCCL calls in the affected regions and ensure that errors are properly
reported before continuing with execution.
---
Nitpick comments:
In `@cpp/src/pdlp/distributed_pdlp/partitioner.hpp`:
- Line 9: The `#include <string>` header on line 9 is unused because the
`validate_partition` function uses `char const*` for its context parameter
instead of `std::string`. Remove this include statement to follow IWYU (Include
What You Use) principles and reduce unnecessary dependencies.
In `@cpp/src/pdlp/pdlp.cu`:
- Around line 2996-2999: The validation check for distributed PDLP's average
restart constraint is currently performed during the solve loop (around the
cuopt_expects call with never_restart_to_average message), which is too late for
good user experience. Move this validation to the distributed constructor
(around line 400) to catch the incompatible configuration early at construction
time. Add a cuopt_expects precondition that checks
settings.hyper_params.never_restart_to_average is true and uses
error_type_t::ValidationError instead of RuntimeError. This allows users to
discover the configuration requirement immediately rather than failing deep in
the solve process.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 67a6c70f-e5d7-4fb3-bb09-3e515a5f65f2
📒 Files selected for processing (22)
cpp/CMakeLists.txtcpp/cuopt_cli.cppcpp/include/cuopt/linear_programming/pdlp/solver_settings.hppcpp/src/pdlp/CMakeLists.txtcpp/src/pdlp/distributed_pdlp/distributed_algorithms.cucpp/src/pdlp/distributed_pdlp/multi_gpu_engine.cucpp/src/pdlp/distributed_pdlp/multi_gpu_engine.hppcpp/src/pdlp/distributed_pdlp/partition_loader.cucpp/src/pdlp/distributed_pdlp/partitioner.cppcpp/src/pdlp/distributed_pdlp/partitioner.hppcpp/src/pdlp/distributed_pdlp/rank_data.hppcpp/src/pdlp/distributed_pdlp/shard.cucpp/src/pdlp/distributed_pdlp/shard.hppcpp/src/pdlp/initial_scaling_strategy/initial_scaling.cucpp/src/pdlp/pdhg.cucpp/src/pdlp/pdhg.hppcpp/src/pdlp/pdlp.cucpp/src/pdlp/pdlp.cuhcpp/src/pdlp/saddle_point.cucpp/src/pdlp/solve.cuhcpp/src/pdlp/termination_strategy/convergence_information.cucpp/src/pdlp/termination_strategy/convergence_information.hpp
💤 Files with no reviewable changes (1)
- cpp/src/pdlp/saddle_point.cu
🚧 Files skipped from review as they are similar to previous changes (15)
- cpp/src/pdlp/CMakeLists.txt
- cpp/src/pdlp/distributed_pdlp/rank_data.hpp
- cpp/CMakeLists.txt
- cpp/src/pdlp/pdlp.cuh
- cpp/src/pdlp/termination_strategy/convergence_information.hpp
- cpp/src/pdlp/solve.cuh
- cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp
- cpp/src/pdlp/distributed_pdlp/shard.hpp
- cpp/src/pdlp/distributed_pdlp/multi_gpu_engine.cu
- cpp/cuopt_cli.cpp
- cpp/src/pdlp/distributed_pdlp/partition_loader.cu
- cpp/src/pdlp/distributed_pdlp/partitioner.cpp
- cpp/src/pdlp/distributed_pdlp/shard.cu
- cpp/src/pdlp/distributed_pdlp/distributed_algorithms.cu
- cpp/src/pdlp/termination_strategy/convergence_information.cu
|
/ok to test e2a36ab |
|
/ok to test 9c1e345 |
|
/ok to test 34196b1 |
Implemented metis-partitionned multi-GPU PDLP.
On 8 NVLINKed B200 : 2x to 6x speedup against CuOpt PDLP. 0.5x to 4x speedup against D-PDLP. My implementation scales better on bigger instances.
I am coming back soon with the memory footprint gains
closes #891