Skip to content

Port distributed comm abilities to PR444#522

Open
PKUZHOU wants to merge 2 commits intohw-native-sys:mainfrom
PKUZHOU:pr_444_async
Open

Port distributed comm abilities to PR444#522
PKUZHOU wants to merge 2 commits intohw-native-sys:mainfrom
PKUZHOU:pr_444_async

Conversation

@PKUZHOU
Copy link
Copy Markdown
Contributor

@PKUZHOU PKUZHOU commented Apr 12, 2026

No description provided.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a distributed communication framework for the PTO runtime, enabling multi-rank execution on A2/A3 hardware via HCCL and on simulation via POSIX shared memory. Key features include RDMA window management, asynchronous task completion using completion queues, and per-chip bootstrap channels, along with new distributed examples and updated Python tooling. Feedback highlights a race condition in the simulation barrier implementation, missing error handling for shared memory allocation timeouts, and an issue where profiling is hardcoded to false for deferred tasks. Additionally, the reviewer noted that querying dlerror when loading optional symbols is unnecessary.

Comment on lines +174 to +183
int arrived = __atomic_add_fetch(&hdr->barrier_count, 1, __ATOMIC_ACQ_REL);

if (arrived == h->nranks) {
__atomic_store_n(&hdr->barrier_count, 0, __ATOMIC_RELEASE);
__atomic_add_fetch(&hdr->barrier_phase, 1, __ATOMIC_ACQ_REL);
} else {
while (__atomic_load_n(&hdr->barrier_phase, __ATOMIC_ACQUIRE) == phase) {
usleep(50);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The barrier implementation in comm_barrier contains a race condition. The last rank to arrive resets barrier_count to 0 (line 177) before incrementing barrier_phase (line 178). If another rank (e.g., Rank B) exits the loop, returns from the function, and immediately enters the next barrier, it will increment barrier_count. If the last rank (Rank A) from the previous barrier then executes line 177, it will overwrite Rank B's increment, causing the next barrier to hang or behave incorrectly. A robust sense-reversing barrier should avoid resetting the shared counter in this manner or use multiple counters.

Comment on lines +40 to +46
T load_optional_symbol(void *handle, const char *name) {
dlerror();
void *sym = dlsym(handle, name);
const char *err = dlerror();
if (err) return nullptr;
return reinterpret_cast<T>(sym);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation of load_optional_symbol unnecessarily queries dlerror(). According to the general rules for this repository, when looking up an optional symbol with dlsym, checking the return value for NULL is the correct approach. The absence of the symbol is intentional and should not be treated as an error by querying dlerror(), which can also be less efficient.

template <typename T>
T load_optional_symbol(void *handle, const char *name) {
    return reinterpret_cast<T>(dlsym(handle, name));
}
References
  1. When using dlsym to look up an optional symbol, checking the return value for NULL is the correct approach. Do not treat a NULL return as an error by checking dlerror(), as the absence of the symbol is intentional for reasons like backward compatibility.

Comment on lines +101 to +105
for (int i = 0; i < 5000; ++i) {
struct stat st;
if (fstat(fd, &st) == 0 && static_cast<size_t>(st.st_size) >= total) break;
usleep(1000);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In comm_alloc_windows, if the polling loop for the shared memory file size times out, the function continues to mmap (line 111) without verifying if the file was successfully initialized. This could lead to mapping an incomplete region or a subsequent crash (e.g., SIGBUS). Following repository guidelines, return values of resource allocation and setup functions like fstat should be checked, and failures handled gracefully. The function should return an error if the timeout is reached.

        bool ok = false;
        for (int i = 0; i < 5000; ++i) {
            struct stat st;
            if (fstat(fd, &st) == 0 && static_cast<size_t>(st.st_size) >= total) {
                ok = true;
                break;
            }
            usleep(1000);
        }
        if (!ok) {
            std::fprintf(stderr, "comm_sim: timeout waiting for shm initialization\n");
            close(fd);
            return -1;
        }
References
  1. Always check the return values of resource allocation functions (like pthread_key_create) and handle potential failures gracefully, for example, by rolling back partial allocations.

Comment on lines +1411 to +1417
PTO2AsyncPollResult poll_result = async_wait_list.poll_and_complete<false>(
&rt->scheduler, local_bufs,
deferred_release_slot_states, deferred_release_count, MAX_DEFERRED_RELEASES
#if PTO2_SCHED_PROFILING
, thread_idx
#endif
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to poll_and_complete hardcodes the Profiling template parameter to false. This means that deferred-completion tasks will not be correctly profiled even when profiling_enabled is true. Additionally, the Profiling template parameter in poll_and_complete (and pto2_complete_task) appears to be unused in the current implementation, as the logic relies on the PTO2_SCHED_PROFILING macro instead. Consider removing the unused template parameter and ensuring profiling is correctly handled for async tasks.

hw-native-sys-bot pushed a commit to PKUZHOU/simpler that referenced this pull request Apr 22, 2026
Two-stage FFN tensor-parallel demo aligned with the L7'a allreduce
shape:

  Stage 1 (AIC matmul):  partial_local = x_shard @ w_shard
  Stage 2 (AIV reduce):  y             = sum_over_ranks(partial_local)

partial_local is a per-rank ``torch.share_memory_()`` tensor, OUTPUT of
stage 1 and INPUT of stage 2. TensorMap auto-discovers the
producer/consumer edge via ``buffer.addr`` equality, so the two
``submit_next_level`` calls need no explicit barrier.

scratch is a single HCCL-window buffer per chip laid out as
``[mailbox: nranks*M*N floats | signal tail: nranks int32 slots]`` and
drives the cross-rank publish/notify/wait/accumulate inside the AIV
kernel.

Kernel + orchestration C++ sources ported from PR hw-native-sys#522.

Co-authored-by: PKUZHOU <zhouzhe@stu.pku.edu.cn>
hw-native-sys-bot pushed a commit to PKUZHOU/simpler that referenced this pull request Apr 22, 2026
Two-stage FFN tensor-parallel demo aligned with the L7'a allreduce
shape:

  Stage 1 (AIC matmul):  partial_local = x_shard @ w_shard
  Stage 2 (AIV reduce):  y             = sum_over_ranks(partial_local)

partial_local is a per-rank ``torch.share_memory_()`` tensor, OUTPUT of
stage 1 and INPUT of stage 2. TensorMap auto-discovers the
producer/consumer edge via ``buffer.addr`` equality, so the two
``submit_next_level`` calls need no explicit barrier.

scratch is a single HCCL-window buffer per chip laid out as
``[mailbox: nranks*M*N floats | signal tail: nranks int32 slots]`` and
drives the cross-rank publish/notify/wait/accumulate inside the AIV
kernel.

Kernel + orchestration C++ sources ported from PR hw-native-sys#522.

Co-authored-by: PKUZHOU <zhouzhe@stu.pku.edu.cn>
ChaoWao added a commit that referenced this pull request Apr 22, 2026
Two-stage FFN tensor-parallel demo aligned with the L7'a allreduce
shape:

  Stage 1 (AIC matmul):  partial_local = x_shard @ w_shard
  Stage 2 (AIV reduce):  y             = sum_over_ranks(partial_local)

partial_local is a per-rank ``torch.share_memory_()`` tensor, OUTPUT of
stage 1 and INPUT of stage 2. TensorMap auto-discovers the
producer/consumer edge via ``buffer.addr`` equality, so the two
``submit_next_level`` calls need no explicit barrier.

scratch is a single HCCL-window buffer per chip laid out as
``[mailbox: nranks*M*N floats | signal tail: nranks int32 slots]`` and
drives the cross-rank publish/notify/wait/accumulate inside the AIV
kernel.

Kernel + orchestration C++ sources ported from PR #522.

Co-authored-by: wcwxy <26245345+ChaoWao@users.noreply.github.com>
Co-authored-by: PKUZHOU <zhouzhe@stu.pku.edu.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant