Port distributed comm abilities to PR444 by PKUZHOU · Pull Request #522 · hw-native-sys/simpler

PKUZHOU · 2026-04-12T09:53:03Z

No description provided.

gemini-code-assist

Code Review

This pull request introduces a distributed communication framework for the PTO runtime, enabling multi-rank execution on A2/A3 hardware via HCCL and on simulation via POSIX shared memory. Key features include RDMA window management, asynchronous task completion using completion queues, and per-chip bootstrap channels, along with new distributed examples and updated Python tooling. Feedback highlights a race condition in the simulation barrier implementation, missing error handling for shared memory allocation timeouts, and an issue where profiling is hardcoded to false for deferred tasks. Additionally, the reviewer noted that querying dlerror when loading optional symbols is unnecessary.

gemini-code-assist · 2026-04-12T09:56:41Z

+    int arrived = __atomic_add_fetch(&hdr->barrier_count, 1, __ATOMIC_ACQ_REL);
+
+    if (arrived == h->nranks) {
+        __atomic_store_n(&hdr->barrier_count, 0, __ATOMIC_RELEASE);
+        __atomic_add_fetch(&hdr->barrier_phase, 1, __ATOMIC_ACQ_REL);
+    } else {
+        while (__atomic_load_n(&hdr->barrier_phase, __ATOMIC_ACQUIRE) == phase) {
+            usleep(50);
+        }
+    }


The barrier implementation in comm_barrier contains a race condition. The last rank to arrive resets barrier_count to 0 (line 177) before incrementing barrier_phase (line 178). If another rank (e.g., Rank B) exits the loop, returns from the function, and immediately enters the next barrier, it will increment barrier_count. If the last rank (Rank A) from the previous barrier then executes line 177, it will overwrite Rank B's increment, causing the next barrier to hang or behave incorrectly. A robust sense-reversing barrier should avoid resetting the shared counter in this manner or use multiple counters.

gemini-code-assist · 2026-04-12T09:56:41Z

+T load_optional_symbol(void *handle, const char *name) {
+    dlerror();
+    void *sym = dlsym(handle, name);
+    const char *err = dlerror();
+    if (err) return nullptr;
+    return reinterpret_cast<T>(sym);
+}


The implementation of load_optional_symbol unnecessarily queries dlerror(). According to the general rules for this repository, when looking up an optional symbol with dlsym, checking the return value for NULL is the correct approach. The absence of the symbol is intentional and should not be treated as an error by querying dlerror(), which can also be less efficient.

template <typename T> T load_optional_symbol(void *handle, const char *name) { return reinterpret_cast<T>(dlsym(handle, name)); }

References

When using dlsym to look up an optional symbol, checking the return value for NULL is the correct approach. Do not treat a NULL return as an error by checking dlerror(), as the absence of the symbol is intentional for reasons like backward compatibility.

gemini-code-assist · 2026-04-12T09:56:41Z

+        for (int i = 0; i < 5000; ++i) {
+            struct stat st;
+            if (fstat(fd, &st) == 0 && static_cast<size_t>(st.st_size) >= total) break;
+            usleep(1000);
+        }


In comm_alloc_windows, if the polling loop for the shared memory file size times out, the function continues to mmap (line 111) without verifying if the file was successfully initialized. This could lead to mapping an incomplete region or a subsequent crash (e.g., SIGBUS). Following repository guidelines, return values of resource allocation and setup functions like fstat should be checked, and failures handled gracefully. The function should return an error if the timeout is reached.

bool ok = false; for (int i = 0; i < 5000; ++i) { struct stat st; if (fstat(fd, &st) == 0 && static_cast<size_t>(st.st_size) >= total) { ok = true; break; } usleep(1000); } if (!ok) { std::fprintf(stderr, "comm_sim: timeout waiting for shm initialization\n"); close(fd); return -1; }

References

Always check the return values of resource allocation functions (like pthread_key_create) and handle potential failures gracefully, for example, by rolling back partial allocations.

gemini-code-assist · 2026-04-12T09:56:41Z

+            PTO2AsyncPollResult poll_result = async_wait_list.poll_and_complete<false>(
+                &rt->scheduler, local_bufs,
+                deferred_release_slot_states, deferred_release_count, MAX_DEFERRED_RELEASES
+#if PTO2_SCHED_PROFILING
+                , thread_idx
+#endif
+            );


The call to poll_and_complete hardcodes the Profiling template parameter to false. This means that deferred-completion tasks will not be correctly profiled even when profiling_enabled is true. Additionally, the Profiling template parameter in poll_and_complete (and pto2_complete_task) appears to be unused in the current implementation, as the logic relies on the PTO2_SCHED_PROFILING macro instead. Consider removing the unused template parameter and ensuring profiling is correctly handled for async tasks.

Two-stage FFN tensor-parallel demo aligned with the L7'a allreduce shape: Stage 1 (AIC matmul): partial_local = x_shard @ w_shard Stage 2 (AIV reduce): y = sum_over_ranks(partial_local) partial_local is a per-rank ``torch.share_memory_()`` tensor, OUTPUT of stage 1 and INPUT of stage 2. TensorMap auto-discovers the producer/consumer edge via ``buffer.addr`` equality, so the two ``submit_next_level`` calls need no explicit barrier. scratch is a single HCCL-window buffer per chip laid out as ``[mailbox: nranks*M*N floats | signal tail: nranks int32 slots]`` and drives the cross-rank publish/notify/wait/accumulate inside the AIV kernel. Kernel + orchestration C++ sources ported from PR hw-native-sys#522. Co-authored-by: PKUZHOU <zhouzhe@stu.pku.edu.cn>

Two-stage FFN tensor-parallel demo aligned with the L7'a allreduce shape: Stage 1 (AIC matmul): partial_local = x_shard @ w_shard Stage 2 (AIV reduce): y = sum_over_ranks(partial_local) partial_local is a per-rank ``torch.share_memory_()`` tensor, OUTPUT of stage 1 and INPUT of stage 2. TensorMap auto-discovers the producer/consumer edge via ``buffer.addr`` equality, so the two ``submit_next_level`` calls need no explicit barrier. scratch is a single HCCL-window buffer per chip laid out as ``[mailbox: nranks*M*N floats | signal tail: nranks int32 slots]`` and drives the cross-rank publish/notify/wait/accumulate inside the AIV kernel. Kernel + orchestration C++ sources ported from PR #522. Co-authored-by: wcwxy <26245345+ChaoWao@users.noreply.github.com> Co-authored-by: PKUZHOU <zhouzhe@stu.pku.edu.cn>

PKUZHOU added 2 commits April 12, 2026 16:33

add distributed (sync+async) execution

7094dd0

add ffn_tp example

12b4cc5

gemini-code-assist Bot reviewed Apr 12, 2026

View reviewed changes

ChaoWao mentioned this pull request Apr 21, 2026

Async distributed runtime #334

Closed

hw-native-sys-bot mentioned this pull request Apr 22, 2026

Add: examples/workers/l3/ffn_tp_parallel end-to-end demo (L7'b) #571

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port distributed comm abilities to PR444#522

Port distributed comm abilities to PR444#522
PKUZHOU wants to merge 2 commits intohw-native-sys:mainfrom
PKUZHOU:pr_444_async

PKUZHOU commented Apr 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PKUZHOU commented Apr 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant