Fix: store_to_host flush ordering and validation in Worker bootstrap loop#630
Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom Apr 22, 2026
Merged
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces up-front validation for host-staging symmetry in the bootstrap context and refactors the worker process loop to flush 'store_to_host' buffers before publishing the task completion status. This change ensures that the parent process does not read incomplete data from shared memory. The review feedback identifies a potential crash when handling zero-sized buffers in the flush loop and suggests maintaining consistency by using the internal implementation for DMA operations. Similar robustness improvements are recommended for the newly added unit tests.
…loop PR hw-native-sys#307 wired the L5 store_to_host DMA flush into _chip_process_loop_with_bootstrap so the parent can read chip outputs from shared memory instead of round-tripping through the cross-fork control plane. The wiring shipped with three defects that only stay harmless while buffers are tiny: - Race: TASK_DONE was published before the D2H flush ran, so a parent returning from worker.run() could observe the state transition while the chip child still had the flush in flight — visible as zeroed or stale output SharedMemory on any production-sized buffer. Move the flush above _write_error + _mailbox_store_i32(_TASK_DONE) so the parent never sees the transition until the D2H copy has completed. - Error path: the flush ran unconditionally, including after a kernel failure, stamping undefined device memory into the parent's SharedMemory and masking the real error in any post-mortem. Guard the flush with `if code == 0:` and, on flush failure, convert the exception into code=1 with a store_to_host=<name> prefix so the parent sees a clear error rather than a silent mailbox hang. - Validation: a ChipBufferSpec with store_to_host=True but no matching HostBufferStaging in host_outputs caused a KeyError deep inside the chip child after bootstrap returned SUCCESS, leaving the parent polling an uninhabited mailbox forever. Add a symmetric up-front check in ChipWorker.bootstrap_context() so both load_from_host and store_to_host surface a ValueError through the channel before any device or communicator state is touched. Add two sim UTs in test_bootstrap_context_sim.py: a store_to_host round-trip that writes a known payload through the same flush path worker.py drives, and a single-process error-path case that trips the new symmetry check and asserts the ERROR payload on the channel.
053f6b7 to
86a96be
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up fix for three defects introduced by PR #307 in
_chip_process_loop_with_bootstrap'sstore_to_hostDMA wiring. All three stay harmless while buffers are tiny (the allreduce_distributed example currently moves 1 KB per rank), but surface on any production-sized output.TASK_DONEwas published before the D2H flush ran, so a parent returning fromworker.run()could observe the state transition while the chip child still had the flush in flight — zeroed or stale output SharedMemory on real workloads. Flush now runs above_write_error+_mailbox_store_i32(_TASK_DONE).if code == 0:, and flush-path exceptions convert tocode=1with astore_to_host=<name>prefix instead of propagating and hanging the mailbox.ChipBufferSpec(store_to_host=True)without a matchingHostBufferStaginginhost_outputsused toKeyError-crash the chip child after bootstrap reported SUCCESS, leaving the parent polling an uninhabited mailbox forever.ChipWorker.bootstrap_context()now does a symmetric up-front check for bothload_from_host ↔ host_inputsandstore_to_host ↔ host_outputsso a misconfiguration surfaces as aValueErroron the channel before any device or communicator state is touched.Test plan
TestBootstrapContextStoreToHost.test_store_to_host_round_trip— 2-rank sim fork that writes a known payload into the window viacopy_to, runs the same D2H flush the worker loop does, and asserts the parent's SharedMemory round-tripped the bytes.TestBootstrapContextMissingOutputStaging.test_store_to_host_without_host_outputs_raises— trips the new symmetry check, asserts child-sideValueErrorand the parent-visibleERRORchannel payload.pytest tests/ut/py/test_worker -v(sim)python examples/workers/l3/allreduce_distributed/main.py -d 0-1on hardware (the current — and, until this PR, only —store_to_hostconsumer)