Skip to content

Fix: store_to_host flush ordering and validation in Worker bootstrap loop#630

Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:fix-store-to-host-flush-ordering
Apr 22, 2026
Merged

Fix: store_to_host flush ordering and validation in Worker bootstrap loop#630
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:fix-store-to-host-flush-ordering

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented Apr 22, 2026

Summary

Follow-up fix for three defects introduced by PR #307 in _chip_process_loop_with_bootstrap's store_to_host DMA wiring. All three stay harmless while buffers are tiny (the allreduce_distributed example currently moves 1 KB per rank), but surface on any production-sized output.

  • Race: TASK_DONE was published before the D2H flush ran, so a parent returning from worker.run() could observe the state transition while the chip child still had the flush in flight — zeroed or stale output SharedMemory on real workloads. Flush now runs above _write_error + _mailbox_store_i32(_TASK_DONE).
  • Error path: the flush ran unconditionally, including after a kernel failure, stamping undefined device memory into the parent's SharedMemory and masking the real error. Flush is now gated by if code == 0:, and flush-path exceptions convert to code=1 with a store_to_host=<name> prefix instead of propagating and hanging the mailbox.
  • Validation: a ChipBufferSpec(store_to_host=True) without a matching HostBufferStaging in host_outputs used to KeyError-crash the chip child after bootstrap reported SUCCESS, leaving the parent polling an uninhabited mailbox forever. ChipWorker.bootstrap_context() now does a symmetric up-front check for both load_from_host ↔ host_inputs and store_to_host ↔ host_outputs so a misconfiguration surfaces as a ValueError on the channel before any device or communicator state is touched.

Test plan

  • Added TestBootstrapContextStoreToHost.test_store_to_host_round_trip — 2-rank sim fork that writes a known payload into the window via copy_to, runs the same D2H flush the worker loop does, and asserts the parent's SharedMemory round-tripped the bytes.
  • Added TestBootstrapContextMissingOutputStaging.test_store_to_host_without_host_outputs_raises — trips the new symmetry check, asserts child-side ValueError and the parent-visible ERROR channel payload.
  • pytest tests/ut/py/test_worker -v (sim)
  • python examples/workers/l3/allreduce_distributed/main.py -d 0-1 on hardware (the current — and, until this PR, only — store_to_host consumer)

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces up-front validation for host-staging symmetry in the bootstrap context and refactors the worker process loop to flush 'store_to_host' buffers before publishing the task completion status. This change ensures that the parent process does not read incomplete data from shared memory. The review feedback identifies a potential crash when handling zero-sized buffers in the flush loop and suggests maintaining consistency by using the internal implementation for DMA operations. Similar robustness improvements are recommended for the newly added unit tests.

Comment thread python/simpler/worker.py
Comment thread tests/ut/py/test_worker/test_bootstrap_context_sim.py
…loop

PR hw-native-sys#307 wired the L5 store_to_host DMA flush into
_chip_process_loop_with_bootstrap so the parent can read chip outputs
from shared memory instead of round-tripping through the cross-fork
control plane.  The wiring shipped with three defects that only stay
harmless while buffers are tiny:

- Race: TASK_DONE was published before the D2H flush ran, so a parent
  returning from worker.run() could observe the state transition while
  the chip child still had the flush in flight — visible as zeroed or
  stale output SharedMemory on any production-sized buffer.  Move the
  flush above _write_error + _mailbox_store_i32(_TASK_DONE) so the
  parent never sees the transition until the D2H copy has completed.

- Error path: the flush ran unconditionally, including after a kernel
  failure, stamping undefined device memory into the parent's
  SharedMemory and masking the real error in any post-mortem.  Guard
  the flush with `if code == 0:` and, on flush failure, convert the
  exception into code=1 with a store_to_host=<name> prefix so the
  parent sees a clear error rather than a silent mailbox hang.

- Validation: a ChipBufferSpec with store_to_host=True but no matching
  HostBufferStaging in host_outputs caused a KeyError deep inside the
  chip child after bootstrap returned SUCCESS, leaving the parent
  polling an uninhabited mailbox forever.  Add a symmetric up-front
  check in ChipWorker.bootstrap_context() so both load_from_host and
  store_to_host surface a ValueError through the channel before any
  device or communicator state is touched.

Add two sim UTs in test_bootstrap_context_sim.py: a store_to_host
round-trip that writes a known payload through the same flush path
worker.py drives, and a single-process error-path case that trips the
new symmetry check and asserts the ERROR payload on the channel.
@ChaoWao ChaoWao force-pushed the fix-store-to-host-flush-ordering branch from 053f6b7 to 86a96be Compare April 22, 2026 02:40
@ChaoWao ChaoWao merged commit dc79739 into hw-native-sys:main Apr 22, 2026
14 checks passed
@ChaoWao ChaoWao deleted the fix-store-to-host-flush-ordering branch April 22, 2026 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant