Skip to content

Add: child_memory + TensorKey + scheduler affinity for device-resident tensors#579

Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:feat/device-resident-tensor
Apr 18, 2026
Merged

Add: child_memory + TensorKey + scheduler affinity for device-resident tensors#579
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:feat/device-resident-tensor

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented Apr 16, 2026

Summary

Enable device-resident tensor buffers to flow through the L3 task pipeline without redundant H2D copies. Adds memory allocation on next-level workers via orch.malloc, scheduler worker affinity, and TensorMap compound keys for cross-NPU address disambiguation.

1. child_memory flag on ContinuousTensor

A 1-byte field in existing tail padding (sizeof stays 40B). When set, init_runtime_impl passes the tensor pointer through as-is instead of device_malloc + copy_to_device + record_tensor_pair.

2. device_malloc_ctx C API

Export device_malloc_ctx / device_free_ctx / copy_to_device_ctx / copy_from_device_ctx from pto_runtime_c_api (all 4 platforms). Wired through ChipWorker.malloc/free/copy_to/copy_from.

3. orch.malloc(worker_id, size) — memory allocation on next-level workers

Full path: Orchestrator → WorkerManager → WorkerThread → ChipWorker.

  • THREAD mode: direct call on ChipWorker (MemoryAllocator is mutex-protected for thread safety)
  • PROCESS mode: control command via mailbox IPC (CONTROL_REQUEST/CONTROL_DONE states)

4. TensorKey compound key for TensorMap

Replace uint64_t key with {uint64_t ptr, int8_t worker} struct. Disambiguates identical device addresses across different NPUs.

5. Scheduler worker affinity

  • submit_next_level(worker=0) / submit_next_level_group(workers=[0,1]) store per-args affinities
  • Scheduler::dispatch_ready two-pass: satisfy affinity constraints first, then fill unconstrained slots from idle pool

6. Thread-safe MemoryAllocator

Add std::mutex to MemoryAllocator (all 4 platform implementations) so orch.malloc from the orch thread can run concurrently with init_runtime_impl on the worker thread.

Testing

  • C++ unit tests pass (8/8) — child_memory roundtrip + TensorKey compound key + orchestrator
  • Python unit tests pass (7/7 new + existing)
  • L3 scene test passes on a2a3sim — orch.mallocorch.copy_tochild_memory=Truesubmit_next_level(worker=0) × 2
  • Hardware tests (requires device)

Related: #571

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a child_memory flag to the ContinuousTensor structure, allowing the runtime to skip host-to-device copies for tensors already managed by a child process. The changes span the core task interface, Python bindings, and runtime implementations for multiple platforms. Feedback highlights a critical regression where skipping these tensors in the initialization phase causes misalignment in the result validation logic, potentially leading to data corruption. Furthermore, the provided system test uses host-side shared memory to simulate device memory, which may not accurately represent behavior on physical hardware.

Comment thread src/a2a3/runtime/aicpu_build_graph/host/runtime_maker.cpp
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp
Comment thread src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp
Comment thread tests/st/a2a3/tensormap_and_ringbuffer/test_l3_child_memory.py
@ChaoWao ChaoWao force-pushed the feat/device-resident-tensor branch 2 times, most recently from cdef4ce to bdb2863 Compare April 18, 2026 04:24
@ChaoWao ChaoWao changed the title Add: child_memory flag on ContinuousTensor to skip H2D copy Add: child_memory + TensorKey + scheduler affinity for device-resident tensors Apr 18, 2026
Add a 1-byte `child_memory` field in the existing padding of
ContinuousTensor (sizeof stays 40B). When set, init_runtime_impl
passes the tensor pointer through as-is instead of malloc + H2D
copy + record_tensor_pair. This enables child-process-allocated
device buffers (e.g. HCCL windows, pre-staged weights) to be
referenced in TaskArgs without being re-copied or freed per task.

- tensor_arg.h: add child_memory field + is_child_memory() helper
- runtime_maker.cpp (a2a3 aicpu, a2a3 tmar, a5 tmar): skip loop
- nanobind: expose child_memory on ContinuousTensor.make() + property
- C++ unit tests: sizeof, default, blob roundtrip, view_to_chip_storage
- Python unit tests: make, property, repr, ChipStorageTaskArgs
- L3 scene test: same child_memory weight across two kernel invocations
@ChaoWao ChaoWao force-pushed the feat/device-resident-tensor branch from bdb2863 to bdf71d4 Compare April 18, 2026 06:48
@ChaoWao ChaoWao merged commit 093c0b3 into hw-native-sys:main Apr 18, 2026
15 checks passed
@ChaoWao ChaoWao deleted the feat/device-resident-tensor branch April 18, 2026 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant