Add: child_memory + TensorKey + scheduler affinity for device-resident tensors by ChaoWao · Pull Request #579 · hw-native-sys/simpler

ChaoWao · 2026-04-16T03:21:46Z

Summary

Enable device-resident tensor buffers to flow through the L3 task pipeline without redundant H2D copies. Adds memory allocation on next-level workers via orch.malloc, scheduler worker affinity, and TensorMap compound keys for cross-NPU address disambiguation.

1. `child_memory` flag on `ContinuousTensor`

A 1-byte field in existing tail padding (sizeof stays 40B). When set, init_runtime_impl passes the tensor pointer through as-is instead of device_malloc + copy_to_device + record_tensor_pair.

2. `device_malloc_ctx` C API

Export device_malloc_ctx / device_free_ctx / copy_to_device_ctx / copy_from_device_ctx from pto_runtime_c_api (all 4 platforms). Wired through ChipWorker.malloc/free/copy_to/copy_from.

3. `orch.malloc(worker_id, size)` — memory allocation on next-level workers

Full path: Orchestrator → WorkerManager → WorkerThread → ChipWorker.

THREAD mode: direct call on ChipWorker (MemoryAllocator is mutex-protected for thread safety)
PROCESS mode: control command via mailbox IPC (CONTROL_REQUEST/CONTROL_DONE states)

4. `TensorKey` compound key for TensorMap

Replace uint64_t key with {uint64_t ptr, int8_t worker} struct. Disambiguates identical device addresses across different NPUs.

5. Scheduler worker affinity

submit_next_level(worker=0) / submit_next_level_group(workers=[0,1]) store per-args affinities
Scheduler::dispatch_ready two-pass: satisfy affinity constraints first, then fill unconstrained slots from idle pool

6. Thread-safe MemoryAllocator

Add std::mutex to MemoryAllocator (all 4 platform implementations) so orch.malloc from the orch thread can run concurrently with init_runtime_impl on the worker thread.

Testing

C++ unit tests pass (8/8) — child_memory roundtrip + TensorKey compound key + orchestrator
Python unit tests pass (7/7 new + existing)
L3 scene test passes on a2a3sim — orch.malloc → orch.copy_to → child_memory=True → submit_next_level(worker=0) × 2
Hardware tests (requires device)

Related: #571

gemini-code-assist

Code Review

This pull request introduces a child_memory flag to the ContinuousTensor structure, allowing the runtime to skip host-to-device copies for tensors already managed by a child process. The changes span the core task interface, Python bindings, and runtime implementations for multiple platforms. Feedback highlights a critical regression where skipping these tensors in the initialization phase causes misalignment in the result validation logic, potentially leading to data corruption. Furthermore, the provided system test uses host-side shared memory to simulate device memory, which may not accurately represent behavior on physical hardware.

Add a 1-byte `child_memory` field in the existing padding of ContinuousTensor (sizeof stays 40B). When set, init_runtime_impl passes the tensor pointer through as-is instead of malloc + H2D copy + record_tensor_pair. This enables child-process-allocated device buffers (e.g. HCCL windows, pre-staged weights) to be referenced in TaskArgs without being re-copied or freed per task. - tensor_arg.h: add child_memory field + is_child_memory() helper - runtime_maker.cpp (a2a3 aicpu, a2a3 tmar, a5 tmar): skip loop - nanobind: expose child_memory on ContinuousTensor.make() + property - C++ unit tests: sizeof, default, blob roundtrip, view_to_chip_storage - Python unit tests: make, property, repr, ChipStorageTaskArgs - L3 scene test: same child_memory weight across two kernel invocations

gemini-code-assist Bot reviewed Apr 16, 2026

View reviewed changes

ChaoWao force-pushed the feat/device-resident-tensor branch 2 times, most recently from cdef4ce to bdb2863 Compare April 18, 2026 04:24

ChaoWao changed the title ~~Add: child_memory flag on ContinuousTensor to skip H2D copy~~ Add: child_memory + TensorKey + scheduler affinity for device-resident tensors Apr 18, 2026

ChaoWao force-pushed the feat/device-resident-tensor branch from bdb2863 to bdf71d4 Compare April 18, 2026 06:48

ChaoWao merged commit 093c0b3 into hw-native-sys:main Apr 18, 2026
15 checks passed

ChaoWao deleted the feat/device-resident-tensor branch April 18, 2026 07:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: child_memory + TensorKey + scheduler affinity for device-resident tensors#579

Add: child_memory + TensorKey + scheduler affinity for device-resident tensors#579
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:feat/device-resident-tensor

ChaoWao commented Apr 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoWao commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. child_memory flag on ContinuousTensor

2. device_malloc_ctx C API

3. orch.malloc(worker_id, size) — memory allocation on next-level workers

4. TensorKey compound key for TensorMap

5. Scheduler worker affinity

6. Thread-safe MemoryAllocator

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChaoWao commented Apr 16, 2026 •

edited

Loading

1. `child_memory` flag on `ContinuousTensor`

2. `device_malloc_ctx` C API

3. `orch.malloc(worker_id, size)` — memory allocation on next-level workers

4. `TensorKey` compound key for TensorMap