Add: child_memory + TensorKey + scheduler affinity for device-resident tensors#579
Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom Apr 18, 2026
Merged
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a child_memory flag to the ContinuousTensor structure, allowing the runtime to skip host-to-device copies for tensors already managed by a child process. The changes span the core task interface, Python bindings, and runtime implementations for multiple platforms. Feedback highlights a critical regression where skipping these tensors in the initialization phase causes misalignment in the result validation logic, potentially leading to data corruption. Furthermore, the provided system test uses host-side shared memory to simulate device memory, which may not accurately represent behavior on physical hardware.
cdef4ce to
bdb2863
Compare
Add a 1-byte `child_memory` field in the existing padding of ContinuousTensor (sizeof stays 40B). When set, init_runtime_impl passes the tensor pointer through as-is instead of malloc + H2D copy + record_tensor_pair. This enables child-process-allocated device buffers (e.g. HCCL windows, pre-staged weights) to be referenced in TaskArgs without being re-copied or freed per task. - tensor_arg.h: add child_memory field + is_child_memory() helper - runtime_maker.cpp (a2a3 aicpu, a2a3 tmar, a5 tmar): skip loop - nanobind: expose child_memory on ContinuousTensor.make() + property - C++ unit tests: sizeof, default, blob roundtrip, view_to_chip_storage - Python unit tests: make, property, repr, ChipStorageTaskArgs - L3 scene test: same child_memory weight across two kernel invocations
bdb2863 to
bdf71d4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enable device-resident tensor buffers to flow through the L3 task pipeline without redundant H2D copies. Adds memory allocation on next-level workers via
orch.malloc, scheduler worker affinity, and TensorMap compound keys for cross-NPU address disambiguation.1.
child_memoryflag onContinuousTensorA 1-byte field in existing tail padding (
sizeofstays 40B). When set,init_runtime_implpasses the tensor pointer through as-is instead ofdevice_malloc+copy_to_device+record_tensor_pair.2.
device_malloc_ctxC APIExport
device_malloc_ctx/device_free_ctx/copy_to_device_ctx/copy_from_device_ctxfrompto_runtime_c_api(all 4 platforms). Wired throughChipWorker.malloc/free/copy_to/copy_from.3.
orch.malloc(worker_id, size)— memory allocation on next-level workersFull path:
Orchestrator → WorkerManager → WorkerThread → ChipWorker.4.
TensorKeycompound key for TensorMapReplace
uint64_tkey with{uint64_t ptr, int8_t worker}struct. Disambiguates identical device addresses across different NPUs.5. Scheduler worker affinity
submit_next_level(worker=0)/submit_next_level_group(workers=[0,1])store per-args affinitiesScheduler::dispatch_readytwo-pass: satisfy affinity constraints first, then fill unconstrained slots from idle pool6. Thread-safe MemoryAllocator
Add
std::mutexto MemoryAllocator (all 4 platform implementations) soorch.mallocfrom the orch thread can run concurrently withinit_runtime_implon the worker thread.Testing
orch.malloc→orch.copy_to→child_memory=True→submit_next_level(worker=0)× 2Related: #571