Add manual-scope v0 to tensormap runtime#568
Add manual-scope v0 to tensormap runtime#568uv-xiao wants to merge 8 commits intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a lighter 'manual-scope' mode for the a2a3/tensormap_and_ringbuffer service, allowing for explicit task dependency management while maintaining the existing AUTO-mode submit API. The design avoids complex delayed wiring, opting for explicit Arg.add_dep(task_id) annotations within a PTO2_SCOPE(PTO2ScopeMode::MANUAL) block. I have identified a critical issue where PTO2_SCOPE_GUARD(); is used incorrectly, resulting in an immediate destruction of the guard and leaving the loop body unprotected. Additionally, the design document contains a future date that should be corrected.
|
Alignment update against
I also benchmarked the current branch and poursoul’s branch directly in an isolated worktree on the same device (
So the current state is:
|
5e4de6a to
84763d8
Compare
|
Update after rerunning the full batch on real device. Cleaned stateThis PR is still the same narrow v0 scope only:
The branch is kept as 3 logical commits:
Fresh real-device benchmarkRerun on Golden status:
Reading of the current batch:
TensorMap lookup / insert comparisonProfiling comparison from the current manual-scope implementation on non-unroll
What this still shows:
|
84763d8 to
0e0d3ee
Compare
|
Small follow-up after aligning the two manual-scope paged-attention examples. What changedThe non-unroll manual example now uses the same update-chain dependency shape as the unroll manual example:
This removes the older conservative pattern where non-unroll attached Targeted real-device rerunRerun only the affected non-unroll manual example on Golden:
100-round trimmed benchmark for the affected rows:
So this is mostly a consistency cleanup, with a small measured improvement on the non-unroll manual path. |
efab669 to
0a8680c
Compare
|
Latest pushed update is commit What changed since the last PR update
Fresh matched 100-round benchmark
Golden status for the TMR AUTO/manual rows above: all PASS. Readout:
|
0a8680c to
b0d7a2d
Compare
de48b2b to
c0d01e1
Compare
- Add manual scope mode and explicit Arg dependency plumbing - Attach submit-result task ids independently from output tensors - Bypass TensorMap lookup and insert while manual scope is active - Keep the runtime/examples/docs scope without adding test changes
- Add non-unroll and unroll manual-scope examples for a2a3 TMR - Wire task ids through Arg.add_dep at submit time - Keep AUTO paged-attention available as the comparison path
- Document v0 API constraints and submit-time dependency model - Record TensorMap bypass behavior and boundary-edge rules - Include the current device benchmark and validation notes
- Make non-unroll manual paged-attention use the same update-chain dependency shape as the unroll manual path - Gate alloc-task retention with is_first/is_last instead of attaching it on every update - Verified with fresh hardware golden and 100-round reruns on device 9
- shrink the runtime diff back toward upstream in the validated manual-scope\n hot path\n- keep TensorMap bypass only for current-manual-local tensors while\n preserving boundary TensorMap behavior\n- refresh the v0 design doc with the rebased branch state plus fresh\n 100-round hardware benchmarks and golden-check results
- wrap Arg explicit dependency storage behind a private helper while\n keeping the add_dep API unchanged\n- reject AUTO scopes nested under an active manual scope and keep\n current-scope dep validation explicit\n- refresh docs/manual-scope-v0-design.md with fresh matched\n 100-round hardware benchmark results from device 10
c0d01e1 to
6205db0
Compare
6205db0 to
54ddc4f
Compare
- validate explicit deps by task liveness and skip retired producers - reset manual-scope state in existing orchestrator init/done paths - make Arg.add_dep variadic with atomic capacity checking - refresh the design doc and golden file to satisfy current hooks
54ddc4f to
e976f7f
Compare
626ce5e to
8309ddc
Compare
Summary
manual_scope v0mode fora2a3/tensormap_and_ringbuffer.Arg.add_dep(task_id).alloc_tensors(...)producers can be depended on explicitly.docs/manual-scope-v0-design.md.Scope
This PR is intentionally smaller than the older manual-dep branch.
In scope:
PTO2_SCOPE()remains AUTO by default.PTO2_SCOPE(PTO2ScopeMode::MANUAL)enters manual mode.pto2_rt_submit_aic_task(...)pto2_rt_submit_aiv_task(...)pto2_rt_submit_task(...)Arg.add_dep(task_id)appends explicit producer task ids to the normalArgobject.alloc_tensors(...)remains output-only and exposes its producer task id.Out of scope for v0:
*_manual(...)submit APIsscope_end()Runtime Model
Manual scope uses the normal submit path with two extra rules:
Arg.add_dep(...)are consumed before TensorMap handlingDependency behavior:
Arg.add_dep(...); no TensorMap lookup/insertArg.add_dep(...)outside manual scope when ordering is requiredReview-alignment cleanup in the latest push:
Argkeeps explicit dependency storage private while preservingadd_dep(),explicit_dep_count(), andexplicit_dep().AUTOscopes nested under an active manual scope are rejected explicitly.Minimal Example
For repeated zero-output updater chains, thread the returned task id explicitly:
Validation And Benchmarks
Primary design note:
docs/manual-scope-v0-design.mdLatest real-device validation was rerun on
a2a3, device10, PTO-ISA478daadb, on commit0a8680c.Golden status:
paged_attention:Case1PASS,Case2PASSpaged_attention_manual_scope:Case1PASS,Case2PASSpaged_attention_unroll:Case1PASS,Case2PASSpaged_attention_unroll_manual_scope:Case1PASS,Case2PASSpaged_attention_unroll:Case1PASS,Case2FAILFresh matched 100-round trimmed benchmark:
paged_attentionCase1paged_attentionCase2paged_attention_unrollCase1paged_attention_unrollCase2Current readout:
Case2is not correctness-clean in this rerun and should not be used as a correctness-clean targetTesting
Latest matched hardware run:
task-submit --timeout 3600 --max-time 3600 --device 10 --run \ "bash /tmp/manual_scope_v0_hw_eval_matched.sh 10"This run performs golden validation without
--skip-golden, then runs the matched 100-round benchmark for the TMR AUTO/manual pairs above.