Add manual-scope v0 to tensormap runtime by uv-xiao · Pull Request #568 · hw-native-sys/simpler

uv-xiao · 2026-04-15T08:01:32Z

Summary

Add the narrow manual_scope v0 mode for a2a3/tensormap_and_ringbuffer.
Keep the same submit API family as AUTO mode; manual ordering is expressed through Arg.add_dep(task_id).
Publish tasks at submit time; there is no delayed wiring, delayed linking, or scope-end replay barrier.
Carry standalone submit-result task ids so zero-output updater chains and alloc_tensors(...) producers can be depended on explicitly.
Skip TensorMap lookup/insert for tasks submitted inside manual scope; dependency correctness in manual scope comes from explicit deps.
Add manual-scope paged-attention examples and keep the detailed design/benchmark note in docs/manual-scope-v0-design.md.

Scope

This PR is intentionally smaller than the older manual-dep branch.

In scope:

PTO2_SCOPE() remains AUTO by default.
PTO2_SCOPE(PTO2ScopeMode::MANUAL) enters manual mode.
Submit APIs stay unchanged:
- pto2_rt_submit_aic_task(...)
- pto2_rt_submit_aiv_task(...)
- pto2_rt_submit_task(...)
Arg.add_dep(task_id) appends explicit producer task ids to the normal Arg object.
Explicit deps are validated and materialized as ordinary fanins before submit-time publish.
alloc_tensors(...) remains output-only and exposes its producer task id.

Out of scope for v0:

no separate *_manual(...) submit APIs
no post-submit dependency API
no delayed dependency wiring/linking
no batch publish barrier at scope_end()
no nested scopes under an active manual scope
no implicit TensorMap fallback for tasks submitted inside manual scope

Runtime Model

Manual scope uses the normal submit path with two extra rules:

explicit deps from Arg.add_dep(...) are consumed before TensorMap handling
if the consumer task is inside manual scope, TensorMap lookup and insert are skipped for that submit

Dependency behavior:

manual producer -> manual consumer: use Arg.add_dep(...); no TensorMap lookup/insert
external producer -> manual consumer: not supported directly in v0; put the producer in the same manual scope or keep the consumer outside manual scope
manual producer -> later external consumer: use Arg.add_dep(...) outside manual scope when ordering is required
AUTO / outside manual scope: keep normal TensorMap behavior

Review-alignment cleanup in the latest push:

Arg keeps explicit dependency storage private while preserving add_dep(), explicit_dep_count(), and explicit_dep().
AUTO scopes nested under an active manual scope are rejected explicitly.
The design doc now reflects the actual submit-scoped TensorMap policy and current hardware data.

Minimal Example

PTO2_SCOPE(PTO2ScopeMode::MANUAL) {
    auto alloc = alloc_tensors(tmp_ci);

    Arg qk;
    qk.add_input(qi, kj);
    qk.add_output(sij_ci);
    qk.add_dep(alloc.task_id());
    auto qk_out = pto2_rt_submit_aic_task(FUNC_QK_MATMUL, qk);

    Arg sf;
    sf.add_input(qk_out.get_ref(0));
    sf.add_output(pij_ci, li_ci, mi_ci);
    sf.add_dep(qk_out.task_id());
    auto sf_out = pto2_rt_submit_aiv_task(FUNC_SOFTMAX_PREPARE, sf);

    Arg up;
    up.add_input(sf_out.get_ref(1), sf_out.get_ref(2));
    up.add_inout(mi, li, out_view);
    up.add_dep(sf_out.task_id());
    (void)pto2_rt_submit_aiv_task(FUNC_ONLINE_UPDATE, up);
}

For repeated zero-output updater chains, thread the returned task id explicitly:

PTO2TaskId prev_update = PTO2TaskId::invalid();
for (...) {
    Arg up = make_update_args(...);
    if (prev_update.is_valid()) {
        up.add_dep(prev_update);
    }
    prev_update = pto2_rt_submit_aiv_task(FUNC_ONLINE_UPDATE, up).task_id();
}

Validation And Benchmarks

Primary design note:

docs/manual-scope-v0-design.md

Latest real-device validation was rerun on a2a3, device 10, PTO-ISA 478daadb, on commit 0a8680c.

Golden status:

TMR AUTO paged_attention: Case1 PASS, Case2 PASS
TMR manual paged_attention_manual_scope: Case1 PASS, Case2 PASS
TMR AUTO paged_attention_unroll: Case1 PASS, Case2 PASS
TMR manual paged_attention_unroll_manual_scope: Case1 PASS, Case2 PASS
ABG paged_attention_unroll: Case1 PASS, Case2 FAIL

Fresh matched 100-round trimmed benchmark:

Example	Case	TMR AUTO Elapsed (us)	TMR AUTO Orch (us)	TMR Manual Elapsed (us)	TMR Manual Orch (us)	Manual Delta Elapsed (us)	Manual Delta Orch (us)
`paged_attention`	`Case1`	72.613	54.655	81.358	62.287	+8.745	+7.632
`paged_attention`	`Case2`	91.682	63.958	101.647	71.885	+9.965	+7.927
`paged_attention_unroll`	`Case1`	1140.067	710.711	1131.544	614.427	-8.523	-96.284
`paged_attention_unroll`	`Case2`	513.079	274.306	491.192	229.887	-21.887	-44.419

Current readout:

non-unroll manual scope is still slower than TMR AUTO, but the gap is now single-digit microseconds in orchestration time
unroll manual scope is faster than TMR AUTO on both kept cases
ABG unroll Case2 is not correctness-clean in this rerun and should not be used as a correctness-clean target
in-tree ABG non-unroll production cases are different/larger shapes, so they are not apples-to-apples with the kept small TMR non-unroll cases

Testing

Latest matched hardware run:

task-submit --timeout 3600 --max-time 3600 --device 10 --run \
  "bash /tmp/manual_scope_v0_hw_eval_matched.sh 10"

This run performs golden validation without --skip-golden, then runs the matched 100-round benchmark for the TMR AUTO/manual pairs above.

gemini-code-assist

Code Review

This pull request introduces a lighter 'manual-scope' mode for the a2a3/tensormap_and_ringbuffer service, allowing for explicit task dependency management while maintaining the existing AUTO-mode submit API. The design avoids complex delayed wiring, opting for explicit Arg.add_dep(task_id) annotations within a PTO2_SCOPE(PTO2ScopeMode::MANUAL) block. I have identified a critical issue where PTO2_SCOPE_GUARD(); is used incorrectly, resulting in an immediate destruction of the guard and leaving the loop body unprotected. Additionally, the design document contains a future date that should be corrected.

uv-xiao · 2026-04-20T05:46:44Z

Alignment update against poursoul/manual_scope (dd76880):

The current manual_scope_v0 branch now follows the same hot-path structure in the main places that matter:
- pto2_prepare_task() prebinds next_task_id() / next_task_slot() and zero-inits slot state with memset
- TaskOutputTensors carries task_id() directly
- payload->init(...) materializes outputs from PTO2TaskAllocResult / PTO2OutputLayout
- manual-scope submit skips TensorMap lookup/insert and uses explicit deps directly
- the unroll manual example uses the same narrowed update dependency chain shape
The intentional difference we keep is allocator-failure signaling:
- this branch keeps the old negative sentinel {-1, -1, nullptr, nullptr}
- poursoul’s visible patch returned {0, 0, nullptr, nullptr} while failed() still checked task_id < 0
- we kept the negative sentinel because task id 0 is valid, so the zero-sentinel form is internally inconsistent unless the failure check also changes

I also benchmarked the current branch and poursoul’s branch directly in an isolated worktree on the same device (a2a3, device 9, 100 rounds, PTO-ISA d96c8784). The result is that the two are materially the same; there is no clear evidence that our current branch is slower because we mis-merged the colleague design.

Example	Case	Current Elapsed / Orch	Poursoul Elapsed / Orch	Poursoul vs Current
`paged_attention`	`Case1`	`74.6 / 59.2`	`77.3 / 61.4`	`+3.6% / +3.7%`
`paged_attention`	`Case2`	`93.2 / 72.3`	`93.1 / 71.9`	`-0.1% / -0.6%`
`paged_attention_manual_scope`	`Case1`	`118.1 / 101.3`	`120.9 / 105.6`	`+2.4% / +4.2%`
`paged_attention_manual_scope`	`Case2`	`138.6 / 115.3`	`128.7 / 115.7`	`-7.1% / +0.4%`
`paged_attention_unroll`	`Case1`	`1135.2 / 774.1`	`1140.1 / 766.9`	`+0.4% / -0.9%`
`paged_attention_unroll`	`Case2`	`516.0 / 305.5`	`520.2 / 319.1`	`+0.8% / +4.5%`
`paged_attention_unroll_manual_scope`	`Case1`	`1129.6 / 651.7`	`1128.9 / 646.2`	`-0.1% / -0.8%`
`paged_attention_unroll_manual_scope`	`Case2`	`494.3 / 253.6`	`495.4 / 252.5`	`+0.2% / -0.4%`

So the current state is:

alignment with poursoul’s design is in place for the main runtime/example paths
performance is effectively the same between the two branches within normal run-to-run noise for most rows
the remaining non-unroll gap vs AUTO is a real manual-scope-v0 issue, not just a divergence from poursoul’s branch

uv-xiao · 2026-04-20T17:42:31Z

Update after rerunning the full batch on real device.

Cleaned state

This PR is still the same narrow v0 scope only:

runtime/API support for PTO2_SCOPE(PTO2ScopeMode::MANUAL)
explicit deps through Arg.add_dep(task_id)
submit-time publish only
standalone submit-result task_id for zero-output updater chaining
TensorMap lookup/insert bypass for current-manual-scope-local tensors
two a2a3 manual-scope examples:
- paged_attention_manual_scope
- paged_attention_unroll_manual_scope
unit / sim validation for the v0 rules

The branch is kept as 3 logical commits:

runtime support
paged-attention examples
design / benchmark doc

Fresh real-device benchmark

Rerun on a2a3, device 9, PTO-ISA d96c8784, 100 rounds, trimmed average.

Golden status:

TMR AUTO/manual all PASS on the two paged-attention workloads
ABG paged_attention PASS
ABG paged_attention_unroll Case2 FAIL again, so that ABG row is still not correctness-clean

Example	Case	TMR AUTO Elapsed (us)	TMR AUTO Orch (us)	TMR Manual Elapsed (us)	TMR Manual Orch (us)	ABG Elapsed (us)	Notes
`paged_attention`	`Case1`	73.4	60.2	119.6	104.8	31385.1	all correctness checks pass
`paged_attention`	`Case2`	93.9	73.3	137.1	114.6	16429.4	all correctness checks pass
`paged_attention_unroll`	`Case1`	1137.0	772.7	1132.2	647.4	1383.3	all correctness checks pass
`paged_attention_unroll`	`Case2`	523.0	317.7	492.6	251.2	676.7	ABG golden fails

Reading of the current batch:

non-unroll manual scope is still slower than TMR AUTO, and the gap is mostly orchestration time
unroll manual scope is still slightly faster than TMR AUTO on both kept cases
ABG paged_attention_unroll Case2 remains an unstable / not correctness-clean baseline in reruns

TensorMap lookup / insert comparison

Profiling comparison from the current manual-scope implementation on non-unroll paged_attention:

Case	Mode	`lookup+dep` Trim (us)	`tensormap_ins` Trim (us)	TensorMap Lookups Avg	TensorMap Inserts Avg	Full Orch Trim (us)
`Case1`	AUTO	4.132	1.842	40.0	12.0	194.508
`Case1`	MANUAL	1.944	1.414	16.0	3.0	259.318
`Case2`	AUTO	6.320	2.638	105.0	32.0	210.274
`Case2`	MANUAL	2.598	1.728	41.0	8.0	285.182

What this still shows:

the manual-local TensorMap bypass is working
lookup / insert traffic is materially reduced in manual mode
the remaining non-unroll gap is not explained by TensorMap lookup / insert alone
the remaining cost is in the explicit-dep / orchestration path, not the old TensorMap path

uv-xiao · 2026-04-21T04:49:20Z

Small follow-up after aligning the two manual-scope paged-attention examples.

What changed

The non-unroll manual example now uses the same update-chain dependency shape as the unroll manual example:

every update keeps the direct pv_outs.task_id() producer edge
the first update depends on the allocation task
later updates depend on the previous update task
the last non-first update also retains the allocation task, matching the unroll path

This removes the older conservative pattern where non-unroll attached alloc_task to every update.

Targeted real-device rerun

Rerun only the affected non-unroll manual example on a2a3, device 9, PTO-ISA d96c8784, with --build.

Golden:

paged_attention_manual_scope Case1: PASS
paged_attention_manual_scope Case2: PASS

100-round trimmed benchmark for the affected rows:

Example	Case	Before Manual Elapsed (us)	Before Manual Orch (us)	After Manual Elapsed (us)	After Manual Orch (us)
`paged_attention_manual_scope`	`Case1`	119.6	104.8	117.3	102.8
`paged_attention_manual_scope`	`Case2`	137.1	114.6	133.8	112.4

So this is mostly a consistency cleanup, with a small measured improvement on the non-unroll manual path.

uv-xiao · 2026-04-23T06:14:25Z

Latest pushed update is commit 0a8680c.

What changed since the last PR update

Rebased/forward-ported branch state was pushed onto uv-xiao:manual_scope_v0.
Arg.add_dep(...) now stores explicit deps behind private Arg storage while keeping the public API unchanged.
AUTO scopes nested under an active manual scope are rejected explicitly.
The design doc was refreshed to match the actual v0 semantics: tasks submitted inside manual scope skip TensorMap lookup/insert; dependencies must be explicit.
Fresh matched real-device validation/benchmark was rerun on a2a3, device 10, PTO-ISA 478daadb.

Fresh matched 100-round benchmark

Example	Case	TMR AUTO Elapsed (us)	TMR AUTO Orch (us)	TMR Manual Elapsed (us)	TMR Manual Orch (us)	Manual Delta Elapsed (us)	Manual Delta Orch (us)
`paged_attention`	`Case1`	72.613	54.655	81.358	62.287	+8.745	+7.632
`paged_attention`	`Case2`	91.682	63.958	101.647	71.885	+9.965	+7.927
`paged_attention_unroll`	`Case1`	1140.067	710.711	1131.544	614.427	-8.523	-96.284
`paged_attention_unroll`	`Case2`	513.079	274.306	491.192	229.887	-21.887	-44.419

Golden status for the TMR AUTO/manual rows above: all PASS.

Readout:

non-unroll manual scope still has a small orchestration overhead vs AUTO
unroll manual scope improves orchestration time on both kept cases
ABG unroll Case2 still fails golden and is not a correctness-clean baseline

- Add manual scope mode and explicit Arg dependency plumbing - Attach submit-result task ids independently from output tensors - Bypass TensorMap lookup and insert while manual scope is active - Keep the runtime/examples/docs scope without adding test changes

- Add non-unroll and unroll manual-scope examples for a2a3 TMR - Wire task ids through Arg.add_dep at submit time - Keep AUTO paged-attention available as the comparison path

- Document v0 API constraints and submit-time dependency model - Record TensorMap bypass behavior and boundary-edge rules - Include the current device benchmark and validation notes

- Make non-unroll manual paged-attention use the same update-chain dependency shape as the unroll manual path - Gate alloc-task retention with is_first/is_last instead of attaching it on every update - Verified with fresh hardware golden and 100-round reruns on device 9

- shrink the runtime diff back toward upstream in the validated manual-scope\n hot path\n- keep TensorMap bypass only for current-manual-local tensors while\n preserving boundary TensorMap behavior\n- refresh the v0 design doc with the rebased branch state plus fresh\n 100-round hardware benchmarks and golden-check results

- wrap Arg explicit dependency storage behind a private helper while\n keeping the add_dep API unchanged\n- reject AUTO scopes nested under an active manual scope and keep\n current-scope dep validation explicit\n- refresh docs/manual-scope-v0-design.md with fresh matched\n 100-round hardware benchmark results from device 10

- validate explicit deps by task liveness and skip retired producers - reset manual-scope state in existing orchestrator init/done paths - make Arg.add_dep variadic with atomic capacity checking - refresh the design doc and golden file to satisfy current hooks

uv-xiao requested a review from poursoul April 15, 2026 08:03

uv-xiao marked this pull request as ready for review April 15, 2026 08:03

gemini-code-assist Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread ...a2a3/tensormap_and_ringbuffer/paged_attention/kernels/orchestration/paged_attention_orch.cpp

Comment thread docs/manual-scope-v0-design.md Outdated

uv-xiao force-pushed the manual_scope_v0 branch from 5e4de6a to 84763d8 Compare April 20, 2026 09:17

uv-xiao force-pushed the manual_scope_v0 branch from 84763d8 to 0e0d3ee Compare April 20, 2026 17:48

poursoul reviewed Apr 22, 2026

View reviewed changes

uv-xiao force-pushed the manual_scope_v0 branch from efab669 to 0a8680c Compare April 23, 2026 06:09

uv-xiao force-pushed the manual_scope_v0 branch from 0a8680c to b0d7a2d Compare April 23, 2026 07:45

uv-xiao requested a review from poursoul April 23, 2026 13:35

uv-xiao force-pushed the manual_scope_v0 branch 3 times, most recently from de48b2b to c0d01e1 Compare April 24, 2026 07:04

uv-xiao added 6 commits April 24, 2026 15:18

Add: manual scope paged-attention examples

5f993a2

- Add non-unroll and unroll manual-scope examples for a2a3 TMR - Wire task ids through Arg.add_dep at submit time - Keep AUTO paged-attention available as the comparison path

docs: add manual scope v0 design

ba0f7a6

- Document v0 API constraints and submit-time dependency model - Record TensorMap bypass behavior and boundary-edge rules - Include the current device benchmark and validation notes

uv-xiao force-pushed the manual_scope_v0 branch from c0d01e1 to 6205db0 Compare April 24, 2026 07:38

poursoul reviewed Apr 24, 2026

View reviewed changes

uv-xiao force-pushed the manual_scope_v0 branch from 6205db0 to 54ddc4f Compare April 24, 2026 08:42

uv-xiao force-pushed the manual_scope_v0 branch from 54ddc4f to e976f7f Compare April 24, 2026 08:51

Fix: allow nested manual scope in a2a3

8309ddc

uv-xiao force-pushed the manual_scope_v0 branch from 626ce5e to 8309ddc Compare April 24, 2026 11:41

Conversation

uv-xiao commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Runtime Model

Minimal Example

Validation And Benchmarks

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

uv-xiao commented Apr 20, 2026

Uh oh!

uv-xiao commented Apr 20, 2026

Cleaned state

Fresh real-device benchmark

TensorMap lookup / insert comparison

Uh oh!

uv-xiao commented Apr 21, 2026

What changed

Targeted real-device rerun

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

uv-xiao commented Apr 23, 2026

What changed since the last PR update

Fresh matched 100-round benchmark

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

uv-xiao commented Apr 15, 2026 •

edited

Loading