Skip to content

Add manual-scope v0 to tensormap runtime#568

Open
uv-xiao wants to merge 8 commits intohw-native-sys:mainfrom
uv-xiao:manual_scope_v0
Open

Add manual-scope v0 to tensormap runtime#568
uv-xiao wants to merge 8 commits intohw-native-sys:mainfrom
uv-xiao:manual_scope_v0

Conversation

@uv-xiao
Copy link
Copy Markdown
Contributor

@uv-xiao uv-xiao commented Apr 15, 2026

Summary

  • Add the narrow manual_scope v0 mode for a2a3/tensormap_and_ringbuffer.
  • Keep the same submit API family as AUTO mode; manual ordering is expressed through Arg.add_dep(task_id).
  • Publish tasks at submit time; there is no delayed wiring, delayed linking, or scope-end replay barrier.
  • Carry standalone submit-result task ids so zero-output updater chains and alloc_tensors(...) producers can be depended on explicitly.
  • Skip TensorMap lookup/insert for tasks submitted inside manual scope; dependency correctness in manual scope comes from explicit deps.
  • Add manual-scope paged-attention examples and keep the detailed design/benchmark note in docs/manual-scope-v0-design.md.

Scope

This PR is intentionally smaller than the older manual-dep branch.

In scope:

  • PTO2_SCOPE() remains AUTO by default.
  • PTO2_SCOPE(PTO2ScopeMode::MANUAL) enters manual mode.
  • Submit APIs stay unchanged:
    • pto2_rt_submit_aic_task(...)
    • pto2_rt_submit_aiv_task(...)
    • pto2_rt_submit_task(...)
  • Arg.add_dep(task_id) appends explicit producer task ids to the normal Arg object.
  • Explicit deps are validated and materialized as ordinary fanins before submit-time publish.
  • alloc_tensors(...) remains output-only and exposes its producer task id.

Out of scope for v0:

  • no separate *_manual(...) submit APIs
  • no post-submit dependency API
  • no delayed dependency wiring/linking
  • no batch publish barrier at scope_end()
  • no nested scopes under an active manual scope
  • no implicit TensorMap fallback for tasks submitted inside manual scope

Runtime Model

Manual scope uses the normal submit path with two extra rules:

  1. explicit deps from Arg.add_dep(...) are consumed before TensorMap handling
  2. if the consumer task is inside manual scope, TensorMap lookup and insert are skipped for that submit

Dependency behavior:

  • manual producer -> manual consumer: use Arg.add_dep(...); no TensorMap lookup/insert
  • external producer -> manual consumer: not supported directly in v0; put the producer in the same manual scope or keep the consumer outside manual scope
  • manual producer -> later external consumer: use Arg.add_dep(...) outside manual scope when ordering is required
  • AUTO / outside manual scope: keep normal TensorMap behavior

Review-alignment cleanup in the latest push:

  • Arg keeps explicit dependency storage private while preserving add_dep(), explicit_dep_count(), and explicit_dep().
  • AUTO scopes nested under an active manual scope are rejected explicitly.
  • The design doc now reflects the actual submit-scoped TensorMap policy and current hardware data.

Minimal Example

PTO2_SCOPE(PTO2ScopeMode::MANUAL) {
    auto alloc = alloc_tensors(tmp_ci);

    Arg qk;
    qk.add_input(qi, kj);
    qk.add_output(sij_ci);
    qk.add_dep(alloc.task_id());
    auto qk_out = pto2_rt_submit_aic_task(FUNC_QK_MATMUL, qk);

    Arg sf;
    sf.add_input(qk_out.get_ref(0));
    sf.add_output(pij_ci, li_ci, mi_ci);
    sf.add_dep(qk_out.task_id());
    auto sf_out = pto2_rt_submit_aiv_task(FUNC_SOFTMAX_PREPARE, sf);

    Arg up;
    up.add_input(sf_out.get_ref(1), sf_out.get_ref(2));
    up.add_inout(mi, li, out_view);
    up.add_dep(sf_out.task_id());
    (void)pto2_rt_submit_aiv_task(FUNC_ONLINE_UPDATE, up);
}

For repeated zero-output updater chains, thread the returned task id explicitly:

PTO2TaskId prev_update = PTO2TaskId::invalid();
for (...) {
    Arg up = make_update_args(...);
    if (prev_update.is_valid()) {
        up.add_dep(prev_update);
    }
    prev_update = pto2_rt_submit_aiv_task(FUNC_ONLINE_UPDATE, up).task_id();
}

Validation And Benchmarks

Primary design note:

  • docs/manual-scope-v0-design.md

Latest real-device validation was rerun on a2a3, device 10, PTO-ISA 478daadb, on commit 0a8680c.

Golden status:

  • TMR AUTO paged_attention: Case1 PASS, Case2 PASS
  • TMR manual paged_attention_manual_scope: Case1 PASS, Case2 PASS
  • TMR AUTO paged_attention_unroll: Case1 PASS, Case2 PASS
  • TMR manual paged_attention_unroll_manual_scope: Case1 PASS, Case2 PASS
  • ABG paged_attention_unroll: Case1 PASS, Case2 FAIL

Fresh matched 100-round trimmed benchmark:

Example Case TMR AUTO Elapsed (us) TMR AUTO Orch (us) TMR Manual Elapsed (us) TMR Manual Orch (us) Manual Delta Elapsed (us) Manual Delta Orch (us)
paged_attention Case1 72.613 54.655 81.358 62.287 +8.745 +7.632
paged_attention Case2 91.682 63.958 101.647 71.885 +9.965 +7.927
paged_attention_unroll Case1 1140.067 710.711 1131.544 614.427 -8.523 -96.284
paged_attention_unroll Case2 513.079 274.306 491.192 229.887 -21.887 -44.419

Current readout:

  • non-unroll manual scope is still slower than TMR AUTO, but the gap is now single-digit microseconds in orchestration time
  • unroll manual scope is faster than TMR AUTO on both kept cases
  • ABG unroll Case2 is not correctness-clean in this rerun and should not be used as a correctness-clean target
  • in-tree ABG non-unroll production cases are different/larger shapes, so they are not apples-to-apples with the kept small TMR non-unroll cases

Testing

Latest matched hardware run:

task-submit --timeout 3600 --max-time 3600 --device 10 --run \
  "bash /tmp/manual_scope_v0_hw_eval_matched.sh 10"

This run performs golden validation without --skip-golden, then runs the matched 100-round benchmark for the TMR AUTO/manual pairs above.

@uv-xiao uv-xiao requested a review from poursoul April 15, 2026 08:03
@uv-xiao uv-xiao marked this pull request as ready for review April 15, 2026 08:03
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a lighter 'manual-scope' mode for the a2a3/tensormap_and_ringbuffer service, allowing for explicit task dependency management while maintaining the existing AUTO-mode submit API. The design avoids complex delayed wiring, opting for explicit Arg.add_dep(task_id) annotations within a PTO2_SCOPE(PTO2ScopeMode::MANUAL) block. I have identified a critical issue where PTO2_SCOPE_GUARD(); is used incorrectly, resulting in an immediate destruction of the guard and leaving the loop body unprotected. Additionally, the design document contains a future date that should be corrected.

Comment thread docs/manual-scope-v0-design.md Outdated
@uv-xiao
Copy link
Copy Markdown
Contributor Author

uv-xiao commented Apr 20, 2026

Alignment update against poursoul/manual_scope (dd76880):

  • The current manual_scope_v0 branch now follows the same hot-path structure in the main places that matter:
    • pto2_prepare_task() prebinds next_task_id() / next_task_slot() and zero-inits slot state with memset
    • TaskOutputTensors carries task_id() directly
    • payload->init(...) materializes outputs from PTO2TaskAllocResult / PTO2OutputLayout
    • manual-scope submit skips TensorMap lookup/insert and uses explicit deps directly
    • the unroll manual example uses the same narrowed update dependency chain shape
  • The intentional difference we keep is allocator-failure signaling:
    • this branch keeps the old negative sentinel {-1, -1, nullptr, nullptr}
    • poursoul’s visible patch returned {0, 0, nullptr, nullptr} while failed() still checked task_id < 0
    • we kept the negative sentinel because task id 0 is valid, so the zero-sentinel form is internally inconsistent unless the failure check also changes

I also benchmarked the current branch and poursoul’s branch directly in an isolated worktree on the same device (a2a3, device 9, 100 rounds, PTO-ISA d96c8784). The result is that the two are materially the same; there is no clear evidence that our current branch is slower because we mis-merged the colleague design.

Example Case Current Elapsed / Orch Poursoul Elapsed / Orch Poursoul vs Current
paged_attention Case1 74.6 / 59.2 77.3 / 61.4 +3.6% / +3.7%
paged_attention Case2 93.2 / 72.3 93.1 / 71.9 -0.1% / -0.6%
paged_attention_manual_scope Case1 118.1 / 101.3 120.9 / 105.6 +2.4% / +4.2%
paged_attention_manual_scope Case2 138.6 / 115.3 128.7 / 115.7 -7.1% / +0.4%
paged_attention_unroll Case1 1135.2 / 774.1 1140.1 / 766.9 +0.4% / -0.9%
paged_attention_unroll Case2 516.0 / 305.5 520.2 / 319.1 +0.8% / +4.5%
paged_attention_unroll_manual_scope Case1 1129.6 / 651.7 1128.9 / 646.2 -0.1% / -0.8%
paged_attention_unroll_manual_scope Case2 494.3 / 253.6 495.4 / 252.5 +0.2% / -0.4%

So the current state is:

  • alignment with poursoul’s design is in place for the main runtime/example paths
  • performance is effectively the same between the two branches within normal run-to-run noise for most rows
  • the remaining non-unroll gap vs AUTO is a real manual-scope-v0 issue, not just a divergence from poursoul’s branch

@uv-xiao
Copy link
Copy Markdown
Contributor Author

uv-xiao commented Apr 20, 2026

Update after rerunning the full batch on real device.

Cleaned state

This PR is still the same narrow v0 scope only:

  • runtime/API support for PTO2_SCOPE(PTO2ScopeMode::MANUAL)
  • explicit deps through Arg.add_dep(task_id)
  • submit-time publish only
  • standalone submit-result task_id for zero-output updater chaining
  • TensorMap lookup/insert bypass for current-manual-scope-local tensors
  • two a2a3 manual-scope examples:
    • paged_attention_manual_scope
    • paged_attention_unroll_manual_scope
  • unit / sim validation for the v0 rules

The branch is kept as 3 logical commits:

  1. runtime support
  2. paged-attention examples
  3. design / benchmark doc

Fresh real-device benchmark

Rerun on a2a3, device 9, PTO-ISA d96c8784, 100 rounds, trimmed average.

Golden status:

  • TMR AUTO/manual all PASS on the two paged-attention workloads
  • ABG paged_attention PASS
  • ABG paged_attention_unroll Case2 FAIL again, so that ABG row is still not correctness-clean
Example Case TMR AUTO Elapsed (us) TMR AUTO Orch (us) TMR Manual Elapsed (us) TMR Manual Orch (us) ABG Elapsed (us) Notes
paged_attention Case1 73.4 60.2 119.6 104.8 31385.1 all correctness checks pass
paged_attention Case2 93.9 73.3 137.1 114.6 16429.4 all correctness checks pass
paged_attention_unroll Case1 1137.0 772.7 1132.2 647.4 1383.3 all correctness checks pass
paged_attention_unroll Case2 523.0 317.7 492.6 251.2 676.7 ABG golden fails

Reading of the current batch:

  • non-unroll manual scope is still slower than TMR AUTO, and the gap is mostly orchestration time
  • unroll manual scope is still slightly faster than TMR AUTO on both kept cases
  • ABG paged_attention_unroll Case2 remains an unstable / not correctness-clean baseline in reruns

TensorMap lookup / insert comparison

Profiling comparison from the current manual-scope implementation on non-unroll paged_attention:

Case Mode lookup+dep Trim (us) tensormap_ins Trim (us) TensorMap Lookups Avg TensorMap Inserts Avg Full Orch Trim (us)
Case1 AUTO 4.132 1.842 40.0 12.0 194.508
Case1 MANUAL 1.944 1.414 16.0 3.0 259.318
Case2 AUTO 6.320 2.638 105.0 32.0 210.274
Case2 MANUAL 2.598 1.728 41.0 8.0 285.182

What this still shows:

  • the manual-local TensorMap bypass is working
  • lookup / insert traffic is materially reduced in manual mode
  • the remaining non-unroll gap is not explained by TensorMap lookup / insert alone
  • the remaining cost is in the explicit-dep / orchestration path, not the old TensorMap path

@uv-xiao
Copy link
Copy Markdown
Contributor Author

uv-xiao commented Apr 21, 2026

Small follow-up after aligning the two manual-scope paged-attention examples.

What changed

The non-unroll manual example now uses the same update-chain dependency shape as the unroll manual example:

  • every update keeps the direct pv_outs.task_id() producer edge
  • the first update depends on the allocation task
  • later updates depend on the previous update task
  • the last non-first update also retains the allocation task, matching the unroll path

This removes the older conservative pattern where non-unroll attached alloc_task to every update.

Targeted real-device rerun

Rerun only the affected non-unroll manual example on a2a3, device 9, PTO-ISA d96c8784, with --build.

Golden:

  • paged_attention_manual_scope Case1: PASS
  • paged_attention_manual_scope Case2: PASS

100-round trimmed benchmark for the affected rows:

Example Case Before Manual Elapsed (us) Before Manual Orch (us) After Manual Elapsed (us) After Manual Orch (us)
paged_attention_manual_scope Case1 119.6 104.8 117.3 102.8
paged_attention_manual_scope Case2 137.1 114.6 133.8 112.4

So this is mostly a consistency cleanup, with a small measured improvement on the non-unroll manual path.

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h Outdated
@uv-xiao
Copy link
Copy Markdown
Contributor Author

uv-xiao commented Apr 23, 2026

Latest pushed update is commit 0a8680c.

What changed since the last PR update

  • Rebased/forward-ported branch state was pushed onto uv-xiao:manual_scope_v0.
  • Arg.add_dep(...) now stores explicit deps behind private Arg storage while keeping the public API unchanged.
  • AUTO scopes nested under an active manual scope are rejected explicitly.
  • The design doc was refreshed to match the actual v0 semantics: tasks submitted inside manual scope skip TensorMap lookup/insert; dependencies must be explicit.
  • Fresh matched real-device validation/benchmark was rerun on a2a3, device 10, PTO-ISA 478daadb.

Fresh matched 100-round benchmark

Example Case TMR AUTO Elapsed (us) TMR AUTO Orch (us) TMR Manual Elapsed (us) TMR Manual Orch (us) Manual Delta Elapsed (us) Manual Delta Orch (us)
paged_attention Case1 72.613 54.655 81.358 62.287 +8.745 +7.632
paged_attention Case2 91.682 63.958 101.647 71.885 +9.965 +7.927
paged_attention_unroll Case1 1140.067 710.711 1131.544 614.427 -8.523 -96.284
paged_attention_unroll Case2 513.079 274.306 491.192 229.887 -21.887 -44.419

Golden status for the TMR AUTO/manual rows above: all PASS.

Readout:

  • non-unroll manual scope still has a small orchestration overhead vs AUTO
  • unroll manual scope improves orchestration time on both kept cases
  • ABG unroll Case2 still fails golden and is not a correctness-clean baseline

@uv-xiao uv-xiao requested a review from poursoul April 23, 2026 13:35
@uv-xiao uv-xiao force-pushed the manual_scope_v0 branch 3 times, most recently from de48b2b to c0d01e1 Compare April 24, 2026 07:04
uv-xiao added 6 commits April 24, 2026 15:18
- Add manual scope mode and explicit Arg dependency plumbing
- Attach submit-result task ids independently from output tensors
- Bypass TensorMap lookup and insert while manual scope is active
- Keep the runtime/examples/docs scope without adding test changes
- Add non-unroll and unroll manual-scope examples for a2a3 TMR
- Wire task ids through Arg.add_dep at submit time
- Keep AUTO paged-attention available as the comparison path
- Document v0 API constraints and submit-time dependency model
- Record TensorMap bypass behavior and boundary-edge rules
- Include the current device benchmark and validation notes
- Make non-unroll manual paged-attention use the same update-chain dependency shape as the unroll manual path
- Gate alloc-task retention with is_first/is_last instead of attaching it on every update
- Verified with fresh hardware golden and 100-round reruns on device 9
- shrink the runtime diff back toward upstream in the validated manual-scope\n  hot path\n- keep TensorMap bypass only for current-manual-local tensors while\n  preserving boundary TensorMap behavior\n- refresh the v0 design doc with the rebased branch state plus fresh\n  100-round hardware benchmarks and golden-check results
- wrap Arg explicit dependency storage behind a private helper while\n  keeping the add_dep API unchanged\n- reject AUTO scopes nested under an active manual scope and keep\n  current-scope dep validation explicit\n- refresh docs/manual-scope-v0-design.md with fresh matched\n  100-round hardware benchmark results from device 10
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h Outdated
- validate explicit deps by task liveness and skip retired producers
- reset manual-scope state in existing orchestrator init/done paths
- make Arg.add_dep variadic with atomic capacity checking
- refresh the design doc and golden file to satisfy current hooks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants