Skip to content

InsertSync: support explicit subview multibuffer autosync#517

Open
TaoTao-real wants to merge 22 commits intohw-native-sys:mainfrom
TaoTao-real:codex/multibuffer-pingpong-slotaware
Open

InsertSync: support explicit subview multibuffer autosync#517
TaoTao-real wants to merge 22 commits intohw-native-sys:mainfrom
TaoTao-real:codex/multibuffer-pingpong-slotaware

Conversation

@TaoTao-real
Copy link
Copy Markdown
Contributor

@TaoTao-real TaoTao-real commented Apr 20, 2026

Summary

  • Support autosync multibuffer only for explicitly annotated pto.subview / lowered memref.subview leaf buffers.
  • Bind event lanes to explicit (root, group, slot) metadata so ping/pong and N-slot selector patterns reuse the correct event id across iterations.
  • Add positive/negative guard samples, regression tests, and an A3-friendly executable sample for the explicit-subview path.

Motivation

  • Architect request: keep this PR clean and correctness-first by narrowing scope to explicit subview multibuffer.
  • Avoid mixing in root-buffer pto.multi_buffer intent, EnableMultiBuffer auto expansion/materialization, or heuristic double-alloc inference.
  • Make autosync depend on proven slot semantics instead of legacy multi-address heuristics.

Design

  • Design note: docs/designs/ptoas-auto-sync-subview-multibuffer-v1.md
  • User-facing example: docs/designs/multibuffer-root-group-slot-demo.pto
  • Key invariants / behaviors:
    • Only explicitly annotated pto.subview / memref.subview leaves participate in slot-aware multibuffer.
    • Required metadata live on leaf subviews: pto.multi_buffer_factor, pto.multi_buffer_slot, optional pto.multi_buffer_group.
    • The common root workspace is implicit family identity; this PR does not support root-buffer annotations.
    • Selector/control-flow is written by the user in IR; PTOAS analyzes it and binds event ids to slots.
    • Unsupported geometry, missing attrs, overlap, or inconsistent group/slot layouts fall back to conservative normal autosync.
    • No EnableMultiBuffer-style pass is included in this PR.

Testing

  • Local build:
    • source ~/Workspace/PTO/env.sh && ninja -C build ptoas
  • Local targeted validation after rebasing onto latest main:
    • bash test/samples/runop.sh -t Sync
    • bash test/samples/runop.sh --enablebc -t Sync
    • test/basic/issue517_subset_slot_binding_regression.pto
    • test/lit/pto/issue428_cube_sync_regression.pto
    • test/lit/pto/issue454_nested_loop_same_pipe_pair_regression.pto
  • Covered new/updated sample cases include:
    • explicit ping/pong selector
    • three-slot selector
    • multi-group selector
    • pure if/else selector
    • negative cases for overlap / missing attrs / mixed selectors / dynamic offsets
  • A3 note:
    • the explicit-subview A3 sample is included in this PR (multibuffer_subset_pingpong_a3.py), but A3 remote validation has not been rerun after the final scope-cleanup + rebase step.

Risk / Rollback

  • Risk: if slot/group inference accepts an invalid annotated layout, autosync could under-synchronize or allocate incorrect lane bundles.
  • Mitigation: the translator validates annotated subview groups/root geometry and falls back to normal autosync whenever proofs are incomplete.
  • Rollback: revert this PR to restore the pre-existing non-slot-aware autosync behavior.

Review Focus

  • PTOIRTranslator propagation/validation of explicit subview multibuffer metadata.
  • MemoryDependentAnalyzer / InsertSyncAnalysis use of (root, group, slot) for dependency pruning and event-lane decisions.
  • SyncEventIdAllocation / SyncCodegen correctness for selector-mode dynamic event-id binding.
  • Sample guards that prove positive selector cases and reject negative cases from accidentally widening into multibuffer lanes.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements auto-sync support for ping/pong buffers derived from a single root workspace via pto.subset or memref.subview. It introduces the PTOEnableMultiBuffer pass to materialize buffer selection and adds dynamic synchronization operations (SetFlagDynOp, WaitFlagDynOp) to handle loop-carried dependencies. The InsertSync analysis is enhanced with slot-aware logic to prove non-overlap between buffer partitions. Review feedback suggests deduplicating interval overlap logic, removing a redundant (void) cast, and optimizing SmallVector capacity in the loop nest traversal.

Comment on lines +535 to +544
auto isOverlap = [](const BaseMemInfo *a, const BaseMemInfo *b, int i,
int j) -> bool {
uint64_t aStart = a->baseAddresses[static_cast<size_t>(i)];
uint64_t bStart = b->baseAddresses[static_cast<size_t>(j)];
uint64_t aEnd = aStart + a->allocateSize;
uint64_t bEnd = bStart + b->allocateSize;
uint64_t maxStart = std::max(aStart, bStart);
uint64_t minEnd = std::min(aEnd, bEnd);
return maxStart < minEnd;
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for checking interval overlap is duplicated in IsSlotAwareMultibufferPair (lines 639-646). Consider extracting this into a common helper function to improve maintainability and ensure consistency across different analysis paths.


bool InsertSyncAnalysis::AreSlotwiseNonOverlapping(
const DepBaseMemInfoPairVec &depBaseMemInfosVec, int factor) const {
(void)factor;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The (void)factor; statement is unnecessary as factor is used in the loop on line 613. Removing it would clean up the code.

Value one = rewriter.create<arith::ConstantIndexOp>(loc, 1);

// Collect loop nest from inner to outer (baseLoop, parent, ...).
SmallVector<scf::ForOp> loops;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The loop nest depth is typically small. Using a small size for SmallVector (e.g., SmallVector<scf::ForOp, 4>) can avoid unnecessary heap allocations.

Suggested change
SmallVector<scf::ForOp> loops;
SmallVector<scf::ForOp, 4> loops;

@reedhecre
Copy link
Copy Markdown

reedhecre commented Apr 20, 2026

Codex Review

该评论由 review 机器人自动更新。

  • PR: InsertSync: support explicit subview multibuffer autosync #517 InsertSync: support explicit subview multibuffer autosync
  • Author: TaoTao-real
  • Base/Head: main / codex/multibuffer-pingpong-slotaware
  • Head SHA: e6bf5d53b9a6
  • Trigger: PR 有新提交
  • Generated At: 2026-04-23T00:35:31Z
  • Previous Head SHA: dbbe3d026699
  • Status: failed at codex-review (exit=1)

Summary

Review failed at stage codex-review: exit=1

Findings

未生成结构化 findings,因为 review 过程提前失败。

Log Tail

 lib/PTO/Transforms/InsertSync/PTOInsertSync.cpp    |   4 +-
 lib/PTO/Transforms/InsertSync/SyncCodegen.cpp      | 134 ++++--
 lib/PTO/Transforms/InsertSync/SyncCommon.cpp       |  35 ++
 .../InsertSync/SyncEventIdAllocation.cpp           |  67 +++
 lib/PTO/Transforms/PTOToEmitC.cpp                  |  11 +-
 lib/PTO/Transforms/PTOViewToMemref.cpp             |   6 +
 .../issue517_subset_slot_binding_regression.pto    |  67 +++
 ...517_subset_unrolled_slot_binding_regression.pto |  67 +++
 test/lit/pto/issue428_cube_sync_regression.pto     |  43 +-
 ...ue454_nested_loop_same_pipe_pair_regression.pto |  29 +-
 .../Sync/multibuffer_subset_pingpong_a3-pto.cpp    |  99 ++++
 .../samples/Sync/multibuffer_subset_pingpong_a3.py | 116 +++++
 ...t_inject_sync_multibuf_subset_dynamic_offset.py |  86 ++++
 ...st_inject_sync_multibuf_subset_group_if_else.py | 131 ++++++
 ...ct_sync_multibuf_subset_group_mixed_selector.py | 116 +++++
 ...t_inject_sync_multibuf_subset_group_selector.py | 133 ++++++
 .../test_inject_sync_multibuf_subset_no_attr.py    |  87 ++++
 .../test_inject_sync_multibuf_subset_overlap.py    |  99 ++++
 .../test_inject_sync_multibuf_subset_pingpong.py   | 101 ++++
 ...ect_sync_multibuf_subset_three_slot_selector.py | 115 +++++
 test/samples/runop.sh                              | 480 +++++++++++++++++++
 36 files changed, 3694 insertions(+), 149 deletions(-)
===== END STAGE clone rc=0 @ 2026-04-23 08:35:06 =====

===== STAGE codex-review @ 2026-04-23 08:35:06 =====
set -euo pipefail
cd '/tmp/ptoas-pr-review-monitor/runs/20260423_083502_pr517/repo'
'codex' exec -C '/tmp/ptoas-pr-review-monitor/runs/20260423_083502_pr517/repo' -s read-only -c 'model_provider="codereview"' -c 'model="gpt-5.4"' -c 'model_reasoning_effort="xhigh"' --output-schema '/tmp/ptoas-pr-review-monitor/runs/20260423_083502_pr517/review_schema.json' -o '/tmp/ptoas-pr-review-monitor/runs/20260423_083502_pr517/codex_last_message.json' --color never - < '/tmp/ptoas-pr-review-monitor/runs/20260423_083502_pr517/review_prompt.txt'
OpenAI Codex v0.115.0 (research preview)
--------
workdir: /tmp/ptoas-pr-review-monitor/runs/20260423_083502_pr517/repo
model: gpt-5.4
provider: codereview
approval: never
sandbox: read-only
reasoning effort: xhigh
reasoning summaries: none
session id: 019db7c3-0c9a-7de3-98d4-19c4e9eb6dcb
--------
user
你现在在审查 GitHub PR。

仓库:hw-native-sys/PTOAS
PR:#517 InsertSync: support explicit subview multibuffer autosync
作者:TaoTao-real
base branch:origin/main
head branch:HEAD(当前已 checkout 到 PR head)

要求:
1. 只审查这个 PR 相对 origin/main 的改动,必要时可以看上下文文件。
2. 重点找真实的 correctness / regression / contract mismatch / CI / runtime / compatibility 问题。
3. 不要提纯风格建议,不要提低价值猜测。
4. 严格按优先级输出:
   - P1:高概率会导致错误结果、编译/运行失败、严重回归、发布阻断
   - P2:重要缺陷、行为回归、遗漏校验/测试、较大兼容性问题
   - P3:次要但明确可改的问题
5. 如果没有问题,summary 直接写:未检查到 PR #517 存在问题,并返回 findings=[]。
6. 如果有问题,summary 简洁概括,findings 里每条都要给出:
   - severity
   - title
   - body(说明为什么是问题,尽量具体)
   - file(尽量给相对路径)
   - line(能确定就填整数,否则 null)

建议先查看:
- git status --short
- git diff --stat origin/main...HEAD
- git diff --unified=80 origin/main...HEAD

最终输出必须严格匹配 JSON schema。

mcp startup: no servers
Reconnecting... 1/5 (unexpected status 503 Service Unavailable: Service temporarily unavailable, url: https://codex.0u0o.com/responses, request id: 9588a58d-022c-4a2e-8934-5400858af818)
Reconnecting... 2/5 (unexpected status 503 Service Unavailable: Service temporarily unavailable, url: https://codex.0u0o.com/responses, request id: 7a70b155-26f5-496e-ac62-da7617d9060e)
Reconnecting... 3/5 (unexpected status 503 Service Unavailable: Service temporarily unavailable, url: https://codex.0u0o.com/responses, request id: 56f3f3ea-f1db-410b-9aac-53f892a11f35)
Reconnecting... 4/5 (unexpected status 503 Service Unavailable: Service temporarily unavailable, url: https://codex.0u0o.com/responses, request id: ee1c702a-d2e1-4c05-9f5c-d13a0223702d)
Reconnecting... 5/5 (unexpected status 503 Service Unavailable: Service temporarily unavailable, url: https://codex.0u0o.com/responses, request id: 559adf34-5447-44c5-8548-8da8c6bcee9c)
ERROR: unexpected status 503 Service Unavailable: Service temporarily unavailable, url: https://codex.0u0o.com/responses, request id: 3eb275ad-c11e-4bae-91b0-cc072811a27c
Warning: no last agent message; wrote empty content to /tmp/ptoas-pr-review-monitor/runs/20260423_083502_pr517/codex_last_message.json
===== END STAGE codex-review rc=1 @ 2026-04-23 08:35:31 =====

@TaoTao-real TaoTao-real force-pushed the codex/multibuffer-pingpong-slotaware branch from 8bb9f55 to 5c0c3d6 Compare April 20, 2026 02:32
- Add pto.multi_buffer=2 attr plumbing into PlanMemory (alloc_tile -> memref.alloc).
- Detect ping/pong via planned address overlap-matrix and emit dynamic event-id set/wait.
- Add EnableMultiBuffer pass to materialize loop-local ping/pong selection.
- Add Sync sample + runop guard; fix PTOViewToMemref typed accessor crash for bitcast/treshape.
emitc.call_opaque requires an IntegerAttr placeholder to print SSA operands.
Add the operand placeholder for event_id so set_flag/wait_flag receive the dynamic event argument, and extend the Sync multibuf runop guard to catch missing 3rd argument.
- Track view-like alias closures (bind_tile/subview/casts) from multi-address pointer_cast.
- Build ping/pong selector in the LCA loop and rematerialize loop-local alias ops so tile allocations also switch ping/pong.
- Update multibuf pingpong sample to use alloc_tile + TLOAD/TSTORE so the generated C++ builds on A2/A3 pto-isa.
Materialize ping/pong by selecting between i64 addresses and building a loop-local PointerCastOp.
This keeps bind_tile lowering able to trace the defining PointerCastOp and avoids generating C++ that casts Tile<> to __ubuf__ pointers (which breaks A2/A3 compilation).
Update the multibuf runop guard accordingly.
The ping/pong base addresses are an implementation detail of PlanMemory.
Check for >=2 distinct int64 constants + ternary address selection, rather than hard-coding 0/512.
@TaoTao-real TaoTao-real force-pushed the codex/multibuffer-pingpong-slotaware branch from a800533 to 2d1e338 Compare April 22, 2026 01:34
@TaoTao-real TaoTao-real changed the title InsertSync: support subset pingpong multibuffer autosync InsertSync: support explicit subview multibuffer autosync Apr 22, 2026
@TaoTao-real
Copy link
Copy Markdown
Contributor Author

补充说明:这轮我重新逐个检查了当前所有 multibuffer 相关 case,不只是看 runop.sh 通过与否,而是同时核对了:

  • insert-sync debug 中的 slotMode
  • eventIdNum / eventIds=[...]
  • 最终生成代码里到底是动态 wait_flag/set_flag(..., vN),还是静态 EVENT_IDx
  • 是否形成了真正按 slot 绑定的 lane,而不是普通 autosync 的常规 event 分配

结论

  • 当前正向 multibuffer case 没有发现“只是语法正确,但并行化同步没有真正生效”的情况。
  • 当前负向 case 也没有发现误触发 multibuffer 并行化的情况,都会保守回退到普通 autosync。

已真正形成并行化同步的 case

  • test_inject_sync_multibuf_subset_pingpong.py
    • slotMode=SELECTOR
    • eventIdNum=2
    • lane bundle 为 [0,1]
    • 最终生成动态 wait_flag/set_flag(..., vN)
  • multibuffer_subset_pingpong_a3.py
    • 与上面一致,A3 样例也是真正的 2-lane selector 并行
  • test_inject_sync_multibuf_subset_three_slot_selector.py
    • slotMode=SELECTOR
    • eventIdNum=3
    • lane bundle 为 [0,1,2]
  • test_inject_sync_multibuf_subset_group_selector.py
    • 两个 group 各自独立分配 lane
    • bundle 分别为 [0,1][2,3]
  • test_inject_sync_multibuf_subset_group_if_else.py
    • 两个 group 的 triple-buffer lane 各自独立
    • bundle 分别为 [0,1,2][3,4,5]
  • test/basic/issue517_subset_slot_binding_regression.pto
    • selector-mode 2-lane 回归仍然成立
  • test/basic/issue517_subset_unrolled_slot_binding_regression.pto
    • 这类不是 dynamic selector,而是 loop body 内顺序使用 ping/pong
    • 当前会形成两条独立单槽位回边
    • 分别固定绑定 event lane [0][1]
    • 这同样是有效的并行化同步,只是不是 *_dyn 形式

正确回退、没有形成 multibuffer 并行化的 case

  • test_inject_sync_multibuf_subset_group_mixed_selector.py
    • slotMode=NONE
    • max eventIdNum=1
  • test_inject_sync_multibuf_subset_overlap.py
    • slotMode=NONE
    • max eventIdNum=1
  • test_inject_sync_multibuf_subset_no_attr.py
    • slotMode=NONE
    • max eventIdNum=1
  • test_inject_sync_multibuf_subset_dynamic_offset.py
    • slotMode=NONE
    • max eventIdNum=1

一个容易混淆的点

  • 负向 case 的最终 C++ 里即使出现了多个 EVENT_IDx,也不等于 multibuffer 并行化生效。
  • 那些可能只是普通 autosync 给不同同步边分配的常规 event。
  • 真正的 multibuffer 证据还是要看:
    • debug 中是否有 slotMode=SELECTOR,或像 unrolled case 那样形成独立 SINGLE slot-local 链
    • 是否出现按 slot 绑定的 bundle,比如 [0,1][0,1,2]
    • 最终 wait/set 是否按该 lane 复用

这次检查覆盖的是当前 PR 范围内所有 multibuffer 相关 basic/sample case,结论可以概括为:

  • 正向 case:并行化同步已真正落地
  • 负向 case:未误判成 multibuffer
  • 未发现“只过语法、不生效”的正向 case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants