[TeleViz] Add QuadLayer + textured-quad pipeline + VizCudaArray#466
Conversation
QuadLayer is the first concrete layer type: a fullscreen-blit textured
quad sampling a CUDA-fed DeviceImage. Two submission paths:
- submit(VizBuffer) Mode A: caller's CUDA buffer is copied
into the layer's DeviceImage.
- acquire() / release() Mode B: caller writes into the layer's
tiled CUDA-Vulkan image directly. Zero
copy. acquire() returns VizCudaArray.
Two image-shape view types now sit side-by-side in viz/core:
- VizBuffer — linear pointer-backed memory (CPU bytes or CUDA
device pointer). Exposes __cuda_array_interface__
in Python for kDevice.
- VizCudaArray — opaque CUDA cudaArray_t (tiled GPU memory). Does
NOT expose __cuda_array_interface__ — that protocol
means pointer-backed memory and a cudaArray_t isn't
one. Used as Mode B return.
Pipeline machinery (sampler, descriptor set layout, pipeline layout,
VkPipeline, descriptor pool/set) lives inside QuadLayer. Pipelines
build using a process-wide VkPipelineCache added to VkContext for
driver-side compilation reuse.
Sync today is heavyweight: vkDeviceWaitIdle (wait for prior frame's
Vulkan reads) + cudaDeviceSynchronize (wait for our writes) inside
submit / release. Fine-grained CUDA-Vulkan binary semaphores ship
later when multi-frame parallelism actually matters.
VizSession::get_vk_context() exposes the underlying context (nullptr
after destroy) so layers can reach pipeline_cache(), cuda_device_id(),
etc. from session-driven code.
Validation rejects zero dimensions, null render pass, and non-kRGBA8
formats (kD32F would create a depth-aspect view that the textured-
quad pipeline can't sample as color).
Milestone tests (test_quad_milestone.cpp) run end-to-end CUDA →
Vulkan → readback round-trip in BOTH modes (Mode A submit + Mode B
acquire/release). 4-quadrant {0, 255}-only RGBA pattern survives the
sRGB attachment encoding because curve endpoints map to themselves.
40 unit + 34 GPU tests pass; unit tests also pass under ASAN+UBSAN.
Signed-off-by: Farbod Motlagh <fmotlagh@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds a new viz::QuadLayer that renders a CUDA-backed 2D texture as a fullscreen quad and supports two ingestion modes: copy-based submit(VizBuffer, cudaStream_t) and zero-copy acquire(cudaStream_t)/release(cudaStream_t) returning VizCudaArray. Removes DeviceImage::view(), introduces VizCudaArray (opaque cudaArray_t view), adds Vulkan/CUDA timeline-semaphore interop and CUDA-side wait/signal helpers on DeviceImage, exposes VkContext::pipeline_cache(), converts viz_layers to a STATIC library, and wires per-layer timeline semaphores into VizCompositor submission. Extensive unit and end-to-end tests and test helpers for CUDA–Vulkan interop were added. Sequence Diagram(s)sequenceDiagram
participant App as Application
participant Session as VizSession
participant Layer as QuadLayer
participant DevImg as DeviceImage
participant CUDA as CUDA Runtime
participant VK as Vulkan
Note over App,Layer: Mode A — submit(VizBuffer, stream)
App->>Layer: submit(VizBuffer, stream)
Layer->>CUDA: cudaSetDevice(ctx.cuda_id)
Layer->>DevImg: cuda_wait_for_vk_read(stream)
Layer->>CUDA: cudaMemcpy2DToArrayAsync(..., stream)
Layer->>DevImg: cuda_signal_write_done(stream)
Note over App,Layer: Mode B — acquire/release(stream)
App->>Layer: acquire(stream)
Layer->>CUDA: cudaSetDevice(ctx.cuda_id)
Layer->>DevImg: cuda_wait_for_vk_read(stream)
Layer-->>App: VizCudaArray (cudaArray_t)
App->>CUDA: cudaMemcpy2DToArray(..., stream)
App->>Layer: release(stream)
Layer->>DevImg: cuda_signal_write_done(stream)
Note over Session,Layer: Render + timeline semaphore submit
Session->>Layer: get_wait_semaphores()
Layer->>DevImg: provide cuda_done_writing() + value
Session->>VK: build VkTimelineSemaphoreSubmitInfo (waits/signals)
Session->>VK: vkQueueSubmit(...)
VK-->>Layer: submission success
Layer->>Layer: commit_pending_signals() (advance vk_done_reading value)
VK->>DevImg: Vulkan reads shader-sampled image during render
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/viz/layers_tests/cpp/test_quad_layer.cpp`:
- Around line 48-76: The three tests in test_quad_layer.cpp titled about
render_pass, resolution, and format never reach those checks because QuadLayer's
constructor first checks VkContext::is_initialized() on the default-constructed
VkContext; move the argument validation (render_pass, cfg.resolution,
cfg.format) to occur before the context initialization check in
QuadLayer::QuadLayer (so the ctor throws for invalid
render_pass/resolution/format even with an uninitialized VkContext), or
alternatively initialize a real VkContext in those TEST_CASEs and mark them as
[gpu] so the intended validations run; update references to QuadLayer::Config,
QuadLayer::QuadLayer, and VkContext::is_initialized() accordingly.
In `@src/viz/layers/cpp/quad_layer.cpp`:
- Around line 99-139: QuadLayer::destroy currently resets device_image_ but
leaves ctx_ and other state live, allowing later public methods (submit,
acquire, record) to operate on null resources; update destroy() to mark the
instance as destroyed (e.g., clear ctx_ or set a dedicated bool destroyed_),
ensure all Vulkan handles (descriptor_pool_, pipeline_, pipeline_layout_,
descriptor_set_layout_, sampler_) and device_image_ are nulled/cleared, and make
public entry points (submit, acquire, record, any public getters) check the
destroyed flag or ctx_ and return an explicit error or no-op instead of
dereferencing resources; also make destroy() idempotent so repeated calls are
safe and add tests for invalid input rejection, state-machine transitions,
exception recovery, and idempotent destroy as described.
- Around line 188-199: The acquire() method (QuadLayer::acquire) must set the
CUDA device before returning the VizCudaArray so subsequent CUDA ops run on the
correct context: call cudaSetDevice(ctx_->cuda_device_id()) at the start (or
immediately before returning) of QuadLayer::acquire(), mirroring the device
pinning done in submit() and release(); ensure you do this prior to exposing
device_image_->cuda_array() in the returned VizCudaArray and handle any
cudaSetDevice error consistently with existing error handling.
In `@src/viz/session_tests/cpp/test_quad_milestone.cpp`:
- Around line 35-49: The current gpu_available() helper inspects only Vulkan
devices via viz::VkContext::enumerate_physical_devices() which can falsely pass
when CUDA is missing; update the tests to use the canonical check
viz::testing::is_gpu_available() and early-exit with SKIP() when it returns
false (add the SKIP() call before any CUDA/Vulkan calls), or replace
gpu_available() usages with viz::testing::is_gpu_available(); ensure the test
functions that currently call gpu_available() (e.g., the functions referencing
gpu_available(), and any code paths that invoke CUDA after the gate) perform
SKIP() if viz::testing::is_gpu_available() is false.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: de5981b8-20a5-463c-8ffb-f3f63e900044
📒 Files selected for processing (17)
src/viz/AGENTS.mdsrc/viz/core/cpp/device_image.cppsrc/viz/core/cpp/inc/viz/core/device_image.hppsrc/viz/core/cpp/inc/viz/core/viz_types.hppsrc/viz/core/cpp/inc/viz/core/vk_context.hppsrc/viz/core/cpp/vk_context.cppsrc/viz/core_tests/cpp/test_device_image.cppsrc/viz/layers/cpp/CMakeLists.txtsrc/viz/layers/cpp/inc/viz/layers/quad_layer.hppsrc/viz/layers/cpp/quad_layer.cppsrc/viz/layers_tests/cpp/CMakeLists.txtsrc/viz/layers_tests/cpp/test_quad_layer.cppsrc/viz/session/cpp/inc/viz/session/viz_session.hppsrc/viz/session/cpp/viz_session.cppsrc/viz/session_tests/cpp/CMakeLists.txtsrc/viz/session_tests/cpp/test_quad_milestone.cppsrc/viz/shaders/cpp/textured_quad.frag
…DA in acquire Three review-driven fixes: 1. Constructor validation order: arg-shape checks (format, resolution, render_pass) now run before VkContext::is_initialized(). The unit tests for those rejection paths previously short-circuited on the ctx check and never reached their named check; with the new order each unit test exercises exactly the path it claims. 2. Use-after-destroy guard: submit / acquire / release / record now throw std::logic_error if device_image_ has been reset. Previously they would deref null resources and crash. Idempotent destroy stays a no-op. Added a [gpu] test covering the guard. 3. acquire() now calls cudaSetDevice on the calling thread before exposing the cudaArray_t, mirroring submit / release. Without this, a worker-thread caller's CUDA writes to the returned handle would target whatever device CUDA defaulted to. Also tightened the gpu_available() helpers in test_quad_layer and test_quad_milestone to probe CUDA, not just Vulkan — so a Vulkan- only machine skips cleanly instead of crashing in cudaImportExternalMemory. (The canonical viz::testing::is_gpu_available stays Vulkan-only since not all viz tests need CUDA.) 40 unit + 35 GPU tests pass; unit tests pass under ASAN+UBSAN. Signed-off-by: Farbod Motlagh <fmotlagh@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
…shared test helpers Folds the deferred fine-grained sync work into m3b plus the substantive review fixes that surfaced from it. DeviceImage: - Vulkan timeline semaphores (VK_KHR_timeline_semaphore, initial value 0) imported into CUDA as TimelineSemaphoreFd. No first-frame primer; timeline waits for >= 0 are trivially satisfied. - Counters split into reserve_*() / commit_*() pairs. reserve_* returns the next monotonic value; commit_* advances the public value via monotonic-max only after the signal has been queued successfully. A failed cuda*Async or vkQueueSubmit no longer poisons the timeline. - DeviceImage exposes small CUDA-side primitives (cuda_wait_for_vk_read, cuda_signal_write_done) so QuadLayer orchestrates them rather than reimplementing the calls. QuadLayer: - acquire/release/submit/record state machine via atomic acquired_. Single-producer-thread contract documented in the header (multi- producer feeding the same layer is undefined; use multiple layers). Catches double-acquire, release-without-acquire, submit- while-acquired, record-while-acquired as std::logic_error. - submit / acquire / release accept an optional cudaStream_t (default 0). Producers can pass their own stream so the wait/copy/signal sequence is correctly ordered after the producer's prior work on that stream. LayerBase + VizCompositor: - Layers reserve signal values via get_signal_semaphores() but don't commit until VizCompositor calls commit_pending_signals() after vkQueueSubmit returns success. A failed submit leaves counters un-advanced. - VizCompositor snapshots the visible-layer set ONCE at the top of render() and uses that snapshot for record / sema-collect / commit. A mid-frame is_visible() toggle would otherwise let a layer record draws but skip semaphore wiring (or vice versa) and desync the timeline counters. Tests: - Canonical viz::testing::is_cuda_vulkan_interop_available() probe in test_helpers.hpp checks Vulkan-CUDA UUID overlap (mirroring VkContext::init's requirement). New viz::test_support INTERFACE CMake target so layers_tests / session_tests share one implementation; the duplicated gpu_available() helpers are gone. - New tests: state-machine rejections (double-acquire, release-without-acquire, submit-while-acquired, release-after- release), non-default-stream submit, multi-frame submit/render/ readback loop (16 frames, varying solid colors) verifying pipelining correctness. 40 unit + 38 GPU tests pass; unit tests pass under ASAN+UBSAN. Signed-off-by: Farbod Motlagh <fmotlagh@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Lands the M3c review findings that were merged into the m3b commit
description but not actually present in the code: the producer side
of QuadLayer was still a single atomic<bool>, DeviceImage was still
single-format-per-image, and the milestone tests didn't exercise the
sRGB color-space round-trip.
QuadLayer:
- Replace acquired_ with std::atomic<ProducerState> (Idle / Submitting
/ Acquired). submit() and acquire() use compare_exchange_strong to
enter their states; submit() takes an RAII guard so an exception
doesn't strand the state machine. record() rejects unless Idle, so
a Mode A submit in flight or a Mode B acquire-without-release can't
race with sampling. Error messages quote the observed state so
misuse is easy to triage.
- get_signal_semaphores() now refuses to reserve a new vk_done_reading
value if the previous reservation hasn't been committed. That would
orphan the first reservation and leave Vulkan signaling a value
whose public counter never advances — silently breaking future
CUDA waits.
DeviceImage:
- Storage VkImage is created with VK_IMAGE_CREATE_MUTABLE_FORMAT_BIT
in UNORM, sampling view in SRGB. CUDA writes raw bytes; the sampler
decodes sRGB->linear; the SRGB color attachment encodes on write;
net round-trip is identity for arbitrary RGBA byte values.
- create() now hard-rejects PixelFormat::kD32F. The depth-image
interop contract (sample semantics, layout transitions, color-space
view) isn't designed yet — better to refuse than half-implement it
until ProjectionLayer ships.
- vk_format() now returns the SRGB sampling view format (the format
layers and tests actually care about); storage format is internal.
Tests:
- test_quad_layer: rename "Mode B state machine" → "producer state
machine" and add a record-while-Acquired case using a real cmd
buffer (the state check fires before any vk command is issued).
Use a real cudaMalloc'd device pointer so the submit() rejection
path is reached for state reasons, not arg-validation reasons.
- test_quad_milestone: add a midtone-RGBA round-trip case (64, 128,
200, 255). The {0, 255}-only end-to-end tests can't catch a wrong
color-space wiring because those endpoints are fixed points of any
gamma curve.
- test_device_image: vk_format() check updated for the SRGB view
format with a comment pointing at the storage/view split.
Build / sanitizers:
- All 79 viz tests (40 unit + 39 gpu) pass locally.
- All 40 viz unit tests pass under ASAN+UBSAN (CI's sanitizer gate).
Co-authored-by: Cursor <cursoragent@cursor.com>
Mode A's submit() now routes into one of three CUDA-Vulkan interop slots; record() promotes the most recent publish at frame start. The producer never blocks and the renderer always samples the latest finished frame, regardless of how producer rate compares to render rate. The bidirectional vk_done_reading sync is gone — with three slots the producer never targets a slot the renderer is currently sampling, so it isn't needed. DeviceImage: - Drop vk_done_reading semaphore + reserve/commit/cuda_wait_for_vk_read. - Keep cuda_done_writing as the one producer->consumer signal. - Inline the reserve/commit dance into cuda_signal_write_done() (no external callers anymore). - Comments reflect the one-direction sync. LayerBase: - Drop SignalSemaphore, get_signal_semaphores(), commit_pending_signals(). No layer needs the compositor to signal a producer-facing semaphore anymore. - get_wait_semaphores() stays — that's how QuadLayer hands the fragment-shader wait on cuda_done_writing[in_use] to the compositor. QuadLayer: - 3 DeviceImage slots, 3 descriptor sets (one per slot's image view), one shared sampler / pipeline / pipeline layout. - Two atomic uint8_t indices (latest_, in_use_) with a kSlotNone sentinel for the pre-first-publish frame; record() skips the draw cleanly when nothing has been published. - submit() picks the unique slot that is neither latest_ nor in_use_, cudaMemcpyAsync into it, signal cuda_done_writing, store latest_. No state machine, no RAII guards, no producer/render reconciliation. - record() promotes latest_ -> in_use_ at frame start; previous in_use_ becomes free for the next submit(). get_wait_semaphores() reads in_use_ (set by record(), which the compositor calls first). - VizCudaArray gone. cudaArray_t is internal to the mailbox now. Producers always present pixels through VizBuffer. VizCompositor: - Drop the signal-semaphore collection + commit_pending_signals fan-out. The submit-info side only needs waits. - visible_layers snapshot stays — still important so a mid-frame visibility toggle doesn't desync record() from get_wait_semaphores(). Tests: - test_quad_layer: drop the Mode B / ProducerState / acquire-release fixtures. Add a "back-to-back submits cycle through mailbox slots" test that confirms cuda_done_writing counters advance once per submit across the three slots. Add a kSlotCount out-of-range check on the diagnostic device_image(slot) accessor. - test_quad_milestone: drop the Mode B end-to-end test. Mode A submit/render/readback, multi-frame palette loop, and midtone sRGB round-trip all stay (and now exercise the mailbox path). - test_device_image: no changes needed — vk_format(), idempotent destroy, and the byte-pattern round-trip don't depend on the dropped APIs. Net diff: -311 lines across QuadLayer + DeviceImage + LayerBase + tests + viz/AGENTS.md. A future zero-copy acquire/release variant is documented under "Future: zero-copy acquire/release" in DESIGN.md for revisit when a real producer (NVDEC, custom CUDA kernel) wants it. Co-authored-by: Cursor <cursoragent@cursor.com>
Three follow-ups from the triple-buffer mailbox review. Documentation: - quad_layer.hpp: note that mailbox correctness depends on VizCompositor::render() being synchronous (frame_sync_->wait() at end of frame). Multi-frame-in-flight would require in_use_ to become per-in-flight-frame. - viz_compositor.cpp: cross-reference the dependency on the trailing frame_sync_->wait() so anyone touching that wait sees what depends on it. DeviceImage: - Drop the dead CAS loop in cuda_signal_write_done. Single producer per DeviceImage means reserved is always > the previously committed _value_, so a release-store suffices. The reserve/commit split via cuda_done_writing_next_ stays — it's still what isolates a failed signal from advancing the public counter (and avoids reusing a timeline value on retry, which is UB). Tests (test_quad_milestone.cpp): - "with no submit yet renders the clear color" — pins the kSlotNone short-circuit in record() / get_wait_semaphores(). Configures a green clear and confirms readback is green when no submit() has run. - "re-renders the same publish when no new submit arrives" — pins in_use_ stability across frames when latest_ doesn't change. - "fast producer: render samples only the latest publish" — pins the core mailbox guarantee: 5 back-to-back submits with distinct colors, one render, readback equals only the last submit's color. All 40 unit tests pass; the 3 new [gpu] tests register and skip cleanly on a no-CUDA-Vulkan-interop machine via the existing is_cuda_vulkan_interop_available() gate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… scrub Mode B refs
Follow-up review-pass on the triple-buffer mailbox MR.
device_image.cpp:
- Inline the create_one lambda in create_interop_semaphores. It was
factored out to dedup two semaphores (cuda_done_writing +
vk_done_reading); after Mode B and the back-channel sync were
dropped, only one semaphore is created, so the closure adds nothing.
test_helpers.hpp:
- Collapse is_gpu_available() (Vulkan-only) and
is_cuda_vulkan_interop_available() (Vulkan + CUDA UUID-overlap)
into a single strict is_gpu_available(). The "Vulkan-only" gate
was actually too loose: VkContext::init() requires the UUID match,
so any [gpu] test that uses GpuFixture would pass the loose check
and then throw inside match_cuda_device_to_vulkan(). One canonical
probe matches the actual init-time contract.
test_quad_layer.cpp / test_quad_milestone.cpp / test_offscreen_render.cpp:
- Drop the per-file `gpu_available()` wrappers that just forwarded
to one of the two helpers. Call sites now use a `using
viz::testing::is_gpu_available;` declaration plus
`is_gpu_available()` directly. Behavior is unchanged; the offscreen
tests get a slightly stricter gate (CUDA UUID required), which
matches what GpuFixture already enforced at init.
CMakeLists.txt (core_tests/cpp):
- Update the helper-list comment to reflect the single probe.
viz_buffer.hpp:
- Comment talked about "Mode B submission (acquire/release)" — that
API is gone. Reword to describe submit() (caller-owned source
buffer copied into the layer's interop slot) and host-readback
views.
test_quad_milestone.cpp:
- Comment in the midtone test still said "Mode A / Mode B tests".
Reword to plain "{0,255}-only round-trip tests".
Net: -47 lines across 7 files. 40/40 unit tests pass; [gpu] tests
register and skip cleanly without GPU.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Do we need to change the MR summary? I see acquire() and release() was removed from QuadLayer |
QuadLayer is the first concrete layer type: a fullscreen-blit textured quad sampling a CUDA-fed DeviceImage. Two submission paths:
into the layer's DeviceImage.
tiled CUDA-Vulkan image directly. Zero
copy. acquire() returns VizCudaArray.
Two image-shape view types now sit side-by-side in viz/core:
device pointer). Exposes cuda_array_interface
in Python for kDevice.
NOT expose cuda_array_interface — that protocol
means pointer-backed memory and a cudaArray_t isn't
one. Used as Mode B return.
Pipeline machinery (sampler, descriptor set layout, pipeline layout, VkPipeline, descriptor pool/set) lives inside QuadLayer. Pipelines build using a process-wide VkPipelineCache added to VkContext for driver-side compilation reuse.
Summary by CodeRabbit
New Features
Tests
Documentation