WebGPU RT: complete the wavefront rewrite (single deliverable — remaining phases) #3

New issue

Closed

opened 2026-05-31 17:52:56 +02:00 by jorijnvdgraaf · 0 comments

jorijnvdgraaf commented

2026-05-31 17:52:56 +02:00

Owner

Scope: remaining wavefront rewrite — to be delivered as ONE PR

This tracks the remainder of the WebGPU RT wavefront rewrite planned in #1. PR #2 landed
only Phase 3 (TLAS coherence / bitonic sort) in isolation. Everything below is what is left.

Important

This is a single deliverable. All work items below must land together in one PR. Do
not split this into per-phase PRs, do not open a PR that implements "Phase 4 only" or
"the benchmark only," and do not mark this issue done until the wavefront tracer is fully
wired end-to-end and the many-instance benchmark renders correctly through the new pipeline.
The phases below are an implementation order to de-risk the work, not a list of separately
shippable chunks. Partial delivery is explicitly out of scope — the megakernel is not removed
and nothing is faster until the whole chain (GENERATE→TRACE→SHADE→RESOLVE + bounce loop) is in
place, so a half-rewrite leaves the codebase in a worse, dual-path state than it is today.

The goal is unchanged from #1: replace the single megakernel software ray tracer with a
wavefront / streaming tracer (GENERATE → TRACE → SHADE → RESOLVE, GPU-driven indirect
bounce loop, TRACE kernel containing zero user code for high SM occupancy). Target: 60fps on
a 4090 for a many-instance (3DForts-style) scene with primary + shadow rays.

Already done (PR #2 — do not redo)

Phase 3 — TLAS coherence. LBVH spatial sort re-enabled via a data-oblivious workgroup
bitonic network in lbvhBuildMain. The TLAS now has Morton spatial coherence.

Remaining work — all of the following, in one PR

Measurement harness (plan Phase 0) — build first.
- New examples/RTStress/: an N×N×N grid of a small mesh, instance count a compile/runtime
  knob (512 → ~8000), primary + shadow ray. This is the standing many-instance benchmark.
- GPU timestamp queries (timestamp-query feature) around each pass + a frame-time
  HUD/console line. Capture the baseline megakernel number before the rewrite so the
  before/after delta is quantified in the PR.
Indirect-dispatch plumbing (plan Phase 2). dispatchWorkgroupsIndirect + INDIRECT
buffer usage + a toy "emit N → dispatch N" round-trip to prove cross-pass atomic visibility
on the target Dawn build. No precedent in the repo — this is the highest-uncertainty piece;
prove it before building real kernels on it.
Megakernel split — GENERATE / TRACE / SHADE / RESOLVE (plan Phase 4). Split the one
@compute megakernel into four entry-point modules in
implementations/Crafter.Graphics-PipelineRTWebGPU.cpp sharing the SBT switches. TRACE
contains zero user code (traversal + intersection only). Bring up at maxDepth=1
(primary only, no emit) and validate against VulkanTriangle. Move the Payload-typed storage
binding into the codegen region (after the user's struct Payload). Define the ray/hit/
payload buffers (2×W·H double-buffered, capacity guard = 1 continuation ray/pixel).
GPU-driven bounce loop + emit/accumulate API break (plan Phase 5). Wire the unrolled
GENERATE; PREP; (TRACE; SHADE; PREP)×maxDepth; RESOLVE chain with rtEmitRay /
rtAccumulate / rtEmitPrimaryRay. Each kernel its own beginComputePass. Breaking API
change (preferred): raygen emits a primary ray; closesthit/miss run in SHADE and emit
continuation/shadow rays + accumulate. Port the Sponza shaders (raygen.wgsl,
closesthit.wgsl, miss.wgsl) to the new model and validate the image matches.
Ordered (nearest-child-first) traversal (plan Phase 6). Add _rtAabbT (entry-t); in
both _rtTraverseBlas and _rtTraverseTlas, descend the nearer child and push the farther
only if t < bestT. Measure the delta on RTStress.
Dynamic TLAS tree depth (deferred in PR #2). N_PADDED = next_pow2(N_real) per build so
descent depth tracks real instance count instead of fixed 14. Couples the build and trace
shaders — include it here.
Binding packing (plan Phase 7) — conditional. Only if a target device reports <12
storage buffers in TRACE/SHADE: merge tlasEntryOrder into BVH pad words and unify
vertices/indices/primRemap into one u32 heap. Skip if the 4090/Dawn target reports ≥12.

Device limits

Extend the clamp(...) block in additional/dom-webgpu.js to also request maxBufferSize,
maxStorageBufferBindingSize (payloadStore ≈ 130 MB at 1080p, over the 128 MB baseline), and
maxComputeWorkgroupsPerDimension (4K 1D dispatch; or de-linearize to 2D in PREP).

Definition of done (all required, in the one PR)

The megakernel RT path is replaced, not duplicated — no dead dual path left behind.
RTStress renders correctly through the wavefront pipeline with no flicker as instance count
scales, and the PR reports fps vs instance count on the 4090, before vs after, plus the
TRACE-kernel occupancy/register delta (the core justification).
Sponza and VulkanTriangle render bit-comparable (within tonemap tolerance) to the pre-rewrite
megakernel output.
Timestamp-query per-pass breakdown is included in the PR description.

Refs #1. Supersedes the "remaining phases" follow-up noted in PR #2's scope note.

## Scope: remaining wavefront rewrite — to be delivered as ONE PR This tracks the **remainder** of the WebGPU RT wavefront rewrite planned in #1. PR #2 landed only **Phase 3 (TLAS coherence / bitonic sort)** in isolation. Everything below is what is left. > [!IMPORTANT] > **This is a single deliverable.** All work items below must land **together in one PR**. Do > **not** split this into per-phase PRs, do **not** open a PR that implements "Phase 4 only" or > "the benchmark only," and do **not** mark this issue done until the wavefront tracer is fully > wired end-to-end and the many-instance benchmark renders correctly through the new pipeline. > The phases below are an *implementation order to de-risk the work*, **not** a list of separately > shippable chunks. Partial delivery is explicitly out of scope — the megakernel is not removed > and nothing is faster until the whole chain (GENERATE→TRACE→SHADE→RESOLVE + bounce loop) is in > place, so a half-rewrite leaves the codebase in a worse, dual-path state than it is today. The goal is unchanged from #1: replace the single megakernel software ray tracer with a **wavefront / streaming** tracer (`GENERATE → TRACE → SHADE → RESOLVE`, GPU-driven indirect bounce loop, TRACE kernel containing zero user code for high SM occupancy). Target: **60fps on a 4090 for a many-instance (3DForts-style) scene** with primary + shadow rays. ## Already done (PR #2 — do not redo) - **Phase 3 — TLAS coherence.** LBVH spatial sort re-enabled via a data-oblivious workgroup bitonic network in `lbvhBuildMain`. The TLAS now has Morton spatial coherence. ## Remaining work — all of the following, in one PR 1. **Measurement harness (plan Phase 0) — build first.** - New `examples/RTStress/`: an N×N×N grid of a small mesh, instance count a compile/runtime knob (512 → ~8000), primary + shadow ray. This is the standing many-instance benchmark. - GPU **timestamp queries** (`timestamp-query` feature) around each pass + a frame-time HUD/console line. Capture the **baseline megakernel** number before the rewrite so the before/after delta is quantified in the PR. 2. **Indirect-dispatch plumbing (plan Phase 2).** `dispatchWorkgroupsIndirect` + `INDIRECT` buffer usage + a toy "emit N → dispatch N" round-trip to prove cross-pass atomic visibility on the target Dawn build. No precedent in the repo — this is the highest-uncertainty piece; prove it before building real kernels on it. 3. **Megakernel split — GENERATE / TRACE / SHADE / RESOLVE (plan Phase 4).** Split the one `@compute` megakernel into four entry-point modules in `implementations/Crafter.Graphics-PipelineRTWebGPU.cpp` sharing the SBT switches. TRACE contains **zero user code** (traversal + intersection only). Bring up at `maxDepth=1` (primary only, no emit) and validate against VulkanTriangle. Move the `Payload`-typed storage binding into the codegen region (after the user's `struct Payload`). Define the ray/hit/ payload buffers (`2×W·H` double-buffered, capacity guard = 1 continuation ray/pixel). 4. **GPU-driven bounce loop + emit/accumulate API break (plan Phase 5).** Wire the unrolled `GENERATE; PREP; (TRACE; SHADE; PREP)×maxDepth; RESOLVE` chain with `rtEmitRay` / `rtAccumulate` / `rtEmitPrimaryRay`. Each kernel its own `beginComputePass`. **Breaking API change** (preferred): raygen emits a primary ray; closesthit/miss run in SHADE and emit continuation/shadow rays + accumulate. **Port the Sponza shaders** (`raygen.wgsl`, `closesthit.wgsl`, `miss.wgsl`) to the new model and validate the image matches. 5. **Ordered (nearest-child-first) traversal (plan Phase 6).** Add `_rtAabbT` (entry-t); in both `_rtTraverseBlas` and `_rtTraverseTlas`, descend the nearer child and push the farther only if `t < bestT`. Measure the delta on RTStress. 6. **Dynamic TLAS tree depth (deferred in PR #2).** `N_PADDED = next_pow2(N_real)` per build so descent depth tracks real instance count instead of fixed 14. Couples the build and trace shaders — include it here. 7. **Binding packing (plan Phase 7) — conditional.** Only if a target device reports <12 storage buffers in TRACE/SHADE: merge `tlasEntryOrder` into BVH pad words and unify vertices/indices/primRemap into one `u32` heap. Skip if the 4090/Dawn target reports ≥12. ### Device limits Extend the `clamp(...)` block in `additional/dom-webgpu.js` to also request `maxBufferSize`, `maxStorageBufferBindingSize` (payloadStore ≈ 130 MB at 1080p, over the 128 MB baseline), and `maxComputeWorkgroupsPerDimension` (4K 1D dispatch; or de-linearize to 2D in PREP). ## Definition of done (all required, in the one PR) - The megakernel RT path is **replaced**, not duplicated — no dead dual path left behind. - RTStress renders correctly through the wavefront pipeline with no flicker as instance count scales, and the PR reports **fps vs instance count on the 4090, before vs after**, plus the TRACE-kernel occupancy/register delta (the core justification). - Sponza and VulkanTriangle render bit-comparable (within tonemap tolerance) to the pre-rewrite megakernel output. - Timestamp-query per-pass breakdown is included in the PR description. Refs #1. Supersedes the "remaining phases" follow-up noted in PR #2's scope note.

jorijnvdgraaf added the

claude:ready

label

2026-05-31 17:53:35 +02:00