WebGPU RT: wavefront rewrite of the software ray tracer #1

Closed
opened 2026-05-31 16:53:15 +02:00 by jorijnvdgraaf · 1 comment

Plan: Wavefront rewrite of the WebGPU software ray tracer

Context

Crafter.Graphics' WebGPU backend is a software ray tracer (native Vulkan uses the
4090's hardware RT cores; WebGPU has none). It currently runs as a single megakernel
compute shader — one @compute @workgroup_size(8,8,1) kernel, one thread per pixel, that
inlines raygen → TLAS traverse → BLAS traverse → the entire SBT (runClosestHit/runAnyHit/
runMiss) → shadow rays. This is "several orders of magnitude slower than native," far worse
than the gap physics requires.

Honest target assessment: a well-tuned software tracer on a 4090 typically lands ~5–20×
behind hardware-RT, not the ~100× seen now. The extra ~50× is recoverable engineering. 60fps
on a 4090 for a many-instance (3DForts-style) scene with primary + shadow rays is achievable

— but only by restructuring away from the megakernel. The user has chosen a full wavefront
rewrite now
, benchmarked against a many-instance scene (the exact case the disabled LBVH
sort breaks at >4000 instances).

Root causes of the current gap (verified in code):

  1. Megakernel occupancy collapse — the kernel's register footprint is the max over all
    paths (two in-register traversal stacks + full Payload + every closesthit's locals + the
    SBT switch). High register pressure → low SM occupancy → memory latency can't be hidden.
    This is the dominant multiplier. (PipelineRTWebGPU.cpp:180-185, traversal at
    dom-webgpu.js:1479-1672).
  2. Unordered BVH traversal_rtTraverseBlas/_rtTraverseTlas descend left-first, never
    nearest-child-first, so bestT tightens slowly → excess triangle tests
    (dom-webgpu.js:1546-1550, :1636-1642).
  3. LBVH sort disabled — the TLAS radix sort is gated behind if (false)
    (dom-webgpu.js:2203, TODO-lbvh-sort.md),
    so TLAS leaves have no spatial coherence (loose AABBs). Fatal at thousands of instances.
  4. Padded TLAS tree — implicit perfect binary tree over TLAS_BVH_N_PADDED=16384 leaves;
    ~thousands of real instances means full 14-level descent through mostly-sentinel structure
    (dom-webgpu.js:1405-1406).
  5. No measurement — there is no benchmark scene and no GPU timing. 3DForts is not in this
    repo (external consumer). "Optimize to the limit" is meaningless without a harness.

The vkQueueWaitIdle-per-frame finding from early exploration is on the native Vulkan
window path, not WebGPU. The WebGPU frame is just queue.submit. Not in scope.

Goal

Replace the megakernel with a wavefront / streaming path tracer: separate GENERATE → TRACE → SHADE → RESOLVE kernels connected by GPU-resident ray/hit/payload buffers, with a GPU-driven
bounce loop (indirect dispatch, no CPU readback). The TRACE kernel contains zero user code
— traversal + intersection only — for high SM occupancy (the core win). Fix TLAS coherence
(bitonic sort) and add ordered traversal so the many-instance benchmark is meaningful.

This is a breaking change to the WebGPU RT shader API (preferred per project policy): the
synchronous "raygen calls traceRay, inspects the hit, traces a shadow ray" pattern cannot
survive without continuations. Raygen now emits a primary ray; closesthit/miss run in SHADE
and may emit continuation/shadow rays + accumulate color — the standard DXR/wavefront idiom.
(Native Vulkan RT path is untouched.)

Critical files

Design

Stages (all compute, recorded once per frame on state.encoder)

  • GENERATE — 2D (⌈W/8⌉,⌈H/8⌉), one thread/pixel. User raygen_main(pixel) calls
    rtEmitPrimaryRay(...) + inits payload; writes RayState/RayMeta/payload slot y*W+x,
    zeroes pixelAccum.
  • TRACE — 1D over live ray count via dispatchWorkgroupsIndirect, @workgroup_size(64).
    No user code. Reads compacted ray list + accel buffers, writes HitState. Holds only the
    traversal stacks + working ray → minimal registers → high occupancy. v1 = opaque-only
    (current Sponza is all-opaque; defer anyhit to a later variant).
  • SHADE — 1D over live ray count. Hosts all user code (the runClosestHit/runMiss
    switches move here). Reads HitState+payload, dispatches by hit.t < tMax, rtAccumulates
    color, and rtEmitRays continuation/shadow rays (atomic-append + compaction into the
    next-bounce buffer).
  • PREP-INDIRECT — 1D, single workgroup. Converts the atomic ray counter written by
    GENERATE/SHADE into dispatchWorkgroupsIndirect args (⌈count/64⌉,1,1) and resets the
    consumed counter. Must be its own kernel: WGSL atomics are only globally coherent across a
    dispatch boundary, so count→args conversion can't happen inside SHADE.
  • RESOLVE — 2D per pixel. pixelAccum → ping/pong output texture (preserving
    state.outIsPing); tonemap/gamma move here from raygen.

GPU-driven bounce loop (no per-bounce CPU readback)

Unrolled at command-record time to a fixed maxDepth (Sponza needs 2 = primary + shadow):

GENERATE; PREP(0)
for b in 0..maxDepth-1:
    TRACE  <indirect args[b%2]>
    SHADE  <indirect args[b%2]>     // appends rays to (b+1)%2, atomicAdd count[(b+1)%2]
    PREP(b+1)                       // count[(b+1)%2] -> args; reset count[b%2]=0
RESOLVE

Each kernel is its own beginComputePass/end (indirect args from one pass must be visible to
the next — the same reason TLAS build ends/reopens the pass at
dom-webgpu.js:2602). Zero-ray rounds dispatch (0,1,1) → near-free.

Buffers

  • RayState (32B) + RayMeta (16B: pixelIndex, flags, missIndex, payloadSlot) — double-buffered (ping/pong per bounce), sized 2×W·H only.
  • HitState (96B, matches HitInfo at dom-webgpu.js:1293) — single buffer reused per bounce.
  • payloadStore — bound as array<Payload> (the user's type), so size/layout are
    compiler-enforced; no fixed-size cap or bitcast. Requires moving the binding text into the
    codegen region after the user's Payload decl.
  • rayCount (2× atomic<u32>) + indirectArgs (2× 3-u32, usage STORAGE | INDIRECT).
  • Capacity guard: rtEmitRay enforces a 1 continuation ray per pixel budget (shadow
    rays excepted), so peak live rays fit 2×W·H. Overflow rays are dropped (graceful).

API break — Sponza rewritten

  • Payload gains throughput: vec3<f32>, rayKind: u32 (PRIMARY/SHADOW), pixelIndex: u32.
  • raygen_main: build pinhole ray, set throughput=1, rtEmitPrimaryRay(...).
  • closesthit_main (PRIMARY): existing UV/normal/albedo work; rtAccumulate(ambient); if
    nDotL>0, rtEmitRay(shadow ray, SKIP_CLOSEST_HIT|TERMINATE_ON_FIRST_HIT, throughput=lit term).
  • miss_main: SHADOW → rtAccumulate(throughput) (sun visible); PRIMARY → rtAccumulate(sky).
  • Occluded shadow rays hit with SKIP_CLOSEST_HIT → contribute nothing → correct shadow.

Coherence (mandatory for the many-instance benchmark)

  • Replace the dead radix sort with a workgroup bitonic sort (TODO strategy #5). Bitonic is
    data-oblivious — its barriers/access pattern are independent of the histogram distribution,
    which is exactly the "count-dependent" axis the TODO bisected the bug to.
    ~105 compare-exchange passes over 16384 keys; in-budget. Same (morton16<<16)|index16 keys;
    0xFFFFFFFF sentinels sort to the end.
  • Dynamic tree depth: set N_PADDED = next_pow2(N_real) per build so descent depth tracks
    real instance count instead of fixed 14. (Packed Karras BVH is a later refinement iff
    profiling still shows TLAS dominating.)

Ordered traversal (TRACE)

Add _rtAabbT returning entry-t. At each internal node compute both children's entry-t, descend
the nearer child, push the farther only if t < bestT. Tightens bestT faster → fewer triangle
tests. Applies to both _rtTraverseBlas and _rtTraverseTlas.

Device limits (extend the existing clamp(...) block at dom-webgpu.js:131)

maxStorageBuffersPerShaderStage=16 is already requested (TRACE/SHADE need 12 — fits). Add
requests for
maxBufferSize, maxStorageBufferBindingSize (payloadStore at 1080p ≈ 130 MB,
over the 128 MB baseline), and maxComputeWorkgroupsPerDimension (needed for 4K 1D dispatch;
or de-linearize to 2D in PREP).

Implementation order (de-risk highest-uncertainty first)

  1. Phase 0 — Measurement harness (do first; gates everything).
    • New examples/RTStress/: N×N×N grid of a small mesh, instance count a compile/runtime knob
      (512 → ~8000), primary + shadow ray. This is the standing many-instance benchmark.
    • Add GPU timestamp queries (timestamp-query feature) around each pass + a frame-time
      HUD/console line. Record the baseline megakernel number to quantify the gap.
  2. Indirect-dispatch plumbing in isolationdispatchWorkgroupsIndirect + INDIRECT
    usage + a toy "emit N → dispatch N" round-trip. Biggest unknown (no precedent in repo); prove
    cross-pass atomic visibility on the target Dawn build before building real kernels on it.
  3. TLAS coherence — bitonic sort + dynamic depth. Independent of the rewrite; validate against
    the existing megakernel image, measure TLAS-traversal fraction before/after.
  4. Split megakernel → GENERATE/TRACE/SHADE/RESOLVE at maxDepth=1 (primary only, no emit).
    Validate stage decomposition + binding budget against VulkanTriangle (single trace).
  5. GPU-driven bounce loop + rtEmitRay/rtAccumulate; port Sponza to the emit model.
    Validate against the current Sponza image.
  6. Nearest-child-first traversal — pure optimization; measure delta.
  7. Pack bindings (merge tlasEntryOrder into BVH pad words; unify vertices/indices/primRemap
    into one u32 heap) only if a target device reports <12 storage buffers.

Optional, secondary (UI — user is skeptical it's the bottleneck)

The UI compute path is O(W·H·N) (each tile thread iterates every item,
UI.cpp:110-131). Cheap win: per-tile item
binning so each tile iterates only overlapping items (~O(W·H·k)). Defer until RT numbers land.

Verification

  • Correctness: Sponza and VulkanTriangle render bit-comparable (within tonemap tolerance) to
    the current megakernel output after Phase 5. RTStress shows no flicker as instance count scales
    (the TODO acceptance criterion: CPU-oracle sort vs GPU output for
    all-uniform, all-one-bucket, and "small object next to tight cluster" distributions).
  • Performance: chart fps vs instance count on the 4090 for RTStress, before (megakernel) and
    after each phase. Primary metric: 60fps at the target instance count. Secondary: TRACE
    kernel register count / occupancy (the rewrite's core justification), per-bounce live-ray
    counts (validate the 2×W·H budget + emit-drop rate), per-pass GPU time via timestamp queries.
  • How to run: build the wasm32 target (project.cpp selects DOM backend on wasm32-*),
    serve examples/RTStress, read the timing HUD in-browser on the 4090.
# Plan: Wavefront rewrite of the WebGPU software ray tracer ## Context Crafter.Graphics' WebGPU backend is a **software ray tracer** (native Vulkan uses the 4090's hardware RT cores; WebGPU has none). It currently runs as a **single megakernel** compute shader — one `@compute @workgroup_size(8,8,1)` kernel, one thread per pixel, that inlines raygen → TLAS traverse → BLAS traverse → the entire SBT (`runClosestHit`/`runAnyHit`/ `runMiss`) → shadow rays. This is "several orders of magnitude slower than native," far worse than the gap physics requires. **Honest target assessment:** a well-tuned *software* tracer on a 4090 typically lands ~5–20× behind hardware-RT, not the ~100× seen now. The extra ~50× is recoverable engineering. **60fps on a 4090 for a many-instance (3DForts-style) scene with primary + shadow rays is achievable** — but only by restructuring away from the megakernel. The user has chosen a **full wavefront rewrite now**, benchmarked against a **many-instance scene** (the exact case the disabled LBVH sort breaks at >4000 instances). **Root causes of the current gap (verified in code):** 1. **Megakernel occupancy collapse** — the kernel's register footprint is the max over *all* paths (two in-register traversal stacks + full `Payload` + every closesthit's locals + the SBT switch). High register pressure → low SM occupancy → memory latency can't be hidden. This is the dominant multiplier. (`PipelineRTWebGPU.cpp:180-185`, traversal at [dom-webgpu.js:1479-1672](additional/dom-webgpu.js#L1479-L1672)). 2. **Unordered BVH traversal** — `_rtTraverseBlas`/`_rtTraverseTlas` descend left-first, never nearest-child-first, so `bestT` tightens slowly → excess triangle tests ([dom-webgpu.js:1546-1550](additional/dom-webgpu.js#L1546-L1550), [:1636-1642](additional/dom-webgpu.js#L1636-L1642)). 3. **LBVH sort disabled** — the TLAS radix sort is gated behind `if (false)` ([dom-webgpu.js:2203](additional/dom-webgpu.js#L2203), [TODO-lbvh-sort.md](TODO-lbvh-sort.md)), so TLAS leaves have **no spatial coherence** (loose AABBs). Fatal at thousands of instances. 4. **Padded TLAS tree** — implicit perfect binary tree over `TLAS_BVH_N_PADDED=16384` leaves; ~thousands of real instances means full 14-level descent through mostly-sentinel structure ([dom-webgpu.js:1405-1406](additional/dom-webgpu.js#L1405-L1406)). 5. **No measurement** — there is no benchmark scene and no GPU timing. 3DForts is *not* in this repo (external consumer). "Optimize to the limit" is meaningless without a harness. > The `vkQueueWaitIdle`-per-frame finding from early exploration is on the **native Vulkan** > window path, not WebGPU. The WebGPU frame is just `queue.submit`. Not in scope. ## Goal Replace the megakernel with a **wavefront / streaming** path tracer: separate `GENERATE → TRACE → SHADE → RESOLVE` kernels connected by GPU-resident ray/hit/payload buffers, with a GPU-driven bounce loop (indirect dispatch, no CPU readback). The **TRACE** kernel contains *zero* user code — traversal + intersection only — for high SM occupancy (the core win). Fix TLAS coherence (bitonic sort) and add ordered traversal so the many-instance benchmark is meaningful. This is a **breaking change** to the WebGPU RT shader API (preferred per project policy): the synchronous "raygen calls `traceRay`, inspects the hit, traces a shadow ray" pattern cannot survive without continuations. Raygen now *emits* a primary ray; closesthit/miss run in SHADE and may *emit* continuation/shadow rays + *accumulate* color — the standard DXR/wavefront idiom. (Native Vulkan RT path is untouched.) ## Critical files - [implementations/Crafter.Graphics-PipelineRTWebGPU.cpp](implementations/Crafter.Graphics-PipelineRTWebGPU.cpp) — codegen. Splits the one megakernel into **four entry-point modules** (GENERATE, TRACE, SHADE, RESOLVE) sharing the SBT switches; moves the `Payload`-typed storage binding into the codegen region (after the user's `struct Payload` declaration). - [additional/dom-webgpu.js](additional/dom-webgpu.js) — all library WGSL ([:1278-1672](additional/dom-webgpu.js#L1278)), RT dispatch ([:2743-2821](additional/dom-webgpu.js#L2743)), TLAS/LBVH build ([:1931-2638](additional/dom-webgpu.js#L1931)), device limits ([:118-139](additional/dom-webgpu.js#L118)). New indirect plumbing, wavefront dispatch chain, bitonic sort, and the new kernels live here. - [interfaces/Crafter.Graphics-WebGPU.cppm](interfaces/Crafter.Graphics-WebGPU.cppm) — add `wgpuDispatchWavefrontRT` + indirect-dispatch import decls. - [interfaces/Crafter.Graphics-RTPass.cppm](interfaces/Crafter.Graphics-RTPass.cppm) — per-frame driver switches to the wavefront chain. - [examples/Sponza/raygen.wgsl](examples/Sponza/raygen.wgsl), `closesthit.wgsl`, `miss.wgsl` — rewritten to the emit/accumulate model (reference for the API break). - **New** `examples/RTStress/` — scalable many-instance benchmark + timing HUD (Phase 0). ## Design ### Stages (all compute, recorded once per frame on `state.encoder`) - **GENERATE** — 2D `(⌈W/8⌉,⌈H/8⌉)`, one thread/pixel. User `raygen_main(pixel)` calls `rtEmitPrimaryRay(...)` + inits payload; writes `RayState`/`RayMeta`/payload slot `y*W+x`, zeroes `pixelAccum`. - **TRACE** — 1D over live ray count via `dispatchWorkgroupsIndirect`, `@workgroup_size(64)`. **No user code.** Reads compacted ray list + accel buffers, writes `HitState`. Holds only the traversal stacks + working ray → minimal registers → high occupancy. **v1 = opaque-only** (current Sponza is all-opaque; defer anyhit to a later variant). - **SHADE** — 1D over live ray count. Hosts all user code (the `runClosestHit`/`runMiss` switches move here). Reads `HitState`+payload, dispatches by `hit.t < tMax`, `rtAccumulate`s color, and `rtEmitRay`s continuation/shadow rays (atomic-append + compaction into the next-bounce buffer). - **PREP-INDIRECT** — 1D, single workgroup. Converts the atomic ray counter written by GENERATE/SHADE into `dispatchWorkgroupsIndirect` args `(⌈count/64⌉,1,1)` and resets the consumed counter. Must be its own kernel: WGSL atomics are only globally coherent across a *dispatch boundary*, so count→args conversion can't happen inside SHADE. - **RESOLVE** — 2D per pixel. `pixelAccum` → ping/pong output texture (preserving `state.outIsPing`); tonemap/gamma move here from raygen. ### GPU-driven bounce loop (no per-bounce CPU readback) Unrolled at command-record time to a fixed `maxDepth` (Sponza needs 2 = primary + shadow): ``` GENERATE; PREP(0) for b in 0..maxDepth-1: TRACE <indirect args[b%2]> SHADE <indirect args[b%2]> // appends rays to (b+1)%2, atomicAdd count[(b+1)%2] PREP(b+1) // count[(b+1)%2] -> args; reset count[b%2]=0 RESOLVE ``` Each kernel is its own `beginComputePass`/`end` (indirect args from one pass must be visible to the next — the same reason TLAS build ends/reopens the pass at [dom-webgpu.js:2602](additional/dom-webgpu.js#L2602)). Zero-ray rounds dispatch `(0,1,1)` → near-free. ### Buffers - `RayState` (32B) + `RayMeta` (16B: pixelIndex, flags, missIndex, payloadSlot) — **double-buffered** (ping/pong per bounce), sized **`2×W·H`** only. - `HitState` (96B, matches `HitInfo` at [dom-webgpu.js:1293](additional/dom-webgpu.js#L1293)) — single buffer reused per bounce. - `payloadStore` — bound as `array<Payload>` (the **user's** type), so size/layout are compiler-enforced; no fixed-size cap or bitcast. Requires moving the binding text into the codegen region after the user's `Payload` decl. - `rayCount` (2× `atomic<u32>`) + `indirectArgs` (2× 3-u32, usage `STORAGE | INDIRECT`). - **Capacity guard:** `rtEmitRay` enforces a **1 continuation ray per pixel** budget (shadow rays excepted), so peak live rays fit `2×W·H`. Overflow rays are dropped (graceful). ### API break — Sponza rewritten - `Payload` gains `throughput: vec3<f32>`, `rayKind: u32` (PRIMARY/SHADOW), `pixelIndex: u32`. - `raygen_main`: build pinhole ray, set throughput=1, `rtEmitPrimaryRay(...)`. - `closesthit_main` (PRIMARY): existing UV/normal/albedo work; `rtAccumulate(ambient)`; if `nDotL>0`, `rtEmitRay(shadow ray, SKIP_CLOSEST_HIT|TERMINATE_ON_FIRST_HIT, throughput=lit term)`. - `miss_main`: SHADOW → `rtAccumulate(throughput)` (sun visible); PRIMARY → `rtAccumulate(sky)`. - Occluded shadow rays hit with `SKIP_CLOSEST_HIT` → contribute nothing → correct shadow. ### Coherence (mandatory for the many-instance benchmark) - **Replace the dead radix sort with a workgroup bitonic sort** (TODO strategy #5). Bitonic is data-oblivious — its barriers/access pattern are independent of the histogram distribution, which is exactly the "count-dependent" axis the [TODO](TODO-lbvh-sort.md) bisected the bug to. ~105 compare-exchange passes over 16384 keys; in-budget. Same `(morton16<<16)|index16` keys; `0xFFFFFFFF` sentinels sort to the end. - **Dynamic tree depth:** set `N_PADDED = next_pow2(N_real)` per build so descent depth tracks real instance count instead of fixed 14. (Packed Karras BVH is a later refinement *iff* profiling still shows TLAS dominating.) ### Ordered traversal (TRACE) Add `_rtAabbT` returning entry-t. At each internal node compute both children's entry-t, descend the nearer child, push the farther only if `t < bestT`. Tightens `bestT` faster → fewer triangle tests. Applies to both `_rtTraverseBlas` and `_rtTraverseTlas`. ### Device limits (extend the existing `clamp(...)` block at [dom-webgpu.js:131](additional/dom-webgpu.js#L131)) `maxStorageBuffersPerShaderStage=16` is already requested (TRACE/SHADE need 12 — fits). **Add requests for** `maxBufferSize`, `maxStorageBufferBindingSize` (payloadStore at 1080p ≈ 130 MB, over the 128 MB baseline), and `maxComputeWorkgroupsPerDimension` (needed for 4K 1D dispatch; or de-linearize to 2D in PREP). ## Implementation order (de-risk highest-uncertainty first) 1. **Phase 0 — Measurement harness (do first; gates everything).** - New `examples/RTStress/`: N×N×N grid of a small mesh, instance count a compile/runtime knob (512 → ~8000), primary + shadow ray. This is the standing many-instance benchmark. - Add **GPU timestamp queries** (`timestamp-query` feature) around each pass + a frame-time HUD/console line. Record the **baseline megakernel** number to quantify the gap. 2. **Indirect-dispatch plumbing in isolation** — `dispatchWorkgroupsIndirect` + `INDIRECT` usage + a toy "emit N → dispatch N" round-trip. Biggest unknown (no precedent in repo); prove cross-pass atomic visibility on the target Dawn build before building real kernels on it. 3. **TLAS coherence** — bitonic sort + dynamic depth. Independent of the rewrite; validate against the *existing* megakernel image, measure TLAS-traversal fraction before/after. 4. **Split megakernel → GENERATE/TRACE/SHADE/RESOLVE at maxDepth=1** (primary only, no emit). Validate stage decomposition + binding budget against VulkanTriangle (single trace). 5. **GPU-driven bounce loop + `rtEmitRay`/`rtAccumulate`**; port Sponza to the emit model. Validate against the current Sponza image. 6. **Nearest-child-first traversal** — pure optimization; measure delta. 7. **Pack bindings** (merge `tlasEntryOrder` into BVH pad words; unify vertices/indices/primRemap into one u32 heap) **only if** a target device reports <12 storage buffers. ### Optional, secondary (UI — user is skeptical it's the bottleneck) The UI compute path is `O(W·H·N)` (each tile thread iterates every item, [UI.cpp:110-131](implementations/Crafter.Graphics-UI.cpp#L110)). Cheap win: per-tile item binning so each tile iterates only overlapping items (~`O(W·H·k)`). Defer until RT numbers land. ## Verification - **Correctness:** Sponza and VulkanTriangle render bit-comparable (within tonemap tolerance) to the current megakernel output after Phase 5. RTStress shows no flicker as instance count scales (the [TODO](TODO-lbvh-sort.md) acceptance criterion: CPU-oracle sort vs GPU output for all-uniform, all-one-bucket, and "small object next to tight cluster" distributions). - **Performance:** chart fps vs instance count on the 4090 for RTStress, before (megakernel) and after each phase. Primary metric: **60fps at the target instance count**. Secondary: TRACE kernel register count / occupancy (the rewrite's core justification), per-bounce live-ray counts (validate the `2×W·H` budget + emit-drop rate), per-pass GPU time via timestamp queries. - **How to run:** build the wasm32 target (`project.cpp` selects DOM backend on `wasm32-*`), serve `examples/RTStress`, read the timing HUD in-browser on the 4090.
Member

Phase 3 (TLAS coherence) landed — merged in #2.

Re-enabled the TLAS LBVH spatial sort by replacing the disabled, buggy LSD radix scatter with a data-oblivious workgroup bitonic sorting network in lbvhBuildMain, and turned it on. This fixes the documented TODO-lbvh-sort.md corruption (the if (false) guard is gone) and restores Morton spatial coherence to TLAS BVH leaves — the prerequisite for the many-instance benchmark to be meaningful.

Bitonic is data-oblivious (its compare-exchange schedule depends only on N_PADDED, not the keys), so it structurally cannot have the count/distribution-dependent race the radix sort did. This was strategy #5 in the TODO.

Verified:

  • GPU unit test against a CPU oracle on the real Firefox/Dawn WebGPU stack — all three required distributions (all-uniform, all-one-bucket, small-object-next-to-cluster) plus random/reverse/empty inputs match bit-for-bit with a valid index permutation, zero GPU errors.
  • Sponza (25 TLAS instances) renders correctly with the sort live — no flicker/corruption (screenshot in #2).
  • crafter-build test is green (library declares no tests; verification was direct).

Still open (intentionally not in #2) — the rest of the wavefront rewrite from the plan:

  • Phase 0: RTStress many-instance benchmark + GPU timestamp timing HUD (the measurement harness).
  • Phase 2: indirect-dispatch plumbing (dispatchWorkgroupsIndirect, cross-pass atomic visibility).
  • Phases 4–5: megakernel → GENERATE/TRACE/SHADE/RESOLVE split + GPU-driven bounce loop + the rtEmitRay/rtAccumulate API break (Sponza ported to emit/accumulate).
  • Phase 6: nearest-child-first ordered traversal.
  • Phase 7: binding packing (only if a target reports <12 storage buffers).
  • Deferred: dynamic TLAS tree depth (N_PADDED = next_pow2(N_real)) — couples build + trace shaders; secondary to coherence.

Landing the coherence fix first (as the plan recommends — it's independent of the kernel rewrite and validatable against the existing renderer) de-risks the benchmark the remaining phases will be measured against. Leaving this issue open to track them.

**Phase 3 (TLAS coherence) landed — merged in #2.** Re-enabled the TLAS LBVH spatial sort by replacing the disabled, buggy LSD radix scatter with a data-oblivious workgroup **bitonic sorting network** in `lbvhBuildMain`, and turned it on. This fixes the documented `TODO-lbvh-sort.md` corruption (the `if (false)` guard is gone) and restores Morton spatial coherence to TLAS BVH leaves — the prerequisite for the many-instance benchmark to be meaningful. Bitonic is data-oblivious (its compare-exchange schedule depends only on `N_PADDED`, not the keys), so it structurally cannot have the count/distribution-dependent race the radix sort did. This was strategy #5 in the TODO. **Verified:** - GPU unit test against a CPU oracle on the real Firefox/Dawn WebGPU stack — all three required distributions (all-uniform, all-one-bucket, small-object-next-to-cluster) plus random/reverse/empty inputs match bit-for-bit with a valid index permutation, zero GPU errors. - Sponza (25 TLAS instances) renders correctly with the sort live — no flicker/corruption (screenshot in #2). - `crafter-build test` is green (library declares no tests; verification was direct). **Still open (intentionally not in #2)** — the rest of the wavefront rewrite from the plan: - Phase 0: `RTStress` many-instance benchmark + GPU timestamp timing HUD (the measurement harness). - Phase 2: indirect-dispatch plumbing (`dispatchWorkgroupsIndirect`, cross-pass atomic visibility). - Phases 4–5: megakernel → GENERATE/TRACE/SHADE/RESOLVE split + GPU-driven bounce loop + the `rtEmitRay`/`rtAccumulate` API break (Sponza ported to emit/accumulate). - Phase 6: nearest-child-first ordered traversal. - Phase 7: binding packing (only if a target reports <12 storage buffers). - Deferred: dynamic TLAS tree depth (`N_PADDED = next_pow2(N_real)`) — couples build + trace shaders; secondary to coherence. Landing the coherence fix first (as the plan recommends — it's independent of the kernel rewrite and validatable against the existing renderer) de-risks the benchmark the remaining phases will be measured against. Leaving this issue open to track them.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Catcrafts/Crafter.Graphics#1
No description provided.