WebGPU RT: wavefront tracer core (GENERATE/PREP/TRACE/SHADE/RESOLVE)

Replace the megakernel @compute entry with five wavefront kernels sharing one module, connected by GPU ray/hit/payload buffers and a GPU-driven indirect bounce loop: GENERATE -> (PREP -> TRACE -> SHADE) x maxDepth -> RESOLVE - TRACE contains zero user code (pure _rtwTraverseTlas/Blas, opaque-only). - PREP publishes dispatchWorkgroupsIndirect args from the live ray count; the indirect-args buffer lives in its own bind group so it is never bound read-write in the same dispatch that consumes it as INDIRECT. - New emit/accumulate API: rtEmitPrimaryRay / rtEmitRay / rtAccumulate, plus an optional user Resolve stage (tonemap hook; identity by default). - Per-pass WfParams via a dynamic-offset uniform ring (curIsA/bounce vary between passes within one submit). - Payload-typed wfPayload binding emitted in the codegen region after the user's struct Payload; payload travels with each ray (2*W*H slots). - Request maxBufferSize / maxStorageBufferBindingSize / maxComputeWorkgroups PerDimension so the W*H-sized work buffers fit past the 128MB baseline. VulkanTriangle ported to the new API and renders bit-identical to the megakernel baseline at maxDepth=1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 16:24:41 +00:00 · 2026-05-31 16:24:41 +00:00 · 4e42d663a6
commit 4e42d663a6
parent e0d72f57f2
9 changed files with 755 additions and 101 deletions
--- a/WAVEFRONT-DESIGN.md
+++ b/WAVEFRONT-DESIGN.md
@ -0,0 +1,69 @@
+# WebGPU wavefront RT rewrite — design & progress (issue #3)
+
+Replaces the single megakernel (`main`, 8×8 tile, per-pixel
+raygen→traceRay→CH/miss→store) with a streaming wavefront tracer:
+`GENERATE → PREP → (TRACE → SHADE → PREP)×maxDepth → RESOLVE`, each its own
+compute pass, dispatch sizes driven by `dispatchWorkgroupsIndirect`.
+
+## Kernels (all generated/assembled the same megakernel way, just split)
+- **GENERATE** (1 thread/pixel, 8×8): runs user `raygen_main(gid)` which calls
+  `rtEmitPrimaryRay(...)`. Clears accum slot + payload slot for the pixel.
+- **PREP** (1 thread): reads emit counter for the just-filled ray buffer,
+  writes indirect args `[ceil(n/64),1,1]`, publishes `traceCount=n`, swaps
+  cur/next ray buffer, resets next emit counter. One PREP before first TRACE
+  and one after each SHADE.
+- **TRACE** (1 thread/ray, 64-wide, indirect): ZERO user code. Reads ray i,
+  runs `_rtTraverseTlas`, writes `HitResult` i (t/instanceId/primId/hg/attribs
+  /objToWorld/customIndex/missFlag).
+- **SHADE** (1 thread/ray, 64-wide, indirect): reads ray i + hit i + payload
+  slot p. miss→`runMiss`, hit→`runClosestHit` (unless SKIP_CLOSEST_HIT). User
+  code calls `rtAccumulate(pixel,rgb)` and `rtEmitRay(...)`.
+- **RESOLVE** (1 thread/pixel, 8×8): reads accum slot, runs user `resolve_main`
+  if present else passthrough; writes outImage.
+
+## Buffers (rtState, sized to 2*W*H rays)
+- `wfRaysA`,`wfRaysB`: array<WfRay>, ping/pong. WfRay = origin,tMin,dir,tMax,
+  pixel,flags,cullMask,missIndex,sbtOffset,payloadSlot,kind,_pad.
+- `wfHits`: array<HitResult> (sized = ray capacity).
+- `wfPayload`: array<Payload> — declared in CODEGEN region after user Payload.
+- `wfAccum`: array<vec4<f32>> per pixel (W*H).
+- `wfCounters`: atomic counters: emitA, emitB, trace dispatch args, etc.
+- `wfIndirect`: INDIRECT dispatch-args buffer.
+
+## API (new, breaking)
+- raygen: `rtEmitPrimaryRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx)`
+  → allocates payloadSlot=pixel, writes ray to current buffer (atomic bump).
+- CH/miss: `rtEmitRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx,payload)`
+  spawns into NEXT buffer carrying a payload slot; `rtAccumulate(pixel,rgb)`.
+- `rtGetPayload(slot)` / payload passed by value into CH/miss via slot.
+
+## Tonemap / resolve
+Accum buffer is linear. Optional user `WebGPURTStage::Resolve` entry
+`resolve_main(coord:vec2<u32>, hdr:vec4<f32>)->vec4<f32>`. None → passthrough.
+VulkanTriangle: no resolve (exact match). Sponza: resolve does Reinhard+gamma.
+
+## Indirect dispatch (Phase 2 de-risk)
+Prove `dispatchWorkgroupsIndirect` + cross-pass atomic visibility with a toy
+"emit N → dispatch N" before wiring real kernels. WebGPU inserts an implicit
+barrier between compute passes in one submit, so atomics written in PREP are
+visible to TRACE.
+
+## maxDepth
+Compile/runtime knob. JS unrolls the chain to maxDepth. VulkanTriangle
+maxDepth=1 (primary only). Sponza maxDepth=2 (primary + shadow).
+
+## Status / progress
+- [x] baseline VulkanTriangle renders (megakernel) — /tmp/baseline-triangle.png
+- [ ] wavefront prelude + codegen
+- [ ] VulkanTriangle on wavefront (maxDepth=1)
+- [ ] bounce loop + indirect + Sponza shadow port
+- [ ] RTStress example + timestamp queries
+- [ ] ordered traversal, dynamic TLAS depth, device limits
+- [ ] remove megakernel dual path; final validation; PR
+
+## Files
+- `additional/dom-webgpu.js` — prelude (`rtWgsl*`), `wgpuLoadRTPipeline`,
+  `wgpuDispatchRT`, LBVH build, rtState/buffers, device-limit clamp (~L131).
+- `implementations/Crafter.Graphics-PipelineRTWebGPU.cpp` — assembles user
+  WGSL + entry glue; must emit 5 entry points + payloadStore binding.
+- examples/{VulkanTriangle,Sponza,RTStress}/*.wgsl + main.cpp.