# WebGPU wavefront RT rewrite — design & progress (issue #3) Replaces the single megakernel (`main`, 8×8 tile, per-pixel raygen→traceRay→CH/miss→store) with a streaming wavefront tracer: `GENERATE → PREP → (TRACE → SHADE → PREP)×maxDepth → RESOLVE`, each its own compute pass, dispatch sizes driven by `dispatchWorkgroupsIndirect`. ## Kernels (all generated/assembled the same megakernel way, just split) - **GENERATE** (1 thread/pixel, 8×8): runs user `raygen_main(gid)` which calls `rtEmitPrimaryRay(...)`. Clears accum slot + payload slot for the pixel. - **PREP** (1 thread): reads emit counter for the just-filled ray buffer, writes indirect args `[ceil(n/64),1,1]`, publishes `traceCount=n`, swaps cur/next ray buffer, resets next emit counter. One PREP before first TRACE and one after each SHADE. - **TRACE** (1 thread/ray, 64-wide, indirect): ZERO user code. Reads ray i, runs `_rtTraverseTlas`, writes `HitResult` i (t/instanceId/primId/hg/attribs /objToWorld/customIndex/missFlag). - **SHADE** (1 thread/ray, 64-wide, indirect): reads ray i + hit i + payload slot p. miss→`runMiss`, hit→`runClosestHit` (unless SKIP_CLOSEST_HIT). User code calls `rtAccumulate(pixel,rgb)` and `rtEmitRay(...)`. - **RESOLVE** (1 thread/pixel, 8×8): reads accum slot, runs user `resolve_main` if present else passthrough; writes outImage. ## Buffers (rtState, sized to 2*W*H rays) - `wfRaysA`,`wfRaysB`: array, ping/pong. WfRay = origin,tMin,dir,tMax, pixel,flags,cullMask,missIndex,sbtOffset,payloadSlot,kind,_pad. - `wfHits`: array (sized = ray capacity). - `wfPayload`: array — declared in CODEGEN region after user Payload. - `wfAccum`: array> per pixel (W*H). - `wfCounters`: atomic counters: emitA, emitB, trace dispatch args, etc. - `wfIndirect`: INDIRECT dispatch-args buffer. ## API (new, breaking) - raygen: `rtEmitPrimaryRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx)` → allocates payloadSlot=pixel, writes ray to current buffer (atomic bump). - CH/miss: `rtEmitRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx,payload)` spawns into NEXT buffer carrying a payload slot; `rtAccumulate(pixel,rgb)`. - `rtGetPayload(slot)` / payload passed by value into CH/miss via slot. ## Tonemap / resolve Accum buffer is linear. Optional user `WebGPURTStage::Resolve` entry `resolve_main(coord:vec2, hdr:vec4)->vec4`. None → passthrough. VulkanTriangle: no resolve (exact match). Sponza: resolve does Reinhard+gamma. ## Indirect dispatch (Phase 2 de-risk) Prove `dispatchWorkgroupsIndirect` + cross-pass atomic visibility with a toy "emit N → dispatch N" before wiring real kernels. WebGPU inserts an implicit barrier between compute passes in one submit, so atomics written in PREP are visible to TRACE. ## maxDepth Compile/runtime knob. JS unrolls the chain to maxDepth. VulkanTriangle maxDepth=1 (primary only). Sponza maxDepth=2 (primary + shadow). ## Status / progress - [x] baseline VulkanTriangle renders (megakernel) — /tmp/baseline-triangle.png - [ ] wavefront prelude + codegen - [ ] VulkanTriangle on wavefront (maxDepth=1) - [ ] bounce loop + indirect + Sponza shadow port - [ ] RTStress example + timestamp queries - [ ] ordered traversal, dynamic TLAS depth, device limits - [ ] remove megakernel dual path; final validation; PR ## Files - `additional/dom-webgpu.js` — prelude (`rtWgsl*`), `wgpuLoadRTPipeline`, `wgpuDispatchRT`, LBVH build, rtState/buffers, device-limit clamp (~L131). - `implementations/Crafter.Graphics-PipelineRTWebGPU.cpp` — assembles user WGSL + entry glue; must emit 5 entry points + payloadStore binding. - examples/{VulkanTriangle,Sponza,RTStress}/*.wgsl + main.cpp.