Replace the megakernel @compute entry with five wavefront kernels sharing one module, connected by GPU ray/hit/payload buffers and a GPU-driven indirect bounce loop: GENERATE -> (PREP -> TRACE -> SHADE) x maxDepth -> RESOLVE - TRACE contains zero user code (pure _rtwTraverseTlas/Blas, opaque-only). - PREP publishes dispatchWorkgroupsIndirect args from the live ray count; the indirect-args buffer lives in its own bind group so it is never bound read-write in the same dispatch that consumes it as INDIRECT. - New emit/accumulate API: rtEmitPrimaryRay / rtEmitRay / rtAccumulate, plus an optional user Resolve stage (tonemap hook; identity by default). - Per-pass WfParams via a dynamic-offset uniform ring (curIsA/bounce vary between passes within one submit). - Payload-typed wfPayload binding emitted in the codegen region after the user's struct Payload; payload travels with each ray (2*W*H slots). - Request maxBufferSize / maxStorageBufferBindingSize / maxComputeWorkgroups PerDimension so the W*H-sized work buffers fit past the 128MB baseline. VulkanTriangle ported to the new API and renders bit-identical to the megakernel baseline at maxDepth=1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
3.7 KiB
WebGPU wavefront RT rewrite — design & progress (issue #3)
Replaces the single megakernel (main, 8×8 tile, per-pixel
raygen→traceRay→CH/miss→store) with a streaming wavefront tracer:
GENERATE → PREP → (TRACE → SHADE → PREP)×maxDepth → RESOLVE, each its own
compute pass, dispatch sizes driven by dispatchWorkgroupsIndirect.
Kernels (all generated/assembled the same megakernel way, just split)
- GENERATE (1 thread/pixel, 8×8): runs user
raygen_main(gid)which callsrtEmitPrimaryRay(...). Clears accum slot + payload slot for the pixel. - PREP (1 thread): reads emit counter for the just-filled ray buffer,
writes indirect args
[ceil(n/64),1,1], publishestraceCount=n, swaps cur/next ray buffer, resets next emit counter. One PREP before first TRACE and one after each SHADE. - TRACE (1 thread/ray, 64-wide, indirect): ZERO user code. Reads ray i,
runs
_rtTraverseTlas, writesHitResulti (t/instanceId/primId/hg/attribs /objToWorld/customIndex/missFlag). - SHADE (1 thread/ray, 64-wide, indirect): reads ray i + hit i + payload
slot p. miss→
runMiss, hit→runClosestHit(unless SKIP_CLOSEST_HIT). User code callsrtAccumulate(pixel,rgb)andrtEmitRay(...). - RESOLVE (1 thread/pixel, 8×8): reads accum slot, runs user
resolve_mainif present else passthrough; writes outImage.
Buffers (rtState, sized to 2WH rays)
wfRaysA,wfRaysB: array, ping/pong. WfRay = origin,tMin,dir,tMax, pixel,flags,cullMask,missIndex,sbtOffset,payloadSlot,kind,_pad.wfHits: array (sized = ray capacity).wfPayload: array — declared in CODEGEN region after user Payload.wfAccum: array<vec4> per pixel (W*H).wfCounters: atomic counters: emitA, emitB, trace dispatch args, etc.wfIndirect: INDIRECT dispatch-args buffer.
API (new, breaking)
- raygen:
rtEmitPrimaryRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx)→ allocates payloadSlot=pixel, writes ray to current buffer (atomic bump). - CH/miss:
rtEmitRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx,payload)spawns into NEXT buffer carrying a payload slot;rtAccumulate(pixel,rgb). rtGetPayload(slot)/ payload passed by value into CH/miss via slot.
Tonemap / resolve
Accum buffer is linear. Optional user WebGPURTStage::Resolve entry
resolve_main(coord:vec2<u32>, hdr:vec4<f32>)->vec4<f32>. None → passthrough.
VulkanTriangle: no resolve (exact match). Sponza: resolve does Reinhard+gamma.
Indirect dispatch (Phase 2 de-risk)
Prove dispatchWorkgroupsIndirect + cross-pass atomic visibility with a toy
"emit N → dispatch N" before wiring real kernels. WebGPU inserts an implicit
barrier between compute passes in one submit, so atomics written in PREP are
visible to TRACE.
maxDepth
Compile/runtime knob. JS unrolls the chain to maxDepth. VulkanTriangle maxDepth=1 (primary only). Sponza maxDepth=2 (primary + shadow).
Status / progress
- baseline VulkanTriangle renders (megakernel) — /tmp/baseline-triangle.png
- wavefront prelude + codegen
- VulkanTriangle on wavefront (maxDepth=1)
- bounce loop + indirect + Sponza shadow port
- RTStress example + timestamp queries
- ordered traversal, dynamic TLAS depth, device limits
- remove megakernel dual path; final validation; PR
Files
additional/dom-webgpu.js— prelude (rtWgsl*),wgpuLoadRTPipeline,wgpuDispatchRT, LBVH build, rtState/buffers, device-limit clamp (~L131).implementations/Crafter.Graphics-PipelineRTWebGPU.cpp— assembles user WGSL + entry glue; must emit 5 entry points + payloadStore binding.- examples/{VulkanTriangle,Sponza,RTStress}/*.wgsl + main.cpp.