Crafter.Graphics/WAVEFRONT-DESIGN.md
catbot 4e42d663a6 WebGPU RT: wavefront tracer core (GENERATE/PREP/TRACE/SHADE/RESOLVE)
Replace the megakernel @compute entry with five wavefront kernels sharing
one module, connected by GPU ray/hit/payload buffers and a GPU-driven
indirect bounce loop:

  GENERATE -> (PREP -> TRACE -> SHADE) x maxDepth -> RESOLVE

- TRACE contains zero user code (pure _rtwTraverseTlas/Blas, opaque-only).
- PREP publishes dispatchWorkgroupsIndirect args from the live ray count;
  the indirect-args buffer lives in its own bind group so it is never
  bound read-write in the same dispatch that consumes it as INDIRECT.
- New emit/accumulate API: rtEmitPrimaryRay / rtEmitRay / rtAccumulate,
  plus an optional user Resolve stage (tonemap hook; identity by default).
- Per-pass WfParams via a dynamic-offset uniform ring (curIsA/bounce vary
  between passes within one submit).
- Payload-typed wfPayload binding emitted in the codegen region after the
  user's struct Payload; payload travels with each ray (2*W*H slots).
- Request maxBufferSize / maxStorageBufferBindingSize / maxComputeWorkgroups
  PerDimension so the W*H-sized work buffers fit past the 128MB baseline.

VulkanTriangle ported to the new API and renders bit-identical to the
megakernel baseline at maxDepth=1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 16:24:41 +00:00

3.7 KiB
Raw Blame History

WebGPU wavefront RT rewrite — design & progress (issue #3)

Replaces the single megakernel (main, 8×8 tile, per-pixel raygen→traceRay→CH/miss→store) with a streaming wavefront tracer: GENERATE → PREP → (TRACE → SHADE → PREP)×maxDepth → RESOLVE, each its own compute pass, dispatch sizes driven by dispatchWorkgroupsIndirect.

Kernels (all generated/assembled the same megakernel way, just split)

  • GENERATE (1 thread/pixel, 8×8): runs user raygen_main(gid) which calls rtEmitPrimaryRay(...). Clears accum slot + payload slot for the pixel.
  • PREP (1 thread): reads emit counter for the just-filled ray buffer, writes indirect args [ceil(n/64),1,1], publishes traceCount=n, swaps cur/next ray buffer, resets next emit counter. One PREP before first TRACE and one after each SHADE.
  • TRACE (1 thread/ray, 64-wide, indirect): ZERO user code. Reads ray i, runs _rtTraverseTlas, writes HitResult i (t/instanceId/primId/hg/attribs /objToWorld/customIndex/missFlag).
  • SHADE (1 thread/ray, 64-wide, indirect): reads ray i + hit i + payload slot p. miss→runMiss, hit→runClosestHit (unless SKIP_CLOSEST_HIT). User code calls rtAccumulate(pixel,rgb) and rtEmitRay(...).
  • RESOLVE (1 thread/pixel, 8×8): reads accum slot, runs user resolve_main if present else passthrough; writes outImage.

Buffers (rtState, sized to 2WH rays)

  • wfRaysA,wfRaysB: array, ping/pong. WfRay = origin,tMin,dir,tMax, pixel,flags,cullMask,missIndex,sbtOffset,payloadSlot,kind,_pad.
  • wfHits: array (sized = ray capacity).
  • wfPayload: array — declared in CODEGEN region after user Payload.
  • wfAccum: array<vec4> per pixel (W*H).
  • wfCounters: atomic counters: emitA, emitB, trace dispatch args, etc.
  • wfIndirect: INDIRECT dispatch-args buffer.

API (new, breaking)

  • raygen: rtEmitPrimaryRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx) → allocates payloadSlot=pixel, writes ray to current buffer (atomic bump).
  • CH/miss: rtEmitRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx,payload) spawns into NEXT buffer carrying a payload slot; rtAccumulate(pixel,rgb).
  • rtGetPayload(slot) / payload passed by value into CH/miss via slot.

Tonemap / resolve

Accum buffer is linear. Optional user WebGPURTStage::Resolve entry resolve_main(coord:vec2<u32>, hdr:vec4<f32>)->vec4<f32>. None → passthrough. VulkanTriangle: no resolve (exact match). Sponza: resolve does Reinhard+gamma.

Indirect dispatch (Phase 2 de-risk)

Prove dispatchWorkgroupsIndirect + cross-pass atomic visibility with a toy "emit N → dispatch N" before wiring real kernels. WebGPU inserts an implicit barrier between compute passes in one submit, so atomics written in PREP are visible to TRACE.

maxDepth

Compile/runtime knob. JS unrolls the chain to maxDepth. VulkanTriangle maxDepth=1 (primary only). Sponza maxDepth=2 (primary + shadow).

Status / progress

  • baseline VulkanTriangle renders (megakernel) — /tmp/baseline-triangle.png
  • wavefront prelude + codegen
  • VulkanTriangle on wavefront (maxDepth=1)
  • bounce loop + indirect + Sponza shadow port
  • RTStress example + timestamp queries
  • ordered traversal, dynamic TLAS depth, device limits
  • remove megakernel dual path; final validation; PR

Files

  • additional/dom-webgpu.js — prelude (rtWgsl*), wgpuLoadRTPipeline, wgpuDispatchRT, LBVH build, rtState/buffers, device-limit clamp (~L131).
  • implementations/Crafter.Graphics-PipelineRTWebGPU.cpp — assembles user WGSL + entry glue; must emit 5 entry points + payloadStore binding.
  • examples/{VulkanTriangle,Sponza,RTStress}/*.wgsl + main.cpp.