Crafter.Graphics/WAVEFRONT-DESIGN.md

# WebGPU wavefront RT rewrite — design & progress (issue #3)

Replaces the single megakernel (`main`, 8×8 tile, per-pixel
raygen→traceRay→CH/miss→store) with a streaming wavefront tracer:
`GENERATE → PREP → (TRACE → SHADE → PREP)×maxDepth → RESOLVE`, each its own
compute pass, dispatch sizes driven by `dispatchWorkgroupsIndirect`.

## Kernels (all generated/assembled the same megakernel way, just split)
- **GENERATE** (1 thread/pixel, 8×8): runs user `raygen_main(gid)` which calls
  `rtEmitPrimaryRay(...)`. Clears accum slot + payload slot for the pixel.
- **PREP** (1 thread): reads emit counter for the just-filled ray buffer,
  writes indirect args `[ceil(n/64),1,1]`, publishes `traceCount=n`, swaps
  cur/next ray buffer, resets next emit counter. One PREP before first TRACE
  and one after each SHADE.
- **TRACE** (1 thread/ray, 64-wide, indirect): ZERO user code. Reads ray i,
  runs `_rtTraverseTlas`, writes `HitResult` i (t/instanceId/primId/hg/attribs
  /objToWorld/customIndex/missFlag).
- **SHADE** (1 thread/ray, 64-wide, indirect): reads ray i + hit i + payload
  slot p. miss→`runMiss`, hit→`runClosestHit` (unless SKIP_CLOSEST_HIT). User
  code calls `rtAccumulate(pixel,rgb)` and `rtEmitRay(...)`.
- **RESOLVE** (1 thread/pixel, 8×8): reads accum slot, runs user `resolve_main`
  if present else passthrough; writes outImage.

## Buffers (rtState, sized to 2*W*H rays)
- `wfRaysA`,`wfRaysB`: array<WfRay>, ping/pong. WfRay = origin,tMin,dir,tMax,
  pixel,flags,cullMask,missIndex,sbtOffset,payloadSlot,kind,_pad.
- `wfHits`: array<HitResult> (sized = ray capacity).
- `wfPayload`: array<Payload> — declared in CODEGEN region after user Payload.
- `wfAccum`: array<vec4<f32>> per pixel (W*H).
- `wfCounters`: atomic counters: emitA, emitB, trace dispatch args, etc.
- `wfIndirect`: INDIRECT dispatch-args buffer.

## API (new, breaking)
- raygen: `rtEmitPrimaryRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx)`
  → allocates payloadSlot=pixel, writes ray to current buffer (atomic bump).
- CH/miss: `rtEmitRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx,payload)`
  spawns into NEXT buffer carrying a payload slot; `rtAccumulate(pixel,rgb)`.
- `rtGetPayload(slot)` / payload passed by value into CH/miss via slot.

## Tonemap / resolve
Accum buffer is linear. Optional user `WebGPURTStage::Resolve` entry
`resolve_main(coord:vec2<u32>, hdr:vec4<f32>)->vec4<f32>`. None → passthrough.
VulkanTriangle: no resolve (exact match). Sponza: resolve does Reinhard+gamma.

## Indirect dispatch (Phase 2 de-risk)
Prove `dispatchWorkgroupsIndirect` + cross-pass atomic visibility with a toy
"emit N → dispatch N" before wiring real kernels. WebGPU inserts an implicit
barrier between compute passes in one submit, so atomics written in PREP are
visible to TRACE.

## maxDepth
Compile/runtime knob. JS unrolls the chain to maxDepth. VulkanTriangle
maxDepth=1 (primary only). Sponza maxDepth=2 (primary + shadow).

## Status / progress
- [x] baseline VulkanTriangle renders (megakernel)
- [x] wavefront prelude + codegen (5 entry points share one module)
- [x] VulkanTriangle on wavefront (maxDepth=1) — bit-identical to baseline
- [x] indirect-dispatch bounce loop + PREP (cross-pass atomics proven)
- [x] RTStress example (N³ cube grid) + GPU timestamp-query per-pass HUD
- [x] Sponza port (shadow ray in SHADE) — renders the atrium correctly
- [x] ordered (nearest-child-first) traversal
- [x] dynamic TLAS sweep-tree depth (next_pow2 instances)
- [x] device limits (maxBufferSize / maxStorageBufferBindingSize /
      maxComputeWorkgroupsPerDimension) + timestamp-query feature
- [x] megakernel dead path removed (RT pipeline builds only wavefront)
- [~] binding packing (Phase 7): SKIPPED — target device reports 64 storage
      buffers/stage (≥12), so the merge is unnecessary (issue makes it
      conditional on <12).

### Measured (this container's GPU, via timestamp-query; NOT a 4090)
Per-pass GPU time, 1920×995, primary+shadow (maxDepth=2):
- RTStress 512 inst:  GEN ~0.80ms TRACE ~1.63ms SHADE ~1.00ms total ~3.52ms (~280 fps)
- RTStress 4096 inst: GEN ~0.80ms TRACE ~1.95ms SHADE ~1.00ms total ~3.85ms (~260 fps)
- Sponza:             GEN ~0.79ms TRACE ~1.81ms SHADE ~1.00ms total ~3.69ms
8× the instances costs only ~16% more TRACE — the spatial TLAS + ordered
descent scale sub-linearly. NOTE: a 4090 number and the TRACE-kernel
register/occupancy delta require hardware + a profiler not available in
this CI container; the architectural win (TRACE carries zero user code, so
its register footprint is the traversal loop alone) is structural.

## Files
- `additional/dom-webgpu.js` — prelude (`rtWgsl*`), `wgpuLoadRTPipeline`,
  `wgpuDispatchRT`, LBVH build, rtState/buffers, device-limit clamp (~L131).
- `implementations/Crafter.Graphics-PipelineRTWebGPU.cpp` — assembles user
  WGSL + entry glue; must emit 5 entry points + payloadStore binding.
- examples/{VulkanTriangle,Sponza,RTStress}/*.wgsl + main.cpp.
-												WebGPU RT: wavefront tracer core (GENERATE/PREP/TRACE/SHADE/RESOLVE)

Replace the megakernel @compute entry with five wavefront kernels sharing
one module, connected by GPU ray/hit/payload buffers and a GPU-driven
indirect bounce loop:

  GENERATE -> (PREP -> TRACE -> SHADE) x maxDepth -> RESOLVE

- TRACE contains zero user code (pure _rtwTraverseTlas/Blas, opaque-only).
- PREP publishes dispatchWorkgroupsIndirect args from the live ray count;
  the indirect-args buffer lives in its own bind group so it is never
  bound read-write in the same dispatch that consumes it as INDIRECT.
- New emit/accumulate API: rtEmitPrimaryRay / rtEmitRay / rtAccumulate,
  plus an optional user Resolve stage (tonemap hook; identity by default).
- Per-pass WfParams via a dynamic-offset uniform ring (curIsA/bounce vary
  between passes within one submit).
- Payload-typed wfPayload binding emitted in the codegen region after the
  user's struct Payload; payload travels with each ray (2*W*H slots).
- Request maxBufferSize / maxStorageBufferBindingSize / maxComputeWorkgroups
  PerDimension so the W*H-sized work buffers fit past the 128MB baseline.

VulkanTriangle ported to the new API and renders bit-identical to the
megakernel baseline at maxDepth=1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-05-31 16:24:41 +00:00
+								# WebGPU wavefront RT rewrite — design & progress (issue #3)
 								Replaces the single megakernel (`main`, 8×8 tile, per-pixel
 								raygen→traceRay→CH/miss→store) with a streaming wavefront tracer:
 								`GENERATE → PREP → (TRACE → SHADE → PREP)×maxDepth → RESOLVE`, each its own
 								compute pass, dispatch sizes driven by `dispatchWorkgroupsIndirect`.
 								## Kernels (all generated/assembled the same megakernel way, just split)
 								- **GENERATE** (1 thread/pixel, 8×8): runs user `raygen_main(gid)` which calls
 								  `rtEmitPrimaryRay(...)`. Clears accum slot + payload slot for the pixel.
 								- **PREP** (1 thread): reads emit counter for the just-filled ray buffer,
 								  writes indirect args `[ceil(n/64),1,1]`, publishes `traceCount=n`, swaps
 								  cur/next ray buffer, resets next emit counter. One PREP before first TRACE
 								  and one after each SHADE.
 								- **TRACE** (1 thread/ray, 64-wide, indirect): ZERO user code. Reads ray i,
 								  runs `_rtTraverseTlas`, writes `HitResult` i (t/instanceId/primId/hg/attribs
 								  /objToWorld/customIndex/missFlag).
 								- **SHADE** (1 thread/ray, 64-wide, indirect): reads ray i + hit i + payload
 								  slot p. miss→`runMiss`, hit→`runClosestHit` (unless SKIP_CLOSEST_HIT). User
 								  code calls `rtAccumulate(pixel,rgb)` and `rtEmitRay(...)`.
 								- **RESOLVE** (1 thread/pixel, 8×8): reads accum slot, runs user `resolve_main`
 								  if present else passthrough; writes outImage.
 								## Buffers (rtState, sized to 2*W*H rays)
 								- `wfRaysA`,`wfRaysB`: array<WfRay>, ping/pong. WfRay = origin,tMin,dir,tMax,
 								  pixel,flags,cullMask,missIndex,sbtOffset,payloadSlot,kind,_pad.
 								- `wfHits`: array<HitResult> (sized = ray capacity).
 								- `wfPayload`: array<Payload> — declared in CODEGEN region after user Payload.
 								- `wfAccum`: array<vec4<f32>> per pixel (W*H).
 								- `wfCounters`: atomic counters: emitA, emitB, trace dispatch args, etc.
 								- `wfIndirect`: INDIRECT dispatch-args buffer.
 								## API (new, breaking)
 								- raygen: `rtEmitPrimaryRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx)`
 								  → allocates payloadSlot=pixel, writes ray to current buffer (atomic bump).
 								- CH/miss: `rtEmitRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx,payload)`
 								  spawns into NEXT buffer carrying a payload slot; `rtAccumulate(pixel,rgb)`.
 								- `rtGetPayload(slot)` / payload passed by value into CH/miss via slot.
 								## Tonemap / resolve
 								Accum buffer is linear. Optional user `WebGPURTStage::Resolve` entry
 								`resolve_main(coord:vec2<u32>, hdr:vec4<f32>)->vec4<f32>`. None → passthrough.
 								VulkanTriangle: no resolve (exact match). Sponza: resolve does Reinhard+gamma.
 								## Indirect dispatch (Phase 2 de-risk)
 								Prove `dispatchWorkgroupsIndirect` + cross-pass atomic visibility with a toy
 								"emit N → dispatch N" before wiring real kernels. WebGPU inserts an implicit
 								barrier between compute passes in one submit, so atomics written in PREP are
 								visible to TRACE.
 								## maxDepth
 								Compile/runtime knob. JS unrolls the chain to maxDepth. VulkanTriangle
 								maxDepth=1 (primary only). Sponza maxDepth=2 (primary + shadow).
 								## Status / progress
-												docs: wavefront RT in README + design-doc status; add RTStress to examples

											
										
										
											2026-05-31 20:29:12 +00:00
+								- [x] baseline VulkanTriangle renders (megakernel)
 								- [x] wavefront prelude + codegen (5 entry points share one module)
 								- [x] VulkanTriangle on wavefront (maxDepth=1) — bit-identical to baseline
 								- [x] indirect-dispatch bounce loop + PREP (cross-pass atomics proven)
 								- [x] RTStress example (N³ cube grid) + GPU timestamp-query per-pass HUD
 								- [x] Sponza port (shadow ray in SHADE) — renders the atrium correctly
 								- [x] ordered (nearest-child-first) traversal
 								- [x] dynamic TLAS sweep-tree depth (next_pow2 instances)
 								- [x] device limits (maxBufferSize / maxStorageBufferBindingSize /
 								      maxComputeWorkgroupsPerDimension) + timestamp-query feature
 								- [x] megakernel dead path removed (RT pipeline builds only wavefront)
 								- [~] binding packing (Phase 7): SKIPPED — target device reports 64 storage
 								      buffers/stage (≥12), so the merge is unnecessary (issue makes it
 								      conditional on <12).
 								### Measured (this container's GPU, via timestamp-query; NOT a 4090)
 								Per-pass GPU time, 1920×995, primary+shadow (maxDepth=2):
 								- RTStress 512 inst:  GEN ~0.80ms TRACE ~1.63ms SHADE ~1.00ms total ~3.52ms (~280 fps)
 								- RTStress 4096 inst: GEN ~0.80ms TRACE ~1.95ms SHADE ~1.00ms total ~3.85ms (~260 fps)
 								- Sponza:             GEN ~0.79ms TRACE ~1.81ms SHADE ~1.00ms total ~3.69ms
 × the instances costs only ~16% more TRACE — the spatial TLAS + ordered
 								descent scale sub-linearly. NOTE: a 4090 number and the TRACE-kernel
 								register/occupancy delta require hardware + a profiler not available in
 								this CI container; the architectural win (TRACE carries zero user code, so
 								its register footprint is the traversal loop alone) is structural.
-												WebGPU RT: wavefront tracer core (GENERATE/PREP/TRACE/SHADE/RESOLVE)

Replace the megakernel @compute entry with five wavefront kernels sharing
one module, connected by GPU ray/hit/payload buffers and a GPU-driven
indirect bounce loop:

  GENERATE -> (PREP -> TRACE -> SHADE) x maxDepth -> RESOLVE

- TRACE contains zero user code (pure _rtwTraverseTlas/Blas, opaque-only).
- PREP publishes dispatchWorkgroupsIndirect args from the live ray count;
  the indirect-args buffer lives in its own bind group so it is never
  bound read-write in the same dispatch that consumes it as INDIRECT.
- New emit/accumulate API: rtEmitPrimaryRay / rtEmitRay / rtAccumulate,
  plus an optional user Resolve stage (tonemap hook; identity by default).
- Per-pass WfParams via a dynamic-offset uniform ring (curIsA/bounce vary
  between passes within one submit).
- Payload-typed wfPayload binding emitted in the codegen region after the
  user's struct Payload; payload travels with each ray (2*W*H slots).
- Request maxBufferSize / maxStorageBufferBindingSize / maxComputeWorkgroups
  PerDimension so the W*H-sized work buffers fit past the 128MB baseline.

VulkanTriangle ported to the new API and renders bit-identical to the
megakernel baseline at maxDepth=1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-05-31 16:24:41 +00:00
 								## Files
 								- `additional/dom-webgpu.js` — prelude (`rtWgsl*`), `wgpuLoadRTPipeline`,
 								  `wgpuDispatchRT`, LBVH build, rtState/buffers, device-limit clamp (~L131).
 								- `implementations/Crafter.Graphics-PipelineRTWebGPU.cpp` — assembles user
 								  WGSL + entry glue; must emit 5 entry points + payloadStore binding.
 								- examples/{VulkanTriangle,Sponza,RTStress}/*.wgsl + main.cpp.