Crafter.Graphics/WAVEFRONT-DESIGN.md
catbot 23780d83a8 fix(webgpu): request adapter's storage-buffer limit, not hardcoded 16
dom-webgpu.js capped maxStorageBuffersPerShaderStage at 16 even when the
adapter reports far more (64 in our test env). The wavefront SHADE kernel
already binds ~16 storage buffers before any user binding, so any RT
pipeline declaring 2+ user storage buffers at @group(3) overflowed the
limit and failed to build with "Too many bindings of type StorageBuffers".

Request the adapter's reported maxStorageBuffersPerShaderStage /
maxStorageBuffersInPipelineLayout instead of a fixed 16. `clamp` already
mins against the adapter cap, so baseline-only devices still get a valid
request, and the `|| 16` fallback + the `typeof cap === "number"` guard
handle limit names a browser doesn't expose (Firefox returns null for
maxStorageBuffersInPipelineLayout).

Verified in-browser: a 17-storage-buffer compute pipeline fails with the
exact reported error on a device clamped to 16, and builds cleanly on a
device requesting the adapter's 64. RTStress renders correctly.

Resolves #8

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 21:55:42 +00:00

91 lines
5.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# WebGPU wavefront RT rewrite — design & progress (issue #3)
Replaces the single megakernel (`main`, 8×8 tile, per-pixel
raygen→traceRay→CH/miss→store) with a streaming wavefront tracer:
`GENERATE → PREP → (TRACE → SHADE → PREP)×maxDepth → RESOLVE`, each its own
compute pass, dispatch sizes driven by `dispatchWorkgroupsIndirect`.
## Kernels (all generated/assembled the same megakernel way, just split)
- **GENERATE** (1 thread/pixel, 8×8): runs user `raygen_main(gid)` which calls
`rtEmitPrimaryRay(...)`. Clears accum slot + payload slot for the pixel.
- **PREP** (1 thread): reads emit counter for the just-filled ray buffer,
writes indirect args `[ceil(n/64),1,1]`, publishes `traceCount=n`, swaps
cur/next ray buffer, resets next emit counter. One PREP before first TRACE
and one after each SHADE.
- **TRACE** (1 thread/ray, 64-wide, indirect): ZERO user code. Reads ray i,
runs `_rtTraverseTlas`, writes `HitResult` i (t/instanceId/primId/hg/attribs
/objToWorld/customIndex/missFlag).
- **SHADE** (1 thread/ray, 64-wide, indirect): reads ray i + hit i + payload
slot p. miss→`runMiss`, hit→`runClosestHit` (unless SKIP_CLOSEST_HIT). User
code calls `rtAccumulate(pixel,rgb)` and `rtEmitRay(...)`.
- **RESOLVE** (1 thread/pixel, 8×8): reads accum slot, runs user `resolve_main`
if present else passthrough; writes outImage.
## Buffers (rtState, sized to 2*W*H rays)
- `wfRaysA`,`wfRaysB`: array<WfRay>, ping/pong. WfRay = origin,tMin,dir,tMax,
pixel,flags,cullMask,missIndex,sbtOffset,payloadSlot,kind,_pad.
- `wfHits`: array<HitResult> (sized = ray capacity).
- `wfPayload`: array<Payload> — declared in CODEGEN region after user Payload.
- `wfAccum`: array<vec4<f32>> per pixel (W*H).
- `wfCounters`: atomic counters: emitA, emitB, trace dispatch args, etc.
- `wfIndirect`: INDIRECT dispatch-args buffer.
## API (new, breaking)
- raygen: `rtEmitPrimaryRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx)`
→ allocates payloadSlot=pixel, writes ray to current buffer (atomic bump).
- CH/miss: `rtEmitRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx,payload)`
spawns into NEXT buffer carrying a payload slot; `rtAccumulate(pixel,rgb)`.
- `rtGetPayload(slot)` / payload passed by value into CH/miss via slot.
## Tonemap / resolve
Accum buffer is linear. Optional user `WebGPURTStage::Resolve` entry
`resolve_main(coord:vec2<u32>, hdr:vec4<f32>)->vec4<f32>`. None → passthrough.
VulkanTriangle: no resolve (exact match). Sponza: resolve does Reinhard+gamma.
## Indirect dispatch (Phase 2 de-risk)
Prove `dispatchWorkgroupsIndirect` + cross-pass atomic visibility with a toy
"emit N → dispatch N" before wiring real kernels. WebGPU inserts an implicit
barrier between compute passes in one submit, so atomics written in PREP are
visible to TRACE.
## maxDepth
Compile/runtime knob. JS unrolls the chain to maxDepth. VulkanTriangle
maxDepth=1 (primary only). Sponza maxDepth=2 (primary + shadow).
## Status / progress
- [x] baseline VulkanTriangle renders (megakernel)
- [x] wavefront prelude + codegen (5 entry points share one module)
- [x] VulkanTriangle on wavefront (maxDepth=1) — bit-identical to baseline
- [x] indirect-dispatch bounce loop + PREP (cross-pass atomics proven)
- [x] RTStress example (N³ cube grid) + GPU timestamp-query per-pass HUD
- [x] Sponza port (shadow ray in SHADE) — renders the atrium correctly
- [x] ordered (nearest-child-first) traversal
- [x] dynamic TLAS sweep-tree depth (next_pow2 instances)
- [x] device limits (maxBufferSize / maxStorageBufferBindingSize /
maxComputeWorkgroupsPerDimension) + timestamp-query feature
- [x] megakernel dead path removed (RT pipeline builds only wavefront)
- [~] binding packing (Phase 7): SKIPPED — target device reports 64 storage
buffers/stage (≥12), so the merge is unnecessary (issue makes it
conditional on <12). NOTE: this only holds because dom-webgpu.js now
requests the adapter's reported maxStorageBuffersPerShaderStage at
device creation (was hardcoded to 16, which left room for ~1 user
storage buffer and broke RT pipelines with ≥2). Devices that genuinely
report <12 storage buffers/stage still need this packing.
### Measured (this container's GPU, via timestamp-query; NOT a 4090)
Per-pass GPU time, 1920×995, primary+shadow (maxDepth=2):
- RTStress 512 inst: GEN ~0.80ms TRACE ~1.63ms SHADE ~1.00ms total ~3.52ms (~280 fps)
- RTStress 4096 inst: GEN ~0.80ms TRACE ~1.95ms SHADE ~1.00ms total ~3.85ms (~260 fps)
- Sponza: GEN ~0.79ms TRACE ~1.81ms SHADE ~1.00ms total ~3.69ms
8× the instances costs only ~16% more TRACE — the spatial TLAS + ordered
descent scale sub-linearly. NOTE: a 4090 number and the TRACE-kernel
register/occupancy delta require hardware + a profiler not available in
this CI container; the architectural win (TRACE carries zero user code, so
its register footprint is the traversal loop alone) is structural.
## Files
- `additional/dom-webgpu.js` — prelude (`rtWgsl*`), `wgpuLoadRTPipeline`,
`wgpuDispatchRT`, LBVH build, rtState/buffers, device-limit clamp (~L131).
- `implementations/Crafter.Graphics-PipelineRTWebGPU.cpp` — assembles user
WGSL + entry glue; must emit 5 entry points + payloadStore binding.
- examples/{VulkanTriangle,Sponza,RTStress}/*.wgsl + main.cpp.