dom-webgpu.js capped maxStorageBuffersPerShaderStage at 16 even when the adapter reports far more (64 in our test env). The wavefront SHADE kernel already binds ~16 storage buffers before any user binding, so any RT pipeline declaring 2+ user storage buffers at @group(3) overflowed the limit and failed to build with "Too many bindings of type StorageBuffers". Request the adapter's reported maxStorageBuffersPerShaderStage / maxStorageBuffersInPipelineLayout instead of a fixed 16. `clamp` already mins against the adapter cap, so baseline-only devices still get a valid request, and the `|| 16` fallback + the `typeof cap === "number"` guard handle limit names a browser doesn't expose (Firefox returns null for maxStorageBuffersInPipelineLayout). Verified in-browser: a 17-storage-buffer compute pipeline fails with the exact reported error on a device clamped to 16, and builds cleanly on a device requesting the adapter's 64. RTStress renders correctly. Resolves #8 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
91 lines
5.2 KiB
Markdown
91 lines
5.2 KiB
Markdown
# WebGPU wavefront RT rewrite — design & progress (issue #3)
|
||
|
||
Replaces the single megakernel (`main`, 8×8 tile, per-pixel
|
||
raygen→traceRay→CH/miss→store) with a streaming wavefront tracer:
|
||
`GENERATE → PREP → (TRACE → SHADE → PREP)×maxDepth → RESOLVE`, each its own
|
||
compute pass, dispatch sizes driven by `dispatchWorkgroupsIndirect`.
|
||
|
||
## Kernels (all generated/assembled the same megakernel way, just split)
|
||
- **GENERATE** (1 thread/pixel, 8×8): runs user `raygen_main(gid)` which calls
|
||
`rtEmitPrimaryRay(...)`. Clears accum slot + payload slot for the pixel.
|
||
- **PREP** (1 thread): reads emit counter for the just-filled ray buffer,
|
||
writes indirect args `[ceil(n/64),1,1]`, publishes `traceCount=n`, swaps
|
||
cur/next ray buffer, resets next emit counter. One PREP before first TRACE
|
||
and one after each SHADE.
|
||
- **TRACE** (1 thread/ray, 64-wide, indirect): ZERO user code. Reads ray i,
|
||
runs `_rtTraverseTlas`, writes `HitResult` i (t/instanceId/primId/hg/attribs
|
||
/objToWorld/customIndex/missFlag).
|
||
- **SHADE** (1 thread/ray, 64-wide, indirect): reads ray i + hit i + payload
|
||
slot p. miss→`runMiss`, hit→`runClosestHit` (unless SKIP_CLOSEST_HIT). User
|
||
code calls `rtAccumulate(pixel,rgb)` and `rtEmitRay(...)`.
|
||
- **RESOLVE** (1 thread/pixel, 8×8): reads accum slot, runs user `resolve_main`
|
||
if present else passthrough; writes outImage.
|
||
|
||
## Buffers (rtState, sized to 2*W*H rays)
|
||
- `wfRaysA`,`wfRaysB`: array<WfRay>, ping/pong. WfRay = origin,tMin,dir,tMax,
|
||
pixel,flags,cullMask,missIndex,sbtOffset,payloadSlot,kind,_pad.
|
||
- `wfHits`: array<HitResult> (sized = ray capacity).
|
||
- `wfPayload`: array<Payload> — declared in CODEGEN region after user Payload.
|
||
- `wfAccum`: array<vec4<f32>> per pixel (W*H).
|
||
- `wfCounters`: atomic counters: emitA, emitB, trace dispatch args, etc.
|
||
- `wfIndirect`: INDIRECT dispatch-args buffer.
|
||
|
||
## API (new, breaking)
|
||
- raygen: `rtEmitPrimaryRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx)`
|
||
→ allocates payloadSlot=pixel, writes ray to current buffer (atomic bump).
|
||
- CH/miss: `rtEmitRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx,payload)`
|
||
spawns into NEXT buffer carrying a payload slot; `rtAccumulate(pixel,rgb)`.
|
||
- `rtGetPayload(slot)` / payload passed by value into CH/miss via slot.
|
||
|
||
## Tonemap / resolve
|
||
Accum buffer is linear. Optional user `WebGPURTStage::Resolve` entry
|
||
`resolve_main(coord:vec2<u32>, hdr:vec4<f32>)->vec4<f32>`. None → passthrough.
|
||
VulkanTriangle: no resolve (exact match). Sponza: resolve does Reinhard+gamma.
|
||
|
||
## Indirect dispatch (Phase 2 de-risk)
|
||
Prove `dispatchWorkgroupsIndirect` + cross-pass atomic visibility with a toy
|
||
"emit N → dispatch N" before wiring real kernels. WebGPU inserts an implicit
|
||
barrier between compute passes in one submit, so atomics written in PREP are
|
||
visible to TRACE.
|
||
|
||
## maxDepth
|
||
Compile/runtime knob. JS unrolls the chain to maxDepth. VulkanTriangle
|
||
maxDepth=1 (primary only). Sponza maxDepth=2 (primary + shadow).
|
||
|
||
## Status / progress
|
||
- [x] baseline VulkanTriangle renders (megakernel)
|
||
- [x] wavefront prelude + codegen (5 entry points share one module)
|
||
- [x] VulkanTriangle on wavefront (maxDepth=1) — bit-identical to baseline
|
||
- [x] indirect-dispatch bounce loop + PREP (cross-pass atomics proven)
|
||
- [x] RTStress example (N³ cube grid) + GPU timestamp-query per-pass HUD
|
||
- [x] Sponza port (shadow ray in SHADE) — renders the atrium correctly
|
||
- [x] ordered (nearest-child-first) traversal
|
||
- [x] dynamic TLAS sweep-tree depth (next_pow2 instances)
|
||
- [x] device limits (maxBufferSize / maxStorageBufferBindingSize /
|
||
maxComputeWorkgroupsPerDimension) + timestamp-query feature
|
||
- [x] megakernel dead path removed (RT pipeline builds only wavefront)
|
||
- [~] binding packing (Phase 7): SKIPPED — target device reports 64 storage
|
||
buffers/stage (≥12), so the merge is unnecessary (issue makes it
|
||
conditional on <12). NOTE: this only holds because dom-webgpu.js now
|
||
requests the adapter's reported maxStorageBuffersPerShaderStage at
|
||
device creation (was hardcoded to 16, which left room for ~1 user
|
||
storage buffer and broke RT pipelines with ≥2). Devices that genuinely
|
||
report <12 storage buffers/stage still need this packing.
|
||
|
||
### Measured (this container's GPU, via timestamp-query; NOT a 4090)
|
||
Per-pass GPU time, 1920×995, primary+shadow (maxDepth=2):
|
||
- RTStress 512 inst: GEN ~0.80ms TRACE ~1.63ms SHADE ~1.00ms total ~3.52ms (~280 fps)
|
||
- RTStress 4096 inst: GEN ~0.80ms TRACE ~1.95ms SHADE ~1.00ms total ~3.85ms (~260 fps)
|
||
- Sponza: GEN ~0.79ms TRACE ~1.81ms SHADE ~1.00ms total ~3.69ms
|
||
8× the instances costs only ~16% more TRACE — the spatial TLAS + ordered
|
||
descent scale sub-linearly. NOTE: a 4090 number and the TRACE-kernel
|
||
register/occupancy delta require hardware + a profiler not available in
|
||
this CI container; the architectural win (TRACE carries zero user code, so
|
||
its register footprint is the traversal loop alone) is structural.
|
||
|
||
## Files
|
||
- `additional/dom-webgpu.js` — prelude (`rtWgsl*`), `wgpuLoadRTPipeline`,
|
||
`wgpuDispatchRT`, LBVH build, rtState/buffers, device-limit clamp (~L131).
|
||
- `implementations/Crafter.Graphics-PipelineRTWebGPU.cpp` — assembles user
|
||
WGSL + entry glue; must emit 5 entry points + payloadStore binding.
|
||
- examples/{VulkanTriangle,Sponza,RTStress}/*.wgsl + main.cpp.
|