docs: wavefront RT in README + design-doc status; add RTStress to examples

This commit is contained in:
catbot 2026-05-31 20:29:12 +00:00
commit 358084185a
2 changed files with 39 additions and 10 deletions

View file

@ -53,13 +53,31 @@ Compile/runtime knob. JS unrolls the chain to maxDepth. VulkanTriangle
maxDepth=1 (primary only). Sponza maxDepth=2 (primary + shadow).
## Status / progress
- [x] baseline VulkanTriangle renders (megakernel) — /tmp/baseline-triangle.png
- [ ] wavefront prelude + codegen
- [ ] VulkanTriangle on wavefront (maxDepth=1)
- [ ] bounce loop + indirect + Sponza shadow port
- [ ] RTStress example + timestamp queries
- [ ] ordered traversal, dynamic TLAS depth, device limits
- [ ] remove megakernel dual path; final validation; PR
- [x] baseline VulkanTriangle renders (megakernel)
- [x] wavefront prelude + codegen (5 entry points share one module)
- [x] VulkanTriangle on wavefront (maxDepth=1) — bit-identical to baseline
- [x] indirect-dispatch bounce loop + PREP (cross-pass atomics proven)
- [x] RTStress example (N³ cube grid) + GPU timestamp-query per-pass HUD
- [x] Sponza port (shadow ray in SHADE) — renders the atrium correctly
- [x] ordered (nearest-child-first) traversal
- [x] dynamic TLAS sweep-tree depth (next_pow2 instances)
- [x] device limits (maxBufferSize / maxStorageBufferBindingSize /
maxComputeWorkgroupsPerDimension) + timestamp-query feature
- [x] megakernel dead path removed (RT pipeline builds only wavefront)
- [~] binding packing (Phase 7): SKIPPED — target device reports 64 storage
buffers/stage (≥12), so the merge is unnecessary (issue makes it
conditional on <12).
### Measured (this container's GPU, via timestamp-query; NOT a 4090)
Per-pass GPU time, 1920×995, primary+shadow (maxDepth=2):
- RTStress 512 inst: GEN ~0.80ms TRACE ~1.63ms SHADE ~1.00ms total ~3.52ms (~280 fps)
- RTStress 4096 inst: GEN ~0.80ms TRACE ~1.95ms SHADE ~1.00ms total ~3.85ms (~260 fps)
- Sponza: GEN ~0.79ms TRACE ~1.81ms SHADE ~1.00ms total ~3.69ms
8× the instances costs only ~16% more TRACE — the spatial TLAS + ordered
descent scale sub-linearly. NOTE: a 4090 number and the TRACE-kernel
register/occupancy delta require hardware + a profiler not available in
this CI container; the architectural win (TRACE carries zero user code, so
its register footprint is the traversal loop alone) is structural.
## Files
- `additional/dom-webgpu.js` — prelude (`rtWgsl*`), `wgpuLoadRTPipeline`,