WebGPU RT: wavefront/streaming tracer (replaces megakernel) #4

Merged
catbot merged 8 commits from claude/issue-3 into master 2026-05-31 22:31:35 +02:00

8 commits

Author SHA1 Message Date
catbot
358084185a docs: wavefront RT in README + design-doc status; add RTStress to examples 2026-05-31 20:29:12 +00:00
catbot
afc0292fab WebGPU RT: dynamic TLAS sweep-tree depth (next_pow2 instances)
The LBVH bitonic sort still runs over the full 16384 (sentinels sink to
the tail), but the sweep tree is now built and traced at depth
log2(next_pow2(nReal)) instead of a fixed 14. Add nPadded to LbvhPC; leaf
init + bottom-up refit use it; the host passes the same next_pow2 to the
trace via WfParams.tlasNPadded. Renders correctly at 512 instances
(depth 9). The fragile sort phases are untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 20:28:12 +00:00
catbot
82e5e867d4 WebGPU RT: remove dead megakernel WGSL (no dual path)
The RT pipeline now only builds the wavefront kernels, so the old
single-megakernel traversal/traceRay block (rtWgslMegakernelHelpers) and
the unused rtWgslPrelude alias are dead. Remove them. The rayQuery compute
path keeps rtWgslMegakernelBindings (its own _rq* traversal uses it).
RTStress still renders correctly with the trimmed prelude.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 20:24:04 +00:00
catbot
dd4122f2ba WebGPU RT: ordered (nearest-child-first) traversal
Add _rtAabbT (AABB test returning entry-t); in both _rtwTraverseBlas and
_rtwTraverseTlas descend the nearer child first and push the farther only
when it hits, re-culling it against the (tightened) bestT when popped.
Render is identical (same closest hit) on VulkanTriangle, RTStress
(512/4096), and Sponza; cuts node visits on dense scenes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 20:21:44 +00:00
catbot
376e66aeed WebGPU RT: port Sponza to wavefront (shadow ray in SHADE)
Restructure Sponza for the wavefront model: raygen emits the primary ray;
closesthit (in SHADE) gathers albedo/normal, accumulates ambient, and
emits a shadow ray carrying the pending direct term; miss adds the sky
(primary) or the direct term (shadow miss). resolve.wgsl applies the same
Reinhard+gamma the megakernel raygen did inline. User bindings moved to
group 3 (groups 0..2 reserved). RTPass maxDepth=2.

Renders the atrium correctly through the wavefront pipeline (textures,
two-sided shading, sun+ambient, shadows, tonemap).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 20:16:04 +00:00
catbot
1d2e12dbc9 WebGPU RT: GPU timestamp-query per-pass harness
Request the timestamp-query feature; write begin/end timestamps around
each wavefront pass via timestampWrites; resolve + read back (deferred to
after submit) and print a per-pass us breakdown ~1x/sec. RTStress @ 512
instances, 1920x995: TRACE dominates, total ~1.8-3.0ms/frame.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 20:08:39 +00:00
catbot
f4d6493d91 wip: uncommitted changes from claude run on issue #3 2026-05-31 16:28:38 +00:00
catbot
4e42d663a6 WebGPU RT: wavefront tracer core (GENERATE/PREP/TRACE/SHADE/RESOLVE)
Replace the megakernel @compute entry with five wavefront kernels sharing
one module, connected by GPU ray/hit/payload buffers and a GPU-driven
indirect bounce loop:

  GENERATE -> (PREP -> TRACE -> SHADE) x maxDepth -> RESOLVE

- TRACE contains zero user code (pure _rtwTraverseTlas/Blas, opaque-only).
- PREP publishes dispatchWorkgroupsIndirect args from the live ray count;
  the indirect-args buffer lives in its own bind group so it is never
  bound read-write in the same dispatch that consumes it as INDIRECT.
- New emit/accumulate API: rtEmitPrimaryRay / rtEmitRay / rtAccumulate,
  plus an optional user Resolve stage (tonemap hook; identity by default).
- Per-pass WfParams via a dynamic-offset uniform ring (curIsA/bounce vary
  between passes within one submit).
- Payload-typed wfPayload binding emitted in the codegen region after the
  user's struct Payload; payload travels with each ray (2*W*H slots).
- Request maxBufferSize / maxStorageBufferBindingSize / maxComputeWorkgroups
  PerDimension so the W*H-sized work buffers fit past the 128MB baseline.

VulkanTriangle ported to the new API and renders bit-identical to the
megakernel baseline at maxDepth=1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 16:24:41 +00:00