The RT pipeline now only builds the wavefront kernels, so the old
single-megakernel traversal/traceRay block (rtWgslMegakernelHelpers) and
the unused rtWgslPrelude alias are dead. Remove them. The rayQuery compute
path keeps rtWgslMegakernelBindings (its own _rq* traversal uses it).
RTStress still renders correctly with the trimmed prelude.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add _rtAabbT (AABB test returning entry-t); in both _rtwTraverseBlas and
_rtwTraverseTlas descend the nearer child first and push the farther only
when it hits, re-culling it against the (tightened) bestT when popped.
Render is identical (same closest hit) on VulkanTriangle, RTStress
(512/4096), and Sponza; cuts node visits on dense scenes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Request the timestamp-query feature; write begin/end timestamps around
each wavefront pass via timestampWrites; resolve + read back (deferred to
after submit) and print a per-pass us breakdown ~1x/sec. RTStress @ 512
instances, 1920x995: TRACE dominates, total ~1.8-3.0ms/frame.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the megakernel @compute entry with five wavefront kernels sharing
one module, connected by GPU ray/hit/payload buffers and a GPU-driven
indirect bounce loop:
GENERATE -> (PREP -> TRACE -> SHADE) x maxDepth -> RESOLVE
- TRACE contains zero user code (pure _rtwTraverseTlas/Blas, opaque-only).
- PREP publishes dispatchWorkgroupsIndirect args from the live ray count;
the indirect-args buffer lives in its own bind group so it is never
bound read-write in the same dispatch that consumes it as INDIRECT.
- New emit/accumulate API: rtEmitPrimaryRay / rtEmitRay / rtAccumulate,
plus an optional user Resolve stage (tonemap hook; identity by default).
- Per-pass WfParams via a dynamic-offset uniform ring (curIsA/bounce vary
between passes within one submit).
- Payload-typed wfPayload binding emitted in the codegen region after the
user's struct Payload; payload travels with each ray (2*W*H slots).
- Request maxBufferSize / maxStorageBufferBindingSize / maxComputeWorkgroups
PerDimension so the W*H-sized work buffers fit past the 128MB baseline.
VulkanTriangle ported to the new API and renders bit-identical to the
megakernel baseline at maxDepth=1.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the disabled LSD radix sort in lbvhBuildMain with a data-oblivious
workgroup bitonic sorting network and enable it. The radix scatter was gated
behind `if (false)` because it produced count/distribution-dependent
corruption (TODO-lbvh-sort.md) — a memory-ordering bug in the Hillis-Steele
scan / parallel scatter that surfaced only for certain Morton distributions
(a small object beside a tight cluster), making geometry flicker.
A bitonic network's compare-exchange schedule depends only on N_PADDED, never
on key values, so it sidesteps that entire class of distribution-dependent
races (TODO strategy #5). 105 sub-stages over 2^14 keys, single workgroup of
1024 threads, 8 compare-exchanges/thread/sub-stage, operating in-place on
sortA with a storageBarrier between sub-stages. Sentinel keys (0xFFFFFFFF)
compare largest and settle at the tail, exactly where Phase 4 expects them.
Restores Morton (Z-order) spatial coherence to TLAS BVH leaves, which the
many-instance case needs. Removes the now-dead radix histogram/scan workgroup
memory and constants.
Verified on the Firefox/Dawn WebGPU stack: a GPU unit test diffs the kernel
output against a CPU oracle across all three required distributions
(all-uniform, all-one-bucket, small-object-next-to-cluster) plus random,
reverse, and empty inputs — all match bit-for-bit with a valid index
permutation. Sponza renders correctly with the sort live.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>