WebGPU RT: wavefront rewrite of the software ray tracer #1
Labels
No labels
claude:done
claude:in-progress
claude:ready
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
Catcrafts/Crafter.Graphics#1
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Plan: Wavefront rewrite of the WebGPU software ray tracer
Context
Crafter.Graphics' WebGPU backend is a software ray tracer (native Vulkan uses the
4090's hardware RT cores; WebGPU has none). It currently runs as a single megakernel
compute shader — one
@compute @workgroup_size(8,8,1)kernel, one thread per pixel, thatinlines raygen → TLAS traverse → BLAS traverse → the entire SBT (
runClosestHit/runAnyHit/runMiss) → shadow rays. This is "several orders of magnitude slower than native," far worsethan the gap physics requires.
Honest target assessment: a well-tuned software tracer on a 4090 typically lands ~5–20×
behind hardware-RT, not the ~100× seen now. The extra ~50× is recoverable engineering. 60fps
on a 4090 for a many-instance (3DForts-style) scene with primary + shadow rays is achievable
— but only by restructuring away from the megakernel. The user has chosen a full wavefront
rewrite now, benchmarked against a many-instance scene (the exact case the disabled LBVH
sort breaks at >4000 instances).
Root causes of the current gap (verified in code):
paths (two in-register traversal stacks + full
Payload+ every closesthit's locals + theSBT switch). High register pressure → low SM occupancy → memory latency can't be hidden.
This is the dominant multiplier. (
PipelineRTWebGPU.cpp:180-185, traversal atdom-webgpu.js:1479-1672).
_rtTraverseBlas/_rtTraverseTlasdescend left-first, nevernearest-child-first, so
bestTtightens slowly → excess triangle tests(dom-webgpu.js:1546-1550, :1636-1642).
if (false)(dom-webgpu.js:2203, TODO-lbvh-sort.md),
so TLAS leaves have no spatial coherence (loose AABBs). Fatal at thousands of instances.
TLAS_BVH_N_PADDED=16384leaves;~thousands of real instances means full 14-level descent through mostly-sentinel structure
(dom-webgpu.js:1405-1406).
repo (external consumer). "Optimize to the limit" is meaningless without a harness.
Goal
Replace the megakernel with a wavefront / streaming path tracer: separate
GENERATE → TRACE → SHADE → RESOLVEkernels connected by GPU-resident ray/hit/payload buffers, with a GPU-drivenbounce loop (indirect dispatch, no CPU readback). The TRACE kernel contains zero user code
— traversal + intersection only — for high SM occupancy (the core win). Fix TLAS coherence
(bitonic sort) and add ordered traversal so the many-instance benchmark is meaningful.
This is a breaking change to the WebGPU RT shader API (preferred per project policy): the
synchronous "raygen calls
traceRay, inspects the hit, traces a shadow ray" pattern cannotsurvive without continuations. Raygen now emits a primary ray; closesthit/miss run in SHADE
and may emit continuation/shadow rays + accumulate color — the standard DXR/wavefront idiom.
(Native Vulkan RT path is untouched.)
Critical files
— codegen. Splits the one megakernel into four entry-point modules (GENERATE, TRACE,
SHADE, RESOLVE) sharing the SBT switches; moves the
Payload-typed storage binding into thecodegen region (after the user's
struct Payloaddeclaration).(:1278-1672), RT dispatch
(:2743-2821), TLAS/LBVH build
(:1931-2638), device limits
(:118-139). New indirect plumbing, wavefront dispatch chain,
bitonic sort, and the new kernels live here.
wgpuDispatchWavefrontRT+ indirect-dispatch import decls.driver switches to the wavefront chain.
closesthit.wgsl,miss.wgsl—rewritten to the emit/accumulate model (reference for the API break).
examples/RTStress/— scalable many-instance benchmark + timing HUD (Phase 0).Design
Stages (all compute, recorded once per frame on
state.encoder)(⌈W/8⌉,⌈H/8⌉), one thread/pixel. Userraygen_main(pixel)callsrtEmitPrimaryRay(...)+ inits payload; writesRayState/RayMeta/payload sloty*W+x,zeroes
pixelAccum.dispatchWorkgroupsIndirect,@workgroup_size(64).No user code. Reads compacted ray list + accel buffers, writes
HitState. Holds only thetraversal stacks + working ray → minimal registers → high occupancy. v1 = opaque-only
(current Sponza is all-opaque; defer anyhit to a later variant).
runClosestHit/runMissswitches move here). Reads
HitState+payload, dispatches byhit.t < tMax,rtAccumulatescolor, and
rtEmitRays continuation/shadow rays (atomic-append + compaction into thenext-bounce buffer).
GENERATE/SHADE into
dispatchWorkgroupsIndirectargs(⌈count/64⌉,1,1)and resets theconsumed counter. Must be its own kernel: WGSL atomics are only globally coherent across a
dispatch boundary, so count→args conversion can't happen inside SHADE.
pixelAccum→ ping/pong output texture (preservingstate.outIsPing); tonemap/gamma move here from raygen.GPU-driven bounce loop (no per-bounce CPU readback)
Unrolled at command-record time to a fixed
maxDepth(Sponza needs 2 = primary + shadow):Each kernel is its own
beginComputePass/end(indirect args from one pass must be visible tothe next — the same reason TLAS build ends/reopens the pass at
dom-webgpu.js:2602). Zero-ray rounds dispatch
(0,1,1)→ near-free.Buffers
RayState(32B) +RayMeta(16B: pixelIndex, flags, missIndex, payloadSlot) — double-buffered (ping/pong per bounce), sized2×W·Honly.HitState(96B, matchesHitInfoat dom-webgpu.js:1293) — single buffer reused per bounce.payloadStore— bound asarray<Payload>(the user's type), so size/layout arecompiler-enforced; no fixed-size cap or bitcast. Requires moving the binding text into the
codegen region after the user's
Payloaddecl.rayCount(2×atomic<u32>) +indirectArgs(2× 3-u32, usageSTORAGE | INDIRECT).rtEmitRayenforces a 1 continuation ray per pixel budget (shadowrays excepted), so peak live rays fit
2×W·H. Overflow rays are dropped (graceful).API break — Sponza rewritten
Payloadgainsthroughput: vec3<f32>,rayKind: u32(PRIMARY/SHADOW),pixelIndex: u32.raygen_main: build pinhole ray, set throughput=1,rtEmitPrimaryRay(...).closesthit_main(PRIMARY): existing UV/normal/albedo work;rtAccumulate(ambient); ifnDotL>0,rtEmitRay(shadow ray, SKIP_CLOSEST_HIT|TERMINATE_ON_FIRST_HIT, throughput=lit term).miss_main: SHADOW →rtAccumulate(throughput)(sun visible); PRIMARY →rtAccumulate(sky).SKIP_CLOSEST_HIT→ contribute nothing → correct shadow.Coherence (mandatory for the many-instance benchmark)
data-oblivious — its barriers/access pattern are independent of the histogram distribution,
which is exactly the "count-dependent" axis the TODO bisected the bug to.
~105 compare-exchange passes over 16384 keys; in-budget. Same
(morton16<<16)|index16keys;0xFFFFFFFFsentinels sort to the end.N_PADDED = next_pow2(N_real)per build so descent depth tracksreal instance count instead of fixed 14. (Packed Karras BVH is a later refinement iff
profiling still shows TLAS dominating.)
Ordered traversal (TRACE)
Add
_rtAabbTreturning entry-t. At each internal node compute both children's entry-t, descendthe nearer child, push the farther only if
t < bestT. TightensbestTfaster → fewer triangletests. Applies to both
_rtTraverseBlasand_rtTraverseTlas.Device limits (extend the existing
clamp(...)block at dom-webgpu.js:131)maxStorageBuffersPerShaderStage=16is already requested (TRACE/SHADE need 12 — fits). Addrequests for
maxBufferSize,maxStorageBufferBindingSize(payloadStore at 1080p ≈ 130 MB,over the 128 MB baseline), and
maxComputeWorkgroupsPerDimension(needed for 4K 1D dispatch;or de-linearize to 2D in PREP).
Implementation order (de-risk highest-uncertainty first)
examples/RTStress/: N×N×N grid of a small mesh, instance count a compile/runtime knob(512 → ~8000), primary + shadow ray. This is the standing many-instance benchmark.
timestamp-queryfeature) around each pass + a frame-timeHUD/console line. Record the baseline megakernel number to quantify the gap.
dispatchWorkgroupsIndirect+INDIRECTusage + a toy "emit N → dispatch N" round-trip. Biggest unknown (no precedent in repo); prove
cross-pass atomic visibility on the target Dawn build before building real kernels on it.
the existing megakernel image, measure TLAS-traversal fraction before/after.
Validate stage decomposition + binding budget against VulkanTriangle (single trace).
rtEmitRay/rtAccumulate; port Sponza to the emit model.Validate against the current Sponza image.
tlasEntryOrderinto BVH pad words; unify vertices/indices/primRemapinto one u32 heap) only if a target device reports <12 storage buffers.
Optional, secondary (UI — user is skeptical it's the bottleneck)
The UI compute path is
O(W·H·N)(each tile thread iterates every item,UI.cpp:110-131). Cheap win: per-tile item
binning so each tile iterates only overlapping items (~
O(W·H·k)). Defer until RT numbers land.Verification
the current megakernel output after Phase 5. RTStress shows no flicker as instance count scales
(the TODO acceptance criterion: CPU-oracle sort vs GPU output for
all-uniform, all-one-bucket, and "small object next to tight cluster" distributions).
after each phase. Primary metric: 60fps at the target instance count. Secondary: TRACE
kernel register count / occupancy (the rewrite's core justification), per-bounce live-ray
counts (validate the
2×W·Hbudget + emit-drop rate), per-pass GPU time via timestamp queries.project.cppselects DOM backend onwasm32-*),serve
examples/RTStress, read the timing HUD in-browser on the 4090.Phase 3 (TLAS coherence) landed — merged in #2.
Re-enabled the TLAS LBVH spatial sort by replacing the disabled, buggy LSD radix scatter with a data-oblivious workgroup bitonic sorting network in
lbvhBuildMain, and turned it on. This fixes the documentedTODO-lbvh-sort.mdcorruption (theif (false)guard is gone) and restores Morton spatial coherence to TLAS BVH leaves — the prerequisite for the many-instance benchmark to be meaningful.Bitonic is data-oblivious (its compare-exchange schedule depends only on
N_PADDED, not the keys), so it structurally cannot have the count/distribution-dependent race the radix sort did. This was strategy #5 in the TODO.Verified:
crafter-build testis green (library declares no tests; verification was direct).Still open (intentionally not in #2) — the rest of the wavefront rewrite from the plan:
RTStressmany-instance benchmark + GPU timestamp timing HUD (the measurement harness).dispatchWorkgroupsIndirect, cross-pass atomic visibility).rtEmitRay/rtAccumulateAPI break (Sponza ported to emit/accumulate).N_PADDED = next_pow2(N_real)) — couples build + trace shaders; secondary to coherence.Landing the coherence fix first (as the plan recommends — it's independent of the kernel rewrite and validatable against the existing renderer) de-risks the benchmark the remaining phases will be measured against. Leaving this issue open to track them.