WebGPU RT: complete the wavefront rewrite (single deliverable — remaining phases) #3
Labels
No labels
claude:done
claude:in-progress
claude:ready
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
Catcrafts/Crafter.Graphics#3
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Scope: remaining wavefront rewrite — to be delivered as ONE PR
This tracks the remainder of the WebGPU RT wavefront rewrite planned in #1. PR #2 landed
only Phase 3 (TLAS coherence / bitonic sort) in isolation. Everything below is what is left.
The goal is unchanged from #1: replace the single megakernel software ray tracer with a
wavefront / streaming tracer (
GENERATE → TRACE → SHADE → RESOLVE, GPU-driven indirectbounce loop, TRACE kernel containing zero user code for high SM occupancy). Target: 60fps on
a 4090 for a many-instance (3DForts-style) scene with primary + shadow rays.
Already done (PR #2 — do not redo)
bitonic network in
lbvhBuildMain. The TLAS now has Morton spatial coherence.Remaining work — all of the following, in one PR
Measurement harness (plan Phase 0) — build first.
examples/RTStress/: an N×N×N grid of a small mesh, instance count a compile/runtimeknob (512 → ~8000), primary + shadow ray. This is the standing many-instance benchmark.
timestamp-queryfeature) around each pass + a frame-timeHUD/console line. Capture the baseline megakernel number before the rewrite so the
before/after delta is quantified in the PR.
Indirect-dispatch plumbing (plan Phase 2).
dispatchWorkgroupsIndirect+INDIRECTbuffer usage + a toy "emit N → dispatch N" round-trip to prove cross-pass atomic visibility
on the target Dawn build. No precedent in the repo — this is the highest-uncertainty piece;
prove it before building real kernels on it.
Megakernel split — GENERATE / TRACE / SHADE / RESOLVE (plan Phase 4). Split the one
@computemegakernel into four entry-point modules inimplementations/Crafter.Graphics-PipelineRTWebGPU.cppsharing the SBT switches. TRACEcontains zero user code (traversal + intersection only). Bring up at
maxDepth=1(primary only, no emit) and validate against VulkanTriangle. Move the
Payload-typed storagebinding into the codegen region (after the user's
struct Payload). Define the ray/hit/payload buffers (
2×W·Hdouble-buffered, capacity guard = 1 continuation ray/pixel).GPU-driven bounce loop + emit/accumulate API break (plan Phase 5). Wire the unrolled
GENERATE; PREP; (TRACE; SHADE; PREP)×maxDepth; RESOLVEchain withrtEmitRay/rtAccumulate/rtEmitPrimaryRay. Each kernel its ownbeginComputePass. Breaking APIchange (preferred): raygen emits a primary ray; closesthit/miss run in SHADE and emit
continuation/shadow rays + accumulate. Port the Sponza shaders (
raygen.wgsl,closesthit.wgsl,miss.wgsl) to the new model and validate the image matches.Ordered (nearest-child-first) traversal (plan Phase 6). Add
_rtAabbT(entry-t); inboth
_rtTraverseBlasand_rtTraverseTlas, descend the nearer child and push the fartheronly if
t < bestT. Measure the delta on RTStress.Dynamic TLAS tree depth (deferred in PR #2).
N_PADDED = next_pow2(N_real)per build sodescent depth tracks real instance count instead of fixed 14. Couples the build and trace
shaders — include it here.
Binding packing (plan Phase 7) — conditional. Only if a target device reports <12
storage buffers in TRACE/SHADE: merge
tlasEntryOrderinto BVH pad words and unifyvertices/indices/primRemap into one
u32heap. Skip if the 4090/Dawn target reports ≥12.Device limits
Extend the
clamp(...)block inadditional/dom-webgpu.jsto also requestmaxBufferSize,maxStorageBufferBindingSize(payloadStore ≈ 130 MB at 1080p, over the 128 MB baseline), andmaxComputeWorkgroupsPerDimension(4K 1D dispatch; or de-linearize to 2D in PREP).Definition of done (all required, in the one PR)
scales, and the PR reports fps vs instance count on the 4090, before vs after, plus the
TRACE-kernel occupancy/register delta (the core justification).
megakernel output.
Refs #1. Supersedes the "remaining phases" follow-up noted in PR #2's scope note.