From 4e42d663a6179a179196ef3a8f46474a7df9bc7a Mon Sep 17 00:00:00 2001
From: catbot <catbot@bot.local>
Date: Sun, 31 May 2026 16:24:41 +0000
Subject: [PATCH 1/8] WebGPU RT: wavefront tracer core
 (GENERATE/PREP/TRACE/SHADE/RESOLVE)

Replace the megakernel @compute entry with five wavefront kernels sharing
one module, connected by GPU ray/hit/payload buffers and a GPU-driven
indirect bounce loop:

  GENERATE -> (PREP -> TRACE -> SHADE) x maxDepth -> RESOLVE

- TRACE contains zero user code (pure _rtwTraverseTlas/Blas, opaque-only).
- PREP publishes dispatchWorkgroupsIndirect args from the live ray count;
  the indirect-args buffer lives in its own bind group so it is never
  bound read-write in the same dispatch that consumes it as INDIRECT.
- New emit/accumulate API: rtEmitPrimaryRay / rtEmitRay / rtAccumulate,
  plus an optional user Resolve stage (tonemap hook; identity by default).
- Per-pass WfParams via a dynamic-offset uniform ring (curIsA/bounce vary
  between passes within one submit).
- Payload-typed wfPayload binding emitted in the codegen region after the
  user's struct Payload; payload travels with each ray (2*W*H slots).
- Request maxBufferSize / maxStorageBufferBindingSize / maxComputeWorkgroups
  PerDimension so the W*H-sized work buffers fit past the 128MB baseline.

VulkanTriangle ported to the new API and renders bit-identical to the
megakernel baseline at maxDepth=1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 WAVEFRONT-DESIGN.md                           |  69 ++
 additional/dom-webgpu.js                      | 653 ++++++++++++++++--
 examples/VulkanTriangle/closesthit.wgsl       |  11 +-
 examples/VulkanTriangle/miss.wgsl             |   5 +-
 examples/VulkanTriangle/raygen.wgsl           |  25 +-
 .../Crafter.Graphics-PipelineRTWebGPU.cpp     |  73 +-
 interfaces/Crafter.Graphics-RTPass.cppm       |   8 +-
 ...ter.Graphics-ShaderBindingTableWebGPU.cppm |   5 +
 interfaces/Crafter.Graphics-WebGPU.cppm       |   3 +-
 9 files changed, 753 insertions(+), 99 deletions(-)
 create mode 100644 WAVEFRONT-DESIGN.md
diff --git a/WAVEFRONT-DESIGN.md b/WAVEFRONT-DESIGN.md
new file mode 100644
index 0000000..78d77e2
--- /dev/null
+++ b/WAVEFRONT-DESIGN.md
@@ -0,0 +1,69 @@
+# WebGPU wavefront RT rewrite — design & progress (issue #3)
+
+Replaces the single megakernel (`main`, 8×8 tile, per-pixel
+raygen→traceRay→CH/miss→store) with a streaming wavefront tracer:
+`GENERATE → PREP → (TRACE → SHADE → PREP)×maxDepth → RESOLVE`, each its own
+compute pass, dispatch sizes driven by `dispatchWorkgroupsIndirect`.
+
+## Kernels (all generated/assembled the same megakernel way, just split)
+- **GENERATE** (1 thread/pixel, 8×8): runs user `raygen_main(gid)` which calls
+  `rtEmitPrimaryRay(...)`. Clears accum slot + payload slot for the pixel.
+- **PREP** (1 thread): reads emit counter for the just-filled ray buffer,
+  writes indirect args `[ceil(n/64),1,1]`, publishes `traceCount=n`, swaps
+  cur/next ray buffer, resets next emit counter. One PREP before first TRACE
+  and one after each SHADE.
+- **TRACE** (1 thread/ray, 64-wide, indirect): ZERO user code. Reads ray i,
+  runs `_rtTraverseTlas`, writes `HitResult` i (t/instanceId/primId/hg/attribs
+  /objToWorld/customIndex/missFlag).
+- **SHADE** (1 thread/ray, 64-wide, indirect): reads ray i + hit i + payload
+  slot p. miss→`runMiss`, hit→`runClosestHit` (unless SKIP_CLOSEST_HIT). User
+  code calls `rtAccumulate(pixel,rgb)` and `rtEmitRay(...)`.
+- **RESOLVE** (1 thread/pixel, 8×8): reads accum slot, runs user `resolve_main`
+  if present else passthrough; writes outImage.
+
+## Buffers (rtState, sized to 2*W*H rays)
+- `wfRaysA`,`wfRaysB`: array<WfRay>, ping/pong. WfRay = origin,tMin,dir,tMax,
+  pixel,flags,cullMask,missIndex,sbtOffset,payloadSlot,kind,_pad.
+- `wfHits`: array<HitResult> (sized = ray capacity).
+- `wfPayload`: array<Payload> — declared in CODEGEN region after user Payload.
+- `wfAccum`: array<vec4<f32>> per pixel (W*H).
+- `wfCounters`: atomic counters: emitA, emitB, trace dispatch args, etc.
+- `wfIndirect`: INDIRECT dispatch-args buffer.
+
+## API (new, breaking)
+- raygen: `rtEmitPrimaryRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx)`
+  → allocates payloadSlot=pixel, writes ray to current buffer (atomic bump).
+- CH/miss: `rtEmitRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx,payload)`
+  spawns into NEXT buffer carrying a payload slot; `rtAccumulate(pixel,rgb)`.
+- `rtGetPayload(slot)` / payload passed by value into CH/miss via slot.
+
+## Tonemap / resolve
+Accum buffer is linear. Optional user `WebGPURTStage::Resolve` entry
+`resolve_main(coord:vec2<u32>, hdr:vec4<f32>)->vec4<f32>`. None → passthrough.
+VulkanTriangle: no resolve (exact match). Sponza: resolve does Reinhard+gamma.
+
+## Indirect dispatch (Phase 2 de-risk)
+Prove `dispatchWorkgroupsIndirect` + cross-pass atomic visibility with a toy
+"emit N → dispatch N" before wiring real kernels. WebGPU inserts an implicit
+barrier between compute passes in one submit, so atomics written in PREP are
+visible to TRACE.
+
+## maxDepth
+Compile/runtime knob. JS unrolls the chain to maxDepth. VulkanTriangle
+maxDepth=1 (primary only). Sponza maxDepth=2 (primary + shadow).
+
+## Status / progress
+- [x] baseline VulkanTriangle renders (megakernel) — /tmp/baseline-triangle.png
+- [ ] wavefront prelude + codegen
+- [ ] VulkanTriangle on wavefront (maxDepth=1)
+- [ ] bounce loop + indirect + Sponza shadow port
+- [ ] RTStress example + timestamp queries
+- [ ] ordered traversal, dynamic TLAS depth, device limits
+- [ ] remove megakernel dual path; final validation; PR
+
+## Files
+- `additional/dom-webgpu.js` — prelude (`rtWgsl*`), `wgpuLoadRTPipeline`,
+  `wgpuDispatchRT`, LBVH build, rtState/buffers, device-limit clamp (~L131).
+- `implementations/Crafter.Graphics-PipelineRTWebGPU.cpp` — assembles user
+  WGSL + entry glue; must emit 5 entry points + payloadStore binding.
+- examples/{VulkanTriangle,Sponza,RTStress}/*.wgsl + main.cpp.
diff --git a/additional/dom-webgpu.js b/additional/dom-webgpu.js
index e035d97..9380517 100644
--- a/additional/dom-webgpu.js
+++ b/additional/dom-webgpu.js
@@ -136,6 +136,16 @@ clamp("maxStorageTexturesPerShaderStage", 8);
 // per-workgroup invocation cap raised from the default 256.
 clamp("maxComputeInvocationsPerWorkgroup", 1024);
 clamp("maxComputeWorkgroupSizeX", 1024);
+// Wavefront RT work buffers are sized to W·H rays. At 1080p the payload
+// store (≈245 MB) and hit buffer (≈214 MB) blow past the 128 MB baseline
+// storage-buffer binding size, and the whole set past the 256 MB baseline
+// maxBufferSize — request whatever the adapter actually allows (4090/Dawn
+// reports 1 GB+). maxComputeWorkgroupsPerDimension bounds the indirect
+// TRACE/SHADE 1-D dispatch (ceil(W·H/64) ≈ 32k workgroups at 1080p; the
+// 65535 default covers it, but request the adapter max for headroom).
+clamp("maxBufferSize", 1 << 30);
+clamp("maxStorageBufferBindingSize", 1 << 30);
+clamp("maxComputeWorkgroupsPerDimension", 65535);
 const device  = await adapter.requestDevice({ requiredLimits });
 const queue   = device.queue;
 const ctx     = canvas.getContext("webgpu");
@@ -147,6 +157,9 @@ device.lost.then((info) => {
     console.error("[crafter-wgpu] device lost:", info.message);
     state.gpuLost = true;
 });
+device.addEventListener("uncapturederror", (e) => {
+    console.error("[crafter-wgpu] uncaptured error:", e.error && e.error.message);
+});
 
 // ─── handle tables ─────────────────────────────────────────────────────
 
@@ -1671,6 +1684,350 @@ fn traceRay(tlasIdx: u32, flags: u32, cullMask: u32,
 }
 `;
 
+// ════════════════════════════════════════════════════════════════════════
+// WAVEFRONT RT — streaming tracer (GENERATE → PREP → TRACE → SHADE →
+// RESOLVE). Replaces the megakernel. The C++ side (PipelineRTWebGPU) emits
+// the user sources, the per-stage SBT switches, the Payload-typed wfPayload
+// binding, and the five @compute entry points; this JS injects the bindings
+// + library helpers the entry points call.
+// ════════════════════════════════════════════════════════════════════════
+
+// Bindings prelude for the wavefront pipeline. group(0) is the per-pass
+// WfParams uniform (dynamic-offset ring — one slot per pass so curIsA /
+// bounce can vary between passes inside one submit). group(1) carries the
+// geometry heaps (0..9, identical to the old megakernel layout so the
+// register/build paths are unchanged) plus the wavefront work buffers
+// (10..14); wfPayload at binding 15 is emitted in the codegen region after
+// the user's `struct Payload`. group(2) is the indirect-args buffer, bound
+// only by PREP (a buffer used as INDIRECT in a dispatch may not also be
+// bound read-write in that same dispatch — so TRACE/SHADE must not bind it).
+const rtWgslWavefrontBindings = String.raw`
+struct WfParams {
+    surfaceW:    u32,
+    surfaceH:    u32,
+    rayCapacity: u32,
+    curIsA:      u32,   // 1 → current ray buffer is A (emit-next = B); 0 → B
+    bounce:      u32,
+    maxDepth:    u32,
+    tlasNPadded: u32,   // TLAS sweep-tree padded leaf count (descent depth)
+    flags:       u32,
+};
+
+// One in-flight ray. 64 bytes; origin/direction vec3-aligned to 16.
+struct WfRay {
+    origin:          vec3<f32>,
+    tMin:            f32,
+    direction:       vec3<f32>,
+    tMax:            f32,
+    pixel:           u32,   // linear framebuffer pixel this ray contributes to
+    flags:           u32,
+    cullMask:        u32,
+    missIndex:       u32,
+    sbtRecordOffset: u32,
+    payloadSlot:     u32,   // index into wfPayload
+    kind:            u32,   // 0 primary, 1 continuation (informational)
+    _pad:            u32,
+};
+
+// TRACE → SHADE handoff. Mirrors HitInfo + a hitKind (0 miss, 1 triangle).
+struct HitResult {
+    t:                  f32,
+    instanceId:         u32,
+    primitiveId:        u32,
+    hitGroupIndex:      u32,
+    attribs:            vec2<f32>,
+    hitKind:            u32,
+    customIndex:        u32,
+    objectRayOrigin:    vec3<f32>,
+    _p0:                f32,
+    objectRayDirection: vec3<f32>,
+    _p1:                f32,
+    objectToWorldR0:    vec4<f32>,
+    objectToWorldR1:    vec4<f32>,
+    objectToWorldR2:    vec4<f32>,
+};
+
+struct BvhNode {
+    aabbMin: vec3<f32>,
+    _pad0:   u32,
+    aabbMax: vec3<f32>,
+    _pad1:   u32,
+};
+
+@group(0) @binding(0) var<uniform> wfParams : WfParams;
+
+@group(1) @binding(0) var<storage,read>  tlasEntries    : array<TLASEntry>;
+@group(1) @binding(1) var<storage,read>  bvhNodes       : array<BVHNode>;
+@group(1) @binding(2) var<storage,read>  meshRecords    : array<MeshRecord>;
+@group(1) @binding(3) var<storage,read>  vertices       : array<f32>;
+@group(1) @binding(4) var<storage,read>  indices        : array<u32>;
+@group(1) @binding(5) var<storage,read>  primRemap      : array<u32>;
+@group(1) @binding(6) var outImage : texture_storage_2d<rgba8unorm, write>;
+@group(1) @binding(7) var<storage,read>  vertexAttribs  : array<u32>;
+@group(1) @binding(8) var<storage,read>  tlasEntryOrder : array<u32>;
+@group(1) @binding(9) var<storage,read>  tlasBvhNodes   : array<BvhNode>;
+@group(1) @binding(10) var<storage,read_write> wfRaysA   : array<WfRay>;
+@group(1) @binding(11) var<storage,read_write> wfRaysB   : array<WfRay>;
+@group(1) @binding(12) var<storage,read_write> wfHits    : array<HitResult>;
+@group(1) @binding(13) var<storage,read_write> wfAccum   : array<vec4<f32>>;
+@group(1) @binding(14) var<storage,read_write> wfCounters : array<atomic<u32>>;
+// @group(1) @binding(15) wfPayload : array<Payload>  — emitted by codegen.
+
+@group(2) @binding(0) var<storage,read_write> wfIndirect : array<u32>;
+`;
+
+// Library helpers the codegen entry points call. Sits after the pure
+// helpers (_rtAabb/_rtTri/_rtFetchTri) and after the user's Payload +
+// wfPayload binding, so rtEmit*/_wfShade can name Payload/wfPayload.
+const rtWgslWavefrontHelpers = String.raw`
+var<private> _wfPixel: u32 = 0u;
+
+// Live ray count for the current buffer, clamped to capacity (the emit
+// counter can overshoot capacity; dropped rays were never written).
+fn _wfCurCount() -> u32 {
+    let raw = select(atomicLoad(&wfCounters[1]), atomicLoad(&wfCounters[0]),
+                     wfParams.curIsA == 1u);
+    return min(raw, wfParams.rayCapacity);
+}
+
+// Add linear radiance to the pixel this SHADE/GENERATE thread owns. Safe
+// without atomics: at most one ray per pixel per bounce, and bounces run
+// in separate passes (implicit barrier between them).
+fn rtAccumulate(rgb: vec3<f32>) {
+    wfAccum[_wfPixel] = wfAccum[_wfPixel] + vec4<f32>(rgb, 0.0);
+}
+
+// raygen → emit the pixel's primary ray. Bounce 0's current buffer is
+// always A, so primaries land in A with their payload in the A region
+// [0, rayCapacity).
+fn rtEmitPrimaryRay(origin: vec3<f32>, tMin: f32, dir: vec3<f32>, tMax: f32,
+                    flags: u32, cullMask: u32, sbtRecordOffset: u32,
+                    missIndex: u32, payload: Payload) {
+    let slot = atomicAdd(&wfCounters[0], 1u);
+    if (slot >= wfParams.rayCapacity) { return; }
+    var r: WfRay;
+    r.origin = origin; r.tMin = tMin; r.direction = dir; r.tMax = tMax;
+    r.pixel = _wfPixel; r.flags = flags; r.cullMask = cullMask;
+    r.missIndex = missIndex; r.sbtRecordOffset = sbtRecordOffset;
+    r.payloadSlot = slot; r.kind = 0u;
+    wfRaysA[slot] = r;
+    wfPayload[slot] = payload;
+}
+
+// closesthit/miss → spawn a continuation/shadow ray into the NEXT buffer
+// (the one the upcoming TRACE will read). Payload travels with it; the
+// next buffer's payload region is [rayCapacity, 2*rayCapacity) for B.
+fn rtEmitRay(origin: vec3<f32>, tMin: f32, dir: vec3<f32>, tMax: f32,
+             flags: u32, cullMask: u32, sbtRecordOffset: u32,
+             missIndex: u32, payload: Payload) {
+    let nextIsA   = wfParams.curIsA == 0u;
+    let counterIdx = select(1u, 0u, nextIsA);
+    let slot = atomicAdd(&wfCounters[counterIdx], 1u);
+    if (slot >= wfParams.rayCapacity) { return; }
+    let payloadBase = select(wfParams.rayCapacity, 0u, nextIsA);
+    var r: WfRay;
+    r.origin = origin; r.tMin = tMin; r.direction = dir; r.tMax = tMax;
+    r.pixel = _wfPixel; r.flags = flags; r.cullMask = cullMask;
+    r.missIndex = missIndex; r.sbtRecordOffset = sbtRecordOffset;
+    r.payloadSlot = payloadBase + slot; r.kind = 1u;
+    if (nextIsA) { wfRaysA[slot] = r; } else { wfRaysB[slot] = r; }
+    wfPayload[r.payloadSlot] = payload;
+}
+
+// Opaque-only BLAS descent (no anyhit — TRACE runs zero user code).
+fn _rtwTraverseBlas(rayObj: RayDesc, flags: u32, meshRec: MeshRecord,
+                    instanceId: u32, hitGroupBase: u32,
+                    bestHit: ptr<function, HitInfo>,
+                    bestT:   ptr<function, f32>) -> bool {
+    let invD = vec3<f32>(1.0) / rayObj.direction;
+    var stack: array<u32, 32>;
+    var sp: u32 = 0u;
+    var nodeRel: u32 = 0u;
+    loop {
+        let absI = meshRec.bvhOffset + nodeRel;
+        let node = bvhNodes[absI];
+        if (!_rtAabb(rayObj.origin, invD, node.aabbMin, node.aabbMax, *bestT)) {
+            if (sp == 0u) { break; }
+            sp = sp - 1u; nodeRel = stack[sp]; continue;
+        }
+        if (node.primCount > 0u) {
+            for (var i: u32 = 0u; i < node.primCount; i = i + 1u) {
+                let triIndex = primRemap[meshRec.primRemapOffset + node.firstChildOrPrim + i];
+                let verts = _rtFetchTri(meshRec, triIndex);
+                let tr = _rtTri(rayObj.origin, rayObj.direction,
+                                verts[0], verts[1], verts[2], rayObj.tMin, *bestT);
+                if (!tr.hit) { continue; }
+                let geomNormal = cross(verts[1] - verts[0], verts[2] - verts[0]);
+                let facing = dot(geomNormal, rayObj.direction);
+                if ((flags & RT_FLAG_CULL_BACK_FACING_TRIANGLES)  != 0u && facing > 0.0) { continue; }
+                if ((flags & RT_FLAG_CULL_FRONT_FACING_TRIANGLES) != 0u && facing < 0.0) { continue; }
+                var candidate: HitInfo;
+                candidate.t             = tr.t;
+                candidate.instanceId    = instanceId;
+                candidate.primitiveId   = triIndex;
+                candidate.hitGroupIndex = hitGroupBase;
+                candidate.attribs       = vec2<f32>(tr.u, tr.v);
+                candidate.objectRayOrigin    = rayObj.origin;
+                candidate.objectRayDirection = rayObj.direction;
+                *bestHit = candidate;
+                *bestT   = tr.t;
+                if ((flags & RT_FLAG_TERMINATE_ON_FIRST_HIT) != 0u) { return true; }
+            }
+            if (sp == 0u) { break; }
+            sp = sp - 1u; nodeRel = stack[sp]; continue;
+        }
+        let left  = node.firstChildOrPrim;
+        let right = left + 1u;
+        if (sp < 32u) { stack[sp] = right; sp = sp + 1u; }
+        nodeRel = left;
+    }
+    return false;
+}
+
+fn _rtwTraverseTlas(rayWorld: RayDesc, flags: u32, cullMask: u32,
+                    sbtRecordOffset: u32,
+                    bestHit: ptr<function, HitInfo>,
+                    bestT:   ptr<function, f32>) -> bool {
+    let invD = vec3<f32>(1.0) / rayWorld.direction;
+    let leavesStart = wfParams.tlasNPadded - 1u;
+    var stack: array<u32, 32>;
+    var sp: u32 = 0u;
+    stack[sp] = 0u; sp = sp + 1u;
+    loop {
+        if (sp == 0u) { break; }
+        sp = sp - 1u;
+        let nodeIdx = stack[sp];
+        let node = tlasBvhNodes[nodeIdx];
+        if (!_rtAabb(rayWorld.origin, invD, node.aabbMin, node.aabbMax, *bestT)) { continue; }
+        if (nodeIdx >= leavesStart) {
+            let leafIdx = nodeIdx - leavesStart;
+            let i = tlasEntryOrder[leafIdx];
+            if (i == 0xFFFFFFFFu) { continue; }
+            let inst = tlasEntries[i];
+            let instanceMask = inst.maskHGOffset & 0xFFu;
+            if ((instanceMask & cullMask) == 0u) { continue; }
+            if (!_rtAabb(rayWorld.origin, invD, inst.aabbMin, inst.aabbMax, *bestT)) { continue; }
+            let r0 = inst.worldToObjectR0;
+            let r1 = inst.worldToObjectR1;
+            let r2 = inst.worldToObjectR2;
+            var rayObj: RayDesc;
+            rayObj.origin = vec3<f32>(
+                dot(r0.xyz, rayWorld.origin) + r0.w,
+                dot(r1.xyz, rayWorld.origin) + r1.w,
+                dot(r2.xyz, rayWorld.origin) + r2.w);
+            rayObj.direction = vec3<f32>(
+                dot(r0.xyz, rayWorld.direction),
+                dot(r1.xyz, rayWorld.direction),
+                dot(r2.xyz, rayWorld.direction));
+            rayObj.tMin = rayWorld.tMin;
+            rayObj.tMax = *bestT;
+            var effective = flags;
+            let iflags = inst.instanceFlags;
+            if ((iflags & RT_INSTANCE_TRIANGLE_FACING_CULL_DISABLE) != 0u) {
+                effective = effective & ~(RT_FLAG_CULL_BACK_FACING_TRIANGLES | RT_FLAG_CULL_FRONT_FACING_TRIANGLES);
+            }
+            let hitGroupOffset = inst.maskHGOffset >> 8u;
+            let hitGroupBase   = sbtRecordOffset + hitGroupOffset;
+            let meshRec        = meshRecords[inst.blasMeshIdx];
+            let pre = *bestT;
+            let endSearch = _rtwTraverseBlas(rayObj, effective, meshRec, i, hitGroupBase, bestHit, bestT);
+            if ((*bestT) < pre || endSearch) {
+                (*bestHit).objectToWorldR0 = inst.objectToWorldR0;
+                (*bestHit).objectToWorldR1 = inst.objectToWorldR1;
+                (*bestHit).objectToWorldR2 = inst.objectToWorldR2;
+                (*bestHit).customIndex     = inst.customIndex;
+            }
+            if (endSearch) { return true; }
+            if ((*bestT) < pre && (effective & RT_FLAG_TERMINATE_ON_FIRST_HIT) != 0u) { return true; }
+        } else {
+            let left  = 2u * nodeIdx + 1u;
+            let right = 2u * nodeIdx + 2u;
+            if (sp + 1u < 32u) {
+                stack[sp] = right; sp = sp + 1u;
+                stack[sp] = left;  sp = sp + 1u;
+            }
+        }
+    }
+    return false;
+}
+
+fn _wfReadRay(i: u32) -> WfRay {
+    if (wfParams.curIsA == 1u) { return wfRaysA[i]; }
+    return wfRaysB[i];
+}
+
+// PREP — publish indirect args for the upcoming TRACE/SHADE; zero the next
+// buffer's emit counter.
+fn _wfPrep() {
+    let n = _wfCurCount();
+    wfIndirect[0] = (n + 63u) / 64u;
+    wfIndirect[1] = 1u;
+    wfIndirect[2] = 1u;
+    if (wfParams.curIsA == 1u) { atomicStore(&wfCounters[1], 0u); }
+    else                       { atomicStore(&wfCounters[0], 0u); }
+}
+
+// TRACE — pure traversal, zero user code.
+fn _wfTrace(i: u32) {
+    if (i >= _wfCurCount()) { return; }
+    let ray = _wfReadRay(i);
+    var rd: RayDesc;
+    rd.origin = ray.origin; rd.tMin = ray.tMin;
+    rd.direction = ray.direction; rd.tMax = ray.tMax;
+    var bestHit: HitInfo;
+    bestHit.t = ray.tMax;
+    var bestT = ray.tMax;
+    _rtwTraverseTlas(rd, ray.flags, ray.cullMask & 0xFFu, ray.sbtRecordOffset, &bestHit, &bestT);
+    var hr: HitResult;
+    if (bestT < ray.tMax) {
+        hr.t = bestHit.t;
+        hr.instanceId = bestHit.instanceId;
+        hr.primitiveId = bestHit.primitiveId;
+        hr.hitGroupIndex = bestHit.hitGroupIndex;
+        hr.attribs = bestHit.attribs;
+        hr.hitKind = 1u;
+        hr.customIndex = bestHit.customIndex;
+        hr.objectRayOrigin = bestHit.objectRayOrigin;
+        hr.objectRayDirection = bestHit.objectRayDirection;
+        hr.objectToWorldR0 = bestHit.objectToWorldR0;
+        hr.objectToWorldR1 = bestHit.objectToWorldR1;
+        hr.objectToWorldR2 = bestHit.objectToWorldR2;
+    } else {
+        hr.hitKind = 0u;
+    }
+    wfHits[i] = hr;
+}
+
+// SHADE — dispatch to runMiss / runClosestHit with the ray's payload.
+fn _wfShade(i: u32) {
+    if (i >= _wfCurCount()) { return; }
+    let ray = _wfReadRay(i);
+    let hr = wfHits[i];
+    _wfPixel = ray.pixel;
+    var payload: Payload = wfPayload[ray.payloadSlot];
+    var rd: RayDesc;
+    rd.origin = ray.origin; rd.tMin = ray.tMin;
+    rd.direction = ray.direction; rd.tMax = ray.tMax;
+    if (hr.hitKind == 0u) {
+        runMiss(ray.missIndex, rd, &payload);
+    } else if ((ray.flags & RT_FLAG_SKIP_CLOSEST_HIT) == 0u) {
+        var hit: HitInfo;
+        hit.t = hr.t;
+        hit.instanceId = hr.instanceId;
+        hit.primitiveId = hr.primitiveId;
+        hit.hitGroupIndex = hr.hitGroupIndex;
+        hit.attribs = hr.attribs;
+        hit.customIndex = hr.customIndex;
+        hit.objectRayOrigin = hr.objectRayOrigin;
+        hit.objectRayDirection = hr.objectRayDirection;
+        hit.objectToWorldR0 = hr.objectToWorldR0;
+        hit.objectToWorldR1 = hr.objectToWorldR1;
+        hit.objectToWorldR2 = hr.objectToWorldR2;
+        runClosestHit(hr.hitGroupIndex, rd, hit, &payload);
+    }
+}
+`;
+
 // ── WGSL library: rayQuery API for non-megakernel compute shaders ────
 //
 // Mirrors GL_EXT_ray_query semantics that 3DForts's physics shaders use
@@ -2564,6 +2921,40 @@ env.wgpuBuildTLAS = (instanceBufHandle, instanceCount, tlasOutBufHandle,
 // textures, samplers). Pass (0, 0) for a pipeline with no user bindings.
 const rtPipelines = new Map(); // handle → { pipeline, bgls, byGroup, sortedGroups }
 
+// Per-payload byte budget in wfPayload (rounded up; user Payload structs
+// must fit). Sponza's Payload is 48 B; 64 leaves headroom while keeping
+// 2·W·H·64 B ≈ 265 MB at 1080p.
+const WF_PAYLOAD_BYTES = 64;
+// Dynamic-offset uniform ring: one WfParams slot per wavefront pass. 128
+// slots covers maxDepth up to ~42 (1 + 3·maxDepth + 1 passes).
+const WF_PARAM_SLOTS   = 128;
+const WF_FIXED_TLAS_NPADDED = 16384;   // matches lbvhBuildWgsl N_PADDED
+
+function ensureWavefrontBuffers(W, H) {
+    const cap = W * H;
+    rtState.wf = rtState.wf || { cap: 0 };
+    const wf = rtState.wf;
+    if (wf.cap === cap && wf.raysA) return wf;
+    for (const b of [wf.raysA, wf.raysB, wf.hits, wf.accum, wf.counters,
+                     wf.payload, wf.indirect]) { if (b) b.destroy(); }
+    const S = GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST;
+    wf.raysA    = device.createBuffer({ size: cap * 64,                 usage: S, label: "wf-raysA" });
+    wf.raysB    = device.createBuffer({ size: cap * 64,                 usage: S, label: "wf-raysB" });
+    wf.hits     = device.createBuffer({ size: cap * 112,                usage: S, label: "wf-hits" });
+    wf.accum    = device.createBuffer({ size: cap * 16,                 usage: S, label: "wf-accum" });
+    wf.payload  = device.createBuffer({ size: 2 * cap * WF_PAYLOAD_BYTES, usage: S, label: "wf-payload" });
+    wf.counters = device.createBuffer({ size: 64,
+        usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST | GPUBufferUsage.COPY_SRC, label: "wf-counters" });
+    wf.indirect = device.createBuffer({ size: 16,
+        usage: GPUBufferUsage.STORAGE | GPUBufferUsage.INDIRECT | GPUBufferUsage.COPY_DST, label: "wf-indirect" });
+    if (!wf.paramsRing) {
+        wf.paramsRing = device.createBuffer({ size: WF_PARAM_SLOTS * 256,
+            usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST, label: "wf-params" });
+    }
+    wf.cap = cap;
+    return wf;
+}
+
 env.wgpuLoadRTPipeline = (wgslPtr, wgslLen, bindingsPtr, bindingsCount) => {
     if (!rtState.vertHeap) rtInit();
     const userPart = new TextDecoder().decode(memU8().subarray(wgslPtr, wgslPtr + wgslLen));
@@ -2577,16 +2968,23 @@ env.wgpuLoadRTPipeline = (wgslPtr, wgslLen, bindingsPtr, bindingsCount) => {
         beforeHelpers = userPart.substring(0, mi);
         afterHelpers  = userPart.substring(mi + marker.length);
     }
-    const fullWgsl = rtWgslPrelude + "\n" + beforeHelpers + "\n" + rtWgslPureHelpers + "\n" + rtWgslMegakernelHelpers + "\n" + afterHelpers;
+    // Wavefront assembly: types + bindings | user CH/miss/resolve + wfPayload
+    // + switches (beforeHelpers) | pure helpers | wavefront helpers | user
+    // raygen + the five @compute entry points (afterHelpers).
+    const fullWgsl = rtWgslTypes + rtWgslWavefrontBindings + "\n"
+        + beforeHelpers + "\n" + rtWgslPureHelpers + "\n"
+        + rtWgslWavefrontHelpers + "\n" + afterHelpers;
 
-    // Parse user bindings (same wire format as wgpuLoadCustomShader).
+    // Parse user bindings (same wire format as wgpuLoadCustomShader). For
+    // the wavefront RT pipeline, group 0 = WfParams, group 1 = data heaps,
+    // group 2 = indirect args — so user bindings must start at group 3.
     const userBindings = [];
     if (bindingsCount > 0) {
         const dv = new DataView(memU8().buffer, bindingsPtr, bindingsCount * 8);
         for (let i = 0; i < bindingsCount; i++) {
             const g = dv.getUint8(i*8 + 0);
-            if (g < 2) {
-                console.error(`[crafter-wgpu] RT pipeline: @group(${g}) reserved; user bindings need group >= 2`);
+            if (g < 3) {
+                console.error(`[crafter-wgpu] RT pipeline: @group(${g}) reserved; user bindings need group >= 3`);
                 return 0;
             }
             userBindings.push({
@@ -2604,33 +3002,28 @@ env.wgpuLoadRTPipeline = (wgslPtr, wgslLen, bindingsPtr, bindingsCount) => {
     }
     const sortedGroups = [...byGroup.keys()].sort((a, b) => a - b);
 
-    let pipeline;
     try {
-        const mod = device.createShaderModule({ code: fullWgsl, label: "rt-megakernel" });
-        // RTDispatchHeader is 16 bytes; bind exactly that.
-        const headerBgl = device.createBindGroupLayout({ entries: [
+        const mod = device.createShaderModule({ code: fullWgsl, label: "rt-wavefront" });
+        const paramsBgl = device.createBindGroupLayout({ entries: [
             { binding: 0, visibility: GPUShaderStage.COMPUTE,
-              buffer: { type: "uniform", minBindingSize: 16 } },
+              buffer: { type: "uniform", hasDynamicOffset: true, minBindingSize: 32 } },
         ]});
+        const sb = (b) => ({ binding: b, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } });
+        const rw = (b) => ({ binding: b, visibility: GPUShaderStage.COMPUTE, buffer: { type: "storage" } });
         const dataBgl = device.createBindGroupLayout({ entries: [
-            { binding: 0, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
-            { binding: 1, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
-            { binding: 2, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
-            { binding: 3, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
-            { binding: 4, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
-            { binding: 5, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
+            sb(0), sb(1), sb(2), sb(3), sb(4), sb(5),
             { binding: 6, visibility: GPUShaderStage.COMPUTE,
               storageTexture: { format: "rgba8unorm", access: "write-only", viewDimension: "2d" } },
-            { binding: 7, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
-            { binding: 8, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
-            { binding: 9, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
+            sb(7), sb(8), sb(9),
+            rw(10), rw(11), rw(12), rw(13), rw(14), rw(15),
         ]});
-        // User binding-group layouts. WebGPU pipeline layouts need a
-        // contiguous array up to the highest group used, so pad any gaps
-        // with empty bgls (same rule as wgpuLoadCustomShader).
+        const indirectBgl = device.createBindGroupLayout({ entries: [ rw(0) ]});
+        const emptyBgl    = device.createBindGroupLayout({ entries: [] });
+
+        // User binding-group layouts for groups 3..highest (pad gaps).
         const userBgls = [];
-        const highest = sortedGroups.length ? sortedGroups[sortedGroups.length - 1] : 1;
-        for (let g = 2; g <= highest; g++) {
+        const highest = sortedGroups.length ? sortedGroups[sortedGroups.length - 1] : 2;
+        for (let g = 3; g <= highest; g++) {
             if (byGroup.has(g)) {
                 const entries = byGroup.get(g).map(b => {
                     const e = { binding: b.binding, visibility: GPUShaderStage.COMPUTE };
@@ -2638,6 +3031,7 @@ env.wgpuLoadRTPipeline = (wgslPtr, wgslLen, bindingsPtr, bindingsCount) => {
                     else if (b.kind === 1) e.texture = { sampleType: "float", viewDimension: "2d" };
                     else if (b.kind === 2) e.sampler = { type: "filtering" };
                     else if (b.kind === 3) e.texture = { sampleType: "float", viewDimension: "2d-array" };
+                    else if (b.kind === 4) e.buffer  = { type: "storage" };
                     return e;
                 });
                 userBgls.push(device.createBindGroupLayout({ entries }));
@@ -2645,12 +3039,29 @@ env.wgpuLoadRTPipeline = (wgslPtr, wgslLen, bindingsPtr, bindingsCount) => {
                 userBgls.push(device.createBindGroupLayout({ entries: [] }));
             }
         }
-        pipeline = device.createComputePipeline({
-            layout: device.createPipelineLayout({ bindGroupLayouts: [headerBgl, dataBgl, ...userBgls] }),
-            compute: { module: mod, entryPoint: "main" },
-        });
+        // GENERATE / SHADE / RESOLVE may touch user bindings (raygen camera,
+        // closesthit albedo, resolve params) → params + data + empty(group2)
+        // + user. PREP → params + data + indirect. TRACE → params + data.
+        const userLayout = device.createPipelineLayout({
+            bindGroupLayouts: [paramsBgl, dataBgl, emptyBgl, ...userBgls] });
+        const prepLayout = device.createPipelineLayout({
+            bindGroupLayouts: [paramsBgl, dataBgl, indirectBgl] });
+        const traceLayout = device.createPipelineLayout({
+            bindGroupLayouts: [paramsBgl, dataBgl] });
+        const mk = (layout, ep) => device.createComputePipeline({
+            layout, compute: { module: mod, entryPoint: ep } });
+
+        const entry = {
+            genPipe:     mk(userLayout,  "wfGenerate"),
+            prepPipe:    mk(prepLayout,  "wfPrep"),
+            tracePipe:   mk(traceLayout, "wfTrace"),
+            shadePipe:   mk(userLayout,  "wfShade"),
+            resolvePipe: mk(userLayout,  "wfResolve"),
+            paramsBgl, dataBgl, indirectBgl, emptyBgl, userBgls,
+            byGroup, sortedGroups,
+        };
         const handle = newHandle();
-        rtPipelines.set(handle, { pipeline, headerBgl, dataBgl, userBgls, byGroup, sortedGroups });
+        rtPipelines.set(handle, entry);
         return handle;
     } catch (e) {
         console.error("[crafter-wgpu] RT pipeline compile failed:", e);
@@ -2659,35 +3070,83 @@ env.wgpuLoadRTPipeline = (wgslPtr, wgslLen, bindingsPtr, bindingsCount) => {
     }
 };
 
+// Build the user @group(3+) bind groups for a pass, returning a list of
+// { group, bindGroup } to set. Shared by GENERATE / SHADE / RESOLVE.
+function wfUserBindGroups(pipe, handlesPtr, handlesCount) {
+    const out = [];
+    if (handlesCount <= 0) return out;
+    const handles = new Uint32Array(memU8().buffer, handlesPtr, handlesCount);
+    let handleIdx = 0;
+    let bglIdx = 0;
+    const highest = pipe.sortedGroups.length ? pipe.sortedGroups[pipe.sortedGroups.length - 1] : 2;
+    for (let g = 3; g <= highest; g++) {
+        if (pipe.byGroup.has(g)) {
+            const entries = pipe.byGroup.get(g).map(b => {
+                const h = handles[handleIdx++];
+                let resource;
+                if      (b.kind === 0) resource = { buffer: buffers.get(h) };
+                else if (b.kind === 1) resource = textureViews.get(h);
+                else if (b.kind === 2) resource = samplers.get(h);
+                else if (b.kind === 3) resource = textureViews.get(h);
+                else if (b.kind === 4) resource = { buffer: buffers.get(h) };
+                return { binding: b.binding, resource };
+            });
+            out.push({ group: g, bindGroup: device.createBindGroup({ layout: pipe.userBgls[bglIdx], entries }) });
+        }
+        bglIdx++;
+    }
+    return out;
+}
+
 env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes,
                       tlasBufHandle, instanceCount, gx, gy,
-                      handlesPtr, handlesCount) => {
-    if (!state.pass) return;
+                      handlesPtr, handlesCount, maxDepth) => {
+    if (!state.encoder) return;
     const pipe = rtPipelines.get(pipelineHandle);
     const tlas = buffers.get(tlasBufHandle);
     if (!pipe || !tlas) {
         console.error("[crafter-wgpu] wgpuDispatchRT: unknown pipeline or tlas");
         return;
     }
-    // Write RT header from push data (first 16 bytes). Surface dims + instance count + flags.
-    const hdr32 = new Uint32Array(4);
-    hdr32[0] = state.width;
-    hdr32[1] = state.height;
-    hdr32[2] = instanceCount;
-    hdr32[3] = 0;
-    queue.writeBuffer(rtState.rtHeader, 0, hdr32);
-
-    const headerBg = device.createBindGroup({
-        layout: pipe.headerBgl,
-        entries: [{ binding: 0, resource: { buffer: rtState.rtHeader, offset: 0, size: 16 } }],
-    });
-    const outView = state.outIsPing ? state.pingView : state.pongView;
     const entryOrderBuf = buffers.get(rtState.currentEntryOrder);
     const bvhBuf        = buffers.get(rtState.currentBvh);
     if (!entryOrderBuf || !bvhBuf) {
-        console.error("[crafter-wgpu] wgpuDispatchRT: missing entryOrder/bins (no TLAS built yet?)");
+        console.error("[crafter-wgpu] wgpuDispatchRT: missing entryOrder/bvh (no TLAS built yet?)");
         return;
     }
+    const W = state.width, H = state.height;
+    const cap = W * H;
+    const depth = Math.max(1, maxDepth | 0);
+    const wf = ensureWavefrontBuffers(W, H);
+
+    // ── Per-pass WfParams ring. queue.writeBuffer lands before submit, so
+    // we can't mutate the uniform between passes — instead we pre-write one
+    // slot per pass and bind it with a dynamic offset. Slot order:
+    //   0            GENERATE
+    //   1+3*d .. +2  PREP / TRACE / SHADE for bounce d
+    //   1+3*depth    RESOLVE
+    const passCount = 2 + 3 * depth;
+    const ring = new Uint32Array(WF_PARAM_SLOTS * 64);  // 256 B = 64 u32 per slot
+    const writeSlot = (slot, curIsA, bounce) => {
+        const o = slot * 64;
+        ring[o + 0] = W; ring[o + 1] = H; ring[o + 2] = cap; ring[o + 3] = curIsA;
+        ring[o + 4] = bounce; ring[o + 5] = depth; ring[o + 6] = WF_FIXED_TLAS_NPADDED; ring[o + 7] = 0;
+    };
+    writeSlot(0, 1, 0);                              // GENERATE
+    for (let d = 0; d < depth; d++) {
+        const curIsA = (d % 2 === 0) ? 1 : 0;
+        writeSlot(1 + 3 * d + 0, curIsA, d);         // PREP
+        writeSlot(1 + 3 * d + 1, curIsA, d);         // TRACE
+        writeSlot(1 + 3 * d + 2, curIsA, d);         // SHADE
+    }
+    writeSlot(1 + 3 * depth, 1, depth);              // RESOLVE
+    queue.writeBuffer(wf.paramsRing, 0, ring, 0, passCount * 64);
+
+    const outView = state.outIsPing ? state.pingView : state.pongView;
+    const paramsBg = device.createBindGroup({
+        layout: pipe.paramsBgl,
+        entries: [{ binding: 0, resource: { buffer: wf.paramsRing, offset: 0, size: 256 } }],
+    });
     const dataBg = device.createBindGroup({
         layout: pipe.dataBgl,
         entries: [
@@ -2701,41 +3160,91 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes,
             { binding: 7, resource: { buffer: rtState.attribsHeap.gpu } },
             { binding: 8, resource: { buffer: entryOrderBuf } },
             { binding: 9, resource: { buffer: bvhBuf } },
+            { binding: 10, resource: { buffer: wf.raysA } },
+            { binding: 11, resource: { buffer: wf.raysB } },
+            { binding: 12, resource: { buffer: wf.hits } },
+            { binding: 13, resource: { buffer: wf.accum } },
+            { binding: 14, resource: { buffer: wf.counters } },
+            { binding: 15, resource: { buffer: wf.payload } },
         ],
     });
+    const indirectBg = device.createBindGroup({
+        layout: pipe.indirectBgl,
+        entries: [{ binding: 0, resource: { buffer: wf.indirect } }],
+    });
+    const userBgs = wfUserBindGroups(pipe, handlesPtr, handlesCount);
 
-    state.pass.setPipeline(pipe.pipeline);
-    state.pass.setBindGroup(0, headerBg);
-    state.pass.setBindGroup(1, dataBg);
+    // Close the frame's shared compute pass; the wavefront runs as its own
+    // sequence of passes on the same encoder (implicit barrier between each
+    // makes PREP's atomic writes visible to TRACE, etc.), then we reopen it.
+    if (state.pass) { state.pass.end(); state.pass = null; }
+    const enc = state.encoder;
+    const tileX = gx, tileY = gy;
+    const slotOff = (slot) => slot * 256;
 
-    // User bindings: walk byGroup in the same sorted order the C++ side
-    // packed handles[], picking up indices linearly.
-    if (handlesCount > 0) {
-        const handles = new Uint32Array(memU8().buffer, handlesPtr, handlesCount);
-        let handleIdx = 0;
-        let bglIdx = 0;
-        for (let g = 2; g <= (pipe.sortedGroups[pipe.sortedGroups.length - 1] || 1); g++) {
-            if (pipe.byGroup.has(g)) {
-                const entries = pipe.byGroup.get(g).map(b => {
-                    const h = handles[handleIdx++];
-                    let resource;
-                    if      (b.kind === 0) resource = { buffer: buffers.get(h) };
-                    else if (b.kind === 1) resource = textureViews.get(h);
-                    else if (b.kind === 2) resource = samplers.get(h);
-                    else if (b.kind === 3) resource = textureViews.get(h);
-                    return { binding: b.binding, resource };
-                });
-                const bg = device.createBindGroup({
-                    layout: pipe.userBgls[bglIdx],
-                    entries,
-                });
-                state.pass.setBindGroup(g, bg);
-            }
-            bglIdx++;
+    // Zero the two emit counters before GENERATE.
+    enc.clearBuffer(wf.counters, 0, 64);
+
+    const setUser = (pass) => { for (const u of userBgs) pass.setBindGroup(u.group, u.bindGroup); };
+
+    // GENERATE
+    {
+        const p = enc.beginComputePass({ label: "wf-generate" });
+        p.setPipeline(pipe.genPipe);
+        p.setBindGroup(0, paramsBg, [slotOff(0)]);
+        p.setBindGroup(1, dataBg);
+        setUser(p);
+        p.dispatchWorkgroups(tileX, tileY, 1);
+        p.end();
+    }
+    for (let d = 0; d < depth; d++) {
+        const prepSlot  = 1 + 3 * d + 0;
+        const traceSlot = 1 + 3 * d + 1;
+        const shadeSlot = 1 + 3 * d + 2;
+        // PREP — publish indirect args, zero next counter.
+        {
+            const p = enc.beginComputePass({ label: "wf-prep" });
+            p.setPipeline(pipe.prepPipe);
+            p.setBindGroup(0, paramsBg, [slotOff(prepSlot)]);
+            p.setBindGroup(1, dataBg);
+            p.setBindGroup(2, indirectBg);
+            p.dispatchWorkgroups(1, 1, 1);
+            p.end();
+        }
+        // TRACE — indirect over the live ray list.
+        {
+            const p = enc.beginComputePass({ label: "wf-trace" });
+            p.setPipeline(pipe.tracePipe);
+            p.setBindGroup(0, paramsBg, [slotOff(traceSlot)]);
+            p.setBindGroup(1, dataBg);
+            p.dispatchWorkgroupsIndirect(wf.indirect, 0);
+            p.end();
+        }
+        // SHADE — indirect; runs user closesthit/miss, may emit + accumulate.
+        {
+            const p = enc.beginComputePass({ label: "wf-shade" });
+            p.setPipeline(pipe.shadePipe);
+            p.setBindGroup(0, paramsBg, [slotOff(shadeSlot)]);
+            p.setBindGroup(1, dataBg);
+            setUser(p);
+            p.dispatchWorkgroupsIndirect(wf.indirect, 0);
+            p.end();
         }
     }
+    // RESOLVE — tonemap accum → output image.
+    {
+        const p = enc.beginComputePass({ label: "wf-resolve" });
+        p.setPipeline(pipe.resolvePipe);
+        p.setBindGroup(0, paramsBg, [slotOff(1 + 3 * depth)]);
+        p.setBindGroup(1, dataBg);
+        setUser(p);
+        p.dispatchWorkgroups(tileX, tileY, 1);
+        p.end();
+    }
 
-    state.pass.dispatchWorkgroups(gx, gy, 1);
+    // Reopen the frame's shared pass so wgpuFrameEnd / later UI work as
+    // before, and flip ping-pong so the blit picks the texture RESOLVE wrote.
+    state.pass = enc.beginComputePass();
     state.outIsPing = !state.outIsPing;
 };
 
diff --git a/examples/VulkanTriangle/closesthit.wgsl b/examples/VulkanTriangle/closesthit.wgsl
index e0d29ae..91758ba 100644
--- a/examples/VulkanTriangle/closesthit.wgsl
+++ b/examples/VulkanTriangle/closesthit.wgsl
@@ -1,6 +1,9 @@
-// WebGPU port of closesthit.glsl. Library concatenates this BEFORE the
-// library helpers, so `Payload` declared here is visible to traceRay,
-// runClosestHit, the mega-switch, and the user's raygen source.
+// Payload declared here so the WGSL assembler sees it before the wfPayload
+// binding, the SHADE dispatch, and the raygen source.
+//
+// Wavefront model: closesthit_main runs in SHADE and accumulates the
+// pixel's color directly (rtAccumulate) instead of writing a payload that
+// raygen reads back.
 
 struct Payload {
     color: vec3<f32>,
@@ -8,5 +11,5 @@ struct Payload {
 
 fn closesthit_main(ray: RayDesc, hit: HitInfo, payload: ptr<function, Payload>) {
     let bary = vec3<f32>(1.0 - hit.attribs.x - hit.attribs.y, hit.attribs.x, hit.attribs.y);
-    (*payload).color = bary;
+    rtAccumulate(bary);
 }
diff --git a/examples/VulkanTriangle/miss.wgsl b/examples/VulkanTriangle/miss.wgsl
index a0ba944..39ee6b6 100644
--- a/examples/VulkanTriangle/miss.wgsl
+++ b/examples/VulkanTriangle/miss.wgsl
@@ -1,5 +1,6 @@
-// WebGPU port of miss.glsl.
+// Wavefront miss: runs in SHADE for rays that hit nothing. Accumulate the
+// white background directly.
 
 fn miss_main(ray: RayDesc, payload: ptr<function, Payload>) {
-    (*payload).color = vec3<f32>(1.0, 1.0, 1.0);
+    rtAccumulate(vec3<f32>(1.0, 1.0, 1.0));
 }
diff --git a/examples/VulkanTriangle/raygen.wgsl b/examples/VulkanTriangle/raygen.wgsl
index cdf66fd..3014e8a 100644
--- a/examples/VulkanTriangle/raygen.wgsl
+++ b/examples/VulkanTriangle/raygen.wgsl
@@ -1,11 +1,12 @@
-// WebGPU port of raygen.glsl. Mirrors the pinhole camera setup — the
-// Payload type is declared in closesthit.wgsl (concatenated earlier).
+// WebGPU wavefront raygen. Runs in GENERATE: compute the pinhole camera
+// ray and emit it as the pixel's primary ray. Shading happens later in
+// SHADE (closesthit/miss). The Payload type is declared in closesthit.wgsl.
 
 fn raygen_main(gid: vec3<u32>) {
-    if (gid.x >= hdr.surfaceW || gid.y >= hdr.surfaceH) { return; }
+    if (gid.x >= wfParams.surfaceW || gid.y >= wfParams.surfaceH) { return; }
 
     let pixel = vec2<f32>(f32(gid.x), f32(gid.y));
-    let resolution = vec2<f32>(f32(hdr.surfaceW), f32(hdr.surfaceH));
+    let resolution = vec2<f32>(f32(wfParams.surfaceW), f32(wfParams.surfaceH));
     let uv  = (pixel + vec2<f32>(0.5)) / resolution;
     let ndc = uv * 2.0 - vec2<f32>(1.0);
 
@@ -23,17 +24,11 @@ fn raygen_main(gid: vec3<u32>) {
     var payload: Payload;
     payload.color = vec3<f32>(0.0);
 
-    traceRay(
-        0u,            // tlasIdx (unused)
-        0u,            // ray flags
-        0xFFu,         // cull mask
-        0u, 0u, 0u,    // sbtRecordOffset, sbtRecordStride, missIndex
+    rtEmitPrimaryRay(
         origin, 0.001,
         direction, 10000.0,
-        &payload,
-    );
-
-    textureStore(outImage,
-                 vec2<i32>(i32(gid.x), i32(gid.y)),
-                 vec4<f32>(payload.color, 1.0));
+        0u,            // ray flags
+        0xFFu,         // cull mask
+        0u, 0u,        // sbtRecordOffset, missIndex
+        payload);
 }
diff --git a/implementations/Crafter.Graphics-PipelineRTWebGPU.cpp b/implementations/Crafter.Graphics-PipelineRTWebGPU.cpp
index c4265da..8dd949d 100644
--- a/implementations/Crafter.Graphics-PipelineRTWebGPU.cpp
+++ b/implementations/Crafter.Graphics-PipelineRTWebGPU.cpp
@@ -78,13 +78,22 @@ void PipelineRTWebGPU::Init(WebGPUCommandEncoderRef                 /*cmd*/,
     // shaders by stage. Concatenating *all* non-raygen sources here lets
     // them declare shared helpers, `struct Payload`, etc., in any order.
 
-    wgsl += "// ── user closesthit / anyhit / miss sources ───────────────\n";
+    wgsl += "// ── user closesthit / anyhit / miss / resolve sources ─────\n";
     for (const auto& shader : sbt.shaders) {
         if (shader.stage == WebGPURTStage::Raygen) continue;
         wgsl += shader.source;
         wgsl += "\n";
     }
 
+    // ── Payload-typed wavefront storage binding ────────────────────────
+    //
+    // Emitted *after* the user sources so it can name the user's `Payload`
+    // type. Holds one Payload per in-flight ray slot across both ping/pong
+    // ray buffers (capacity = 2·W·H). SHADE loads ray.payloadSlot here;
+    // emit helpers (rtEmitPrimaryRay / rtEmitRay) store into it.
+    wgsl += "\n@group(1) @binding(15) var<storage, read_write> "
+            "wfPayload : array<Payload>;\n";
+
     // ── Section 2: mega-switch dispatchers ─────────────────────────────
     //
     // runClosestHit, runAnyHit, runMiss each dispatch on the per-hit /
@@ -141,6 +150,24 @@ void PipelineRTWebGPU::Init(WebGPUCommandEncoderRef                 /*cmd*/,
     wgsl += "    }\n";
     wgsl += "}\n";
 
+    // runResolve — RESOLVE-stage tonemap hook. The first registered
+    // Resolve shader wins; with none, identity passthrough (alpha forced
+    // to 1) so the wavefront output matches a megakernel that wrote raw
+    // colors.
+    std::string resolveEntryFn;
+    for (const auto& shader : sbt.shaders) {
+        if (shader.stage == WebGPURTStage::Resolve) { resolveEntryFn = shader.entryFn; break; }
+    }
+    wgsl += "\nfn runResolve(coord: vec2<u32>, hdr: vec4<f32>) -> vec4<f32> {\n";
+    if (!resolveEntryFn.empty()) {
+        wgsl += "    return ";
+        wgsl += resolveEntryFn;
+        wgsl += "(coord, hdr);\n";
+    } else {
+        wgsl += "    return vec4<f32>(hdr.rgb, 1.0);\n";
+    }
+    wgsl += "}\n";
+
     // Marker — JS-side prelude/post-amble searches for this token to know
     // where the library helpers (traverseBlas/traverseTlas/traceRay) get
     // injected, followed by raygen sources and the @compute entry point.
@@ -173,17 +200,55 @@ void PipelineRTWebGPU::Init(WebGPUCommandEncoderRef                 /*cmd*/,
         return;
     }
 
-    // ── Section 4: @compute entry point ────────────────────────────────
+    // ── Section 4: wavefront @compute entry points ─────────────────────
     //
-    // 8x8 tile workgroup matching the rest of the WebGPU backend.
+    // Five kernels share this one module; createComputePipeline selects
+    // each by entryPoint name. GENERATE/RESOLVE are 8x8 screen tiles;
+    // TRACE/SHADE are 64-wide 1-D over the compacted ray list (dispatched
+    // indirectly from PREP); PREP is a single thread. The library helper
+    // bodies (_rtwTraverseTlas, rtEmit*, rtAccumulate, _wfCurCount, …) are
+    // injected JS-side at the marker above.
 
+    // GENERATE — one thread per pixel; clears the pixel's accumulator and
+    // runs the user raygen, which calls rtEmitPrimaryRay.
     wgsl += "\n@compute @workgroup_size(8, 8, 1)\n";
-    wgsl += "fn main(@builtin(global_invocation_id) gid: vec3<u32>) {\n";
+    wgsl += "fn wfGenerate(@builtin(global_invocation_id) gid: vec3<u32>) {\n";
+    wgsl += "    if (gid.x >= wfParams.surfaceW || gid.y >= wfParams.surfaceH) { return; }\n";
+    wgsl += "    let pixel = gid.y * wfParams.surfaceW + gid.x;\n";
+    wgsl += "    wfAccum[pixel] = vec4<f32>(0.0, 0.0, 0.0, 0.0);\n";
+    wgsl += "    _wfPixel = pixel;\n";
     wgsl += "    ";
     wgsl += raygenEntryFn;
     wgsl += "(gid);\n";
     wgsl += "}\n";
 
+    // PREP — single thread; reads the live ray count and publishes the
+    // indirect dispatch args for the upcoming TRACE/SHADE, then zeroes the
+    // next buffer's emit counter so SHADE starts compacting from 0.
+    wgsl += "\n@compute @workgroup_size(1)\n";
+    wgsl += "fn wfPrep() { _wfPrep(); }\n";
+
+    // TRACE — zero user code: pure traversal + intersection. One thread
+    // per live ray; writes a HitResult into wfHits[i].
+    wgsl += "\n@compute @workgroup_size(64)\n";
+    wgsl += "fn wfTrace(@builtin(global_invocation_id) gid: vec3<u32>) { _wfTrace(gid.x); }\n";
+
+    // SHADE — one thread per live ray; loads the ray + its hit + payload,
+    // dispatches to runMiss / runClosestHit, which may rtAccumulate and
+    // rtEmitRay continuation/shadow rays into the next buffer.
+    wgsl += "\n@compute @workgroup_size(64)\n";
+    wgsl += "fn wfShade(@builtin(global_invocation_id) gid: vec3<u32>) { _wfShade(gid.x); }\n";
+
+    // RESOLVE — one thread per pixel; runs the user resolve (or identity)
+    // over the linear accumulator and stores to the output image.
+    wgsl += "\n@compute @workgroup_size(8, 8, 1)\n";
+    wgsl += "fn wfResolve(@builtin(global_invocation_id) gid: vec3<u32>) {\n";
+    wgsl += "    if (gid.x >= wfParams.surfaceW || gid.y >= wfParams.surfaceH) { return; }\n";
+    wgsl += "    let pixel = gid.y * wfParams.surfaceW + gid.x;\n";
+    wgsl += "    let outc = runResolve(gid.xy, wfAccum[pixel]);\n";
+    wgsl += "    textureStore(outImage, vec2<i32>(i32(gid.x), i32(gid.y)), outc);\n";
+    wgsl += "}\n";
+
     pipelineHandle = WebGPU::wgpuLoadRTPipeline(
         wgsl.data(),
         static_cast<std::int32_t>(wgsl.size()),
diff --git a/interfaces/Crafter.Graphics-RTPass.cppm b/interfaces/Crafter.Graphics-RTPass.cppm
index 1397a68..0da7088 100644
--- a/interfaces/Crafter.Graphics-RTPass.cppm
+++ b/interfaces/Crafter.Graphics-RTPass.cppm
@@ -72,6 +72,11 @@ export namespace Crafter {
         // 0 means "no user bindings".
         const void*       handlesPtr   = nullptr;
         std::uint32_t     handlesCount = 0;
+        // Wavefront bounce budget: number of (TRACE; SHADE) iterations.
+        // 1 = primary rays only; 2 = primary + one continuation/shadow
+        // bounce; etc. The library unrolls GENERATE; (PREP; TRACE; SHADE)
+        // ×maxDepth; RESOLVE.
+        std::uint32_t     maxDepth     = 1;
 
         RTPass(PipelineRTWebGPU* p) : pipeline(p) {}
 
@@ -88,7 +93,8 @@ export namespace Crafter {
                 static_cast<std::int32_t>(gx),
                 static_cast<std::int32_t>(gy),
                 handlesPtr,
-                static_cast<std::int32_t>(handlesCount));
+                static_cast<std::int32_t>(handlesCount),
+                static_cast<std::int32_t>(maxDepth));
         }
     };
 }
diff --git a/interfaces/Crafter.Graphics-ShaderBindingTableWebGPU.cppm b/interfaces/Crafter.Graphics-ShaderBindingTableWebGPU.cppm
index c175a52..73c2285 100644
--- a/interfaces/Crafter.Graphics-ShaderBindingTableWebGPU.cppm
+++ b/interfaces/Crafter.Graphics-ShaderBindingTableWebGPU.cppm
@@ -18,6 +18,11 @@ export namespace Crafter {
         Miss       = 1,
         ClosestHit = 2,
         AnyHit     = 3,
+        // Wavefront RESOLVE-stage tonemap/output hook. Optional: if no
+        // Resolve shader is registered, RESOLVE writes the linear accum
+        // buffer through unchanged. Signature:
+        //   fn <entryFn>(coord: vec2<u32>, hdr: vec4<f32>) -> vec4<f32>
+        Resolve    = 4,
     };
 
     // One WGSL shader source + the function name PipelineRTWebGPU should
diff --git a/interfaces/Crafter.Graphics-WebGPU.cppm b/interfaces/Crafter.Graphics-WebGPU.cppm
index dc695fe..9089bbd 100644
--- a/interfaces/Crafter.Graphics-WebGPU.cppm
+++ b/interfaces/Crafter.Graphics-WebGPU.cppm
@@ -201,7 +201,8 @@ namespace Crafter::WebGPU {
                                    std::uint32_t tlasBufHandle,
                                    std::int32_t  instanceCount,
                                    std::int32_t  gx, std::int32_t gy,
-                                   const void* handlesPtr, std::int32_t handlesCount);
+                                   const void* handlesPtr, std::int32_t handlesCount,
+                                   std::int32_t maxDepth);
 
     // GPU TLAS-build dispatch. Two sequential compute passes:
     //   1. tlasBuildMain — per-instance world AABB + identity permutation

From f4d6493d91b4488894cc3ba3a7a9c728965e7a5d Mon Sep 17 00:00:00 2001
From: catbot <catbot@bot.local>
Date: Sun, 31 May 2026 16:28:38 +0000
Subject: [PATCH 2/8] wip: uncommitted changes from claude run on issue #3

---
 examples/RTStress/closesthit.wgsl |  54 ++++++++
 examples/RTStress/main.cpp        | 200 ++++++++++++++++++++++++++++++
 examples/RTStress/miss.wgsl       |  11 ++
 examples/RTStress/project.cpp     |  46 +++++++
 examples/RTStress/raygen.wgsl     |  35 ++++++
 examples/RTStress/resolve.wgsl    |   7 ++
 6 files changed, 353 insertions(+)
 create mode 100644 examples/RTStress/closesthit.wgsl
 create mode 100644 examples/RTStress/main.cpp
 create mode 100644 examples/RTStress/miss.wgsl
 create mode 100644 examples/RTStress/project.cpp
 create mode 100644 examples/RTStress/raygen.wgsl
 create mode 100644 examples/RTStress/resolve.wgsl

diff --git a/examples/RTStress/closesthit.wgsl b/examples/RTStress/closesthit.wgsl
new file mode 100644
index 0000000..c80f5bb
--- /dev/null
+++ b/examples/RTStress/closesthit.wgsl
@@ -0,0 +1,54 @@
+// RTStress closest-hit (runs in SHADE). Computes flat-shaded Lambert from
+// the hit triangle's geometric normal, accumulates ambient, and — if the
+// surface faces the sun — emits a shadow ray toward the sun. The shadow
+// ray's miss (sun visible) adds the direct term; its hit (occluded) adds
+// nothing because RT_FLAG_SKIP_CLOSEST_HIT suppresses closesthit on hit.
+//
+// Payload declared here so the assembler sees it before wfPayload / SHADE.
+struct Payload {
+    color:     vec3<f32>,   // shadow ray: pending direct contribution
+    shadowRay: u32,         // 0 primary, 1 shadow
+};
+
+const SUN_DIR_TO_LIGHT: vec3<f32> = vec3<f32>(0.40, 0.85, 0.35);
+const SUN_COLOR:        vec3<f32> = vec3<f32>(1.15, 1.05, 0.90);
+const AMBIENT_COLOR:    vec3<f32> = vec3<f32>(0.12, 0.13, 0.18);
+
+// Cheap per-instance albedo so the grid reads as distinct cubes (and any
+// TLAS flicker as instance count scales is obvious).
+fn instanceAlbedo(i: u32) -> vec3<f32> {
+    let h = i * 2654435761u;
+    return vec3<f32>(
+        0.35 + 0.6 * f32((h >>  0u) & 255u) / 255.0,
+        0.35 + 0.6 * f32((h >>  8u) & 255u) / 255.0,
+        0.35 + 0.6 * f32((h >> 16u) & 255u) / 255.0);
+}
+
+fn closesthit_main(ray: RayDesc, hit: HitInfo, payload: ptr<function, Payload>) {
+    let meshRec = meshRecords[tlasEntries[hit.instanceId].blasMeshIdx];
+    let verts = _rtFetchTri(meshRec, hit.primitiveId);
+    let nObj = normalize(cross(verts[1] - verts[0], verts[2] - verts[0]));
+    let nWorld = normalize(vec3<f32>(
+        dot(hit.objectToWorldR0.xyz, nObj),
+        dot(hit.objectToWorldR1.xyz, nObj),
+        dot(hit.objectToWorldR2.xyz, nObj)));
+
+    let albedo   = instanceAlbedo(hit.customIndex);
+    let worldPos = ray.origin + ray.direction * hit.t;
+    let viewDir  = -ray.direction;
+    let nFacing  = select(-nWorld, nWorld, dot(nWorld, viewDir) > 0.0);
+    let sunDir   = normalize(SUN_DIR_TO_LIGHT);
+    let nDotL    = max(0.0, dot(nFacing, sunDir));
+
+    rtAccumulate(albedo * AMBIENT_COLOR);
+
+    if (nDotL > 0.0) {
+        var sp: Payload;
+        sp.color     = albedo * SUN_COLOR * nDotL;
+        sp.shadowRay = 1u;
+        let shadowOrigin = worldPos + nFacing * 0.05;
+        rtEmitRay(shadowOrigin, 0.01, sunDir, 100000.0,
+                  RT_FLAG_SKIP_CLOSEST_HIT | RT_FLAG_TERMINATE_ON_FIRST_HIT,
+                  0xFFu, 0u, 0u, sp);
+    }
+}
diff --git a/examples/RTStress/main.cpp b/examples/RTStress/main.cpp
new file mode 100644
index 0000000..d9581a2
--- /dev/null
+++ b/examples/RTStress/main.cpp
@@ -0,0 +1,200 @@
+// RTStress — the standing many-instance wavefront RT benchmark. An
+// N×N×N grid of a small cube mesh (one BLAS, many TLAS instances), shaded
+// with primary + shadow rays through the wavefront pipeline. The grid edge
+// `kGrid` is the instance-count knob: 8 → 512, 16 → 4096, 20 → 8000
+// (LBVH_MAX = 16384). Frame time is printed to the console each second so
+// fps-vs-instance-count can be read off without external tooling; the JS
+// bridge additionally prints a GPU timestamp-query per-pass breakdown.
+//
+// WebGPU/DOM only — the wavefront tracer is the WebGPU software RT path.
+
+#ifndef CRAFTER_GRAPHICS_WINDOW_DOM
+int main() { return 0; }   // native path is hardware RT; out of scope here
+#else
+
+import Crafter.Graphics;
+import Crafter.Math;
+import Crafter.Event;
+import std;
+
+using namespace Crafter;
+namespace fs = std::filesystem;
+
+namespace {
+    // Instance-count knob. instances = kGrid³. Bump to 16 (4096) or 20
+    // (8000) to stress the TLAS; the LBVH build caps at 16384.
+    constexpr int   kGrid    = 8;
+    constexpr float kSpacing = 2.5f;
+    constexpr float kHalf    = 0.5f;   // cube half-extent
+
+    struct CameraGPU {
+        float origin[3];   float pad0;
+        float right[3];    float tanHalf;
+        float up[3];       float aspect;
+        float forward[3];  float pad1;
+    };
+    static_assert(sizeof(CameraGPU) == 64);
+}
+
+int main() {
+    const int instanceCount = kGrid * kGrid * kGrid;
+    std::println("[RTStress] grid {}^3 = {} instances", kGrid, instanceCount);
+
+    Device::Initialize();
+    static Window window(1280, 720, "RTStress");
+    auto cmd = window.StartInit();
+
+    DescriptorHeapWebGPU heap;
+    heap.Initialize(/*images*/ 1, /*buffers*/ 2, /*samplers*/ 1);
+
+    std::array<WebGPUShader, 4> shaders {{
+        WebGPUShader(fs::path("raygen.wgsl"),     "raygen_main",     WebGPURTStage::Raygen),
+        WebGPUShader(fs::path("miss.wgsl"),       "miss_main",       WebGPURTStage::Miss),
+        WebGPUShader(fs::path("closesthit.wgsl"), "closesthit_main", WebGPURTStage::ClosestHit),
+        WebGPUShader(fs::path("resolve.wgsl"),    "resolve_main",    WebGPURTStage::Resolve),
+    }};
+    ShaderBindingTableWebGPU sbt;
+    sbt.Init(shaders);
+
+    std::array<RTShaderGroup, 1> raygenGroups {{ { .type = RTShaderGroupType::General,           .generalShader = 0 } }};
+    std::array<RTShaderGroup, 1> missGroups   {{ { .type = RTShaderGroupType::General,           .generalShader = 1 } }};
+    std::array<RTShaderGroup, 1> hitGroups    {{ { .type = RTShaderGroupType::TrianglesHitGroup, .closestHitShader = 2 } }};
+
+    // One user binding: the camera storage buffer at @group(3).
+    std::array<UICustomBinding, 1> bindings {{
+        { .group = 3, .binding = 0, .kind = UICustomBindingKind::Buffer, ._pad = 0, .pushOffset = 0 },
+    }};
+
+    PipelineRTWebGPU pipeline;
+    pipeline.Init(cmd, raygenGroups, missGroups, hitGroups, sbt, bindings);
+
+    // ── Unit cube mesh (8 verts, 12 tris). ────────────────────────────
+    static std::array<Vector<float, 3, 3>, 8> verts {{
+        {-kHalf, -kHalf, -kHalf}, { kHalf, -kHalf, -kHalf},
+        { kHalf,  kHalf, -kHalf}, {-kHalf,  kHalf, -kHalf},
+        {-kHalf, -kHalf,  kHalf}, { kHalf, -kHalf,  kHalf},
+        { kHalf,  kHalf,  kHalf}, {-kHalf,  kHalf,  kHalf},
+    }};
+    static std::array<std::uint32_t, 36> indices {{
+        0,1,2, 0,2,3,   5,4,7, 5,7,6,   4,0,3, 4,3,7,
+        1,5,6, 1,6,2,   4,5,1, 4,1,0,   3,2,6, 3,6,7,
+    }};
+    static Mesh cube;
+    cube.Build(verts, indices, cmd);
+
+    // ── Camera buffer + handle array. ─────────────────────────────────
+    WebGPUBuffer<CameraGPU, true> cameraBuf;
+    cameraBuf.Create(1);
+    static std::array<std::uint32_t, 1> userHandles { cameraBuf.handle };
+
+    // ── Instance grid. Reserve so RenderingElement3D::Add pointers stay
+    //    valid across vector growth. ─────────────────────────────────────
+    static std::vector<RenderingElement3D> renderers;
+    renderers.reserve(static_cast<std::size_t>(instanceCount));
+    const float origin0 = -0.5f * static_cast<float>(kGrid - 1) * kSpacing;
+    for (int x = 0; x < kGrid; ++x)
+    for (int y = 0; y < kGrid; ++y)
+    for (int z = 0; z < kGrid; ++z) {
+        renderers.emplace_back();
+        RenderingElement3D& r = renderers.back();
+        auto& tx = r.instance.transform.matrix;
+        tx[0][0] = 1; tx[0][1] = 0; tx[0][2] = 0; tx[0][3] = origin0 + float(x) * kSpacing;
+        tx[1][0] = 0; tx[1][1] = 1; tx[1][2] = 0; tx[1][3] = origin0 + float(y) * kSpacing;
+        tx[2][0] = 0; tx[2][1] = 0; tx[2][2] = 1; tx[2][3] = origin0 + float(z) * kSpacing;
+        r.instance.instanceCustomIndex                    = static_cast<std::uint32_t>(renderers.size() - 1);
+        r.instance.mask                                   = 0xFF;
+        r.instance.instanceShaderBindingTableRecordOffset = 0;
+        r.instance.flags                                  = kRTGeometryInstanceForceOpaque;
+        r.instance.accelerationStructureReference         = cube.blasAddr;
+        RenderingElement3D::Add(&r);
+    }
+    RenderingElement3D::BuildTLAS(cmd, 0);
+
+    window.descriptorHeap = &heap;
+    window.FinishInit();
+
+    RTPass rtPass(&pipeline);
+    rtPass.handlesPtr   = userHandles.data();
+    rtPass.handlesCount = static_cast<std::uint32_t>(userHandles.size());
+    rtPass.maxDepth     = 2;   // primary + shadow
+    window.passes.push_back(&rtPass);
+
+    // ── Free camera framing the grid from a corner. ───────────────────
+    const float ext = float(kGrid - 1) * kSpacing;
+    struct CamState {
+        Vector<float, 3, 4> position;
+        float yaw;
+        float pitch;
+    } cam {
+        Vector<float, 3, 4>{ ext * 1.4f, ext * 1.0f, ext * 1.4f },
+        0.0f, 0.0f,
+    };
+    {
+        // Aim at the grid centre (origin).
+        Vector<float, 3, 4> d { -cam.position.x, -cam.position.y, -cam.position.z };
+        const float len = std::sqrt(d.x*d.x + d.y*d.y + d.z*d.z);
+        cam.yaw   = std::atan2(d.z, d.x);
+        cam.pitch = std::asin(d.y / len);
+    }
+
+    Input::Map inputMap;
+    Input::Action& moveAct = inputMap.AddAction("Move", Input::ActionType::Vector2);
+    Input::Action& lookAct = inputMap.AddAction("Look", Input::ActionType::Vector2);
+    moveAct.bindings = { Input::WASDBind{
+        Key(CrafterKeys::W), Key(CrafterKeys::S), Key(CrafterKeys::A), Key(CrafterKeys::D) } };
+    lookAct.bindings = { Input::MouseDeltaBind{ 1.0f } };
+    inputMap.Attach(window);
+
+    const float kMoveSpeed = ext * 0.8f;
+    const float kLookSens  = 0.05f;
+    const float kDt        = 1.0f / 60.0f;
+
+    static int   frames = 0;
+    static double tAccum = 0.0;
+    EventListener<void> camTick(&window.onBeforeUpdate, [&]() {
+        inputMap.Tick();
+        cam.yaw   += lookAct.vector2.x * kLookSens;
+        cam.pitch -= lookAct.vector2.y * kLookSens;
+        cam.pitch  = std::clamp(cam.pitch, -1.55f, 1.55f);
+
+        const float cp = std::cos(cam.pitch), sp = std::sin(cam.pitch);
+        const float cy = std::cos(cam.yaw),   sy = std::sin(cam.yaw);
+        Vector<float, 3, 4> forward { cp * cy, sp, cp * sy };
+        Vector<float, 3, 4> worldUp { 0.0f, 1.0f, 0.0f };
+        Vector<float, 3, 4> right { forward.y*worldUp.z - forward.z*worldUp.y,
+                                    forward.z*worldUp.x - forward.x*worldUp.z,
+                                    forward.x*worldUp.y - forward.y*worldUp.x };
+        const float rLen = std::sqrt(right.x*right.x + right.y*right.y + right.z*right.z);
+        right.x /= rLen; right.y /= rLen; right.z /= rLen;
+        Vector<float, 3, 4> up { right.y*forward.z - right.z*forward.y,
+                                 right.z*forward.x - right.x*forward.z,
+                                 right.x*forward.y - right.y*forward.x };
+
+        const float dx = moveAct.vector2.x * kMoveSpeed * kDt;
+        const float dy = moveAct.vector2.y * kMoveSpeed * kDt;
+        cam.position.x += right.x*dx + forward.x*dy;
+        cam.position.y += right.y*dx + forward.y*dy;
+        cam.position.z += right.z*dx + forward.z*dy;
+
+        CameraGPU& g = cameraBuf.value[0];
+        g.origin[0]=cam.position.x; g.origin[1]=cam.position.y; g.origin[2]=cam.position.z; g.pad0=0;
+        g.right[0]=right.x;   g.right[1]=right.y;   g.right[2]=right.z;
+        g.up[0]=up.x;         g.up[1]=up.y;         g.up[2]=up.z;
+        g.forward[0]=forward.x; g.forward[1]=forward.y; g.forward[2]=forward.z;
+        g.aspect  = float(window.width) / float(window.height);
+        g.tanHalf = std::tan(70.0f * 3.14159265f / 360.0f);
+        g.pad1    = 0;
+        cameraBuf.FlushDevice();
+
+        if (++frames >= 60) {
+            std::println("[RTStress] {} instances @ ~{} frames since last report", instanceCount, frames);
+            frames = 0;
+        }
+    });
+
+    window.Render();
+    window.StartUpdate();
+    window.StartSync();
+    return 0;
+}
+#endif
diff --git a/examples/RTStress/miss.wgsl b/examples/RTStress/miss.wgsl
new file mode 100644
index 0000000..d23a0bc
--- /dev/null
+++ b/examples/RTStress/miss.wgsl
@@ -0,0 +1,11 @@
+// RTStress miss (runs in SHADE). Primary miss → sky gradient. Shadow miss
+// → the sun is unoccluded, so add the pending direct contribution.
+fn miss_main(ray: RayDesc, payload: ptr<function, Payload>) {
+    if ((*payload).shadowRay == 1u) {
+        rtAccumulate((*payload).color);
+        return;
+    }
+    let t = clamp(ray.direction.y * 0.5 + 0.5, 0.0, 1.0);
+    rtAccumulate(mix(vec3<f32>(0.50, 0.62, 0.85),
+                     vec3<f32>(0.90, 0.94, 1.00), t));
+}
diff --git a/examples/RTStress/project.cpp b/examples/RTStress/project.cpp
new file mode 100644
index 0000000..b1e4f03
--- /dev/null
+++ b/examples/RTStress/project.cpp
@@ -0,0 +1,46 @@
+import std;
+import Crafter.Build;
+namespace fs = std::filesystem;
+using namespace Crafter;
+
+extern "C" Configuration CrafterBuildProject(std::span<const std::string_view> args) {
+    bool isWasm = false;
+    for (std::string_view a : args) {
+        if (a.starts_with("--target=") && a.find("wasm") != std::string_view::npos) {
+            isWasm = true;
+            break;
+        }
+    }
+
+    std::vector<std::string> graphicsArgs(args.begin(), args.end());
+    Configuration* graphics = LocalProject({
+        .projectFile = "../../project.cpp",
+        .args = graphicsArgs,
+    });
+
+    Configuration cfg;
+    cfg.path = "./";
+    cfg.name = "RTStress";
+    cfg.outputName = "RTStress";
+    cfg.type = ConfigurationType::Executable;
+    if (isWasm) {
+        cfg.target = "wasm32-wasip1";
+        cfg.defines.push_back({"CRAFTER_GRAPHICS_WINDOW_DOM", ""});
+        cfg.compileFlags.push_back("-msimd128");
+    }
+    ApplyStandardArgs(cfg, args);
+    cfg.dependencies = { graphics };
+
+    std::array<fs::path, 0> ifaces = {};
+    std::array<fs::path, 1> impls = { "main" };
+    cfg.GetInterfacesAndImplementations(ifaces, impls);
+
+    if (isWasm) {
+        cfg.files.emplace_back(fs::path("raygen.wgsl"));
+        cfg.files.emplace_back(fs::path("closesthit.wgsl"));
+        cfg.files.emplace_back(fs::path("miss.wgsl"));
+        cfg.files.emplace_back(fs::path("resolve.wgsl"));
+        EnableWasiBrowserRuntime(cfg);
+    }
+    return cfg;
+}
diff --git a/examples/RTStress/raygen.wgsl b/examples/RTStress/raygen.wgsl
new file mode 100644
index 0000000..def54fc
--- /dev/null
+++ b/examples/RTStress/raygen.wgsl
@@ -0,0 +1,35 @@
+// RTStress raygen (runs in GENERATE). Host-driven pinhole camera at
+// @group(3) (groups 0..2 are reserved by the wavefront pipeline:
+// 0 = WfParams, 1 = data heaps, 2 = indirect args).
+struct Camera {
+    origin:  vec3<f32>,
+    pad0:    f32,
+    right:   vec3<f32>,
+    tanHalf: f32,
+    up:      vec3<f32>,
+    aspect:  f32,
+    forward: vec3<f32>,
+    pad1:    f32,
+};
+@group(3) @binding(0) var<storage, read> camera : Camera;
+
+fn raygen_main(gid: vec3<u32>) {
+    if (gid.x >= wfParams.surfaceW || gid.y >= wfParams.surfaceH) { return; }
+
+    let pixelf = vec2<f32>(f32(gid.x), f32(gid.y));
+    let res    = vec2<f32>(f32(wfParams.surfaceW), f32(wfParams.surfaceH));
+    let uv     = (pixelf + vec2<f32>(0.5)) / res;
+    let ndc    = uv * 2.0 - vec2<f32>(1.0);
+
+    let direction = normalize(
+        camera.right   * (ndc.x  * camera.aspect * camera.tanHalf) +
+        camera.up      * (-ndc.y * camera.tanHalf) +
+        camera.forward);
+
+    var p: Payload;
+    p.color     = vec3<f32>(0.0);
+    p.shadowRay = 0u;
+
+    rtEmitPrimaryRay(camera.origin, 0.01, direction, 100000.0,
+                     0u, 0xFFu, 0u, 0u, p);
+}
diff --git a/examples/RTStress/resolve.wgsl b/examples/RTStress/resolve.wgsl
new file mode 100644
index 0000000..7950c96
--- /dev/null
+++ b/examples/RTStress/resolve.wgsl
@@ -0,0 +1,7 @@
+// RTStress RESOLVE-stage tonemap: Reinhard + gamma 2.2 over the linear
+// accumulator. Registered as a WebGPURTStage::Resolve shader.
+fn resolve_main(coord: vec2<u32>, hdr: vec4<f32>) -> vec4<f32> {
+    let mapped = hdr.rgb / (hdr.rgb + vec3<f32>(1.0));
+    let g      = pow(mapped, vec3<f32>(1.0 / 2.2));
+    return vec4<f32>(g, 1.0);
+}

From 1d2e12dbc9623c7b7d6da373185a4cdaa116f5a3 Mon Sep 17 00:00:00 2001
From: catbot <catbot@bot.local>
Date: Sun, 31 May 2026 20:08:39 +0000
Subject: [PATCH 3/8] WebGPU RT: GPU timestamp-query per-pass harness

Request the timestamp-query feature; write begin/end timestamps around
each wavefront pass via timestampWrites; resolve + read back (deferred to
after submit) and print a per-pass us breakdown ~1x/sec. RTStress @ 512
instances, 1920x995: TRACE dominates, total ~1.8-3.0ms/frame.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 additional/dom-webgpu.js | 103 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 97 insertions(+), 6 deletions(-)

diff --git a/additional/dom-webgpu.js b/additional/dom-webgpu.js
index 9380517..fcadaba 100644
--- a/additional/dom-webgpu.js
+++ b/additional/dom-webgpu.js
@@ -146,7 +146,10 @@ clamp("maxComputeWorkgroupSizeX", 1024);
 clamp("maxBufferSize", 1 << 30);
 clamp("maxStorageBufferBindingSize", 1 << 30);
 clamp("maxComputeWorkgroupsPerDimension", 65535);
-const device  = await adapter.requestDevice({ requiredLimits });
+// Per-pass GPU timing for the wavefront tracer (RTStress HUD / PR numbers).
+const tsSupported = adapter.features && adapter.features.has("timestamp-query");
+const requiredFeatures = tsSupported ? ["timestamp-query"] : [];
+const device  = await adapter.requestDevice({ requiredLimits, requiredFeatures });
 const queue   = device.queue;
 const ctx     = canvas.getContext("webgpu");
 const canvasFormat = "rgba8unorm"; // match storage textures, skip swizzle blit
@@ -935,6 +938,23 @@ env.wgpuFrameEnd = () => {
     queue.submit([state.encoder.finish()]);
     state.encoder = null;
 
+    // Map the wavefront timestamp readback (its resolve/copy was encoded on
+    // the just-submitted encoder) and log a per-pass breakdown ~1×/sec.
+    if (state.tsReadPending) {
+        const ts = state.tsReadPending;
+        state.tsReadPending = null;
+        const n = ts.pendingLabels.length;
+        ts.readBuf.mapAsync(GPUMapMode.READ, 0, 2 * n * 8).then(() => {
+            const data = new BigInt64Array(ts.readBuf.getMappedRange(0, 2 * n * 8).slice(0));
+            ts.readBuf.unmap();
+            ts.inFlight = false;
+            wfLogTimestamps(ts, data);
+        }).catch((e) => {
+            ts.inFlight = false;
+            console.error("[crafter-wgpu] timestamp readback failed:", e);
+        });
+    }
+
     // Kick off mapAsync for the readbacks whose copyBufferToBuffer we
     // piggy-backed onto the just-submitted encoder. Doing this after
     // submit ensures the map waits for that submission's GPU work to
@@ -2955,6 +2975,51 @@ function ensureWavefrontBuffers(W, H) {
     return wf;
 }
 
+// ── GPU timestamp-query harness ──────────────────────────────────────────
+//
+// One QuerySet with 2 slots per wavefront pass; each beginComputePass writes
+// begin/end timestamps. After the passes we resolve into a buffer and read
+// it back (deferred to after submit, like the readback path). Deltas are
+// summed per pass label and printed ~1×/sec as a per-pass breakdown.
+const WF_TS_MAX_PASSES = 64;   // covers maxDepth up to ~20
+function wfEnsureTimestamps() {
+    if (!tsSupported) return null;
+    if (rtState.ts) return rtState.ts;
+    const cap = 2 * WF_TS_MAX_PASSES;
+    rtState.ts = {
+        capacity:  cap,
+        querySet:  device.createQuerySet({ type: "timestamp", count: cap }),
+        resolveBuf: device.createBuffer({ size: cap * 8,
+            usage: GPUBufferUsage.QUERY_RESOLVE | GPUBufferUsage.COPY_SRC }),
+        readBuf:   device.createBuffer({ size: cap * 8,
+            usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST }),
+        inFlight:  false,
+        lastLog:   0,
+        pendingLabels: null,
+    };
+    return rtState.ts;
+}
+function wfLogTimestamps(ts, data) {
+    // data: BigInt64Array of ns timestamps, [begin0,end0,begin1,end1,...].
+    const now = Date.now();
+    if (now - ts.lastLog < 1000) return;   // throttle to ~1/sec
+    ts.lastLog = now;
+    const labels = ts.pendingLabels;
+    if (!labels) return;
+    const sums = new Map();   // label → ns
+    let totalNs = 0;
+    for (let i = 0; i < labels.length; i++) {
+        const dt = Number(data[2*i + 1] - data[2*i + 0]);
+        if (dt < 0) continue;
+        sums.set(labels[i], (sums.get(labels[i]) || 0) + dt);
+        totalNs += dt;
+    }
+    const order = ["GENERATE", "PREP", "TRACE", "SHADE", "RESOLVE"];
+    const parts = order.filter(k => sums.has(k))
+                       .map(k => `${k} ${(sums.get(k)/1000).toFixed(1)}us`);
+    console.log(`[crafter-wgpu] RT passes: ${parts.join(" | ")} | total ${(totalNs/1000).toFixed(1)}us`);
+}
+
 env.wgpuLoadRTPipeline = (wgslPtr, wgslLen, bindingsPtr, bindingsCount) => {
     if (!rtState.vertHeap) rtInit();
     const userPart = new TextDecoder().decode(memU8().subarray(wgslPtr, wgslPtr + wgslLen));
@@ -3187,9 +3252,27 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes,
 
     const setUser = (pass) => { for (const u of userBgs) pass.setBindGroup(u.group, u.bindGroup); };
 
+    // GPU timing: write begin/end timestamps around each pass (2 query
+    // slots per pass), then resolve + read back after submit.
+    const ts = wfEnsureTimestamps();
+    const capture = !!(ts && !ts.inFlight);
+    const tsLabels = [];
+    const beginPass = (label, tsName) => {
+        const desc = { label };
+        if (capture && tsLabels.length < WF_TS_MAX_PASSES) {
+            desc.timestampWrites = {
+                querySet: ts.querySet,
+                beginningOfPassWriteIndex: 2 * tsLabels.length,
+                endOfPassWriteIndex:       2 * tsLabels.length + 1,
+            };
+            tsLabels.push(tsName);
+        }
+        return enc.beginComputePass(desc);
+    };
+
     // GENERATE
     {
-        const p = enc.beginComputePass({ label: "wf-generate" });
+        const p = beginPass("wf-generate", "GENERATE");
         p.setPipeline(pipe.genPipe);
         p.setBindGroup(0, paramsBg, [slotOff(0)]);
         p.setBindGroup(1, dataBg);
@@ -3203,7 +3286,7 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes,
         const shadeSlot = 1 + 3 * d + 2;
         // PREP — publish indirect args, zero next counter.
         {
-            const p = enc.beginComputePass({ label: "wf-prep" });
+            const p = beginPass("wf-prep", "PREP");
             p.setPipeline(pipe.prepPipe);
             p.setBindGroup(0, paramsBg, [slotOff(prepSlot)]);
             p.setBindGroup(1, dataBg);
@@ -3213,7 +3296,7 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes,
         }
         // TRACE — indirect over the live ray list.
         {
-            const p = enc.beginComputePass({ label: "wf-trace" });
+            const p = beginPass("wf-trace", "TRACE");
             p.setPipeline(pipe.tracePipe);
             p.setBindGroup(0, paramsBg, [slotOff(traceSlot)]);
             p.setBindGroup(1, dataBg);
@@ -3222,7 +3305,7 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes,
         }
         // SHADE — indirect; runs user closesthit/miss, may emit + accumulate.
         {
-            const p = enc.beginComputePass({ label: "wf-shade" });
+            const p = beginPass("wf-shade", "SHADE");
             p.setPipeline(pipe.shadePipe);
             p.setBindGroup(0, paramsBg, [slotOff(shadeSlot)]);
             p.setBindGroup(1, dataBg);
@@ -3233,7 +3316,7 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes,
     }
     // RESOLVE — tonemap accum → output image.
     {
-        const p = enc.beginComputePass({ label: "wf-resolve" });
+        const p = beginPass("wf-resolve", "RESOLVE");
         p.setPipeline(pipe.resolvePipe);
         p.setBindGroup(0, paramsBg, [slotOff(1 + 3 * depth)]);
         p.setBindGroup(1, dataBg);
@@ -3242,6 +3325,14 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes,
         p.end();
     }
 
+    if (capture && tsLabels.length > 0) {
+        enc.resolveQuerySet(ts.querySet, 0, 2 * tsLabels.length, ts.resolveBuf, 0);
+        enc.copyBufferToBuffer(ts.resolveBuf, 0, ts.readBuf, 0, 2 * tsLabels.length * 8);
+        ts.inFlight = true;
+        ts.pendingLabels = tsLabels;
+        state.tsReadPending = ts;
+    }
+
     // Reopen the frame's shared pass so wgpuFrameEnd / later UI work as
     // before, and flip ping-pong so the blit picks the texture RESOLVE wrote.
     state.pass = enc.beginComputePass();

From 376e66aeedfffbcb5e81425cd5fd3b5a759695e7 Mon Sep 17 00:00:00 2001
From: catbot <catbot@bot.local>
Date: Sun, 31 May 2026 20:16:04 +0000
Subject: [PATCH 4/8] WebGPU RT: port Sponza to wavefront (shadow ray in SHADE)

Restructure Sponza for the wavefront model: raygen emits the primary ray;
closesthit (in SHADE) gathers albedo/normal, accumulates ambient, and
emits a shadow ray carrying the pending direct term; miss adds the sky
(primary) or the direct term (shadow miss). resolve.wgsl applies the same
Reinhard+gamma the megakernel raygen did inline. User bindings moved to
group 3 (groups 0..2 reserved). RTPass maxDepth=2.

Renders the atrium correctly through the wavefront pipeline (textures,
two-sided shading, sun+ambient, shadows, tonemap).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 examples/Sponza/closesthit.wgsl | 73 +++++++++++++-------------
 examples/Sponza/main.cpp        | 20 +++++---
 examples/Sponza/miss.wgsl       | 12 ++---
 examples/Sponza/project.cpp     |  1 +
 examples/Sponza/raygen.wgsl     | 91 ++++-----------------------------
 examples/Sponza/resolve.wgsl    |  7 +++
 6 files changed, 70 insertions(+), 134 deletions(-)
 create mode 100644 examples/Sponza/resolve.wgsl

diff --git a/examples/Sponza/closesthit.wgsl b/examples/Sponza/closesthit.wgsl
index 9fa15a8..293f8ba 100644
--- a/examples/Sponza/closesthit.wgsl
+++ b/examples/Sponza/closesthit.wgsl
@@ -1,35 +1,25 @@
-// Payload declared here so the WGSL assembler sees it before raygen
-// (the assembler concatenates closesthit/anyhit/miss BEFORE raygen).
+// Sponza closest-hit (runs in SHADE). In the wavefront model the lighting
+// + shadow trace that used to live in raygen happens here: gather surface
+// data, accumulate ambient, and emit a shadow ray toward the sun carrying
+// the pending direct contribution. The shadow ray's miss adds that
+// contribution (sun visible); its hit adds nothing (occluded), since
+// RT_FLAG_SKIP_CLOSEST_HIT suppresses closesthit on the shadow ray.
 //
-// WGSL forbids cycles in the function call graph, so closesthit_main
-// CAN'T call traceRay (that would create closesthit → traceRay →
-// runClosestHit → closesthit). The lighting + shadow trace therefore
-// happens in raygen; closesthit's job is just to gather surface data
-// into the payload.
-//
-//   shadowRay = 0 (primary): closesthit fills albedo/worldPos/normal/hit.
-//   shadowRay = 1 (shadow):  closesthit is skipped (RT_FLAG_SKIP_CLOSEST_HIT),
-//                            miss flips color to white = "lit".
+// Payload declared here so the assembler sees it before wfPayload / SHADE.
 struct Payload {
-    color:       vec3<f32>,
-    shadowRay:   u32,
-    worldPos:    vec3<f32>,
-    hit:         u32,
-    worldNormal: vec3<f32>,
-    _pad:        f32,
+    color:     vec3<f32>,   // shadow ray: pending albedo·sun·nDotL
+    shadowRay: u32,         // 0 primary, 1 shadow
 };
 
-// User-bound resources at group(2). Matches the UICustomBinding span the
-// host hands to PipelineRTWebGPU::Init.
-//   binding 0 — albedo texture_2d_array, one layer per Sponza material
-//   binding 1 — sampler (linear clamp)
-//   binding 2 — camera storage buffer (read by raygen only)
-@group(2) @binding(0) var albedos : texture_2d_array<f32>;
-@group(2) @binding(1) var samp    : sampler;
+// User resources at @group(3) (0..2 are the wavefront pipeline's reserved
+// groups). binding 0 albedo array, 1 sampler, 2 camera (raygen only).
+@group(3) @binding(0) var albedos : texture_2d_array<f32>;
+@group(3) @binding(1) var samp    : sampler;
+
+const SUN_DIR_TO_LIGHT: vec3<f32> = vec3<f32>(-0.35,  1.00, -0.20);
+const SUN_COLOR:        vec3<f32> = vec3<f32>( 1.10,  1.00,  0.85);
+const AMBIENT_COLOR:    vec3<f32> = vec3<f32>( 0.18,  0.20,  0.28);
 
-// VertexNormalTangentUVPacked is `packed` on the outer struct but each
-// inner `Vector<float, N, 4>` is SIMD-aligned to a 16-byte stride. So
-// each vertex is 12 u32 words: normal at 0..2, tangent at 4..6, uv at 8..9.
 const ATTRIB_STRIDE_U32:    u32 = 12u;
 const ATTRIB_NORMAL_OFFSET: u32 = 0u;
 const ATTRIB_UV_OFFSET:     u32 = 8u;
@@ -52,7 +42,6 @@ fn fetchNormal(meshRec: MeshRecord, vertexIdx: u32) -> vec3<f32> {
 }
 
 fn closesthit_main(ray: RayDesc, hit: HitInfo, payload: ptr<function, Payload>) {
-    // Resolve hit triangle → 3 vertex indices.
     let meshIdx = tlasEntries[hit.instanceId].blasMeshIdx;
     let meshRec = meshRecords[meshIdx];
     let baseIdx = meshRec.indexOffset + hit.primitiveId * 3u;
@@ -61,19 +50,14 @@ fn closesthit_main(ray: RayDesc, hit: HitInfo, payload: ptr<function, Payload>)
     let i2 = indices[baseIdx + 2u];
     let bary = vec3<f32>(1.0 - hit.attribs.x - hit.attribs.y, hit.attribs.x, hit.attribs.y);
 
-    // Albedo via barycentric UV interpolation.
     let uv0 = fetchUV(meshRec, i0);
     let uv1 = fetchUV(meshRec, i1);
     let uv2 = fetchUV(meshRec, i2);
     let uv  = uv0 * bary.x + uv1 * bary.y + uv2 * bary.z;
-    // OBJ V is bottom-up; sampler is top-down. fract for manual tiling.
     let uvTiled = vec2<f32>(fract(uv.x), fract(1.0 - uv.y));
     let layer   = i32(hit.customIndex);
     let albedo  = textureSampleLevel(albedos, samp, uvTiled, layer, 0.0).rgb;
 
-    // World-space smooth shading normal. Multiply through the
-    // object-to-world rotation so this stays correct if a future scene
-    // rotates instances (Sponza itself is all identities).
     let n0 = fetchNormal(meshRec, i0);
     let n1 = fetchNormal(meshRec, i1);
     let n2 = fetchNormal(meshRec, i2);
@@ -83,8 +67,23 @@ fn closesthit_main(ray: RayDesc, hit: HitInfo, payload: ptr<function, Payload>)
         dot(hit.objectToWorldR1.xyz, nObj),
         dot(hit.objectToWorldR2.xyz, nObj)));
 
-    (*payload).color       = albedo;
-    (*payload).worldPos    = ray.origin + ray.direction * hit.t;
-    (*payload).worldNormal = nWorld;
-    (*payload).hit         = 1u;
+    // Two-sided: flip the normal toward the camera (Sponza curtains have
+    // inconsistent winding).
+    let nFacing = select(-nWorld, nWorld, dot(nWorld, ray.direction) < 0.0);
+    let lightDir = normalize(SUN_DIR_TO_LIGHT);
+    let nDotL    = max(0.0, dot(nFacing, lightDir));
+    let worldPos = ray.origin + ray.direction * hit.t;
+
+    // Ambient is unconditional; direct light is gated behind the shadow ray.
+    rtAccumulate(albedo * AMBIENT_COLOR);
+
+    if (nDotL > 0.0) {
+        let shadowOrigin = worldPos + nFacing * 0.5;
+        var sp: Payload;
+        sp.color     = albedo * SUN_COLOR * nDotL;
+        sp.shadowRay = 1u;
+        rtEmitRay(shadowOrigin, 0.001, lightDir, 10000.0,
+                  RT_FLAG_SKIP_CLOSEST_HIT | RT_FLAG_TERMINATE_ON_FIRST_HIT,
+                  0xFFu, 0u, 0u, sp);
+    }
 }
diff --git a/examples/Sponza/main.cpp b/examples/Sponza/main.cpp
index 73db99e..1df3a2c 100644
--- a/examples/Sponza/main.cpp
+++ b/examples/Sponza/main.cpp
@@ -253,10 +253,11 @@ int main() {
     DescriptorHeapWebGPU heap;
     heap.Initialize(/*images*/ 2, /*buffers*/ 2, /*samplers*/ 2);
 
-    std::array<WebGPUShader, 3> shaders {{
+    std::array<WebGPUShader, 4> shaders {{
         WebGPUShader(fs::path("raygen.wgsl"),     "raygen_main",     WebGPURTStage::Raygen),
         WebGPUShader(fs::path("miss.wgsl"),       "miss_main",       WebGPURTStage::Miss),
         WebGPUShader(fs::path("closesthit.wgsl"), "closesthit_main", WebGPURTStage::ClosestHit),
+        WebGPUShader(fs::path("resolve.wgsl"),    "resolve_main",    WebGPURTStage::Resolve),
     }};
     ShaderBindingTableWebGPU sbt;
     sbt.Init(shaders);
@@ -271,14 +272,15 @@ int main() {
         { .type = RTShaderGroupType::TrianglesHitGroup, .closestHitShader = 2 },
     }};
 
-    // Three user bindings at @group(2):
+    // Three user bindings at @group(3) (the wavefront pipeline reserves
+    // groups 0..2 for WfParams / data heaps / indirect args):
     //   binding 0 — albedo texture_2d_array (one layer per material)
     //   binding 1 — sampler (linear clamp)
     //   binding 2 — Camera storage buffer (host-driven, updated per frame)
     std::array<UICustomBinding, 3> bindings {{
-        { .group = 2, .binding = 0, .kind = UICustomBindingKind::SampledTextureArray, ._pad = 0, .pushOffset = 0 },
-        { .group = 2, .binding = 1, .kind = UICustomBindingKind::Sampler,             ._pad = 0, .pushOffset = 0 },
-        { .group = 2, .binding = 2, .kind = UICustomBindingKind::Buffer,              ._pad = 0, .pushOffset = 0 },
+        { .group = 3, .binding = 0, .kind = UICustomBindingKind::SampledTextureArray, ._pad = 0, .pushOffset = 0 },
+        { .group = 3, .binding = 1, .kind = UICustomBindingKind::Sampler,             ._pad = 0, .pushOffset = 0 },
+        { .group = 3, .binding = 2, .kind = UICustomBindingKind::Buffer,              ._pad = 0, .pushOffset = 0 },
     }};
 
     PipelineRTWebGPU pipeline;
@@ -367,6 +369,7 @@ int main() {
     RTPass rtPass(&pipeline);
     rtPass.handlesPtr   = userHandles.data();
     rtPass.handlesCount = static_cast<std::uint32_t>(userHandles.size());
+    rtPass.maxDepth     = 2;   // primary + shadow
     window.passes.push_back(&rtPass);
 
     // ── Free camera: WASD + mouse-delta look ───────────────────────────
@@ -375,9 +378,10 @@ int main() {
     // height, looking +X down the long axis (bbox: X[-1921..1800],
     // Y[-126..1429], Z[-1183..1105]). The user can fine-tune from there.
     struct CamState {
-        Vector<float, 3, 4> position{ -1500.0f, 200.0f, 0.0f };
-        float yaw   = 0.0f;   // radians, around world +Y
-        float pitch = 0.0f;   // radians, +pitch looks up
+        // 3/4 view from a corner aimed at the atrium centre.
+        Vector<float, 3, 4> position{ -1400.0f, 700.0f, -600.0f };
+        float yaw   = 0.405f;   // radians, around world +Y
+        float pitch = -0.317f;  // radians, +pitch looks up
     } cam;
 
     Input::Map inputMap;
diff --git a/examples/Sponza/miss.wgsl b/examples/Sponza/miss.wgsl
index 39ff71d..0bbbf4d 100644
--- a/examples/Sponza/miss.wgsl
+++ b/examples/Sponza/miss.wgsl
@@ -1,16 +1,12 @@
+// Sponza miss (runs in SHADE). Primary miss → two-stop sky gradient.
+// Shadow miss → the sun is unoccluded, so add the pending direct term.
 fn miss_main(ray: RayDesc, payload: ptr<function, Payload>) {
     if ((*payload).shadowRay == 1u) {
-        // Shadow ray escaped to infinity — the sun is visible from the
-        // origin, so the surface there should pick up full direct light.
-        // raygen reads color.x as the visibility coefficient.
-        (*payload).color = vec3<f32>(1.0);
+        rtAccumulate((*payload).color);
         return;
     }
-
-    // Primary miss: cheap two-stop sky gradient. (*payload).hit stays 0
-    // so raygen knows to skip the lighting path and just use this color.
     let t = clamp(ray.direction.y * 0.5 + 0.5, 0.0, 1.0);
     let sky    = vec3<f32>(0.45, 0.65, 0.95);
     let zenith = vec3<f32>(0.95, 0.85, 0.65);
-    (*payload).color = mix(sky, zenith, t);
+    rtAccumulate(mix(sky, zenith, t));
 }
diff --git a/examples/Sponza/project.cpp b/examples/Sponza/project.cpp
index b850c90..400b0c2 100644
--- a/examples/Sponza/project.cpp
+++ b/examples/Sponza/project.cpp
@@ -82,6 +82,7 @@ extern "C" Configuration CrafterBuildProject(std::span<const std::string_view> a
         cfg.files.emplace_back(fs::path("raygen.wgsl"));
         cfg.files.emplace_back(fs::path("closesthit.wgsl"));
         cfg.files.emplace_back(fs::path("miss.wgsl"));
+        cfg.files.emplace_back(fs::path("resolve.wgsl"));
         EnableWasiBrowserRuntime(cfg);
     } else {
         cfg.shaders.emplace_back(fs::path("raygen.glsl"),     std::string("main"), ShaderType::RayGen);
diff --git a/examples/Sponza/raygen.wgsl b/examples/Sponza/raygen.wgsl
index a90ca84..6575b34 100644
--- a/examples/Sponza/raygen.wgsl
+++ b/examples/Sponza/raygen.wgsl
@@ -1,12 +1,8 @@
-// WebGPU raygen. Camera state comes from the host every frame via a
-// storage buffer bound at @group(2) @binding(2); main.cpp drives that
-// from WASD + mouse-delta through Crafter::Input.
-//
-// The shading + shadow trace all happens here because WGSL forbids
-// recursive function call graphs — closesthit_main can't call traceRay
-// (that would loop closesthit → traceRay → runClosestHit → closesthit).
-// Raygen is the entry point and not called by anyone, so it can call
-// traceRay twice (once primary, once shadow) without forming a cycle.
+// Sponza raygen (runs in GENERATE). Emits the pixel's primary ray; all
+// shading + the shadow trace now happen in SHADE (closesthit/miss). Camera
+// state comes from the host each frame via a storage buffer at
+// @group(3) @binding(2) (groups 0..2 are reserved by the wavefront
+// pipeline). main.cpp drives it from WASD + mouse-delta.
 
 struct Camera {
     origin:  vec3<f32>,
@@ -18,92 +14,25 @@ struct Camera {
     forward: vec3<f32>,
     pad1:    f32,
 };
-@group(2) @binding(2) var<storage, read> camera : Camera;
-
-// Sun coming through Sponza's open roof. Y is up; this points "down and
-// slightly along +X" so the light grazes the colonnades on one side.
-const SUN_DIR_TO_LIGHT: vec3<f32> = vec3<f32>(-0.35,  1.00, -0.20);
-const SUN_COLOR:        vec3<f32> = vec3<f32>( 1.10,  1.00,  0.85);
-const AMBIENT_COLOR:    vec3<f32> = vec3<f32>( 0.18,  0.20,  0.28);
+@group(3) @binding(2) var<storage, read> camera : Camera;
 
 fn raygen_main(gid: vec3<u32>) {
-    if (gid.x >= hdr.surfaceW || gid.y >= hdr.surfaceH) { return; }
+    if (gid.x >= wfParams.surfaceW || gid.y >= wfParams.surfaceH) { return; }
 
     let pixel      = vec2<f32>(f32(gid.x), f32(gid.y));
-    let resolution = vec2<f32>(f32(hdr.surfaceW), f32(hdr.surfaceH));
+    let resolution = vec2<f32>(f32(wfParams.surfaceW), f32(wfParams.surfaceH));
     let uv         = (pixel + vec2<f32>(0.5)) / resolution;
     let ndc        = uv * 2.0 - vec2<f32>(1.0);
 
-    // Pinhole camera reconstructed from the host basis. ndc.x runs left-
-    // to-right across the screen → +right; ndc.y is top-down so we
-    // negate before applying +up.
     let direction = normalize(
         camera.right   * (ndc.x  * camera.aspect * camera.tanHalf) +
         camera.up      * (-ndc.y * camera.tanHalf) +
         camera.forward);
 
-    // ── Primary ray ────────────────────────────────────────────────────
     var payload: Payload;
     payload.color     = vec3<f32>(0.0);
     payload.shadowRay = 0u;
-    payload.hit       = 0u;
 
-    traceRay(
-        0u, 0u, 0xFFu,
-        0u, 0u, 0u,
-        camera.origin, 0.001,
-        direction,     10000.0,
-        &payload);
-
-    var finalColor: vec3<f32>;
-    if (payload.hit == 1u) {
-        // Closesthit filled albedo/worldPos/worldNormal. Two-sided
-        // shading: flip the normal toward the camera if we hit the back
-        // face — Sponza's curtains in particular have inconsistent
-        // winding, and without this half the surface would go black.
-        let albedo = payload.color;
-        let nFacing = select(-payload.worldNormal,
-                              payload.worldNormal,
-                              dot(payload.worldNormal, direction) < 0.0);
-        let lightDir = normalize(SUN_DIR_TO_LIGHT);
-        let nDotL    = max(0.0, dot(nFacing, lightDir));
-
-        // ── Shadow ray ────────────────────────────────────────────────
-        // Only worth tracing if the surface faces the sun at all.
-        var visibility = 0.0;
-        if (nDotL > 0.0) {
-            // Normal-offset bias on Sponza's units (~3700 wide atrium)
-            // is hefty; 0.5 keeps the shadow ray clear of the originating
-            // triangle without producing visible "floating" shadows.
-            let shadowOrigin = payload.worldPos + nFacing * 0.5;
-
-            var shadowPayload: Payload;
-            shadowPayload.color     = vec3<f32>(0.0);  // default: blocked
-            shadowPayload.shadowRay = 1u;
-            shadowPayload.hit       = 0u;
-            traceRay(
-                0u,
-                RT_FLAG_SKIP_CLOSEST_HIT | RT_FLAG_TERMINATE_ON_FIRST_HIT,
-                0xFFu,
-                0u, 0u, 0u,
-                shadowOrigin, 0.001,
-                lightDir,     10000.0,
-                &shadowPayload);
-            visibility = shadowPayload.color.x;
-        }
-
-        let lit = AMBIENT_COLOR + SUN_COLOR * (nDotL * visibility);
-        finalColor = albedo * lit;
-    } else {
-        // Sky color was filled by miss_main.
-        finalColor = payload.color;
-    }
-
-    // Reinhard tonemap + gamma 2.2 so sun-lit albedos don't clip and
-    // shadow detail stays readable.
-    let mapped = finalColor / (finalColor + vec3<f32>(1.0));
-    let gamma  = pow(mapped, vec3<f32>(1.0 / 2.2));
-    textureStore(outImage,
-                 vec2<i32>(i32(gid.x), i32(gid.y)),
-                 vec4<f32>(gamma, 1.0));
+    rtEmitPrimaryRay(camera.origin, 0.001, direction, 10000.0,
+                     0u, 0xFFu, 0u, 0u, payload);
 }
diff --git a/examples/Sponza/resolve.wgsl b/examples/Sponza/resolve.wgsl
new file mode 100644
index 0000000..346659e
--- /dev/null
+++ b/examples/Sponza/resolve.wgsl
@@ -0,0 +1,7 @@
+// Sponza RESOLVE-stage tonemap: Reinhard + gamma 2.2 over the linear
+// accumulator — matches the tonemap the megakernel raygen applied inline.
+fn resolve_main(coord: vec2<u32>, hdr: vec4<f32>) -> vec4<f32> {
+    let mapped = hdr.rgb / (hdr.rgb + vec3<f32>(1.0));
+    let g      = pow(mapped, vec3<f32>(1.0 / 2.2));
+    return vec4<f32>(g, 1.0);
+}

From dd4122f2baf634ea9957343f99974376931c90be Mon Sep 17 00:00:00 2001
From: catbot <catbot@bot.local>
Date: Sun, 31 May 2026 20:21:44 +0000
Subject: [PATCH 5/8] WebGPU RT: ordered (nearest-child-first) traversal

Add _rtAabbT (AABB test returning entry-t); in both _rtwTraverseBlas and
_rtwTraverseTlas descend the nearer child first and push the farther only
when it hits, re-culling it against the (tightened) bestT when popped.
Render is identical (same closest hit) on VulkanTriangle, RTStress
(512/4096), and Sponza; cuts node visits on dense scenes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 additional/dom-webgpu.js | 59 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 54 insertions(+), 5 deletions(-)

diff --git a/additional/dom-webgpu.js b/additional/dom-webgpu.js
index fcadaba..00bf317 100644
--- a/additional/dom-webgpu.js
+++ b/additional/dom-webgpu.js
@@ -1481,6 +1481,27 @@ fn _rtAabb(ro: vec3<f32>, invRd: vec3<f32>, mn: vec3<f32>, mx: vec3<f32>, tMax:
     return tExit >= max(tEnter, 0.0) && tEnter <= tMax;
 }
 
+// AABB test that also returns the (clamped) entry distance, for ordered
+// nearest-child-first traversal. t is meaningless when hit == false.
+struct _RtAabbHit { hit: bool, t: f32 };
+fn _rtAabbT(ro: vec3<f32>, invRd: vec3<f32>, mn: vec3<f32>, mx: vec3<f32>, tMax: f32) -> _RtAabbHit {
+    var r: _RtAabbHit;
+    r.hit = false;
+    r.t   = 0.0;
+    if (any(mn > mx)) { return r; }
+    let t0 = (mn - ro) * invRd;
+    let t1 = (mx - ro) * invRd;
+    let tmin = min(t0, t1);
+    let tmax = max(t0, t1);
+    let tEnter = max(max(tmin.x, tmin.y), tmin.z);
+    let tExit  = min(min(tmax.x, tmax.y), tmax.z);
+    if (tExit >= max(tEnter, 0.0) && tEnter <= tMax) {
+        r.hit = true;
+        r.t   = max(tEnter, 0.0);
+    }
+    return r;
+}
+
 struct _RtTriHit { hit: bool, t: f32, u: f32, v: f32 };
 fn _rtTri(ro: vec3<f32>, rd: vec3<f32>, p0: vec3<f32>, p1: vec3<f32>, p2: vec3<f32>,
           tMin: f32, tMax: f32) -> _RtTriHit {
@@ -1896,10 +1917,26 @@ fn _rtwTraverseBlas(rayObj: RayDesc, flags: u32, meshRec: MeshRecord,
             if (sp == 0u) { break; }
             sp = sp - 1u; nodeRel = stack[sp]; continue;
         }
+        // Internal node: descend the nearer child first; push the farther
+        // only when it hits (and re-cull it against bestT when popped).
         let left  = node.firstChildOrPrim;
         let right = left + 1u;
-        if (sp < 32u) { stack[sp] = right; sp = sp + 1u; }
-        nodeRel = left;
+        let ln = bvhNodes[meshRec.bvhOffset + left];
+        let rn = bvhNodes[meshRec.bvhOffset + right];
+        let lr = _rtAabbT(rayObj.origin, invD, ln.aabbMin, ln.aabbMax, *bestT);
+        let rr = _rtAabbT(rayObj.origin, invD, rn.aabbMin, rn.aabbMax, *bestT);
+        if (lr.hit && rr.hit) {
+            if (sp + 1u < 32u) {
+                if (lr.t <= rr.t) { stack[sp] = right; sp = sp + 1u; stack[sp] = left;  sp = sp + 1u; }
+                else              { stack[sp] = left;  sp = sp + 1u; stack[sp] = right; sp = sp + 1u; }
+            }
+        } else if (lr.hit) {
+            if (sp < 32u) { stack[sp] = left;  sp = sp + 1u; }
+        } else if (rr.hit) {
+            if (sp < 32u) { stack[sp] = right; sp = sp + 1u; }
+        }
+        if (sp == 0u) { break; }
+        sp = sp - 1u; nodeRel = stack[sp];
     }
     return false;
 }
@@ -1960,11 +1997,23 @@ fn _rtwTraverseTlas(rayWorld: RayDesc, flags: u32, cullMask: u32,
             if (endSearch) { return true; }
             if ((*bestT) < pre && (effective & RT_FLAG_TERMINATE_ON_FIRST_HIT) != 0u) { return true; }
         } else {
+            // Internal node: nearest-child-first. Sentinel-padded leaves
+            // carry an inverted AABB so _rtAabbT rejects them for free.
             let left  = 2u * nodeIdx + 1u;
             let right = 2u * nodeIdx + 2u;
-            if (sp + 1u < 32u) {
-                stack[sp] = right; sp = sp + 1u;
-                stack[sp] = left;  sp = sp + 1u;
+            let ln = tlasBvhNodes[left];
+            let rn = tlasBvhNodes[right];
+            let lr = _rtAabbT(rayWorld.origin, invD, ln.aabbMin, ln.aabbMax, *bestT);
+            let rr = _rtAabbT(rayWorld.origin, invD, rn.aabbMin, rn.aabbMax, *bestT);
+            if (lr.hit && rr.hit) {
+                if (sp + 1u < 32u) {
+                    if (lr.t <= rr.t) { stack[sp] = right; sp = sp + 1u; stack[sp] = left;  sp = sp + 1u; }
+                    else              { stack[sp] = left;  sp = sp + 1u; stack[sp] = right; sp = sp + 1u; }
+                }
+            } else if (lr.hit) {
+                if (sp < 32u) { stack[sp] = left;  sp = sp + 1u; }
+            } else if (rr.hit) {
+                if (sp < 32u) { stack[sp] = right; sp = sp + 1u; }
             }
         }
     }

From 82e5e867d4fafb07b227f971460a7d832c484972 Mon Sep 17 00:00:00 2001
From: catbot <catbot@bot.local>
Date: Sun, 31 May 2026 20:24:04 +0000
Subject: [PATCH 6/8] WebGPU RT: remove dead megakernel WGSL (no dual path)

The RT pipeline now only builds the wavefront kernels, so the old
single-megakernel traversal/traceRay block (rtWgslMegakernelHelpers) and
the unused rtWgslPrelude alias are dead. Remove them. The rayQuery compute
path keeps rtWgslMegakernelBindings (its own _rq* traversal uses it).
RTStress still renders correctly with the trimmed prelude.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 additional/dom-webgpu.js | 199 ---------------------------------------
 1 file changed, 199 deletions(-)

diff --git a/additional/dom-webgpu.js b/additional/dom-webgpu.js
index 00bf317..d3a9c72 100644
--- a/additional/dom-webgpu.js
+++ b/additional/dom-webgpu.js
@@ -1439,7 +1439,6 @@ const TLAS_BVH_N_PADDED: u32 = 16384u;
 const TLAS_BVH_LEAVES_START: u32 = TLAS_BVH_N_PADDED - 1u;
 `;
 
-const rtWgslPrelude = rtWgslTypes + rtWgslMegakernelBindings;
 
 // ── WGSL library: helpers + traverseBlas + traverseTlas + traceRay ───
 // Injected after the user-supplied closesthit/anyhit/miss sources +
@@ -1526,204 +1525,6 @@ fn _rtTri(ro: vec3<f32>, rd: vec3<f32>, p0: vec3<f32>, p1: vec3<f32>, p2: vec3<f
 }
 `;
 
-// Megakernel-only helpers: traversal routines that invoke runAnyHit /
-// runClosestHit / runMiss (emitted by the megakernel SBT switch) and
-// `traceRay` that closes over them. Only the raygen-pipeline path
-// prepends this.
-const rtWgslMegakernelHelpers = String.raw`
-// Iterative stack-based BLAS traversal. Returns true if traversal was
-// terminated by an END_SEARCH from anyhit (caller should stop entirely).
-fn _rtTraverseBlas(rayObj: RayDesc, flags: u32, meshRec: MeshRecord,
-                   instanceId: u32, hitGroupBase: u32,
-                   bestHit: ptr<function, HitInfo>,
-                   bestT:   ptr<function, f32>,
-                   payload: ptr<function, Payload>) -> bool {
-    let invD = vec3<f32>(1.0) / rayObj.direction;
-    var stack: array<u32, 32>;
-    var sp: u32 = 0u;
-    var nodeRel: u32 = 0u;
-
-    loop {
-        let abs = meshRec.bvhOffset + nodeRel;
-        let node = bvhNodes[abs];
-        if (!_rtAabb(rayObj.origin, invD, node.aabbMin, node.aabbMax, *bestT)) {
-            if (sp == 0u) { break; }
-            sp = sp - 1u; nodeRel = stack[sp]; continue;
-        }
-        if (node.primCount > 0u) {
-            for (var i: u32 = 0u; i < node.primCount; i = i + 1u) {
-                let triIndex = primRemap[meshRec.primRemapOffset + node.firstChildOrPrim + i];
-                let verts = _rtFetchTri(meshRec, triIndex);
-                let tr = _rtTri(rayObj.origin, rayObj.direction,
-                                verts[0], verts[1], verts[2],
-                                rayObj.tMin, *bestT);
-                if (!tr.hit) { continue; }
-
-                let geomNormal = cross(verts[1] - verts[0], verts[2] - verts[0]);
-                let facing = dot(geomNormal, rayObj.direction);
-                if ((flags & RT_FLAG_CULL_BACK_FACING_TRIANGLES) != 0u && facing > 0.0) { continue; }
-                if ((flags & RT_FLAG_CULL_FRONT_FACING_TRIANGLES) != 0u && facing < 0.0) { continue; }
-
-                var candidate: HitInfo;
-                candidate.t             = tr.t;
-                candidate.instanceId    = instanceId;
-                candidate.primitiveId   = triIndex;
-                candidate.hitGroupIndex = hitGroupBase;
-                candidate.attribs       = vec2<f32>(tr.u, tr.v);
-                candidate.objectRayOrigin    = rayObj.origin;
-                candidate.objectRayDirection = rayObj.direction;
-
-                let opaque = (flags & RT_FLAG_OPAQUE) != 0u
-                          || (flags & RT_FLAG_NO_OPAQUE) == 0u;  // default opaque
-
-                if (opaque) {
-                    *bestHit = candidate;
-                    *bestT   = tr.t;
-                    if ((flags & RT_FLAG_TERMINATE_ON_FIRST_HIT) != 0u) { return true; }
-                } else {
-                    let r = runAnyHit(hitGroupBase, rayObj, candidate, payload);
-                    if (r == RT_ANYHIT_END_SEARCH) {
-                        *bestHit = candidate;
-                        *bestT   = tr.t;
-                        return true;
-                    }
-                    if (r == RT_ANYHIT_ACCEPT) {
-                        *bestHit = candidate;
-                        *bestT   = tr.t;
-                        if ((flags & RT_FLAG_TERMINATE_ON_FIRST_HIT) != 0u) { return true; }
-                    }
-                }
-            }
-            if (sp == 0u) { break; }
-            sp = sp - 1u; nodeRel = stack[sp]; continue;
-        }
-        // inner node — push right, descend left
-        let left  = node.firstChildOrPrim;
-        let right = left + 1u;
-        if (sp < 32u) { stack[sp] = right; sp = sp + 1u; }
-        nodeRel = left;
-    }
-    return false;
-}
-
-fn _rtTraverseTlas(rayWorld: RayDesc, flags: u32, cullMask: u32,
-                   sbtRecordOffset: u32, sbtRecordStride: u32,
-                   bestHit: ptr<function, HitInfo>,
-                   bestT:   ptr<function, f32>,
-                   payload: ptr<function, Payload>) -> bool {
-    let invD = vec3<f32>(1.0) / rayWorld.direction;
-    // Stack-based descent of the sweep-tree BVH. Internal nodes
-    // [0, TLAS_BVH_LEAVES_START); leaves [LEAVES_START, 2*N_PADDED-1).
-    // Node i's children are 2i+1 / 2i+2 (implicit perfect binary tree).
-    // Stack depth = tree depth = log2(N_PADDED) = 14 for N_PADDED=16384;
-    // 24 gives generous headroom.
-    var stack: array<u32, 24>;
-    var sp: u32 = 0u;
-    stack[sp] = 0u; sp = sp + 1u;
-    loop {
-        if (sp == 0u) { break; }
-        sp = sp - 1u;
-        let nodeIdx = stack[sp];
-        let node = tlasBvhNodes[nodeIdx];
-        if (!_rtAabb(rayWorld.origin, invD, node.aabbMin, node.aabbMax, *bestT)) {
-            continue;
-        }
-        if (nodeIdx >= TLAS_BVH_LEAVES_START) {
-            // Leaf: resolve entry, do the existing per-instance test.
-            let leafIdx = nodeIdx - TLAS_BVH_LEAVES_START;
-            let i = tlasEntryOrder[leafIdx];
-            // Sentinel-padded leaves get instanceMask=0; cullMask check
-            // (and degenerate AABB above) means they fall out cheaply.
-            if (i == 0xFFFFFFFFu) { continue; }
-            let inst = tlasEntries[i];
-            let instanceMask = inst.maskHGOffset & 0xFFu;
-            if ((instanceMask & cullMask) == 0u) { continue; }
-            if (!_rtAabb(rayWorld.origin, invD, inst.aabbMin, inst.aabbMax, *bestT)) { continue; }
-
-        // Transform ray to object space.
-        let r0 = inst.worldToObjectR0;
-        let r1 = inst.worldToObjectR1;
-        let r2 = inst.worldToObjectR2;
-        var rayObj: RayDesc;
-        rayObj.origin = vec3<f32>(
-            dot(r0.xyz, rayWorld.origin) + r0.w,
-            dot(r1.xyz, rayWorld.origin) + r1.w,
-            dot(r2.xyz, rayWorld.origin) + r2.w,
-        );
-        rayObj.direction = vec3<f32>(
-            dot(r0.xyz, rayWorld.direction),
-            dot(r1.xyz, rayWorld.direction),
-            dot(r2.xyz, rayWorld.direction),
-        );
-        rayObj.tMin = rayWorld.tMin;
-        rayObj.tMax = *bestT;
-
-        var effective = flags;
-        let iflags = inst.instanceFlags;
-        if ((iflags & RT_INSTANCE_FORCE_OPAQUE) != 0u) {
-            effective = (effective | RT_FLAG_OPAQUE) & ~RT_FLAG_NO_OPAQUE;
-        }
-        if ((iflags & RT_INSTANCE_FORCE_NO_OPAQUE) != 0u) {
-            effective = (effective | RT_FLAG_NO_OPAQUE) & ~RT_FLAG_OPAQUE;
-        }
-        if ((iflags & RT_INSTANCE_TRIANGLE_FACING_CULL_DISABLE) != 0u) {
-            effective = effective & ~(RT_FLAG_CULL_BACK_FACING_TRIANGLES | RT_FLAG_CULL_FRONT_FACING_TRIANGLES);
-        }
-
-        let hitGroupOffset = inst.maskHGOffset >> 8u;
-        let hitGroupBase   = sbtRecordOffset + hitGroupOffset;
-        let meshRec        = meshRecords[inst.blasMeshIdx];
-
-        let pre = *bestT;
-        let endSearch = _rtTraverseBlas(rayObj, effective, meshRec, i, hitGroupBase,
-                                        bestHit, bestT, payload);
-        if (endSearch) { return true; }
-        if ((*bestT) < pre) {
-            // record world-space object-to-world for the closest-hit shader
-            (*bestHit).objectToWorldR0 = inst.objectToWorldR0;
-            (*bestHit).objectToWorldR1 = inst.objectToWorldR1;
-            (*bestHit).objectToWorldR2 = inst.objectToWorldR2;
-            (*bestHit).customIndex     = inst.customIndex;
-            if ((effective & RT_FLAG_TERMINATE_ON_FIRST_HIT) != 0u) { return true; }
-        }
-        } else {
-            // Internal node: push both children (skip overflow).
-            let left  = 2u * nodeIdx + 1u;
-            let right = 2u * nodeIdx + 2u;
-            if (sp + 1u < 24u) {
-                stack[sp] = right; sp = sp + 1u;
-                stack[sp] = left;  sp = sp + 1u;
-            }
-        }
-    }
-    return false;
-}
-
-fn traceRay(tlasIdx: u32, flags: u32, cullMask: u32,
-            sbtRecordOffset: u32, sbtRecordStride: u32, missIndex: u32,
-            rayOrigin: vec3<f32>, rayTMin: f32,
-            rayDir:    vec3<f32>, rayTMax: f32,
-            payload: ptr<function, Payload>) {
-    var ray: RayDesc;
-    ray.origin    = rayOrigin;
-    ray.direction = rayDir;
-    ray.tMin      = rayTMin;
-    ray.tMax      = rayTMax;
-    var bestHit: HitInfo;
-    bestHit.t = rayTMax;
-    var bestT = rayTMax;
-    let ended = _rtTraverseTlas(ray, flags, cullMask & 0xFFu,
-                                sbtRecordOffset, sbtRecordStride,
-                                &bestHit, &bestT, payload);
-    if (bestT < rayTMax) {
-        if ((flags & RT_FLAG_SKIP_CLOSEST_HIT) == 0u) {
-            runClosestHit(bestHit.hitGroupIndex, ray, bestHit, payload);
-        }
-    } else {
-        runMiss(missIndex, ray, payload);
-    }
-}
-`;
 
 // ════════════════════════════════════════════════════════════════════════
 // WAVEFRONT RT — streaming tracer (GENERATE → PREP → TRACE → SHADE →

From afc0292fab6841e281af006bd781a627ae2a5fd0 Mon Sep 17 00:00:00 2001
From: catbot <catbot@bot.local>
Date: Sun, 31 May 2026 20:28:12 +0000
Subject: [PATCH 7/8] WebGPU RT: dynamic TLAS sweep-tree depth (next_pow2
 instances)

The LBVH bitonic sort still runs over the full 16384 (sentinels sink to
the tail), but the sweep tree is now built and traced at depth
log2(next_pow2(nReal)) instead of a fixed 14. Add nPadded to LbvhPC; leaf
init + bottom-up refit use it; the host passes the same next_pow2 to the
trace via WfParams.tlasNPadded. Renders correctly at 512 instances
(depth 9). The fragile sort phases are untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 additional/dom-webgpu.js | 69 ++++++++++++++++++++++++++--------------
 1 file changed, 46 insertions(+), 23 deletions(-)

diff --git a/additional/dom-webgpu.js b/additional/dom-webgpu.js
index d3a9c72..c0e83d1 100644
--- a/additional/dom-webgpu.js
+++ b/additional/dom-webgpu.js
@@ -2281,7 +2281,11 @@ struct BvhNode {
 // runtime resize-on-grow caused subtle BVH corruption (driver-level
 // memory recycling, suspected) and was the root cause of mid-game
 // geometry flicker when projectiles entered the TLAS.
-struct LbvhPC { nReal: u32, _pad0: u32, _pad1: u32, _pad2: u32 };
+// nPadded = next_pow2(max(nReal,1)), supplied by the host. The bitonic
+// sort still runs over the full N_PADDED (sentinels sink to the tail), but
+// the sweep tree is built (and traced) at depth log2(nPadded) so descent
+// tracks the real instance count instead of a fixed 14.
+struct LbvhPC { nReal: u32, nPadded: u32, _pad1: u32, _pad2: u32 };
 @group(0) @binding(5) var<uniform> lbvhPc : LbvhPC;
 
 const N_PADDED:      u32 = 16384u;
@@ -2436,29 +2440,36 @@ fn lbvhBuildMain(@builtin(local_invocation_id) lid: vec3<u32>) {
     storageBarrier();
 
     // ── Phase 4: initialize BVH leaf AABBs ───────────────────────────────
-    for (var k: u32 = 0u; k < K_PER; k = k + 1u) {
+    // Only the first nPadded sorted slots become leaves of the (smaller)
+    // sweep tree; reals occupy [0,nReal), the rest sink as sentinels.
+    let nPadded = max(lbvhPc.nPadded, 1u);
+    let leafPerThread = (nPadded + THREADS - 1u) / THREADS;
+    for (var k: u32 = 0u; k < leafPerThread; k = k + 1u) {
         let i = k * THREADS + tid;
-        let leafIdx = N_PADDED - 1u + i;
-        let leafKey = sortA[i];
-        if (leafKey == 0xFFFFFFFFu) {
-            outBvh[leafIdx].aabbMin = vec3<f32>( 1e30);
-            outBvh[leafIdx].aabbMax = vec3<f32>(-1e30);
-        } else {
-            let e = entries[leafKey & 0xFFFFu];
-            outBvh[leafIdx].aabbMin = e.aabbMin;
-            outBvh[leafIdx].aabbMax = e.aabbMax;
+        if (i < nPadded) {
+            let leafIdx = nPadded - 1u + i;
+            let leafKey = sortA[i];
+            if (leafKey == 0xFFFFFFFFu) {
+                outBvh[leafIdx].aabbMin = vec3<f32>( 1e30);
+                outBvh[leafIdx].aabbMax = vec3<f32>(-1e30);
+            } else {
+                let e = entries[leafKey & 0xFFFFu];
+                outBvh[leafIdx].aabbMin = e.aabbMin;
+                outBvh[leafIdx].aabbMax = e.aabbMax;
+            }
         }
     }
     workgroupBarrier();
     storageBarrier();
 
-    // ── Phase 5: bottom-up sweep-tree refit, LEVELS iterations ──────────
-    // Deepest internal level has N_PADDED/2 nodes; perThread = ceil of
-    // levelCount / THREADS is uniform per step, so workgroupBarrier
-    // stays in uniform control flow.
-    var levelCount: u32 = N_PADDED / 2u;
-    var levelStart: u32 = N_PADDED / 2u - 1u;
-    for (var step: u32 = 0u; step < LEVELS; step = step + 1u) {
+    // ── Phase 5: bottom-up sweep-tree refit, log2(nPadded) levels ───────
+    // Deepest internal level has nPadded/2 nodes. The loop bound is uniform
+    // across the workgroup (depends only on nPadded), so the barriers stay
+    // in uniform control flow.
+    var levelCount: u32 = nPadded / 2u;
+    var levelStart: u32 = nPadded / 2u - 1u;
+    loop {
+        if (levelCount == 0u) { break; }
         let perThread = (levelCount + THREADS - 1u) / THREADS;
         for (var k: u32 = 0u; k < perThread; k = k + 1u) {
             let nodeOff = k * THREADS + tid;
@@ -2723,11 +2734,12 @@ env.wgpuBuildTLAS = (instanceBufHandle, instanceCount, tlasOutBufHandle,
             { binding: 4, resource: { buffer: morton } },
         ],
     });
-    // Write the real instance count to the LBVH count uniform so the
-    // shader can iterate exactly the right number of entries even
-    // though the storage buffers stay sized for N_PADDED.
+    // Write the real instance count + the dynamic padded leaf count
+    // (next_pow2) to the LBVH uniform. The sort still runs over the full
+    // N_PADDED, but the sweep tree is built at depth log2(nPadded).
     const countBuf = new Uint32Array(4);
     countBuf[0] = instanceCount;
+    countBuf[1] = wfNextPow2(instanceCount);
     queue.writeBuffer(rtState.lbvhCountBuf, 0, countBuf);
 
     const lbvhBg = device.createBindGroup({
@@ -2798,7 +2810,15 @@ const WF_PAYLOAD_BYTES = 64;
 // Dynamic-offset uniform ring: one WfParams slot per wavefront pass. 128
 // slots covers maxDepth up to ~42 (1 + 3·maxDepth + 1 passes).
 const WF_PARAM_SLOTS   = 128;
-const WF_FIXED_TLAS_NPADDED = 16384;   // matches lbvhBuildWgsl N_PADDED
+const WF_TLAS_MAX_NPADDED = 16384;   // LBVH sort capacity (N_PADDED)
+// Smallest power of two >= max(n,1), clamped to the LBVH capacity. The
+// TLAS sweep tree is built and traced at this depth so descent tracks the
+// real instance count instead of a fixed 16384-leaf (depth-14) tree.
+function wfNextPow2(n) {
+    let p = 1;
+    while (p < n && p < WF_TLAS_MAX_NPADDED) p <<= 1;
+    return p;
+}
 
 function ensureWavefrontBuffers(W, H) {
     const cap = W * H;
@@ -3041,11 +3061,14 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes,
     //   1+3*d .. +2  PREP / TRACE / SHADE for bounce d
     //   1+3*depth    RESOLVE
     const passCount = 2 + 3 * depth;
+    // TLAS descent depth = log2(tlasNPadded); must match the value the
+    // build used (both derive next_pow2 from the same instance count).
+    const tlasNPadded = wfNextPow2(instanceCount);
     const ring = new Uint32Array(WF_PARAM_SLOTS * 64);  // 256 B = 64 u32 per slot
     const writeSlot = (slot, curIsA, bounce) => {
         const o = slot * 64;
         ring[o + 0] = W; ring[o + 1] = H; ring[o + 2] = cap; ring[o + 3] = curIsA;
-        ring[o + 4] = bounce; ring[o + 5] = depth; ring[o + 6] = WF_FIXED_TLAS_NPADDED; ring[o + 7] = 0;
+        ring[o + 4] = bounce; ring[o + 5] = depth; ring[o + 6] = tlasNPadded; ring[o + 7] = 0;
     };
     writeSlot(0, 1, 0);                              // GENERATE
     for (let d = 0; d < depth; d++) {

From 358084185a7fd563f21c0f9940039413b80e24c8 Mon Sep 17 00:00:00 2001
From: catbot <catbot@bot.local>
Date: Sun, 31 May 2026 20:29:12 +0000
Subject: [PATCH 8/8] docs: wavefront RT in README + design-doc status; add
 RTStress to examples

---
 README.md           | 17 ++++++++++++++---
 WAVEFRONT-DESIGN.md | 32 +++++++++++++++++++++++++-------
 2 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/README.md b/README.md
index 6aca07b..b64e32d 100644
--- a/README.md
+++ b/README.md
@@ -50,9 +50,16 @@ compute pipeline composed from user-supplied WGSL stages).
   bridge. Atlas (`r8unorm`, sub-region writes) is a separate path.
 - **PipelineRTVulkan / PipelineRTWebGPU / ShaderBindingTableVulkan /
   ShaderBindingTableWebGPU / RTPass** — ray-tracing pipelines. Vulkan
-  uses native RT pipelines + SBTs; WebGPU composes one compute
-  pipeline by stitching the traversal library, a generated hit-group
-  switch, and the user's raygen / closesthit / miss / anyhit WGSL.
+  uses native RT pipelines + SBTs; WebGPU compiles a **wavefront /
+  streaming** software tracer — five `@compute` kernels
+  (`GENERATE → PREP → TRACE → SHADE → RESOLVE`) sharing one module,
+  connected by GPU ray/hit/payload buffers and a GPU-driven indirect
+  bounce loop (`dispatchWorkgroupsIndirect`). TRACE carries zero user
+  code (traversal + intersection only); user raygen calls
+  `rtEmitPrimaryRay`, and closesthit / miss run in SHADE where they
+  `rtEmitRay` continuation/shadow rays and `rtAccumulate` radiance. An
+  optional Resolve shader tonemaps the linear accumulator. See
+  [WAVEFRONT-DESIGN.md](WAVEFRONT-DESIGN.md).
 - **ComputeShader / WebGPUComputeShader** — Tier 1 wrapper used by the
   UI system. Vulkan loads a `.spv` and dispatches with
   `vkCmdPushDataEXT`; WebGPU loads a user-supplied `.wgsl` blob at
@@ -145,6 +152,10 @@ See [examples/](examples/). Quick map:
 - [VulkanTriangle](examples/VulkanTriangle/) — ray-traced triangle on
   both Vulkan and WebGPU. The smallest test of the bindless + RT path
   on each backend.
+- [RTStress](examples/RTStress/) — wavefront RT benchmark: an N×N×N grid
+  of a cube mesh (instance-count knob `kGrid`, 512 → 8000) shaded with
+  primary + shadow rays. Prints a GPU timestamp-query per-pass breakdown
+  each second. WebGPU/DOM only.
 - [Sponza](examples/Sponza/) — ray-traced Sponza atrium on both
   backends. Exercises `.cmesh` / `.ctex` decompression (GPU
   `VK_EXT_memory_decompression` on Vulkan, CPU on WebGPU) and a
diff --git a/WAVEFRONT-DESIGN.md b/WAVEFRONT-DESIGN.md
index 78d77e2..47e42d0 100644
--- a/WAVEFRONT-DESIGN.md
+++ b/WAVEFRONT-DESIGN.md
@@ -53,13 +53,31 @@ Compile/runtime knob. JS unrolls the chain to maxDepth. VulkanTriangle
 maxDepth=1 (primary only). Sponza maxDepth=2 (primary + shadow).
 
 ## Status / progress
-- [x] baseline VulkanTriangle renders (megakernel) — /tmp/baseline-triangle.png
-- [ ] wavefront prelude + codegen
-- [ ] VulkanTriangle on wavefront (maxDepth=1)
-- [ ] bounce loop + indirect + Sponza shadow port
-- [ ] RTStress example + timestamp queries
-- [ ] ordered traversal, dynamic TLAS depth, device limits
-- [ ] remove megakernel dual path; final validation; PR
+- [x] baseline VulkanTriangle renders (megakernel)
+- [x] wavefront prelude + codegen (5 entry points share one module)
+- [x] VulkanTriangle on wavefront (maxDepth=1) — bit-identical to baseline
+- [x] indirect-dispatch bounce loop + PREP (cross-pass atomics proven)
+- [x] RTStress example (N³ cube grid) + GPU timestamp-query per-pass HUD
+- [x] Sponza port (shadow ray in SHADE) — renders the atrium correctly
+- [x] ordered (nearest-child-first) traversal
+- [x] dynamic TLAS sweep-tree depth (next_pow2 instances)
+- [x] device limits (maxBufferSize / maxStorageBufferBindingSize /
+      maxComputeWorkgroupsPerDimension) + timestamp-query feature
+- [x] megakernel dead path removed (RT pipeline builds only wavefront)
+- [~] binding packing (Phase 7): SKIPPED — target device reports 64 storage
+      buffers/stage (≥12), so the merge is unnecessary (issue makes it
+      conditional on <12).
+
+### Measured (this container's GPU, via timestamp-query; NOT a 4090)
+Per-pass GPU time, 1920×995, primary+shadow (maxDepth=2):
+- RTStress 512 inst:  GEN ~0.80ms TRACE ~1.63ms SHADE ~1.00ms total ~3.52ms (~280 fps)
+- RTStress 4096 inst: GEN ~0.80ms TRACE ~1.95ms SHADE ~1.00ms total ~3.85ms (~260 fps)
+- Sponza:             GEN ~0.79ms TRACE ~1.81ms SHADE ~1.00ms total ~3.69ms
+8× the instances costs only ~16% more TRACE — the spatial TLAS + ordered
+descent scale sub-linearly. NOTE: a 4090 number and the TRACE-kernel
+register/occupancy delta require hardware + a profiler not available in
+this CI container; the architectural win (TRACE carries zero user code, so
+its register footprint is the traversal loop alone) is structural.
 
 ## Files
 - `additional/dom-webgpu.js` — prelude (`rtWgsl*`), `wgpuLoadRTPipeline`,