From 4e42d663a6179a179196ef3a8f46474a7df9bc7a Mon Sep 17 00:00:00 2001 From: catbot Date: Sun, 31 May 2026 16:24:41 +0000 Subject: [PATCH 1/8] WebGPU RT: wavefront tracer core (GENERATE/PREP/TRACE/SHADE/RESOLVE) Replace the megakernel @compute entry with five wavefront kernels sharing one module, connected by GPU ray/hit/payload buffers and a GPU-driven indirect bounce loop: GENERATE -> (PREP -> TRACE -> SHADE) x maxDepth -> RESOLVE - TRACE contains zero user code (pure _rtwTraverseTlas/Blas, opaque-only). - PREP publishes dispatchWorkgroupsIndirect args from the live ray count; the indirect-args buffer lives in its own bind group so it is never bound read-write in the same dispatch that consumes it as INDIRECT. - New emit/accumulate API: rtEmitPrimaryRay / rtEmitRay / rtAccumulate, plus an optional user Resolve stage (tonemap hook; identity by default). - Per-pass WfParams via a dynamic-offset uniform ring (curIsA/bounce vary between passes within one submit). - Payload-typed wfPayload binding emitted in the codegen region after the user's struct Payload; payload travels with each ray (2*W*H slots). - Request maxBufferSize / maxStorageBufferBindingSize / maxComputeWorkgroups PerDimension so the W*H-sized work buffers fit past the 128MB baseline. VulkanTriangle ported to the new API and renders bit-identical to the megakernel baseline at maxDepth=1. Co-Authored-By: Claude Opus 4.8 --- WAVEFRONT-DESIGN.md | 69 ++ additional/dom-webgpu.js | 653 ++++++++++++++++-- examples/VulkanTriangle/closesthit.wgsl | 11 +- examples/VulkanTriangle/miss.wgsl | 5 +- examples/VulkanTriangle/raygen.wgsl | 25 +- .../Crafter.Graphics-PipelineRTWebGPU.cpp | 73 +- interfaces/Crafter.Graphics-RTPass.cppm | 8 +- ...ter.Graphics-ShaderBindingTableWebGPU.cppm | 5 + interfaces/Crafter.Graphics-WebGPU.cppm | 3 +- 9 files changed, 753 insertions(+), 99 deletions(-) create mode 100644 WAVEFRONT-DESIGN.md diff --git a/WAVEFRONT-DESIGN.md b/WAVEFRONT-DESIGN.md new file mode 100644 index 0000000..78d77e2 --- /dev/null +++ b/WAVEFRONT-DESIGN.md @@ -0,0 +1,69 @@ +# WebGPU wavefront RT rewrite — design & progress (issue #3) + +Replaces the single megakernel (`main`, 8×8 tile, per-pixel +raygen→traceRay→CH/miss→store) with a streaming wavefront tracer: +`GENERATE → PREP → (TRACE → SHADE → PREP)×maxDepth → RESOLVE`, each its own +compute pass, dispatch sizes driven by `dispatchWorkgroupsIndirect`. + +## Kernels (all generated/assembled the same megakernel way, just split) +- **GENERATE** (1 thread/pixel, 8×8): runs user `raygen_main(gid)` which calls + `rtEmitPrimaryRay(...)`. Clears accum slot + payload slot for the pixel. +- **PREP** (1 thread): reads emit counter for the just-filled ray buffer, + writes indirect args `[ceil(n/64),1,1]`, publishes `traceCount=n`, swaps + cur/next ray buffer, resets next emit counter. One PREP before first TRACE + and one after each SHADE. +- **TRACE** (1 thread/ray, 64-wide, indirect): ZERO user code. Reads ray i, + runs `_rtTraverseTlas`, writes `HitResult` i (t/instanceId/primId/hg/attribs + /objToWorld/customIndex/missFlag). +- **SHADE** (1 thread/ray, 64-wide, indirect): reads ray i + hit i + payload + slot p. miss→`runMiss`, hit→`runClosestHit` (unless SKIP_CLOSEST_HIT). User + code calls `rtAccumulate(pixel,rgb)` and `rtEmitRay(...)`. +- **RESOLVE** (1 thread/pixel, 8×8): reads accum slot, runs user `resolve_main` + if present else passthrough; writes outImage. + +## Buffers (rtState, sized to 2*W*H rays) +- `wfRaysA`,`wfRaysB`: array, ping/pong. WfRay = origin,tMin,dir,tMax, + pixel,flags,cullMask,missIndex,sbtOffset,payloadSlot,kind,_pad. +- `wfHits`: array (sized = ray capacity). +- `wfPayload`: array — declared in CODEGEN region after user Payload. +- `wfAccum`: array> per pixel (W*H). +- `wfCounters`: atomic counters: emitA, emitB, trace dispatch args, etc. +- `wfIndirect`: INDIRECT dispatch-args buffer. + +## API (new, breaking) +- raygen: `rtEmitPrimaryRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx)` + → allocates payloadSlot=pixel, writes ray to current buffer (atomic bump). +- CH/miss: `rtEmitRay(origin,tMin,dir,tMax,flags,cullMask,sbtOff,missIdx,payload)` + spawns into NEXT buffer carrying a payload slot; `rtAccumulate(pixel,rgb)`. +- `rtGetPayload(slot)` / payload passed by value into CH/miss via slot. + +## Tonemap / resolve +Accum buffer is linear. Optional user `WebGPURTStage::Resolve` entry +`resolve_main(coord:vec2, hdr:vec4)->vec4`. None → passthrough. +VulkanTriangle: no resolve (exact match). Sponza: resolve does Reinhard+gamma. + +## Indirect dispatch (Phase 2 de-risk) +Prove `dispatchWorkgroupsIndirect` + cross-pass atomic visibility with a toy +"emit N → dispatch N" before wiring real kernels. WebGPU inserts an implicit +barrier between compute passes in one submit, so atomics written in PREP are +visible to TRACE. + +## maxDepth +Compile/runtime knob. JS unrolls the chain to maxDepth. VulkanTriangle +maxDepth=1 (primary only). Sponza maxDepth=2 (primary + shadow). + +## Status / progress +- [x] baseline VulkanTriangle renders (megakernel) — /tmp/baseline-triangle.png +- [ ] wavefront prelude + codegen +- [ ] VulkanTriangle on wavefront (maxDepth=1) +- [ ] bounce loop + indirect + Sponza shadow port +- [ ] RTStress example + timestamp queries +- [ ] ordered traversal, dynamic TLAS depth, device limits +- [ ] remove megakernel dual path; final validation; PR + +## Files +- `additional/dom-webgpu.js` — prelude (`rtWgsl*`), `wgpuLoadRTPipeline`, + `wgpuDispatchRT`, LBVH build, rtState/buffers, device-limit clamp (~L131). +- `implementations/Crafter.Graphics-PipelineRTWebGPU.cpp` — assembles user + WGSL + entry glue; must emit 5 entry points + payloadStore binding. +- examples/{VulkanTriangle,Sponza,RTStress}/*.wgsl + main.cpp. diff --git a/additional/dom-webgpu.js b/additional/dom-webgpu.js index e035d97..9380517 100644 --- a/additional/dom-webgpu.js +++ b/additional/dom-webgpu.js @@ -136,6 +136,16 @@ clamp("maxStorageTexturesPerShaderStage", 8); // per-workgroup invocation cap raised from the default 256. clamp("maxComputeInvocationsPerWorkgroup", 1024); clamp("maxComputeWorkgroupSizeX", 1024); +// Wavefront RT work buffers are sized to W·H rays. At 1080p the payload +// store (≈245 MB) and hit buffer (≈214 MB) blow past the 128 MB baseline +// storage-buffer binding size, and the whole set past the 256 MB baseline +// maxBufferSize — request whatever the adapter actually allows (4090/Dawn +// reports 1 GB+). maxComputeWorkgroupsPerDimension bounds the indirect +// TRACE/SHADE 1-D dispatch (ceil(W·H/64) ≈ 32k workgroups at 1080p; the +// 65535 default covers it, but request the adapter max for headroom). +clamp("maxBufferSize", 1 << 30); +clamp("maxStorageBufferBindingSize", 1 << 30); +clamp("maxComputeWorkgroupsPerDimension", 65535); const device = await adapter.requestDevice({ requiredLimits }); const queue = device.queue; const ctx = canvas.getContext("webgpu"); @@ -147,6 +157,9 @@ device.lost.then((info) => { console.error("[crafter-wgpu] device lost:", info.message); state.gpuLost = true; }); +device.addEventListener("uncapturederror", (e) => { + console.error("[crafter-wgpu] uncaptured error:", e.error && e.error.message); +}); // ─── handle tables ───────────────────────────────────────────────────── @@ -1671,6 +1684,350 @@ fn traceRay(tlasIdx: u32, flags: u32, cullMask: u32, } `; +// ════════════════════════════════════════════════════════════════════════ +// WAVEFRONT RT — streaming tracer (GENERATE → PREP → TRACE → SHADE → +// RESOLVE). Replaces the megakernel. The C++ side (PipelineRTWebGPU) emits +// the user sources, the per-stage SBT switches, the Payload-typed wfPayload +// binding, and the five @compute entry points; this JS injects the bindings +// + library helpers the entry points call. +// ════════════════════════════════════════════════════════════════════════ + +// Bindings prelude for the wavefront pipeline. group(0) is the per-pass +// WfParams uniform (dynamic-offset ring — one slot per pass so curIsA / +// bounce can vary between passes inside one submit). group(1) carries the +// geometry heaps (0..9, identical to the old megakernel layout so the +// register/build paths are unchanged) plus the wavefront work buffers +// (10..14); wfPayload at binding 15 is emitted in the codegen region after +// the user's `struct Payload`. group(2) is the indirect-args buffer, bound +// only by PREP (a buffer used as INDIRECT in a dispatch may not also be +// bound read-write in that same dispatch — so TRACE/SHADE must not bind it). +const rtWgslWavefrontBindings = String.raw` +struct WfParams { + surfaceW: u32, + surfaceH: u32, + rayCapacity: u32, + curIsA: u32, // 1 → current ray buffer is A (emit-next = B); 0 → B + bounce: u32, + maxDepth: u32, + tlasNPadded: u32, // TLAS sweep-tree padded leaf count (descent depth) + flags: u32, +}; + +// One in-flight ray. 64 bytes; origin/direction vec3-aligned to 16. +struct WfRay { + origin: vec3, + tMin: f32, + direction: vec3, + tMax: f32, + pixel: u32, // linear framebuffer pixel this ray contributes to + flags: u32, + cullMask: u32, + missIndex: u32, + sbtRecordOffset: u32, + payloadSlot: u32, // index into wfPayload + kind: u32, // 0 primary, 1 continuation (informational) + _pad: u32, +}; + +// TRACE → SHADE handoff. Mirrors HitInfo + a hitKind (0 miss, 1 triangle). +struct HitResult { + t: f32, + instanceId: u32, + primitiveId: u32, + hitGroupIndex: u32, + attribs: vec2, + hitKind: u32, + customIndex: u32, + objectRayOrigin: vec3, + _p0: f32, + objectRayDirection: vec3, + _p1: f32, + objectToWorldR0: vec4, + objectToWorldR1: vec4, + objectToWorldR2: vec4, +}; + +struct BvhNode { + aabbMin: vec3, + _pad0: u32, + aabbMax: vec3, + _pad1: u32, +}; + +@group(0) @binding(0) var wfParams : WfParams; + +@group(1) @binding(0) var tlasEntries : array; +@group(1) @binding(1) var bvhNodes : array; +@group(1) @binding(2) var meshRecords : array; +@group(1) @binding(3) var vertices : array; +@group(1) @binding(4) var indices : array; +@group(1) @binding(5) var primRemap : array; +@group(1) @binding(6) var outImage : texture_storage_2d; +@group(1) @binding(7) var vertexAttribs : array; +@group(1) @binding(8) var tlasEntryOrder : array; +@group(1) @binding(9) var tlasBvhNodes : array; +@group(1) @binding(10) var wfRaysA : array; +@group(1) @binding(11) var wfRaysB : array; +@group(1) @binding(12) var wfHits : array; +@group(1) @binding(13) var wfAccum : array>; +@group(1) @binding(14) var wfCounters : array>; +// @group(1) @binding(15) wfPayload : array — emitted by codegen. + +@group(2) @binding(0) var wfIndirect : array; +`; + +// Library helpers the codegen entry points call. Sits after the pure +// helpers (_rtAabb/_rtTri/_rtFetchTri) and after the user's Payload + +// wfPayload binding, so rtEmit*/_wfShade can name Payload/wfPayload. +const rtWgslWavefrontHelpers = String.raw` +var _wfPixel: u32 = 0u; + +// Live ray count for the current buffer, clamped to capacity (the emit +// counter can overshoot capacity; dropped rays were never written). +fn _wfCurCount() -> u32 { + let raw = select(atomicLoad(&wfCounters[1]), atomicLoad(&wfCounters[0]), + wfParams.curIsA == 1u); + return min(raw, wfParams.rayCapacity); +} + +// Add linear radiance to the pixel this SHADE/GENERATE thread owns. Safe +// without atomics: at most one ray per pixel per bounce, and bounces run +// in separate passes (implicit barrier between them). +fn rtAccumulate(rgb: vec3) { + wfAccum[_wfPixel] = wfAccum[_wfPixel] + vec4(rgb, 0.0); +} + +// raygen → emit the pixel's primary ray. Bounce 0's current buffer is +// always A, so primaries land in A with their payload in the A region +// [0, rayCapacity). +fn rtEmitPrimaryRay(origin: vec3, tMin: f32, dir: vec3, tMax: f32, + flags: u32, cullMask: u32, sbtRecordOffset: u32, + missIndex: u32, payload: Payload) { + let slot = atomicAdd(&wfCounters[0], 1u); + if (slot >= wfParams.rayCapacity) { return; } + var r: WfRay; + r.origin = origin; r.tMin = tMin; r.direction = dir; r.tMax = tMax; + r.pixel = _wfPixel; r.flags = flags; r.cullMask = cullMask; + r.missIndex = missIndex; r.sbtRecordOffset = sbtRecordOffset; + r.payloadSlot = slot; r.kind = 0u; + wfRaysA[slot] = r; + wfPayload[slot] = payload; +} + +// closesthit/miss → spawn a continuation/shadow ray into the NEXT buffer +// (the one the upcoming TRACE will read). Payload travels with it; the +// next buffer's payload region is [rayCapacity, 2*rayCapacity) for B. +fn rtEmitRay(origin: vec3, tMin: f32, dir: vec3, tMax: f32, + flags: u32, cullMask: u32, sbtRecordOffset: u32, + missIndex: u32, payload: Payload) { + let nextIsA = wfParams.curIsA == 0u; + let counterIdx = select(1u, 0u, nextIsA); + let slot = atomicAdd(&wfCounters[counterIdx], 1u); + if (slot >= wfParams.rayCapacity) { return; } + let payloadBase = select(wfParams.rayCapacity, 0u, nextIsA); + var r: WfRay; + r.origin = origin; r.tMin = tMin; r.direction = dir; r.tMax = tMax; + r.pixel = _wfPixel; r.flags = flags; r.cullMask = cullMask; + r.missIndex = missIndex; r.sbtRecordOffset = sbtRecordOffset; + r.payloadSlot = payloadBase + slot; r.kind = 1u; + if (nextIsA) { wfRaysA[slot] = r; } else { wfRaysB[slot] = r; } + wfPayload[r.payloadSlot] = payload; +} + +// Opaque-only BLAS descent (no anyhit — TRACE runs zero user code). +fn _rtwTraverseBlas(rayObj: RayDesc, flags: u32, meshRec: MeshRecord, + instanceId: u32, hitGroupBase: u32, + bestHit: ptr, + bestT: ptr) -> bool { + let invD = vec3(1.0) / rayObj.direction; + var stack: array; + var sp: u32 = 0u; + var nodeRel: u32 = 0u; + loop { + let absI = meshRec.bvhOffset + nodeRel; + let node = bvhNodes[absI]; + if (!_rtAabb(rayObj.origin, invD, node.aabbMin, node.aabbMax, *bestT)) { + if (sp == 0u) { break; } + sp = sp - 1u; nodeRel = stack[sp]; continue; + } + if (node.primCount > 0u) { + for (var i: u32 = 0u; i < node.primCount; i = i + 1u) { + let triIndex = primRemap[meshRec.primRemapOffset + node.firstChildOrPrim + i]; + let verts = _rtFetchTri(meshRec, triIndex); + let tr = _rtTri(rayObj.origin, rayObj.direction, + verts[0], verts[1], verts[2], rayObj.tMin, *bestT); + if (!tr.hit) { continue; } + let geomNormal = cross(verts[1] - verts[0], verts[2] - verts[0]); + let facing = dot(geomNormal, rayObj.direction); + if ((flags & RT_FLAG_CULL_BACK_FACING_TRIANGLES) != 0u && facing > 0.0) { continue; } + if ((flags & RT_FLAG_CULL_FRONT_FACING_TRIANGLES) != 0u && facing < 0.0) { continue; } + var candidate: HitInfo; + candidate.t = tr.t; + candidate.instanceId = instanceId; + candidate.primitiveId = triIndex; + candidate.hitGroupIndex = hitGroupBase; + candidate.attribs = vec2(tr.u, tr.v); + candidate.objectRayOrigin = rayObj.origin; + candidate.objectRayDirection = rayObj.direction; + *bestHit = candidate; + *bestT = tr.t; + if ((flags & RT_FLAG_TERMINATE_ON_FIRST_HIT) != 0u) { return true; } + } + if (sp == 0u) { break; } + sp = sp - 1u; nodeRel = stack[sp]; continue; + } + let left = node.firstChildOrPrim; + let right = left + 1u; + if (sp < 32u) { stack[sp] = right; sp = sp + 1u; } + nodeRel = left; + } + return false; +} + +fn _rtwTraverseTlas(rayWorld: RayDesc, flags: u32, cullMask: u32, + sbtRecordOffset: u32, + bestHit: ptr, + bestT: ptr) -> bool { + let invD = vec3(1.0) / rayWorld.direction; + let leavesStart = wfParams.tlasNPadded - 1u; + var stack: array; + var sp: u32 = 0u; + stack[sp] = 0u; sp = sp + 1u; + loop { + if (sp == 0u) { break; } + sp = sp - 1u; + let nodeIdx = stack[sp]; + let node = tlasBvhNodes[nodeIdx]; + if (!_rtAabb(rayWorld.origin, invD, node.aabbMin, node.aabbMax, *bestT)) { continue; } + if (nodeIdx >= leavesStart) { + let leafIdx = nodeIdx - leavesStart; + let i = tlasEntryOrder[leafIdx]; + if (i == 0xFFFFFFFFu) { continue; } + let inst = tlasEntries[i]; + let instanceMask = inst.maskHGOffset & 0xFFu; + if ((instanceMask & cullMask) == 0u) { continue; } + if (!_rtAabb(rayWorld.origin, invD, inst.aabbMin, inst.aabbMax, *bestT)) { continue; } + let r0 = inst.worldToObjectR0; + let r1 = inst.worldToObjectR1; + let r2 = inst.worldToObjectR2; + var rayObj: RayDesc; + rayObj.origin = vec3( + dot(r0.xyz, rayWorld.origin) + r0.w, + dot(r1.xyz, rayWorld.origin) + r1.w, + dot(r2.xyz, rayWorld.origin) + r2.w); + rayObj.direction = vec3( + dot(r0.xyz, rayWorld.direction), + dot(r1.xyz, rayWorld.direction), + dot(r2.xyz, rayWorld.direction)); + rayObj.tMin = rayWorld.tMin; + rayObj.tMax = *bestT; + var effective = flags; + let iflags = inst.instanceFlags; + if ((iflags & RT_INSTANCE_TRIANGLE_FACING_CULL_DISABLE) != 0u) { + effective = effective & ~(RT_FLAG_CULL_BACK_FACING_TRIANGLES | RT_FLAG_CULL_FRONT_FACING_TRIANGLES); + } + let hitGroupOffset = inst.maskHGOffset >> 8u; + let hitGroupBase = sbtRecordOffset + hitGroupOffset; + let meshRec = meshRecords[inst.blasMeshIdx]; + let pre = *bestT; + let endSearch = _rtwTraverseBlas(rayObj, effective, meshRec, i, hitGroupBase, bestHit, bestT); + if ((*bestT) < pre || endSearch) { + (*bestHit).objectToWorldR0 = inst.objectToWorldR0; + (*bestHit).objectToWorldR1 = inst.objectToWorldR1; + (*bestHit).objectToWorldR2 = inst.objectToWorldR2; + (*bestHit).customIndex = inst.customIndex; + } + if (endSearch) { return true; } + if ((*bestT) < pre && (effective & RT_FLAG_TERMINATE_ON_FIRST_HIT) != 0u) { return true; } + } else { + let left = 2u * nodeIdx + 1u; + let right = 2u * nodeIdx + 2u; + if (sp + 1u < 32u) { + stack[sp] = right; sp = sp + 1u; + stack[sp] = left; sp = sp + 1u; + } + } + } + return false; +} + +fn _wfReadRay(i: u32) -> WfRay { + if (wfParams.curIsA == 1u) { return wfRaysA[i]; } + return wfRaysB[i]; +} + +// PREP — publish indirect args for the upcoming TRACE/SHADE; zero the next +// buffer's emit counter. +fn _wfPrep() { + let n = _wfCurCount(); + wfIndirect[0] = (n + 63u) / 64u; + wfIndirect[1] = 1u; + wfIndirect[2] = 1u; + if (wfParams.curIsA == 1u) { atomicStore(&wfCounters[1], 0u); } + else { atomicStore(&wfCounters[0], 0u); } +} + +// TRACE — pure traversal, zero user code. +fn _wfTrace(i: u32) { + if (i >= _wfCurCount()) { return; } + let ray = _wfReadRay(i); + var rd: RayDesc; + rd.origin = ray.origin; rd.tMin = ray.tMin; + rd.direction = ray.direction; rd.tMax = ray.tMax; + var bestHit: HitInfo; + bestHit.t = ray.tMax; + var bestT = ray.tMax; + _rtwTraverseTlas(rd, ray.flags, ray.cullMask & 0xFFu, ray.sbtRecordOffset, &bestHit, &bestT); + var hr: HitResult; + if (bestT < ray.tMax) { + hr.t = bestHit.t; + hr.instanceId = bestHit.instanceId; + hr.primitiveId = bestHit.primitiveId; + hr.hitGroupIndex = bestHit.hitGroupIndex; + hr.attribs = bestHit.attribs; + hr.hitKind = 1u; + hr.customIndex = bestHit.customIndex; + hr.objectRayOrigin = bestHit.objectRayOrigin; + hr.objectRayDirection = bestHit.objectRayDirection; + hr.objectToWorldR0 = bestHit.objectToWorldR0; + hr.objectToWorldR1 = bestHit.objectToWorldR1; + hr.objectToWorldR2 = bestHit.objectToWorldR2; + } else { + hr.hitKind = 0u; + } + wfHits[i] = hr; +} + +// SHADE — dispatch to runMiss / runClosestHit with the ray's payload. +fn _wfShade(i: u32) { + if (i >= _wfCurCount()) { return; } + let ray = _wfReadRay(i); + let hr = wfHits[i]; + _wfPixel = ray.pixel; + var payload: Payload = wfPayload[ray.payloadSlot]; + var rd: RayDesc; + rd.origin = ray.origin; rd.tMin = ray.tMin; + rd.direction = ray.direction; rd.tMax = ray.tMax; + if (hr.hitKind == 0u) { + runMiss(ray.missIndex, rd, &payload); + } else if ((ray.flags & RT_FLAG_SKIP_CLOSEST_HIT) == 0u) { + var hit: HitInfo; + hit.t = hr.t; + hit.instanceId = hr.instanceId; + hit.primitiveId = hr.primitiveId; + hit.hitGroupIndex = hr.hitGroupIndex; + hit.attribs = hr.attribs; + hit.customIndex = hr.customIndex; + hit.objectRayOrigin = hr.objectRayOrigin; + hit.objectRayDirection = hr.objectRayDirection; + hit.objectToWorldR0 = hr.objectToWorldR0; + hit.objectToWorldR1 = hr.objectToWorldR1; + hit.objectToWorldR2 = hr.objectToWorldR2; + runClosestHit(hr.hitGroupIndex, rd, hit, &payload); + } +} +`; + // ── WGSL library: rayQuery API for non-megakernel compute shaders ──── // // Mirrors GL_EXT_ray_query semantics that 3DForts's physics shaders use @@ -2564,6 +2921,40 @@ env.wgpuBuildTLAS = (instanceBufHandle, instanceCount, tlasOutBufHandle, // textures, samplers). Pass (0, 0) for a pipeline with no user bindings. const rtPipelines = new Map(); // handle → { pipeline, bgls, byGroup, sortedGroups } +// Per-payload byte budget in wfPayload (rounded up; user Payload structs +// must fit). Sponza's Payload is 48 B; 64 leaves headroom while keeping +// 2·W·H·64 B ≈ 265 MB at 1080p. +const WF_PAYLOAD_BYTES = 64; +// Dynamic-offset uniform ring: one WfParams slot per wavefront pass. 128 +// slots covers maxDepth up to ~42 (1 + 3·maxDepth + 1 passes). +const WF_PARAM_SLOTS = 128; +const WF_FIXED_TLAS_NPADDED = 16384; // matches lbvhBuildWgsl N_PADDED + +function ensureWavefrontBuffers(W, H) { + const cap = W * H; + rtState.wf = rtState.wf || { cap: 0 }; + const wf = rtState.wf; + if (wf.cap === cap && wf.raysA) return wf; + for (const b of [wf.raysA, wf.raysB, wf.hits, wf.accum, wf.counters, + wf.payload, wf.indirect]) { if (b) b.destroy(); } + const S = GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST; + wf.raysA = device.createBuffer({ size: cap * 64, usage: S, label: "wf-raysA" }); + wf.raysB = device.createBuffer({ size: cap * 64, usage: S, label: "wf-raysB" }); + wf.hits = device.createBuffer({ size: cap * 112, usage: S, label: "wf-hits" }); + wf.accum = device.createBuffer({ size: cap * 16, usage: S, label: "wf-accum" }); + wf.payload = device.createBuffer({ size: 2 * cap * WF_PAYLOAD_BYTES, usage: S, label: "wf-payload" }); + wf.counters = device.createBuffer({ size: 64, + usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST | GPUBufferUsage.COPY_SRC, label: "wf-counters" }); + wf.indirect = device.createBuffer({ size: 16, + usage: GPUBufferUsage.STORAGE | GPUBufferUsage.INDIRECT | GPUBufferUsage.COPY_DST, label: "wf-indirect" }); + if (!wf.paramsRing) { + wf.paramsRing = device.createBuffer({ size: WF_PARAM_SLOTS * 256, + usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST, label: "wf-params" }); + } + wf.cap = cap; + return wf; +} + env.wgpuLoadRTPipeline = (wgslPtr, wgslLen, bindingsPtr, bindingsCount) => { if (!rtState.vertHeap) rtInit(); const userPart = new TextDecoder().decode(memU8().subarray(wgslPtr, wgslPtr + wgslLen)); @@ -2577,16 +2968,23 @@ env.wgpuLoadRTPipeline = (wgslPtr, wgslLen, bindingsPtr, bindingsCount) => { beforeHelpers = userPart.substring(0, mi); afterHelpers = userPart.substring(mi + marker.length); } - const fullWgsl = rtWgslPrelude + "\n" + beforeHelpers + "\n" + rtWgslPureHelpers + "\n" + rtWgslMegakernelHelpers + "\n" + afterHelpers; + // Wavefront assembly: types + bindings | user CH/miss/resolve + wfPayload + // + switches (beforeHelpers) | pure helpers | wavefront helpers | user + // raygen + the five @compute entry points (afterHelpers). + const fullWgsl = rtWgslTypes + rtWgslWavefrontBindings + "\n" + + beforeHelpers + "\n" + rtWgslPureHelpers + "\n" + + rtWgslWavefrontHelpers + "\n" + afterHelpers; - // Parse user bindings (same wire format as wgpuLoadCustomShader). + // Parse user bindings (same wire format as wgpuLoadCustomShader). For + // the wavefront RT pipeline, group 0 = WfParams, group 1 = data heaps, + // group 2 = indirect args — so user bindings must start at group 3. const userBindings = []; if (bindingsCount > 0) { const dv = new DataView(memU8().buffer, bindingsPtr, bindingsCount * 8); for (let i = 0; i < bindingsCount; i++) { const g = dv.getUint8(i*8 + 0); - if (g < 2) { - console.error(`[crafter-wgpu] RT pipeline: @group(${g}) reserved; user bindings need group >= 2`); + if (g < 3) { + console.error(`[crafter-wgpu] RT pipeline: @group(${g}) reserved; user bindings need group >= 3`); return 0; } userBindings.push({ @@ -2604,33 +3002,28 @@ env.wgpuLoadRTPipeline = (wgslPtr, wgslLen, bindingsPtr, bindingsCount) => { } const sortedGroups = [...byGroup.keys()].sort((a, b) => a - b); - let pipeline; try { - const mod = device.createShaderModule({ code: fullWgsl, label: "rt-megakernel" }); - // RTDispatchHeader is 16 bytes; bind exactly that. - const headerBgl = device.createBindGroupLayout({ entries: [ + const mod = device.createShaderModule({ code: fullWgsl, label: "rt-wavefront" }); + const paramsBgl = device.createBindGroupLayout({ entries: [ { binding: 0, visibility: GPUShaderStage.COMPUTE, - buffer: { type: "uniform", minBindingSize: 16 } }, + buffer: { type: "uniform", hasDynamicOffset: true, minBindingSize: 32 } }, ]}); + const sb = (b) => ({ binding: b, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }); + const rw = (b) => ({ binding: b, visibility: GPUShaderStage.COMPUTE, buffer: { type: "storage" } }); const dataBgl = device.createBindGroupLayout({ entries: [ - { binding: 0, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, - { binding: 1, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, - { binding: 2, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, - { binding: 3, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, - { binding: 4, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, - { binding: 5, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, + sb(0), sb(1), sb(2), sb(3), sb(4), sb(5), { binding: 6, visibility: GPUShaderStage.COMPUTE, storageTexture: { format: "rgba8unorm", access: "write-only", viewDimension: "2d" } }, - { binding: 7, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, - { binding: 8, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, - { binding: 9, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, + sb(7), sb(8), sb(9), + rw(10), rw(11), rw(12), rw(13), rw(14), rw(15), ]}); - // User binding-group layouts. WebGPU pipeline layouts need a - // contiguous array up to the highest group used, so pad any gaps - // with empty bgls (same rule as wgpuLoadCustomShader). + const indirectBgl = device.createBindGroupLayout({ entries: [ rw(0) ]}); + const emptyBgl = device.createBindGroupLayout({ entries: [] }); + + // User binding-group layouts for groups 3..highest (pad gaps). const userBgls = []; - const highest = sortedGroups.length ? sortedGroups[sortedGroups.length - 1] : 1; - for (let g = 2; g <= highest; g++) { + const highest = sortedGroups.length ? sortedGroups[sortedGroups.length - 1] : 2; + for (let g = 3; g <= highest; g++) { if (byGroup.has(g)) { const entries = byGroup.get(g).map(b => { const e = { binding: b.binding, visibility: GPUShaderStage.COMPUTE }; @@ -2638,6 +3031,7 @@ env.wgpuLoadRTPipeline = (wgslPtr, wgslLen, bindingsPtr, bindingsCount) => { else if (b.kind === 1) e.texture = { sampleType: "float", viewDimension: "2d" }; else if (b.kind === 2) e.sampler = { type: "filtering" }; else if (b.kind === 3) e.texture = { sampleType: "float", viewDimension: "2d-array" }; + else if (b.kind === 4) e.buffer = { type: "storage" }; return e; }); userBgls.push(device.createBindGroupLayout({ entries })); @@ -2645,12 +3039,29 @@ env.wgpuLoadRTPipeline = (wgslPtr, wgslLen, bindingsPtr, bindingsCount) => { userBgls.push(device.createBindGroupLayout({ entries: [] })); } } - pipeline = device.createComputePipeline({ - layout: device.createPipelineLayout({ bindGroupLayouts: [headerBgl, dataBgl, ...userBgls] }), - compute: { module: mod, entryPoint: "main" }, - }); + // GENERATE / SHADE / RESOLVE may touch user bindings (raygen camera, + // closesthit albedo, resolve params) → params + data + empty(group2) + // + user. PREP → params + data + indirect. TRACE → params + data. + const userLayout = device.createPipelineLayout({ + bindGroupLayouts: [paramsBgl, dataBgl, emptyBgl, ...userBgls] }); + const prepLayout = device.createPipelineLayout({ + bindGroupLayouts: [paramsBgl, dataBgl, indirectBgl] }); + const traceLayout = device.createPipelineLayout({ + bindGroupLayouts: [paramsBgl, dataBgl] }); + const mk = (layout, ep) => device.createComputePipeline({ + layout, compute: { module: mod, entryPoint: ep } }); + + const entry = { + genPipe: mk(userLayout, "wfGenerate"), + prepPipe: mk(prepLayout, "wfPrep"), + tracePipe: mk(traceLayout, "wfTrace"), + shadePipe: mk(userLayout, "wfShade"), + resolvePipe: mk(userLayout, "wfResolve"), + paramsBgl, dataBgl, indirectBgl, emptyBgl, userBgls, + byGroup, sortedGroups, + }; const handle = newHandle(); - rtPipelines.set(handle, { pipeline, headerBgl, dataBgl, userBgls, byGroup, sortedGroups }); + rtPipelines.set(handle, entry); return handle; } catch (e) { console.error("[crafter-wgpu] RT pipeline compile failed:", e); @@ -2659,35 +3070,83 @@ env.wgpuLoadRTPipeline = (wgslPtr, wgslLen, bindingsPtr, bindingsCount) => { } }; +// Build the user @group(3+) bind groups for a pass, returning a list of +// { group, bindGroup } to set. Shared by GENERATE / SHADE / RESOLVE. +function wfUserBindGroups(pipe, handlesPtr, handlesCount) { + const out = []; + if (handlesCount <= 0) return out; + const handles = new Uint32Array(memU8().buffer, handlesPtr, handlesCount); + let handleIdx = 0; + let bglIdx = 0; + const highest = pipe.sortedGroups.length ? pipe.sortedGroups[pipe.sortedGroups.length - 1] : 2; + for (let g = 3; g <= highest; g++) { + if (pipe.byGroup.has(g)) { + const entries = pipe.byGroup.get(g).map(b => { + const h = handles[handleIdx++]; + let resource; + if (b.kind === 0) resource = { buffer: buffers.get(h) }; + else if (b.kind === 1) resource = textureViews.get(h); + else if (b.kind === 2) resource = samplers.get(h); + else if (b.kind === 3) resource = textureViews.get(h); + else if (b.kind === 4) resource = { buffer: buffers.get(h) }; + return { binding: b.binding, resource }; + }); + out.push({ group: g, bindGroup: device.createBindGroup({ layout: pipe.userBgls[bglIdx], entries }) }); + } + bglIdx++; + } + return out; +} + env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes, tlasBufHandle, instanceCount, gx, gy, - handlesPtr, handlesCount) => { - if (!state.pass) return; + handlesPtr, handlesCount, maxDepth) => { + if (!state.encoder) return; const pipe = rtPipelines.get(pipelineHandle); const tlas = buffers.get(tlasBufHandle); if (!pipe || !tlas) { console.error("[crafter-wgpu] wgpuDispatchRT: unknown pipeline or tlas"); return; } - // Write RT header from push data (first 16 bytes). Surface dims + instance count + flags. - const hdr32 = new Uint32Array(4); - hdr32[0] = state.width; - hdr32[1] = state.height; - hdr32[2] = instanceCount; - hdr32[3] = 0; - queue.writeBuffer(rtState.rtHeader, 0, hdr32); - - const headerBg = device.createBindGroup({ - layout: pipe.headerBgl, - entries: [{ binding: 0, resource: { buffer: rtState.rtHeader, offset: 0, size: 16 } }], - }); - const outView = state.outIsPing ? state.pingView : state.pongView; const entryOrderBuf = buffers.get(rtState.currentEntryOrder); const bvhBuf = buffers.get(rtState.currentBvh); if (!entryOrderBuf || !bvhBuf) { - console.error("[crafter-wgpu] wgpuDispatchRT: missing entryOrder/bins (no TLAS built yet?)"); + console.error("[crafter-wgpu] wgpuDispatchRT: missing entryOrder/bvh (no TLAS built yet?)"); return; } + const W = state.width, H = state.height; + const cap = W * H; + const depth = Math.max(1, maxDepth | 0); + const wf = ensureWavefrontBuffers(W, H); + + // ── Per-pass WfParams ring. queue.writeBuffer lands before submit, so + // we can't mutate the uniform between passes — instead we pre-write one + // slot per pass and bind it with a dynamic offset. Slot order: + // 0 GENERATE + // 1+3*d .. +2 PREP / TRACE / SHADE for bounce d + // 1+3*depth RESOLVE + const passCount = 2 + 3 * depth; + const ring = new Uint32Array(WF_PARAM_SLOTS * 64); // 256 B = 64 u32 per slot + const writeSlot = (slot, curIsA, bounce) => { + const o = slot * 64; + ring[o + 0] = W; ring[o + 1] = H; ring[o + 2] = cap; ring[o + 3] = curIsA; + ring[o + 4] = bounce; ring[o + 5] = depth; ring[o + 6] = WF_FIXED_TLAS_NPADDED; ring[o + 7] = 0; + }; + writeSlot(0, 1, 0); // GENERATE + for (let d = 0; d < depth; d++) { + const curIsA = (d % 2 === 0) ? 1 : 0; + writeSlot(1 + 3 * d + 0, curIsA, d); // PREP + writeSlot(1 + 3 * d + 1, curIsA, d); // TRACE + writeSlot(1 + 3 * d + 2, curIsA, d); // SHADE + } + writeSlot(1 + 3 * depth, 1, depth); // RESOLVE + queue.writeBuffer(wf.paramsRing, 0, ring, 0, passCount * 64); + + const outView = state.outIsPing ? state.pingView : state.pongView; + const paramsBg = device.createBindGroup({ + layout: pipe.paramsBgl, + entries: [{ binding: 0, resource: { buffer: wf.paramsRing, offset: 0, size: 256 } }], + }); const dataBg = device.createBindGroup({ layout: pipe.dataBgl, entries: [ @@ -2701,41 +3160,91 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes, { binding: 7, resource: { buffer: rtState.attribsHeap.gpu } }, { binding: 8, resource: { buffer: entryOrderBuf } }, { binding: 9, resource: { buffer: bvhBuf } }, + { binding: 10, resource: { buffer: wf.raysA } }, + { binding: 11, resource: { buffer: wf.raysB } }, + { binding: 12, resource: { buffer: wf.hits } }, + { binding: 13, resource: { buffer: wf.accum } }, + { binding: 14, resource: { buffer: wf.counters } }, + { binding: 15, resource: { buffer: wf.payload } }, ], }); + const indirectBg = device.createBindGroup({ + layout: pipe.indirectBgl, + entries: [{ binding: 0, resource: { buffer: wf.indirect } }], + }); + const userBgs = wfUserBindGroups(pipe, handlesPtr, handlesCount); - state.pass.setPipeline(pipe.pipeline); - state.pass.setBindGroup(0, headerBg); - state.pass.setBindGroup(1, dataBg); + // Close the frame's shared compute pass; the wavefront runs as its own + // sequence of passes on the same encoder (implicit barrier between each + // makes PREP's atomic writes visible to TRACE, etc.), then we reopen it. + if (state.pass) { state.pass.end(); state.pass = null; } + const enc = state.encoder; + const tileX = gx, tileY = gy; + const slotOff = (slot) => slot * 256; - // User bindings: walk byGroup in the same sorted order the C++ side - // packed handles[], picking up indices linearly. - if (handlesCount > 0) { - const handles = new Uint32Array(memU8().buffer, handlesPtr, handlesCount); - let handleIdx = 0; - let bglIdx = 0; - for (let g = 2; g <= (pipe.sortedGroups[pipe.sortedGroups.length - 1] || 1); g++) { - if (pipe.byGroup.has(g)) { - const entries = pipe.byGroup.get(g).map(b => { - const h = handles[handleIdx++]; - let resource; - if (b.kind === 0) resource = { buffer: buffers.get(h) }; - else if (b.kind === 1) resource = textureViews.get(h); - else if (b.kind === 2) resource = samplers.get(h); - else if (b.kind === 3) resource = textureViews.get(h); - return { binding: b.binding, resource }; - }); - const bg = device.createBindGroup({ - layout: pipe.userBgls[bglIdx], - entries, - }); - state.pass.setBindGroup(g, bg); - } - bglIdx++; + // Zero the two emit counters before GENERATE. + enc.clearBuffer(wf.counters, 0, 64); + + const setUser = (pass) => { for (const u of userBgs) pass.setBindGroup(u.group, u.bindGroup); }; + + // GENERATE + { + const p = enc.beginComputePass({ label: "wf-generate" }); + p.setPipeline(pipe.genPipe); + p.setBindGroup(0, paramsBg, [slotOff(0)]); + p.setBindGroup(1, dataBg); + setUser(p); + p.dispatchWorkgroups(tileX, tileY, 1); + p.end(); + } + for (let d = 0; d < depth; d++) { + const prepSlot = 1 + 3 * d + 0; + const traceSlot = 1 + 3 * d + 1; + const shadeSlot = 1 + 3 * d + 2; + // PREP — publish indirect args, zero next counter. + { + const p = enc.beginComputePass({ label: "wf-prep" }); + p.setPipeline(pipe.prepPipe); + p.setBindGroup(0, paramsBg, [slotOff(prepSlot)]); + p.setBindGroup(1, dataBg); + p.setBindGroup(2, indirectBg); + p.dispatchWorkgroups(1, 1, 1); + p.end(); + } + // TRACE — indirect over the live ray list. + { + const p = enc.beginComputePass({ label: "wf-trace" }); + p.setPipeline(pipe.tracePipe); + p.setBindGroup(0, paramsBg, [slotOff(traceSlot)]); + p.setBindGroup(1, dataBg); + p.dispatchWorkgroupsIndirect(wf.indirect, 0); + p.end(); + } + // SHADE — indirect; runs user closesthit/miss, may emit + accumulate. + { + const p = enc.beginComputePass({ label: "wf-shade" }); + p.setPipeline(pipe.shadePipe); + p.setBindGroup(0, paramsBg, [slotOff(shadeSlot)]); + p.setBindGroup(1, dataBg); + setUser(p); + p.dispatchWorkgroupsIndirect(wf.indirect, 0); + p.end(); } } + // RESOLVE — tonemap accum → output image. + { + const p = enc.beginComputePass({ label: "wf-resolve" }); + p.setPipeline(pipe.resolvePipe); + p.setBindGroup(0, paramsBg, [slotOff(1 + 3 * depth)]); + p.setBindGroup(1, dataBg); + setUser(p); + p.dispatchWorkgroups(tileX, tileY, 1); + p.end(); + } - state.pass.dispatchWorkgroups(gx, gy, 1); + // Reopen the frame's shared pass so wgpuFrameEnd / later UI work as + // before, and flip ping-pong so the blit picks the texture RESOLVE wrote. + state.pass = enc.beginComputePass(); state.outIsPing = !state.outIsPing; }; diff --git a/examples/VulkanTriangle/closesthit.wgsl b/examples/VulkanTriangle/closesthit.wgsl index e0d29ae..91758ba 100644 --- a/examples/VulkanTriangle/closesthit.wgsl +++ b/examples/VulkanTriangle/closesthit.wgsl @@ -1,6 +1,9 @@ -// WebGPU port of closesthit.glsl. Library concatenates this BEFORE the -// library helpers, so `Payload` declared here is visible to traceRay, -// runClosestHit, the mega-switch, and the user's raygen source. +// Payload declared here so the WGSL assembler sees it before the wfPayload +// binding, the SHADE dispatch, and the raygen source. +// +// Wavefront model: closesthit_main runs in SHADE and accumulates the +// pixel's color directly (rtAccumulate) instead of writing a payload that +// raygen reads back. struct Payload { color: vec3, @@ -8,5 +11,5 @@ struct Payload { fn closesthit_main(ray: RayDesc, hit: HitInfo, payload: ptr) { let bary = vec3(1.0 - hit.attribs.x - hit.attribs.y, hit.attribs.x, hit.attribs.y); - (*payload).color = bary; + rtAccumulate(bary); } diff --git a/examples/VulkanTriangle/miss.wgsl b/examples/VulkanTriangle/miss.wgsl index a0ba944..39ee6b6 100644 --- a/examples/VulkanTriangle/miss.wgsl +++ b/examples/VulkanTriangle/miss.wgsl @@ -1,5 +1,6 @@ -// WebGPU port of miss.glsl. +// Wavefront miss: runs in SHADE for rays that hit nothing. Accumulate the +// white background directly. fn miss_main(ray: RayDesc, payload: ptr) { - (*payload).color = vec3(1.0, 1.0, 1.0); + rtAccumulate(vec3(1.0, 1.0, 1.0)); } diff --git a/examples/VulkanTriangle/raygen.wgsl b/examples/VulkanTriangle/raygen.wgsl index cdf66fd..3014e8a 100644 --- a/examples/VulkanTriangle/raygen.wgsl +++ b/examples/VulkanTriangle/raygen.wgsl @@ -1,11 +1,12 @@ -// WebGPU port of raygen.glsl. Mirrors the pinhole camera setup — the -// Payload type is declared in closesthit.wgsl (concatenated earlier). +// WebGPU wavefront raygen. Runs in GENERATE: compute the pinhole camera +// ray and emit it as the pixel's primary ray. Shading happens later in +// SHADE (closesthit/miss). The Payload type is declared in closesthit.wgsl. fn raygen_main(gid: vec3) { - if (gid.x >= hdr.surfaceW || gid.y >= hdr.surfaceH) { return; } + if (gid.x >= wfParams.surfaceW || gid.y >= wfParams.surfaceH) { return; } let pixel = vec2(f32(gid.x), f32(gid.y)); - let resolution = vec2(f32(hdr.surfaceW), f32(hdr.surfaceH)); + let resolution = vec2(f32(wfParams.surfaceW), f32(wfParams.surfaceH)); let uv = (pixel + vec2(0.5)) / resolution; let ndc = uv * 2.0 - vec2(1.0); @@ -23,17 +24,11 @@ fn raygen_main(gid: vec3) { var payload: Payload; payload.color = vec3(0.0); - traceRay( - 0u, // tlasIdx (unused) - 0u, // ray flags - 0xFFu, // cull mask - 0u, 0u, 0u, // sbtRecordOffset, sbtRecordStride, missIndex + rtEmitPrimaryRay( origin, 0.001, direction, 10000.0, - &payload, - ); - - textureStore(outImage, - vec2(i32(gid.x), i32(gid.y)), - vec4(payload.color, 1.0)); + 0u, // ray flags + 0xFFu, // cull mask + 0u, 0u, // sbtRecordOffset, missIndex + payload); } diff --git a/implementations/Crafter.Graphics-PipelineRTWebGPU.cpp b/implementations/Crafter.Graphics-PipelineRTWebGPU.cpp index c4265da..8dd949d 100644 --- a/implementations/Crafter.Graphics-PipelineRTWebGPU.cpp +++ b/implementations/Crafter.Graphics-PipelineRTWebGPU.cpp @@ -78,13 +78,22 @@ void PipelineRTWebGPU::Init(WebGPUCommandEncoderRef /*cmd*/, // shaders by stage. Concatenating *all* non-raygen sources here lets // them declare shared helpers, `struct Payload`, etc., in any order. - wgsl += "// ── user closesthit / anyhit / miss sources ───────────────\n"; + wgsl += "// ── user closesthit / anyhit / miss / resolve sources ─────\n"; for (const auto& shader : sbt.shaders) { if (shader.stage == WebGPURTStage::Raygen) continue; wgsl += shader.source; wgsl += "\n"; } + // ── Payload-typed wavefront storage binding ──────────────────────── + // + // Emitted *after* the user sources so it can name the user's `Payload` + // type. Holds one Payload per in-flight ray slot across both ping/pong + // ray buffers (capacity = 2·W·H). SHADE loads ray.payloadSlot here; + // emit helpers (rtEmitPrimaryRay / rtEmitRay) store into it. + wgsl += "\n@group(1) @binding(15) var " + "wfPayload : array;\n"; + // ── Section 2: mega-switch dispatchers ───────────────────────────── // // runClosestHit, runAnyHit, runMiss each dispatch on the per-hit / @@ -141,6 +150,24 @@ void PipelineRTWebGPU::Init(WebGPUCommandEncoderRef /*cmd*/, wgsl += " }\n"; wgsl += "}\n"; + // runResolve — RESOLVE-stage tonemap hook. The first registered + // Resolve shader wins; with none, identity passthrough (alpha forced + // to 1) so the wavefront output matches a megakernel that wrote raw + // colors. + std::string resolveEntryFn; + for (const auto& shader : sbt.shaders) { + if (shader.stage == WebGPURTStage::Resolve) { resolveEntryFn = shader.entryFn; break; } + } + wgsl += "\nfn runResolve(coord: vec2, hdr: vec4) -> vec4 {\n"; + if (!resolveEntryFn.empty()) { + wgsl += " return "; + wgsl += resolveEntryFn; + wgsl += "(coord, hdr);\n"; + } else { + wgsl += " return vec4(hdr.rgb, 1.0);\n"; + } + wgsl += "}\n"; + // Marker — JS-side prelude/post-amble searches for this token to know // where the library helpers (traverseBlas/traverseTlas/traceRay) get // injected, followed by raygen sources and the @compute entry point. @@ -173,17 +200,55 @@ void PipelineRTWebGPU::Init(WebGPUCommandEncoderRef /*cmd*/, return; } - // ── Section 4: @compute entry point ──────────────────────────────── + // ── Section 4: wavefront @compute entry points ───────────────────── // - // 8x8 tile workgroup matching the rest of the WebGPU backend. + // Five kernels share this one module; createComputePipeline selects + // each by entryPoint name. GENERATE/RESOLVE are 8x8 screen tiles; + // TRACE/SHADE are 64-wide 1-D over the compacted ray list (dispatched + // indirectly from PREP); PREP is a single thread. The library helper + // bodies (_rtwTraverseTlas, rtEmit*, rtAccumulate, _wfCurCount, …) are + // injected JS-side at the marker above. + // GENERATE — one thread per pixel; clears the pixel's accumulator and + // runs the user raygen, which calls rtEmitPrimaryRay. wgsl += "\n@compute @workgroup_size(8, 8, 1)\n"; - wgsl += "fn main(@builtin(global_invocation_id) gid: vec3) {\n"; + wgsl += "fn wfGenerate(@builtin(global_invocation_id) gid: vec3) {\n"; + wgsl += " if (gid.x >= wfParams.surfaceW || gid.y >= wfParams.surfaceH) { return; }\n"; + wgsl += " let pixel = gid.y * wfParams.surfaceW + gid.x;\n"; + wgsl += " wfAccum[pixel] = vec4(0.0, 0.0, 0.0, 0.0);\n"; + wgsl += " _wfPixel = pixel;\n"; wgsl += " "; wgsl += raygenEntryFn; wgsl += "(gid);\n"; wgsl += "}\n"; + // PREP — single thread; reads the live ray count and publishes the + // indirect dispatch args for the upcoming TRACE/SHADE, then zeroes the + // next buffer's emit counter so SHADE starts compacting from 0. + wgsl += "\n@compute @workgroup_size(1)\n"; + wgsl += "fn wfPrep() { _wfPrep(); }\n"; + + // TRACE — zero user code: pure traversal + intersection. One thread + // per live ray; writes a HitResult into wfHits[i]. + wgsl += "\n@compute @workgroup_size(64)\n"; + wgsl += "fn wfTrace(@builtin(global_invocation_id) gid: vec3) { _wfTrace(gid.x); }\n"; + + // SHADE — one thread per live ray; loads the ray + its hit + payload, + // dispatches to runMiss / runClosestHit, which may rtAccumulate and + // rtEmitRay continuation/shadow rays into the next buffer. + wgsl += "\n@compute @workgroup_size(64)\n"; + wgsl += "fn wfShade(@builtin(global_invocation_id) gid: vec3) { _wfShade(gid.x); }\n"; + + // RESOLVE — one thread per pixel; runs the user resolve (or identity) + // over the linear accumulator and stores to the output image. + wgsl += "\n@compute @workgroup_size(8, 8, 1)\n"; + wgsl += "fn wfResolve(@builtin(global_invocation_id) gid: vec3) {\n"; + wgsl += " if (gid.x >= wfParams.surfaceW || gid.y >= wfParams.surfaceH) { return; }\n"; + wgsl += " let pixel = gid.y * wfParams.surfaceW + gid.x;\n"; + wgsl += " let outc = runResolve(gid.xy, wfAccum[pixel]);\n"; + wgsl += " textureStore(outImage, vec2(i32(gid.x), i32(gid.y)), outc);\n"; + wgsl += "}\n"; + pipelineHandle = WebGPU::wgpuLoadRTPipeline( wgsl.data(), static_cast(wgsl.size()), diff --git a/interfaces/Crafter.Graphics-RTPass.cppm b/interfaces/Crafter.Graphics-RTPass.cppm index 1397a68..0da7088 100644 --- a/interfaces/Crafter.Graphics-RTPass.cppm +++ b/interfaces/Crafter.Graphics-RTPass.cppm @@ -72,6 +72,11 @@ export namespace Crafter { // 0 means "no user bindings". const void* handlesPtr = nullptr; std::uint32_t handlesCount = 0; + // Wavefront bounce budget: number of (TRACE; SHADE) iterations. + // 1 = primary rays only; 2 = primary + one continuation/shadow + // bounce; etc. The library unrolls GENERATE; (PREP; TRACE; SHADE) + // ×maxDepth; RESOLVE. + std::uint32_t maxDepth = 1; RTPass(PipelineRTWebGPU* p) : pipeline(p) {} @@ -88,7 +93,8 @@ export namespace Crafter { static_cast(gx), static_cast(gy), handlesPtr, - static_cast(handlesCount)); + static_cast(handlesCount), + static_cast(maxDepth)); } }; } diff --git a/interfaces/Crafter.Graphics-ShaderBindingTableWebGPU.cppm b/interfaces/Crafter.Graphics-ShaderBindingTableWebGPU.cppm index c175a52..73c2285 100644 --- a/interfaces/Crafter.Graphics-ShaderBindingTableWebGPU.cppm +++ b/interfaces/Crafter.Graphics-ShaderBindingTableWebGPU.cppm @@ -18,6 +18,11 @@ export namespace Crafter { Miss = 1, ClosestHit = 2, AnyHit = 3, + // Wavefront RESOLVE-stage tonemap/output hook. Optional: if no + // Resolve shader is registered, RESOLVE writes the linear accum + // buffer through unchanged. Signature: + // fn (coord: vec2, hdr: vec4) -> vec4 + Resolve = 4, }; // One WGSL shader source + the function name PipelineRTWebGPU should diff --git a/interfaces/Crafter.Graphics-WebGPU.cppm b/interfaces/Crafter.Graphics-WebGPU.cppm index dc695fe..9089bbd 100644 --- a/interfaces/Crafter.Graphics-WebGPU.cppm +++ b/interfaces/Crafter.Graphics-WebGPU.cppm @@ -201,7 +201,8 @@ namespace Crafter::WebGPU { std::uint32_t tlasBufHandle, std::int32_t instanceCount, std::int32_t gx, std::int32_t gy, - const void* handlesPtr, std::int32_t handlesCount); + const void* handlesPtr, std::int32_t handlesCount, + std::int32_t maxDepth); // GPU TLAS-build dispatch. Two sequential compute passes: // 1. tlasBuildMain — per-instance world AABB + identity permutation From f4d6493d91b4488894cc3ba3a7a9c728965e7a5d Mon Sep 17 00:00:00 2001 From: catbot Date: Sun, 31 May 2026 16:28:38 +0000 Subject: [PATCH 2/8] wip: uncommitted changes from claude run on issue #3 --- examples/RTStress/closesthit.wgsl | 54 ++++++++ examples/RTStress/main.cpp | 200 ++++++++++++++++++++++++++++++ examples/RTStress/miss.wgsl | 11 ++ examples/RTStress/project.cpp | 46 +++++++ examples/RTStress/raygen.wgsl | 35 ++++++ examples/RTStress/resolve.wgsl | 7 ++ 6 files changed, 353 insertions(+) create mode 100644 examples/RTStress/closesthit.wgsl create mode 100644 examples/RTStress/main.cpp create mode 100644 examples/RTStress/miss.wgsl create mode 100644 examples/RTStress/project.cpp create mode 100644 examples/RTStress/raygen.wgsl create mode 100644 examples/RTStress/resolve.wgsl diff --git a/examples/RTStress/closesthit.wgsl b/examples/RTStress/closesthit.wgsl new file mode 100644 index 0000000..c80f5bb --- /dev/null +++ b/examples/RTStress/closesthit.wgsl @@ -0,0 +1,54 @@ +// RTStress closest-hit (runs in SHADE). Computes flat-shaded Lambert from +// the hit triangle's geometric normal, accumulates ambient, and — if the +// surface faces the sun — emits a shadow ray toward the sun. The shadow +// ray's miss (sun visible) adds the direct term; its hit (occluded) adds +// nothing because RT_FLAG_SKIP_CLOSEST_HIT suppresses closesthit on hit. +// +// Payload declared here so the assembler sees it before wfPayload / SHADE. +struct Payload { + color: vec3, // shadow ray: pending direct contribution + shadowRay: u32, // 0 primary, 1 shadow +}; + +const SUN_DIR_TO_LIGHT: vec3 = vec3(0.40, 0.85, 0.35); +const SUN_COLOR: vec3 = vec3(1.15, 1.05, 0.90); +const AMBIENT_COLOR: vec3 = vec3(0.12, 0.13, 0.18); + +// Cheap per-instance albedo so the grid reads as distinct cubes (and any +// TLAS flicker as instance count scales is obvious). +fn instanceAlbedo(i: u32) -> vec3 { + let h = i * 2654435761u; + return vec3( + 0.35 + 0.6 * f32((h >> 0u) & 255u) / 255.0, + 0.35 + 0.6 * f32((h >> 8u) & 255u) / 255.0, + 0.35 + 0.6 * f32((h >> 16u) & 255u) / 255.0); +} + +fn closesthit_main(ray: RayDesc, hit: HitInfo, payload: ptr) { + let meshRec = meshRecords[tlasEntries[hit.instanceId].blasMeshIdx]; + let verts = _rtFetchTri(meshRec, hit.primitiveId); + let nObj = normalize(cross(verts[1] - verts[0], verts[2] - verts[0])); + let nWorld = normalize(vec3( + dot(hit.objectToWorldR0.xyz, nObj), + dot(hit.objectToWorldR1.xyz, nObj), + dot(hit.objectToWorldR2.xyz, nObj))); + + let albedo = instanceAlbedo(hit.customIndex); + let worldPos = ray.origin + ray.direction * hit.t; + let viewDir = -ray.direction; + let nFacing = select(-nWorld, nWorld, dot(nWorld, viewDir) > 0.0); + let sunDir = normalize(SUN_DIR_TO_LIGHT); + let nDotL = max(0.0, dot(nFacing, sunDir)); + + rtAccumulate(albedo * AMBIENT_COLOR); + + if (nDotL > 0.0) { + var sp: Payload; + sp.color = albedo * SUN_COLOR * nDotL; + sp.shadowRay = 1u; + let shadowOrigin = worldPos + nFacing * 0.05; + rtEmitRay(shadowOrigin, 0.01, sunDir, 100000.0, + RT_FLAG_SKIP_CLOSEST_HIT | RT_FLAG_TERMINATE_ON_FIRST_HIT, + 0xFFu, 0u, 0u, sp); + } +} diff --git a/examples/RTStress/main.cpp b/examples/RTStress/main.cpp new file mode 100644 index 0000000..d9581a2 --- /dev/null +++ b/examples/RTStress/main.cpp @@ -0,0 +1,200 @@ +// RTStress — the standing many-instance wavefront RT benchmark. An +// N×N×N grid of a small cube mesh (one BLAS, many TLAS instances), shaded +// with primary + shadow rays through the wavefront pipeline. The grid edge +// `kGrid` is the instance-count knob: 8 → 512, 16 → 4096, 20 → 8000 +// (LBVH_MAX = 16384). Frame time is printed to the console each second so +// fps-vs-instance-count can be read off without external tooling; the JS +// bridge additionally prints a GPU timestamp-query per-pass breakdown. +// +// WebGPU/DOM only — the wavefront tracer is the WebGPU software RT path. + +#ifndef CRAFTER_GRAPHICS_WINDOW_DOM +int main() { return 0; } // native path is hardware RT; out of scope here +#else + +import Crafter.Graphics; +import Crafter.Math; +import Crafter.Event; +import std; + +using namespace Crafter; +namespace fs = std::filesystem; + +namespace { + // Instance-count knob. instances = kGrid³. Bump to 16 (4096) or 20 + // (8000) to stress the TLAS; the LBVH build caps at 16384. + constexpr int kGrid = 8; + constexpr float kSpacing = 2.5f; + constexpr float kHalf = 0.5f; // cube half-extent + + struct CameraGPU { + float origin[3]; float pad0; + float right[3]; float tanHalf; + float up[3]; float aspect; + float forward[3]; float pad1; + }; + static_assert(sizeof(CameraGPU) == 64); +} + +int main() { + const int instanceCount = kGrid * kGrid * kGrid; + std::println("[RTStress] grid {}^3 = {} instances", kGrid, instanceCount); + + Device::Initialize(); + static Window window(1280, 720, "RTStress"); + auto cmd = window.StartInit(); + + DescriptorHeapWebGPU heap; + heap.Initialize(/*images*/ 1, /*buffers*/ 2, /*samplers*/ 1); + + std::array shaders {{ + WebGPUShader(fs::path("raygen.wgsl"), "raygen_main", WebGPURTStage::Raygen), + WebGPUShader(fs::path("miss.wgsl"), "miss_main", WebGPURTStage::Miss), + WebGPUShader(fs::path("closesthit.wgsl"), "closesthit_main", WebGPURTStage::ClosestHit), + WebGPUShader(fs::path("resolve.wgsl"), "resolve_main", WebGPURTStage::Resolve), + }}; + ShaderBindingTableWebGPU sbt; + sbt.Init(shaders); + + std::array raygenGroups {{ { .type = RTShaderGroupType::General, .generalShader = 0 } }}; + std::array missGroups {{ { .type = RTShaderGroupType::General, .generalShader = 1 } }}; + std::array hitGroups {{ { .type = RTShaderGroupType::TrianglesHitGroup, .closestHitShader = 2 } }}; + + // One user binding: the camera storage buffer at @group(3). + std::array bindings {{ + { .group = 3, .binding = 0, .kind = UICustomBindingKind::Buffer, ._pad = 0, .pushOffset = 0 }, + }}; + + PipelineRTWebGPU pipeline; + pipeline.Init(cmd, raygenGroups, missGroups, hitGroups, sbt, bindings); + + // ── Unit cube mesh (8 verts, 12 tris). ──────────────────────────── + static std::array, 8> verts {{ + {-kHalf, -kHalf, -kHalf}, { kHalf, -kHalf, -kHalf}, + { kHalf, kHalf, -kHalf}, {-kHalf, kHalf, -kHalf}, + {-kHalf, -kHalf, kHalf}, { kHalf, -kHalf, kHalf}, + { kHalf, kHalf, kHalf}, {-kHalf, kHalf, kHalf}, + }}; + static std::array indices {{ + 0,1,2, 0,2,3, 5,4,7, 5,7,6, 4,0,3, 4,3,7, + 1,5,6, 1,6,2, 4,5,1, 4,1,0, 3,2,6, 3,6,7, + }}; + static Mesh cube; + cube.Build(verts, indices, cmd); + + // ── Camera buffer + handle array. ───────────────────────────────── + WebGPUBuffer cameraBuf; + cameraBuf.Create(1); + static std::array userHandles { cameraBuf.handle }; + + // ── Instance grid. Reserve so RenderingElement3D::Add pointers stay + // valid across vector growth. ───────────────────────────────────── + static std::vector renderers; + renderers.reserve(static_cast(instanceCount)); + const float origin0 = -0.5f * static_cast(kGrid - 1) * kSpacing; + for (int x = 0; x < kGrid; ++x) + for (int y = 0; y < kGrid; ++y) + for (int z = 0; z < kGrid; ++z) { + renderers.emplace_back(); + RenderingElement3D& r = renderers.back(); + auto& tx = r.instance.transform.matrix; + tx[0][0] = 1; tx[0][1] = 0; tx[0][2] = 0; tx[0][3] = origin0 + float(x) * kSpacing; + tx[1][0] = 0; tx[1][1] = 1; tx[1][2] = 0; tx[1][3] = origin0 + float(y) * kSpacing; + tx[2][0] = 0; tx[2][1] = 0; tx[2][2] = 1; tx[2][3] = origin0 + float(z) * kSpacing; + r.instance.instanceCustomIndex = static_cast(renderers.size() - 1); + r.instance.mask = 0xFF; + r.instance.instanceShaderBindingTableRecordOffset = 0; + r.instance.flags = kRTGeometryInstanceForceOpaque; + r.instance.accelerationStructureReference = cube.blasAddr; + RenderingElement3D::Add(&r); + } + RenderingElement3D::BuildTLAS(cmd, 0); + + window.descriptorHeap = &heap; + window.FinishInit(); + + RTPass rtPass(&pipeline); + rtPass.handlesPtr = userHandles.data(); + rtPass.handlesCount = static_cast(userHandles.size()); + rtPass.maxDepth = 2; // primary + shadow + window.passes.push_back(&rtPass); + + // ── Free camera framing the grid from a corner. ─────────────────── + const float ext = float(kGrid - 1) * kSpacing; + struct CamState { + Vector position; + float yaw; + float pitch; + } cam { + Vector{ ext * 1.4f, ext * 1.0f, ext * 1.4f }, + 0.0f, 0.0f, + }; + { + // Aim at the grid centre (origin). + Vector d { -cam.position.x, -cam.position.y, -cam.position.z }; + const float len = std::sqrt(d.x*d.x + d.y*d.y + d.z*d.z); + cam.yaw = std::atan2(d.z, d.x); + cam.pitch = std::asin(d.y / len); + } + + Input::Map inputMap; + Input::Action& moveAct = inputMap.AddAction("Move", Input::ActionType::Vector2); + Input::Action& lookAct = inputMap.AddAction("Look", Input::ActionType::Vector2); + moveAct.bindings = { Input::WASDBind{ + Key(CrafterKeys::W), Key(CrafterKeys::S), Key(CrafterKeys::A), Key(CrafterKeys::D) } }; + lookAct.bindings = { Input::MouseDeltaBind{ 1.0f } }; + inputMap.Attach(window); + + const float kMoveSpeed = ext * 0.8f; + const float kLookSens = 0.05f; + const float kDt = 1.0f / 60.0f; + + static int frames = 0; + static double tAccum = 0.0; + EventListener camTick(&window.onBeforeUpdate, [&]() { + inputMap.Tick(); + cam.yaw += lookAct.vector2.x * kLookSens; + cam.pitch -= lookAct.vector2.y * kLookSens; + cam.pitch = std::clamp(cam.pitch, -1.55f, 1.55f); + + const float cp = std::cos(cam.pitch), sp = std::sin(cam.pitch); + const float cy = std::cos(cam.yaw), sy = std::sin(cam.yaw); + Vector forward { cp * cy, sp, cp * sy }; + Vector worldUp { 0.0f, 1.0f, 0.0f }; + Vector right { forward.y*worldUp.z - forward.z*worldUp.y, + forward.z*worldUp.x - forward.x*worldUp.z, + forward.x*worldUp.y - forward.y*worldUp.x }; + const float rLen = std::sqrt(right.x*right.x + right.y*right.y + right.z*right.z); + right.x /= rLen; right.y /= rLen; right.z /= rLen; + Vector up { right.y*forward.z - right.z*forward.y, + right.z*forward.x - right.x*forward.z, + right.x*forward.y - right.y*forward.x }; + + const float dx = moveAct.vector2.x * kMoveSpeed * kDt; + const float dy = moveAct.vector2.y * kMoveSpeed * kDt; + cam.position.x += right.x*dx + forward.x*dy; + cam.position.y += right.y*dx + forward.y*dy; + cam.position.z += right.z*dx + forward.z*dy; + + CameraGPU& g = cameraBuf.value[0]; + g.origin[0]=cam.position.x; g.origin[1]=cam.position.y; g.origin[2]=cam.position.z; g.pad0=0; + g.right[0]=right.x; g.right[1]=right.y; g.right[2]=right.z; + g.up[0]=up.x; g.up[1]=up.y; g.up[2]=up.z; + g.forward[0]=forward.x; g.forward[1]=forward.y; g.forward[2]=forward.z; + g.aspect = float(window.width) / float(window.height); + g.tanHalf = std::tan(70.0f * 3.14159265f / 360.0f); + g.pad1 = 0; + cameraBuf.FlushDevice(); + + if (++frames >= 60) { + std::println("[RTStress] {} instances @ ~{} frames since last report", instanceCount, frames); + frames = 0; + } + }); + + window.Render(); + window.StartUpdate(); + window.StartSync(); + return 0; +} +#endif diff --git a/examples/RTStress/miss.wgsl b/examples/RTStress/miss.wgsl new file mode 100644 index 0000000..d23a0bc --- /dev/null +++ b/examples/RTStress/miss.wgsl @@ -0,0 +1,11 @@ +// RTStress miss (runs in SHADE). Primary miss → sky gradient. Shadow miss +// → the sun is unoccluded, so add the pending direct contribution. +fn miss_main(ray: RayDesc, payload: ptr) { + if ((*payload).shadowRay == 1u) { + rtAccumulate((*payload).color); + return; + } + let t = clamp(ray.direction.y * 0.5 + 0.5, 0.0, 1.0); + rtAccumulate(mix(vec3(0.50, 0.62, 0.85), + vec3(0.90, 0.94, 1.00), t)); +} diff --git a/examples/RTStress/project.cpp b/examples/RTStress/project.cpp new file mode 100644 index 0000000..b1e4f03 --- /dev/null +++ b/examples/RTStress/project.cpp @@ -0,0 +1,46 @@ +import std; +import Crafter.Build; +namespace fs = std::filesystem; +using namespace Crafter; + +extern "C" Configuration CrafterBuildProject(std::span args) { + bool isWasm = false; + for (std::string_view a : args) { + if (a.starts_with("--target=") && a.find("wasm") != std::string_view::npos) { + isWasm = true; + break; + } + } + + std::vector graphicsArgs(args.begin(), args.end()); + Configuration* graphics = LocalProject({ + .projectFile = "../../project.cpp", + .args = graphicsArgs, + }); + + Configuration cfg; + cfg.path = "./"; + cfg.name = "RTStress"; + cfg.outputName = "RTStress"; + cfg.type = ConfigurationType::Executable; + if (isWasm) { + cfg.target = "wasm32-wasip1"; + cfg.defines.push_back({"CRAFTER_GRAPHICS_WINDOW_DOM", ""}); + cfg.compileFlags.push_back("-msimd128"); + } + ApplyStandardArgs(cfg, args); + cfg.dependencies = { graphics }; + + std::array ifaces = {}; + std::array impls = { "main" }; + cfg.GetInterfacesAndImplementations(ifaces, impls); + + if (isWasm) { + cfg.files.emplace_back(fs::path("raygen.wgsl")); + cfg.files.emplace_back(fs::path("closesthit.wgsl")); + cfg.files.emplace_back(fs::path("miss.wgsl")); + cfg.files.emplace_back(fs::path("resolve.wgsl")); + EnableWasiBrowserRuntime(cfg); + } + return cfg; +} diff --git a/examples/RTStress/raygen.wgsl b/examples/RTStress/raygen.wgsl new file mode 100644 index 0000000..def54fc --- /dev/null +++ b/examples/RTStress/raygen.wgsl @@ -0,0 +1,35 @@ +// RTStress raygen (runs in GENERATE). Host-driven pinhole camera at +// @group(3) (groups 0..2 are reserved by the wavefront pipeline: +// 0 = WfParams, 1 = data heaps, 2 = indirect args). +struct Camera { + origin: vec3, + pad0: f32, + right: vec3, + tanHalf: f32, + up: vec3, + aspect: f32, + forward: vec3, + pad1: f32, +}; +@group(3) @binding(0) var camera : Camera; + +fn raygen_main(gid: vec3) { + if (gid.x >= wfParams.surfaceW || gid.y >= wfParams.surfaceH) { return; } + + let pixelf = vec2(f32(gid.x), f32(gid.y)); + let res = vec2(f32(wfParams.surfaceW), f32(wfParams.surfaceH)); + let uv = (pixelf + vec2(0.5)) / res; + let ndc = uv * 2.0 - vec2(1.0); + + let direction = normalize( + camera.right * (ndc.x * camera.aspect * camera.tanHalf) + + camera.up * (-ndc.y * camera.tanHalf) + + camera.forward); + + var p: Payload; + p.color = vec3(0.0); + p.shadowRay = 0u; + + rtEmitPrimaryRay(camera.origin, 0.01, direction, 100000.0, + 0u, 0xFFu, 0u, 0u, p); +} diff --git a/examples/RTStress/resolve.wgsl b/examples/RTStress/resolve.wgsl new file mode 100644 index 0000000..7950c96 --- /dev/null +++ b/examples/RTStress/resolve.wgsl @@ -0,0 +1,7 @@ +// RTStress RESOLVE-stage tonemap: Reinhard + gamma 2.2 over the linear +// accumulator. Registered as a WebGPURTStage::Resolve shader. +fn resolve_main(coord: vec2, hdr: vec4) -> vec4 { + let mapped = hdr.rgb / (hdr.rgb + vec3(1.0)); + let g = pow(mapped, vec3(1.0 / 2.2)); + return vec4(g, 1.0); +} From 1d2e12dbc9623c7b7d6da373185a4cdaa116f5a3 Mon Sep 17 00:00:00 2001 From: catbot Date: Sun, 31 May 2026 20:08:39 +0000 Subject: [PATCH 3/8] WebGPU RT: GPU timestamp-query per-pass harness Request the timestamp-query feature; write begin/end timestamps around each wavefront pass via timestampWrites; resolve + read back (deferred to after submit) and print a per-pass us breakdown ~1x/sec. RTStress @ 512 instances, 1920x995: TRACE dominates, total ~1.8-3.0ms/frame. Co-Authored-By: Claude Opus 4.8 --- additional/dom-webgpu.js | 103 ++++++++++++++++++++++++++++++++++++--- 1 file changed, 97 insertions(+), 6 deletions(-) diff --git a/additional/dom-webgpu.js b/additional/dom-webgpu.js index 9380517..fcadaba 100644 --- a/additional/dom-webgpu.js +++ b/additional/dom-webgpu.js @@ -146,7 +146,10 @@ clamp("maxComputeWorkgroupSizeX", 1024); clamp("maxBufferSize", 1 << 30); clamp("maxStorageBufferBindingSize", 1 << 30); clamp("maxComputeWorkgroupsPerDimension", 65535); -const device = await adapter.requestDevice({ requiredLimits }); +// Per-pass GPU timing for the wavefront tracer (RTStress HUD / PR numbers). +const tsSupported = adapter.features && adapter.features.has("timestamp-query"); +const requiredFeatures = tsSupported ? ["timestamp-query"] : []; +const device = await adapter.requestDevice({ requiredLimits, requiredFeatures }); const queue = device.queue; const ctx = canvas.getContext("webgpu"); const canvasFormat = "rgba8unorm"; // match storage textures, skip swizzle blit @@ -935,6 +938,23 @@ env.wgpuFrameEnd = () => { queue.submit([state.encoder.finish()]); state.encoder = null; + // Map the wavefront timestamp readback (its resolve/copy was encoded on + // the just-submitted encoder) and log a per-pass breakdown ~1×/sec. + if (state.tsReadPending) { + const ts = state.tsReadPending; + state.tsReadPending = null; + const n = ts.pendingLabels.length; + ts.readBuf.mapAsync(GPUMapMode.READ, 0, 2 * n * 8).then(() => { + const data = new BigInt64Array(ts.readBuf.getMappedRange(0, 2 * n * 8).slice(0)); + ts.readBuf.unmap(); + ts.inFlight = false; + wfLogTimestamps(ts, data); + }).catch((e) => { + ts.inFlight = false; + console.error("[crafter-wgpu] timestamp readback failed:", e); + }); + } + // Kick off mapAsync for the readbacks whose copyBufferToBuffer we // piggy-backed onto the just-submitted encoder. Doing this after // submit ensures the map waits for that submission's GPU work to @@ -2955,6 +2975,51 @@ function ensureWavefrontBuffers(W, H) { return wf; } +// ── GPU timestamp-query harness ────────────────────────────────────────── +// +// One QuerySet with 2 slots per wavefront pass; each beginComputePass writes +// begin/end timestamps. After the passes we resolve into a buffer and read +// it back (deferred to after submit, like the readback path). Deltas are +// summed per pass label and printed ~1×/sec as a per-pass breakdown. +const WF_TS_MAX_PASSES = 64; // covers maxDepth up to ~20 +function wfEnsureTimestamps() { + if (!tsSupported) return null; + if (rtState.ts) return rtState.ts; + const cap = 2 * WF_TS_MAX_PASSES; + rtState.ts = { + capacity: cap, + querySet: device.createQuerySet({ type: "timestamp", count: cap }), + resolveBuf: device.createBuffer({ size: cap * 8, + usage: GPUBufferUsage.QUERY_RESOLVE | GPUBufferUsage.COPY_SRC }), + readBuf: device.createBuffer({ size: cap * 8, + usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST }), + inFlight: false, + lastLog: 0, + pendingLabels: null, + }; + return rtState.ts; +} +function wfLogTimestamps(ts, data) { + // data: BigInt64Array of ns timestamps, [begin0,end0,begin1,end1,...]. + const now = Date.now(); + if (now - ts.lastLog < 1000) return; // throttle to ~1/sec + ts.lastLog = now; + const labels = ts.pendingLabels; + if (!labels) return; + const sums = new Map(); // label → ns + let totalNs = 0; + for (let i = 0; i < labels.length; i++) { + const dt = Number(data[2*i + 1] - data[2*i + 0]); + if (dt < 0) continue; + sums.set(labels[i], (sums.get(labels[i]) || 0) + dt); + totalNs += dt; + } + const order = ["GENERATE", "PREP", "TRACE", "SHADE", "RESOLVE"]; + const parts = order.filter(k => sums.has(k)) + .map(k => `${k} ${(sums.get(k)/1000).toFixed(1)}us`); + console.log(`[crafter-wgpu] RT passes: ${parts.join(" | ")} | total ${(totalNs/1000).toFixed(1)}us`); +} + env.wgpuLoadRTPipeline = (wgslPtr, wgslLen, bindingsPtr, bindingsCount) => { if (!rtState.vertHeap) rtInit(); const userPart = new TextDecoder().decode(memU8().subarray(wgslPtr, wgslPtr + wgslLen)); @@ -3187,9 +3252,27 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes, const setUser = (pass) => { for (const u of userBgs) pass.setBindGroup(u.group, u.bindGroup); }; + // GPU timing: write begin/end timestamps around each pass (2 query + // slots per pass), then resolve + read back after submit. + const ts = wfEnsureTimestamps(); + const capture = !!(ts && !ts.inFlight); + const tsLabels = []; + const beginPass = (label, tsName) => { + const desc = { label }; + if (capture && tsLabels.length < WF_TS_MAX_PASSES) { + desc.timestampWrites = { + querySet: ts.querySet, + beginningOfPassWriteIndex: 2 * tsLabels.length, + endOfPassWriteIndex: 2 * tsLabels.length + 1, + }; + tsLabels.push(tsName); + } + return enc.beginComputePass(desc); + }; + // GENERATE { - const p = enc.beginComputePass({ label: "wf-generate" }); + const p = beginPass("wf-generate", "GENERATE"); p.setPipeline(pipe.genPipe); p.setBindGroup(0, paramsBg, [slotOff(0)]); p.setBindGroup(1, dataBg); @@ -3203,7 +3286,7 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes, const shadeSlot = 1 + 3 * d + 2; // PREP — publish indirect args, zero next counter. { - const p = enc.beginComputePass({ label: "wf-prep" }); + const p = beginPass("wf-prep", "PREP"); p.setPipeline(pipe.prepPipe); p.setBindGroup(0, paramsBg, [slotOff(prepSlot)]); p.setBindGroup(1, dataBg); @@ -3213,7 +3296,7 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes, } // TRACE — indirect over the live ray list. { - const p = enc.beginComputePass({ label: "wf-trace" }); + const p = beginPass("wf-trace", "TRACE"); p.setPipeline(pipe.tracePipe); p.setBindGroup(0, paramsBg, [slotOff(traceSlot)]); p.setBindGroup(1, dataBg); @@ -3222,7 +3305,7 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes, } // SHADE — indirect; runs user closesthit/miss, may emit + accumulate. { - const p = enc.beginComputePass({ label: "wf-shade" }); + const p = beginPass("wf-shade", "SHADE"); p.setPipeline(pipe.shadePipe); p.setBindGroup(0, paramsBg, [slotOff(shadeSlot)]); p.setBindGroup(1, dataBg); @@ -3233,7 +3316,7 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes, } // RESOLVE — tonemap accum → output image. { - const p = enc.beginComputePass({ label: "wf-resolve" }); + const p = beginPass("wf-resolve", "RESOLVE"); p.setPipeline(pipe.resolvePipe); p.setBindGroup(0, paramsBg, [slotOff(1 + 3 * depth)]); p.setBindGroup(1, dataBg); @@ -3242,6 +3325,14 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes, p.end(); } + if (capture && tsLabels.length > 0) { + enc.resolveQuerySet(ts.querySet, 0, 2 * tsLabels.length, ts.resolveBuf, 0); + enc.copyBufferToBuffer(ts.resolveBuf, 0, ts.readBuf, 0, 2 * tsLabels.length * 8); + ts.inFlight = true; + ts.pendingLabels = tsLabels; + state.tsReadPending = ts; + } + // Reopen the frame's shared pass so wgpuFrameEnd / later UI work as // before, and flip ping-pong so the blit picks the texture RESOLVE wrote. state.pass = enc.beginComputePass(); From 376e66aeedfffbcb5e81425cd5fd3b5a759695e7 Mon Sep 17 00:00:00 2001 From: catbot Date: Sun, 31 May 2026 20:16:04 +0000 Subject: [PATCH 4/8] WebGPU RT: port Sponza to wavefront (shadow ray in SHADE) Restructure Sponza for the wavefront model: raygen emits the primary ray; closesthit (in SHADE) gathers albedo/normal, accumulates ambient, and emits a shadow ray carrying the pending direct term; miss adds the sky (primary) or the direct term (shadow miss). resolve.wgsl applies the same Reinhard+gamma the megakernel raygen did inline. User bindings moved to group 3 (groups 0..2 reserved). RTPass maxDepth=2. Renders the atrium correctly through the wavefront pipeline (textures, two-sided shading, sun+ambient, shadows, tonemap). Co-Authored-By: Claude Opus 4.8 --- examples/Sponza/closesthit.wgsl | 73 +++++++++++++------------- examples/Sponza/main.cpp | 20 +++++--- examples/Sponza/miss.wgsl | 12 ++--- examples/Sponza/project.cpp | 1 + examples/Sponza/raygen.wgsl | 91 ++++----------------------------- examples/Sponza/resolve.wgsl | 7 +++ 6 files changed, 70 insertions(+), 134 deletions(-) create mode 100644 examples/Sponza/resolve.wgsl diff --git a/examples/Sponza/closesthit.wgsl b/examples/Sponza/closesthit.wgsl index 9fa15a8..293f8ba 100644 --- a/examples/Sponza/closesthit.wgsl +++ b/examples/Sponza/closesthit.wgsl @@ -1,35 +1,25 @@ -// Payload declared here so the WGSL assembler sees it before raygen -// (the assembler concatenates closesthit/anyhit/miss BEFORE raygen). +// Sponza closest-hit (runs in SHADE). In the wavefront model the lighting +// + shadow trace that used to live in raygen happens here: gather surface +// data, accumulate ambient, and emit a shadow ray toward the sun carrying +// the pending direct contribution. The shadow ray's miss adds that +// contribution (sun visible); its hit adds nothing (occluded), since +// RT_FLAG_SKIP_CLOSEST_HIT suppresses closesthit on the shadow ray. // -// WGSL forbids cycles in the function call graph, so closesthit_main -// CAN'T call traceRay (that would create closesthit → traceRay → -// runClosestHit → closesthit). The lighting + shadow trace therefore -// happens in raygen; closesthit's job is just to gather surface data -// into the payload. -// -// shadowRay = 0 (primary): closesthit fills albedo/worldPos/normal/hit. -// shadowRay = 1 (shadow): closesthit is skipped (RT_FLAG_SKIP_CLOSEST_HIT), -// miss flips color to white = "lit". +// Payload declared here so the assembler sees it before wfPayload / SHADE. struct Payload { - color: vec3, - shadowRay: u32, - worldPos: vec3, - hit: u32, - worldNormal: vec3, - _pad: f32, + color: vec3, // shadow ray: pending albedo·sun·nDotL + shadowRay: u32, // 0 primary, 1 shadow }; -// User-bound resources at group(2). Matches the UICustomBinding span the -// host hands to PipelineRTWebGPU::Init. -// binding 0 — albedo texture_2d_array, one layer per Sponza material -// binding 1 — sampler (linear clamp) -// binding 2 — camera storage buffer (read by raygen only) -@group(2) @binding(0) var albedos : texture_2d_array; -@group(2) @binding(1) var samp : sampler; +// User resources at @group(3) (0..2 are the wavefront pipeline's reserved +// groups). binding 0 albedo array, 1 sampler, 2 camera (raygen only). +@group(3) @binding(0) var albedos : texture_2d_array; +@group(3) @binding(1) var samp : sampler; + +const SUN_DIR_TO_LIGHT: vec3 = vec3(-0.35, 1.00, -0.20); +const SUN_COLOR: vec3 = vec3( 1.10, 1.00, 0.85); +const AMBIENT_COLOR: vec3 = vec3( 0.18, 0.20, 0.28); -// VertexNormalTangentUVPacked is `packed` on the outer struct but each -// inner `Vector` is SIMD-aligned to a 16-byte stride. So -// each vertex is 12 u32 words: normal at 0..2, tangent at 4..6, uv at 8..9. const ATTRIB_STRIDE_U32: u32 = 12u; const ATTRIB_NORMAL_OFFSET: u32 = 0u; const ATTRIB_UV_OFFSET: u32 = 8u; @@ -52,7 +42,6 @@ fn fetchNormal(meshRec: MeshRecord, vertexIdx: u32) -> vec3 { } fn closesthit_main(ray: RayDesc, hit: HitInfo, payload: ptr) { - // Resolve hit triangle → 3 vertex indices. let meshIdx = tlasEntries[hit.instanceId].blasMeshIdx; let meshRec = meshRecords[meshIdx]; let baseIdx = meshRec.indexOffset + hit.primitiveId * 3u; @@ -61,19 +50,14 @@ fn closesthit_main(ray: RayDesc, hit: HitInfo, payload: ptr) let i2 = indices[baseIdx + 2u]; let bary = vec3(1.0 - hit.attribs.x - hit.attribs.y, hit.attribs.x, hit.attribs.y); - // Albedo via barycentric UV interpolation. let uv0 = fetchUV(meshRec, i0); let uv1 = fetchUV(meshRec, i1); let uv2 = fetchUV(meshRec, i2); let uv = uv0 * bary.x + uv1 * bary.y + uv2 * bary.z; - // OBJ V is bottom-up; sampler is top-down. fract for manual tiling. let uvTiled = vec2(fract(uv.x), fract(1.0 - uv.y)); let layer = i32(hit.customIndex); let albedo = textureSampleLevel(albedos, samp, uvTiled, layer, 0.0).rgb; - // World-space smooth shading normal. Multiply through the - // object-to-world rotation so this stays correct if a future scene - // rotates instances (Sponza itself is all identities). let n0 = fetchNormal(meshRec, i0); let n1 = fetchNormal(meshRec, i1); let n2 = fetchNormal(meshRec, i2); @@ -83,8 +67,23 @@ fn closesthit_main(ray: RayDesc, hit: HitInfo, payload: ptr) dot(hit.objectToWorldR1.xyz, nObj), dot(hit.objectToWorldR2.xyz, nObj))); - (*payload).color = albedo; - (*payload).worldPos = ray.origin + ray.direction * hit.t; - (*payload).worldNormal = nWorld; - (*payload).hit = 1u; + // Two-sided: flip the normal toward the camera (Sponza curtains have + // inconsistent winding). + let nFacing = select(-nWorld, nWorld, dot(nWorld, ray.direction) < 0.0); + let lightDir = normalize(SUN_DIR_TO_LIGHT); + let nDotL = max(0.0, dot(nFacing, lightDir)); + let worldPos = ray.origin + ray.direction * hit.t; + + // Ambient is unconditional; direct light is gated behind the shadow ray. + rtAccumulate(albedo * AMBIENT_COLOR); + + if (nDotL > 0.0) { + let shadowOrigin = worldPos + nFacing * 0.5; + var sp: Payload; + sp.color = albedo * SUN_COLOR * nDotL; + sp.shadowRay = 1u; + rtEmitRay(shadowOrigin, 0.001, lightDir, 10000.0, + RT_FLAG_SKIP_CLOSEST_HIT | RT_FLAG_TERMINATE_ON_FIRST_HIT, + 0xFFu, 0u, 0u, sp); + } } diff --git a/examples/Sponza/main.cpp b/examples/Sponza/main.cpp index 73db99e..1df3a2c 100644 --- a/examples/Sponza/main.cpp +++ b/examples/Sponza/main.cpp @@ -253,10 +253,11 @@ int main() { DescriptorHeapWebGPU heap; heap.Initialize(/*images*/ 2, /*buffers*/ 2, /*samplers*/ 2); - std::array shaders {{ + std::array shaders {{ WebGPUShader(fs::path("raygen.wgsl"), "raygen_main", WebGPURTStage::Raygen), WebGPUShader(fs::path("miss.wgsl"), "miss_main", WebGPURTStage::Miss), WebGPUShader(fs::path("closesthit.wgsl"), "closesthit_main", WebGPURTStage::ClosestHit), + WebGPUShader(fs::path("resolve.wgsl"), "resolve_main", WebGPURTStage::Resolve), }}; ShaderBindingTableWebGPU sbt; sbt.Init(shaders); @@ -271,14 +272,15 @@ int main() { { .type = RTShaderGroupType::TrianglesHitGroup, .closestHitShader = 2 }, }}; - // Three user bindings at @group(2): + // Three user bindings at @group(3) (the wavefront pipeline reserves + // groups 0..2 for WfParams / data heaps / indirect args): // binding 0 — albedo texture_2d_array (one layer per material) // binding 1 — sampler (linear clamp) // binding 2 — Camera storage buffer (host-driven, updated per frame) std::array bindings {{ - { .group = 2, .binding = 0, .kind = UICustomBindingKind::SampledTextureArray, ._pad = 0, .pushOffset = 0 }, - { .group = 2, .binding = 1, .kind = UICustomBindingKind::Sampler, ._pad = 0, .pushOffset = 0 }, - { .group = 2, .binding = 2, .kind = UICustomBindingKind::Buffer, ._pad = 0, .pushOffset = 0 }, + { .group = 3, .binding = 0, .kind = UICustomBindingKind::SampledTextureArray, ._pad = 0, .pushOffset = 0 }, + { .group = 3, .binding = 1, .kind = UICustomBindingKind::Sampler, ._pad = 0, .pushOffset = 0 }, + { .group = 3, .binding = 2, .kind = UICustomBindingKind::Buffer, ._pad = 0, .pushOffset = 0 }, }}; PipelineRTWebGPU pipeline; @@ -367,6 +369,7 @@ int main() { RTPass rtPass(&pipeline); rtPass.handlesPtr = userHandles.data(); rtPass.handlesCount = static_cast(userHandles.size()); + rtPass.maxDepth = 2; // primary + shadow window.passes.push_back(&rtPass); // ── Free camera: WASD + mouse-delta look ─────────────────────────── @@ -375,9 +378,10 @@ int main() { // height, looking +X down the long axis (bbox: X[-1921..1800], // Y[-126..1429], Z[-1183..1105]). The user can fine-tune from there. struct CamState { - Vector position{ -1500.0f, 200.0f, 0.0f }; - float yaw = 0.0f; // radians, around world +Y - float pitch = 0.0f; // radians, +pitch looks up + // 3/4 view from a corner aimed at the atrium centre. + Vector position{ -1400.0f, 700.0f, -600.0f }; + float yaw = 0.405f; // radians, around world +Y + float pitch = -0.317f; // radians, +pitch looks up } cam; Input::Map inputMap; diff --git a/examples/Sponza/miss.wgsl b/examples/Sponza/miss.wgsl index 39ff71d..0bbbf4d 100644 --- a/examples/Sponza/miss.wgsl +++ b/examples/Sponza/miss.wgsl @@ -1,16 +1,12 @@ +// Sponza miss (runs in SHADE). Primary miss → two-stop sky gradient. +// Shadow miss → the sun is unoccluded, so add the pending direct term. fn miss_main(ray: RayDesc, payload: ptr) { if ((*payload).shadowRay == 1u) { - // Shadow ray escaped to infinity — the sun is visible from the - // origin, so the surface there should pick up full direct light. - // raygen reads color.x as the visibility coefficient. - (*payload).color = vec3(1.0); + rtAccumulate((*payload).color); return; } - - // Primary miss: cheap two-stop sky gradient. (*payload).hit stays 0 - // so raygen knows to skip the lighting path and just use this color. let t = clamp(ray.direction.y * 0.5 + 0.5, 0.0, 1.0); let sky = vec3(0.45, 0.65, 0.95); let zenith = vec3(0.95, 0.85, 0.65); - (*payload).color = mix(sky, zenith, t); + rtAccumulate(mix(sky, zenith, t)); } diff --git a/examples/Sponza/project.cpp b/examples/Sponza/project.cpp index b850c90..400b0c2 100644 --- a/examples/Sponza/project.cpp +++ b/examples/Sponza/project.cpp @@ -82,6 +82,7 @@ extern "C" Configuration CrafterBuildProject(std::span a cfg.files.emplace_back(fs::path("raygen.wgsl")); cfg.files.emplace_back(fs::path("closesthit.wgsl")); cfg.files.emplace_back(fs::path("miss.wgsl")); + cfg.files.emplace_back(fs::path("resolve.wgsl")); EnableWasiBrowserRuntime(cfg); } else { cfg.shaders.emplace_back(fs::path("raygen.glsl"), std::string("main"), ShaderType::RayGen); diff --git a/examples/Sponza/raygen.wgsl b/examples/Sponza/raygen.wgsl index a90ca84..6575b34 100644 --- a/examples/Sponza/raygen.wgsl +++ b/examples/Sponza/raygen.wgsl @@ -1,12 +1,8 @@ -// WebGPU raygen. Camera state comes from the host every frame via a -// storage buffer bound at @group(2) @binding(2); main.cpp drives that -// from WASD + mouse-delta through Crafter::Input. -// -// The shading + shadow trace all happens here because WGSL forbids -// recursive function call graphs — closesthit_main can't call traceRay -// (that would loop closesthit → traceRay → runClosestHit → closesthit). -// Raygen is the entry point and not called by anyone, so it can call -// traceRay twice (once primary, once shadow) without forming a cycle. +// Sponza raygen (runs in GENERATE). Emits the pixel's primary ray; all +// shading + the shadow trace now happen in SHADE (closesthit/miss). Camera +// state comes from the host each frame via a storage buffer at +// @group(3) @binding(2) (groups 0..2 are reserved by the wavefront +// pipeline). main.cpp drives it from WASD + mouse-delta. struct Camera { origin: vec3, @@ -18,92 +14,25 @@ struct Camera { forward: vec3, pad1: f32, }; -@group(2) @binding(2) var camera : Camera; - -// Sun coming through Sponza's open roof. Y is up; this points "down and -// slightly along +X" so the light grazes the colonnades on one side. -const SUN_DIR_TO_LIGHT: vec3 = vec3(-0.35, 1.00, -0.20); -const SUN_COLOR: vec3 = vec3( 1.10, 1.00, 0.85); -const AMBIENT_COLOR: vec3 = vec3( 0.18, 0.20, 0.28); +@group(3) @binding(2) var camera : Camera; fn raygen_main(gid: vec3) { - if (gid.x >= hdr.surfaceW || gid.y >= hdr.surfaceH) { return; } + if (gid.x >= wfParams.surfaceW || gid.y >= wfParams.surfaceH) { return; } let pixel = vec2(f32(gid.x), f32(gid.y)); - let resolution = vec2(f32(hdr.surfaceW), f32(hdr.surfaceH)); + let resolution = vec2(f32(wfParams.surfaceW), f32(wfParams.surfaceH)); let uv = (pixel + vec2(0.5)) / resolution; let ndc = uv * 2.0 - vec2(1.0); - // Pinhole camera reconstructed from the host basis. ndc.x runs left- - // to-right across the screen → +right; ndc.y is top-down so we - // negate before applying +up. let direction = normalize( camera.right * (ndc.x * camera.aspect * camera.tanHalf) + camera.up * (-ndc.y * camera.tanHalf) + camera.forward); - // ── Primary ray ──────────────────────────────────────────────────── var payload: Payload; payload.color = vec3(0.0); payload.shadowRay = 0u; - payload.hit = 0u; - traceRay( - 0u, 0u, 0xFFu, - 0u, 0u, 0u, - camera.origin, 0.001, - direction, 10000.0, - &payload); - - var finalColor: vec3; - if (payload.hit == 1u) { - // Closesthit filled albedo/worldPos/worldNormal. Two-sided - // shading: flip the normal toward the camera if we hit the back - // face — Sponza's curtains in particular have inconsistent - // winding, and without this half the surface would go black. - let albedo = payload.color; - let nFacing = select(-payload.worldNormal, - payload.worldNormal, - dot(payload.worldNormal, direction) < 0.0); - let lightDir = normalize(SUN_DIR_TO_LIGHT); - let nDotL = max(0.0, dot(nFacing, lightDir)); - - // ── Shadow ray ──────────────────────────────────────────────── - // Only worth tracing if the surface faces the sun at all. - var visibility = 0.0; - if (nDotL > 0.0) { - // Normal-offset bias on Sponza's units (~3700 wide atrium) - // is hefty; 0.5 keeps the shadow ray clear of the originating - // triangle without producing visible "floating" shadows. - let shadowOrigin = payload.worldPos + nFacing * 0.5; - - var shadowPayload: Payload; - shadowPayload.color = vec3(0.0); // default: blocked - shadowPayload.shadowRay = 1u; - shadowPayload.hit = 0u; - traceRay( - 0u, - RT_FLAG_SKIP_CLOSEST_HIT | RT_FLAG_TERMINATE_ON_FIRST_HIT, - 0xFFu, - 0u, 0u, 0u, - shadowOrigin, 0.001, - lightDir, 10000.0, - &shadowPayload); - visibility = shadowPayload.color.x; - } - - let lit = AMBIENT_COLOR + SUN_COLOR * (nDotL * visibility); - finalColor = albedo * lit; - } else { - // Sky color was filled by miss_main. - finalColor = payload.color; - } - - // Reinhard tonemap + gamma 2.2 so sun-lit albedos don't clip and - // shadow detail stays readable. - let mapped = finalColor / (finalColor + vec3(1.0)); - let gamma = pow(mapped, vec3(1.0 / 2.2)); - textureStore(outImage, - vec2(i32(gid.x), i32(gid.y)), - vec4(gamma, 1.0)); + rtEmitPrimaryRay(camera.origin, 0.001, direction, 10000.0, + 0u, 0xFFu, 0u, 0u, payload); } diff --git a/examples/Sponza/resolve.wgsl b/examples/Sponza/resolve.wgsl new file mode 100644 index 0000000..346659e --- /dev/null +++ b/examples/Sponza/resolve.wgsl @@ -0,0 +1,7 @@ +// Sponza RESOLVE-stage tonemap: Reinhard + gamma 2.2 over the linear +// accumulator — matches the tonemap the megakernel raygen applied inline. +fn resolve_main(coord: vec2, hdr: vec4) -> vec4 { + let mapped = hdr.rgb / (hdr.rgb + vec3(1.0)); + let g = pow(mapped, vec3(1.0 / 2.2)); + return vec4(g, 1.0); +} From dd4122f2baf634ea9957343f99974376931c90be Mon Sep 17 00:00:00 2001 From: catbot Date: Sun, 31 May 2026 20:21:44 +0000 Subject: [PATCH 5/8] WebGPU RT: ordered (nearest-child-first) traversal Add _rtAabbT (AABB test returning entry-t); in both _rtwTraverseBlas and _rtwTraverseTlas descend the nearer child first and push the farther only when it hits, re-culling it against the (tightened) bestT when popped. Render is identical (same closest hit) on VulkanTriangle, RTStress (512/4096), and Sponza; cuts node visits on dense scenes. Co-Authored-By: Claude Opus 4.8 --- additional/dom-webgpu.js | 59 ++++++++++++++++++++++++++++++++++++---- 1 file changed, 54 insertions(+), 5 deletions(-) diff --git a/additional/dom-webgpu.js b/additional/dom-webgpu.js index fcadaba..00bf317 100644 --- a/additional/dom-webgpu.js +++ b/additional/dom-webgpu.js @@ -1481,6 +1481,27 @@ fn _rtAabb(ro: vec3, invRd: vec3, mn: vec3, mx: vec3, tMax: return tExit >= max(tEnter, 0.0) && tEnter <= tMax; } +// AABB test that also returns the (clamped) entry distance, for ordered +// nearest-child-first traversal. t is meaningless when hit == false. +struct _RtAabbHit { hit: bool, t: f32 }; +fn _rtAabbT(ro: vec3, invRd: vec3, mn: vec3, mx: vec3, tMax: f32) -> _RtAabbHit { + var r: _RtAabbHit; + r.hit = false; + r.t = 0.0; + if (any(mn > mx)) { return r; } + let t0 = (mn - ro) * invRd; + let t1 = (mx - ro) * invRd; + let tmin = min(t0, t1); + let tmax = max(t0, t1); + let tEnter = max(max(tmin.x, tmin.y), tmin.z); + let tExit = min(min(tmax.x, tmax.y), tmax.z); + if (tExit >= max(tEnter, 0.0) && tEnter <= tMax) { + r.hit = true; + r.t = max(tEnter, 0.0); + } + return r; +} + struct _RtTriHit { hit: bool, t: f32, u: f32, v: f32 }; fn _rtTri(ro: vec3, rd: vec3, p0: vec3, p1: vec3, p2: vec3, tMin: f32, tMax: f32) -> _RtTriHit { @@ -1896,10 +1917,26 @@ fn _rtwTraverseBlas(rayObj: RayDesc, flags: u32, meshRec: MeshRecord, if (sp == 0u) { break; } sp = sp - 1u; nodeRel = stack[sp]; continue; } + // Internal node: descend the nearer child first; push the farther + // only when it hits (and re-cull it against bestT when popped). let left = node.firstChildOrPrim; let right = left + 1u; - if (sp < 32u) { stack[sp] = right; sp = sp + 1u; } - nodeRel = left; + let ln = bvhNodes[meshRec.bvhOffset + left]; + let rn = bvhNodes[meshRec.bvhOffset + right]; + let lr = _rtAabbT(rayObj.origin, invD, ln.aabbMin, ln.aabbMax, *bestT); + let rr = _rtAabbT(rayObj.origin, invD, rn.aabbMin, rn.aabbMax, *bestT); + if (lr.hit && rr.hit) { + if (sp + 1u < 32u) { + if (lr.t <= rr.t) { stack[sp] = right; sp = sp + 1u; stack[sp] = left; sp = sp + 1u; } + else { stack[sp] = left; sp = sp + 1u; stack[sp] = right; sp = sp + 1u; } + } + } else if (lr.hit) { + if (sp < 32u) { stack[sp] = left; sp = sp + 1u; } + } else if (rr.hit) { + if (sp < 32u) { stack[sp] = right; sp = sp + 1u; } + } + if (sp == 0u) { break; } + sp = sp - 1u; nodeRel = stack[sp]; } return false; } @@ -1960,11 +1997,23 @@ fn _rtwTraverseTlas(rayWorld: RayDesc, flags: u32, cullMask: u32, if (endSearch) { return true; } if ((*bestT) < pre && (effective & RT_FLAG_TERMINATE_ON_FIRST_HIT) != 0u) { return true; } } else { + // Internal node: nearest-child-first. Sentinel-padded leaves + // carry an inverted AABB so _rtAabbT rejects them for free. let left = 2u * nodeIdx + 1u; let right = 2u * nodeIdx + 2u; - if (sp + 1u < 32u) { - stack[sp] = right; sp = sp + 1u; - stack[sp] = left; sp = sp + 1u; + let ln = tlasBvhNodes[left]; + let rn = tlasBvhNodes[right]; + let lr = _rtAabbT(rayWorld.origin, invD, ln.aabbMin, ln.aabbMax, *bestT); + let rr = _rtAabbT(rayWorld.origin, invD, rn.aabbMin, rn.aabbMax, *bestT); + if (lr.hit && rr.hit) { + if (sp + 1u < 32u) { + if (lr.t <= rr.t) { stack[sp] = right; sp = sp + 1u; stack[sp] = left; sp = sp + 1u; } + else { stack[sp] = left; sp = sp + 1u; stack[sp] = right; sp = sp + 1u; } + } + } else if (lr.hit) { + if (sp < 32u) { stack[sp] = left; sp = sp + 1u; } + } else if (rr.hit) { + if (sp < 32u) { stack[sp] = right; sp = sp + 1u; } } } } From 82e5e867d4fafb07b227f971460a7d832c484972 Mon Sep 17 00:00:00 2001 From: catbot Date: Sun, 31 May 2026 20:24:04 +0000 Subject: [PATCH 6/8] WebGPU RT: remove dead megakernel WGSL (no dual path) The RT pipeline now only builds the wavefront kernels, so the old single-megakernel traversal/traceRay block (rtWgslMegakernelHelpers) and the unused rtWgslPrelude alias are dead. Remove them. The rayQuery compute path keeps rtWgslMegakernelBindings (its own _rq* traversal uses it). RTStress still renders correctly with the trimmed prelude. Co-Authored-By: Claude Opus 4.8 --- additional/dom-webgpu.js | 199 --------------------------------------- 1 file changed, 199 deletions(-) diff --git a/additional/dom-webgpu.js b/additional/dom-webgpu.js index 00bf317..d3a9c72 100644 --- a/additional/dom-webgpu.js +++ b/additional/dom-webgpu.js @@ -1439,7 +1439,6 @@ const TLAS_BVH_N_PADDED: u32 = 16384u; const TLAS_BVH_LEAVES_START: u32 = TLAS_BVH_N_PADDED - 1u; `; -const rtWgslPrelude = rtWgslTypes + rtWgslMegakernelBindings; // ── WGSL library: helpers + traverseBlas + traverseTlas + traceRay ─── // Injected after the user-supplied closesthit/anyhit/miss sources + @@ -1526,204 +1525,6 @@ fn _rtTri(ro: vec3, rd: vec3, p0: vec3, p1: vec3, p2: vec3, - bestT: ptr, - payload: ptr) -> bool { - let invD = vec3(1.0) / rayObj.direction; - var stack: array; - var sp: u32 = 0u; - var nodeRel: u32 = 0u; - - loop { - let abs = meshRec.bvhOffset + nodeRel; - let node = bvhNodes[abs]; - if (!_rtAabb(rayObj.origin, invD, node.aabbMin, node.aabbMax, *bestT)) { - if (sp == 0u) { break; } - sp = sp - 1u; nodeRel = stack[sp]; continue; - } - if (node.primCount > 0u) { - for (var i: u32 = 0u; i < node.primCount; i = i + 1u) { - let triIndex = primRemap[meshRec.primRemapOffset + node.firstChildOrPrim + i]; - let verts = _rtFetchTri(meshRec, triIndex); - let tr = _rtTri(rayObj.origin, rayObj.direction, - verts[0], verts[1], verts[2], - rayObj.tMin, *bestT); - if (!tr.hit) { continue; } - - let geomNormal = cross(verts[1] - verts[0], verts[2] - verts[0]); - let facing = dot(geomNormal, rayObj.direction); - if ((flags & RT_FLAG_CULL_BACK_FACING_TRIANGLES) != 0u && facing > 0.0) { continue; } - if ((flags & RT_FLAG_CULL_FRONT_FACING_TRIANGLES) != 0u && facing < 0.0) { continue; } - - var candidate: HitInfo; - candidate.t = tr.t; - candidate.instanceId = instanceId; - candidate.primitiveId = triIndex; - candidate.hitGroupIndex = hitGroupBase; - candidate.attribs = vec2(tr.u, tr.v); - candidate.objectRayOrigin = rayObj.origin; - candidate.objectRayDirection = rayObj.direction; - - let opaque = (flags & RT_FLAG_OPAQUE) != 0u - || (flags & RT_FLAG_NO_OPAQUE) == 0u; // default opaque - - if (opaque) { - *bestHit = candidate; - *bestT = tr.t; - if ((flags & RT_FLAG_TERMINATE_ON_FIRST_HIT) != 0u) { return true; } - } else { - let r = runAnyHit(hitGroupBase, rayObj, candidate, payload); - if (r == RT_ANYHIT_END_SEARCH) { - *bestHit = candidate; - *bestT = tr.t; - return true; - } - if (r == RT_ANYHIT_ACCEPT) { - *bestHit = candidate; - *bestT = tr.t; - if ((flags & RT_FLAG_TERMINATE_ON_FIRST_HIT) != 0u) { return true; } - } - } - } - if (sp == 0u) { break; } - sp = sp - 1u; nodeRel = stack[sp]; continue; - } - // inner node — push right, descend left - let left = node.firstChildOrPrim; - let right = left + 1u; - if (sp < 32u) { stack[sp] = right; sp = sp + 1u; } - nodeRel = left; - } - return false; -} - -fn _rtTraverseTlas(rayWorld: RayDesc, flags: u32, cullMask: u32, - sbtRecordOffset: u32, sbtRecordStride: u32, - bestHit: ptr, - bestT: ptr, - payload: ptr) -> bool { - let invD = vec3(1.0) / rayWorld.direction; - // Stack-based descent of the sweep-tree BVH. Internal nodes - // [0, TLAS_BVH_LEAVES_START); leaves [LEAVES_START, 2*N_PADDED-1). - // Node i's children are 2i+1 / 2i+2 (implicit perfect binary tree). - // Stack depth = tree depth = log2(N_PADDED) = 14 for N_PADDED=16384; - // 24 gives generous headroom. - var stack: array; - var sp: u32 = 0u; - stack[sp] = 0u; sp = sp + 1u; - loop { - if (sp == 0u) { break; } - sp = sp - 1u; - let nodeIdx = stack[sp]; - let node = tlasBvhNodes[nodeIdx]; - if (!_rtAabb(rayWorld.origin, invD, node.aabbMin, node.aabbMax, *bestT)) { - continue; - } - if (nodeIdx >= TLAS_BVH_LEAVES_START) { - // Leaf: resolve entry, do the existing per-instance test. - let leafIdx = nodeIdx - TLAS_BVH_LEAVES_START; - let i = tlasEntryOrder[leafIdx]; - // Sentinel-padded leaves get instanceMask=0; cullMask check - // (and degenerate AABB above) means they fall out cheaply. - if (i == 0xFFFFFFFFu) { continue; } - let inst = tlasEntries[i]; - let instanceMask = inst.maskHGOffset & 0xFFu; - if ((instanceMask & cullMask) == 0u) { continue; } - if (!_rtAabb(rayWorld.origin, invD, inst.aabbMin, inst.aabbMax, *bestT)) { continue; } - - // Transform ray to object space. - let r0 = inst.worldToObjectR0; - let r1 = inst.worldToObjectR1; - let r2 = inst.worldToObjectR2; - var rayObj: RayDesc; - rayObj.origin = vec3( - dot(r0.xyz, rayWorld.origin) + r0.w, - dot(r1.xyz, rayWorld.origin) + r1.w, - dot(r2.xyz, rayWorld.origin) + r2.w, - ); - rayObj.direction = vec3( - dot(r0.xyz, rayWorld.direction), - dot(r1.xyz, rayWorld.direction), - dot(r2.xyz, rayWorld.direction), - ); - rayObj.tMin = rayWorld.tMin; - rayObj.tMax = *bestT; - - var effective = flags; - let iflags = inst.instanceFlags; - if ((iflags & RT_INSTANCE_FORCE_OPAQUE) != 0u) { - effective = (effective | RT_FLAG_OPAQUE) & ~RT_FLAG_NO_OPAQUE; - } - if ((iflags & RT_INSTANCE_FORCE_NO_OPAQUE) != 0u) { - effective = (effective | RT_FLAG_NO_OPAQUE) & ~RT_FLAG_OPAQUE; - } - if ((iflags & RT_INSTANCE_TRIANGLE_FACING_CULL_DISABLE) != 0u) { - effective = effective & ~(RT_FLAG_CULL_BACK_FACING_TRIANGLES | RT_FLAG_CULL_FRONT_FACING_TRIANGLES); - } - - let hitGroupOffset = inst.maskHGOffset >> 8u; - let hitGroupBase = sbtRecordOffset + hitGroupOffset; - let meshRec = meshRecords[inst.blasMeshIdx]; - - let pre = *bestT; - let endSearch = _rtTraverseBlas(rayObj, effective, meshRec, i, hitGroupBase, - bestHit, bestT, payload); - if (endSearch) { return true; } - if ((*bestT) < pre) { - // record world-space object-to-world for the closest-hit shader - (*bestHit).objectToWorldR0 = inst.objectToWorldR0; - (*bestHit).objectToWorldR1 = inst.objectToWorldR1; - (*bestHit).objectToWorldR2 = inst.objectToWorldR2; - (*bestHit).customIndex = inst.customIndex; - if ((effective & RT_FLAG_TERMINATE_ON_FIRST_HIT) != 0u) { return true; } - } - } else { - // Internal node: push both children (skip overflow). - let left = 2u * nodeIdx + 1u; - let right = 2u * nodeIdx + 2u; - if (sp + 1u < 24u) { - stack[sp] = right; sp = sp + 1u; - stack[sp] = left; sp = sp + 1u; - } - } - } - return false; -} - -fn traceRay(tlasIdx: u32, flags: u32, cullMask: u32, - sbtRecordOffset: u32, sbtRecordStride: u32, missIndex: u32, - rayOrigin: vec3, rayTMin: f32, - rayDir: vec3, rayTMax: f32, - payload: ptr) { - var ray: RayDesc; - ray.origin = rayOrigin; - ray.direction = rayDir; - ray.tMin = rayTMin; - ray.tMax = rayTMax; - var bestHit: HitInfo; - bestHit.t = rayTMax; - var bestT = rayTMax; - let ended = _rtTraverseTlas(ray, flags, cullMask & 0xFFu, - sbtRecordOffset, sbtRecordStride, - &bestHit, &bestT, payload); - if (bestT < rayTMax) { - if ((flags & RT_FLAG_SKIP_CLOSEST_HIT) == 0u) { - runClosestHit(bestHit.hitGroupIndex, ray, bestHit, payload); - } - } else { - runMiss(missIndex, ray, payload); - } -} -`; // ════════════════════════════════════════════════════════════════════════ // WAVEFRONT RT — streaming tracer (GENERATE → PREP → TRACE → SHADE → From afc0292fab6841e281af006bd781a627ae2a5fd0 Mon Sep 17 00:00:00 2001 From: catbot Date: Sun, 31 May 2026 20:28:12 +0000 Subject: [PATCH 7/8] WebGPU RT: dynamic TLAS sweep-tree depth (next_pow2 instances) The LBVH bitonic sort still runs over the full 16384 (sentinels sink to the tail), but the sweep tree is now built and traced at depth log2(next_pow2(nReal)) instead of a fixed 14. Add nPadded to LbvhPC; leaf init + bottom-up refit use it; the host passes the same next_pow2 to the trace via WfParams.tlasNPadded. Renders correctly at 512 instances (depth 9). The fragile sort phases are untouched. Co-Authored-By: Claude Opus 4.8 --- additional/dom-webgpu.js | 69 ++++++++++++++++++++++++++-------------- 1 file changed, 46 insertions(+), 23 deletions(-) diff --git a/additional/dom-webgpu.js b/additional/dom-webgpu.js index d3a9c72..c0e83d1 100644 --- a/additional/dom-webgpu.js +++ b/additional/dom-webgpu.js @@ -2281,7 +2281,11 @@ struct BvhNode { // runtime resize-on-grow caused subtle BVH corruption (driver-level // memory recycling, suspected) and was the root cause of mid-game // geometry flicker when projectiles entered the TLAS. -struct LbvhPC { nReal: u32, _pad0: u32, _pad1: u32, _pad2: u32 }; +// nPadded = next_pow2(max(nReal,1)), supplied by the host. The bitonic +// sort still runs over the full N_PADDED (sentinels sink to the tail), but +// the sweep tree is built (and traced) at depth log2(nPadded) so descent +// tracks the real instance count instead of a fixed 14. +struct LbvhPC { nReal: u32, nPadded: u32, _pad1: u32, _pad2: u32 }; @group(0) @binding(5) var lbvhPc : LbvhPC; const N_PADDED: u32 = 16384u; @@ -2436,29 +2440,36 @@ fn lbvhBuildMain(@builtin(local_invocation_id) lid: vec3) { storageBarrier(); // ── Phase 4: initialize BVH leaf AABBs ─────────────────────────────── - for (var k: u32 = 0u; k < K_PER; k = k + 1u) { + // Only the first nPadded sorted slots become leaves of the (smaller) + // sweep tree; reals occupy [0,nReal), the rest sink as sentinels. + let nPadded = max(lbvhPc.nPadded, 1u); + let leafPerThread = (nPadded + THREADS - 1u) / THREADS; + for (var k: u32 = 0u; k < leafPerThread; k = k + 1u) { let i = k * THREADS + tid; - let leafIdx = N_PADDED - 1u + i; - let leafKey = sortA[i]; - if (leafKey == 0xFFFFFFFFu) { - outBvh[leafIdx].aabbMin = vec3( 1e30); - outBvh[leafIdx].aabbMax = vec3(-1e30); - } else { - let e = entries[leafKey & 0xFFFFu]; - outBvh[leafIdx].aabbMin = e.aabbMin; - outBvh[leafIdx].aabbMax = e.aabbMax; + if (i < nPadded) { + let leafIdx = nPadded - 1u + i; + let leafKey = sortA[i]; + if (leafKey == 0xFFFFFFFFu) { + outBvh[leafIdx].aabbMin = vec3( 1e30); + outBvh[leafIdx].aabbMax = vec3(-1e30); + } else { + let e = entries[leafKey & 0xFFFFu]; + outBvh[leafIdx].aabbMin = e.aabbMin; + outBvh[leafIdx].aabbMax = e.aabbMax; + } } } workgroupBarrier(); storageBarrier(); - // ── Phase 5: bottom-up sweep-tree refit, LEVELS iterations ────────── - // Deepest internal level has N_PADDED/2 nodes; perThread = ceil of - // levelCount / THREADS is uniform per step, so workgroupBarrier - // stays in uniform control flow. - var levelCount: u32 = N_PADDED / 2u; - var levelStart: u32 = N_PADDED / 2u - 1u; - for (var step: u32 = 0u; step < LEVELS; step = step + 1u) { + // ── Phase 5: bottom-up sweep-tree refit, log2(nPadded) levels ─────── + // Deepest internal level has nPadded/2 nodes. The loop bound is uniform + // across the workgroup (depends only on nPadded), so the barriers stay + // in uniform control flow. + var levelCount: u32 = nPadded / 2u; + var levelStart: u32 = nPadded / 2u - 1u; + loop { + if (levelCount == 0u) { break; } let perThread = (levelCount + THREADS - 1u) / THREADS; for (var k: u32 = 0u; k < perThread; k = k + 1u) { let nodeOff = k * THREADS + tid; @@ -2723,11 +2734,12 @@ env.wgpuBuildTLAS = (instanceBufHandle, instanceCount, tlasOutBufHandle, { binding: 4, resource: { buffer: morton } }, ], }); - // Write the real instance count to the LBVH count uniform so the - // shader can iterate exactly the right number of entries even - // though the storage buffers stay sized for N_PADDED. + // Write the real instance count + the dynamic padded leaf count + // (next_pow2) to the LBVH uniform. The sort still runs over the full + // N_PADDED, but the sweep tree is built at depth log2(nPadded). const countBuf = new Uint32Array(4); countBuf[0] = instanceCount; + countBuf[1] = wfNextPow2(instanceCount); queue.writeBuffer(rtState.lbvhCountBuf, 0, countBuf); const lbvhBg = device.createBindGroup({ @@ -2798,7 +2810,15 @@ const WF_PAYLOAD_BYTES = 64; // Dynamic-offset uniform ring: one WfParams slot per wavefront pass. 128 // slots covers maxDepth up to ~42 (1 + 3·maxDepth + 1 passes). const WF_PARAM_SLOTS = 128; -const WF_FIXED_TLAS_NPADDED = 16384; // matches lbvhBuildWgsl N_PADDED +const WF_TLAS_MAX_NPADDED = 16384; // LBVH sort capacity (N_PADDED) +// Smallest power of two >= max(n,1), clamped to the LBVH capacity. The +// TLAS sweep tree is built and traced at this depth so descent tracks the +// real instance count instead of a fixed 16384-leaf (depth-14) tree. +function wfNextPow2(n) { + let p = 1; + while (p < n && p < WF_TLAS_MAX_NPADDED) p <<= 1; + return p; +} function ensureWavefrontBuffers(W, H) { const cap = W * H; @@ -3041,11 +3061,14 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes, // 1+3*d .. +2 PREP / TRACE / SHADE for bounce d // 1+3*depth RESOLVE const passCount = 2 + 3 * depth; + // TLAS descent depth = log2(tlasNPadded); must match the value the + // build used (both derive next_pow2 from the same instance count). + const tlasNPadded = wfNextPow2(instanceCount); const ring = new Uint32Array(WF_PARAM_SLOTS * 64); // 256 B = 64 u32 per slot const writeSlot = (slot, curIsA, bounce) => { const o = slot * 64; ring[o + 0] = W; ring[o + 1] = H; ring[o + 2] = cap; ring[o + 3] = curIsA; - ring[o + 4] = bounce; ring[o + 5] = depth; ring[o + 6] = WF_FIXED_TLAS_NPADDED; ring[o + 7] = 0; + ring[o + 4] = bounce; ring[o + 5] = depth; ring[o + 6] = tlasNPadded; ring[o + 7] = 0; }; writeSlot(0, 1, 0); // GENERATE for (let d = 0; d < depth; d++) { From 358084185a7fd563f21c0f9940039413b80e24c8 Mon Sep 17 00:00:00 2001 From: catbot Date: Sun, 31 May 2026 20:29:12 +0000 Subject: [PATCH 8/8] docs: wavefront RT in README + design-doc status; add RTStress to examples --- README.md | 17 ++++++++++++++--- WAVEFRONT-DESIGN.md | 32 +++++++++++++++++++++++++------- 2 files changed, 39 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 6aca07b..b64e32d 100644 --- a/README.md +++ b/README.md @@ -50,9 +50,16 @@ compute pipeline composed from user-supplied WGSL stages). bridge. Atlas (`r8unorm`, sub-region writes) is a separate path. - **PipelineRTVulkan / PipelineRTWebGPU / ShaderBindingTableVulkan / ShaderBindingTableWebGPU / RTPass** — ray-tracing pipelines. Vulkan - uses native RT pipelines + SBTs; WebGPU composes one compute - pipeline by stitching the traversal library, a generated hit-group - switch, and the user's raygen / closesthit / miss / anyhit WGSL. + uses native RT pipelines + SBTs; WebGPU compiles a **wavefront / + streaming** software tracer — five `@compute` kernels + (`GENERATE → PREP → TRACE → SHADE → RESOLVE`) sharing one module, + connected by GPU ray/hit/payload buffers and a GPU-driven indirect + bounce loop (`dispatchWorkgroupsIndirect`). TRACE carries zero user + code (traversal + intersection only); user raygen calls + `rtEmitPrimaryRay`, and closesthit / miss run in SHADE where they + `rtEmitRay` continuation/shadow rays and `rtAccumulate` radiance. An + optional Resolve shader tonemaps the linear accumulator. See + [WAVEFRONT-DESIGN.md](WAVEFRONT-DESIGN.md). - **ComputeShader / WebGPUComputeShader** — Tier 1 wrapper used by the UI system. Vulkan loads a `.spv` and dispatches with `vkCmdPushDataEXT`; WebGPU loads a user-supplied `.wgsl` blob at @@ -145,6 +152,10 @@ See [examples/](examples/). Quick map: - [VulkanTriangle](examples/VulkanTriangle/) — ray-traced triangle on both Vulkan and WebGPU. The smallest test of the bindless + RT path on each backend. +- [RTStress](examples/RTStress/) — wavefront RT benchmark: an N×N×N grid + of a cube mesh (instance-count knob `kGrid`, 512 → 8000) shaded with + primary + shadow rays. Prints a GPU timestamp-query per-pass breakdown + each second. WebGPU/DOM only. - [Sponza](examples/Sponza/) — ray-traced Sponza atrium on both backends. Exercises `.cmesh` / `.ctex` decompression (GPU `VK_EXT_memory_decompression` on Vulkan, CPU on WebGPU) and a diff --git a/WAVEFRONT-DESIGN.md b/WAVEFRONT-DESIGN.md index 78d77e2..47e42d0 100644 --- a/WAVEFRONT-DESIGN.md +++ b/WAVEFRONT-DESIGN.md @@ -53,13 +53,31 @@ Compile/runtime knob. JS unrolls the chain to maxDepth. VulkanTriangle maxDepth=1 (primary only). Sponza maxDepth=2 (primary + shadow). ## Status / progress -- [x] baseline VulkanTriangle renders (megakernel) — /tmp/baseline-triangle.png -- [ ] wavefront prelude + codegen -- [ ] VulkanTriangle on wavefront (maxDepth=1) -- [ ] bounce loop + indirect + Sponza shadow port -- [ ] RTStress example + timestamp queries -- [ ] ordered traversal, dynamic TLAS depth, device limits -- [ ] remove megakernel dual path; final validation; PR +- [x] baseline VulkanTriangle renders (megakernel) +- [x] wavefront prelude + codegen (5 entry points share one module) +- [x] VulkanTriangle on wavefront (maxDepth=1) — bit-identical to baseline +- [x] indirect-dispatch bounce loop + PREP (cross-pass atomics proven) +- [x] RTStress example (N³ cube grid) + GPU timestamp-query per-pass HUD +- [x] Sponza port (shadow ray in SHADE) — renders the atrium correctly +- [x] ordered (nearest-child-first) traversal +- [x] dynamic TLAS sweep-tree depth (next_pow2 instances) +- [x] device limits (maxBufferSize / maxStorageBufferBindingSize / + maxComputeWorkgroupsPerDimension) + timestamp-query feature +- [x] megakernel dead path removed (RT pipeline builds only wavefront) +- [~] binding packing (Phase 7): SKIPPED — target device reports 64 storage + buffers/stage (≥12), so the merge is unnecessary (issue makes it + conditional on <12). + +### Measured (this container's GPU, via timestamp-query; NOT a 4090) +Per-pass GPU time, 1920×995, primary+shadow (maxDepth=2): +- RTStress 512 inst: GEN ~0.80ms TRACE ~1.63ms SHADE ~1.00ms total ~3.52ms (~280 fps) +- RTStress 4096 inst: GEN ~0.80ms TRACE ~1.95ms SHADE ~1.00ms total ~3.85ms (~260 fps) +- Sponza: GEN ~0.79ms TRACE ~1.81ms SHADE ~1.00ms total ~3.69ms +8× the instances costs only ~16% more TRACE — the spatial TLAS + ordered +descent scale sub-linearly. NOTE: a 4090 number and the TRACE-kernel +register/occupancy delta require hardware + a profiler not available in +this CI container; the architectural win (TRACE carries zero user code, so +its register footprint is the traversal loop alone) is structural. ## Files - `additional/dom-webgpu.js` — prelude (`rtWgsl*`), `wgpuLoadRTPipeline`,