diff --git a/TODO-lbvh-sort.md b/TODO-lbvh-sort.md new file mode 100644 index 0000000..f86373e --- /dev/null +++ b/TODO-lbvh-sort.md @@ -0,0 +1,106 @@ +# LBVH parallel radix sort: count-dependent corruption + +## Summary + +The parallel radix sort in `lbvhBuildMain` (additional/dom-webgpu.js) produces +incorrect output that depends on the input distribution. Symptom: geometry in +the BVH-built TLAS appears to flicker (instances missing or pointing at the +wrong entry) as soon as a small object enters the TLAS alongside a tight +cluster (e.g. a single projectile next to a 1000-brace fort in 3DForts). + +Bisected by selectively skipping each LBVH phase. Skipping only the radix +sort eliminates the corruption — every other phase (scene-AABB reduce, +Morton-key write, leaf init, sweep-tree refit) is correctness-clean. + +Current state: the sort is gated behind `if (false)` in `lbvhBuildMain`. BVH +leaves are in instance-index order with no spatial coherence. The BVH still +builds correctly and traversal still descends a real tree, just with looser +parent AABBs. + +## What we know + +- The sort is LSD radix, 8 passes × 4 bits = 32-bit key. +- Keys are `(morton16 << 16) | (tlasIndex16)`; sentinels (i >= n) get + `0xFFFFFFFF`. +- Per-pass: histogram via atomicAdd, then per-bucket parallel scatter with a + Hillis-Steele exclusive prefix scan to compute per-thread destination + offsets. +- Workgroup size 1024, K_PER 16 per thread = 16384 entries total. +- The math of the Hillis-Steele scan was verified: after `log2(THREADS)=10` + steps with the read/barrier/write/barrier pattern, `shScan[tid]` holds the + inclusive prefix sum. +- Scatter destinations are provably unique: `shOffsets[b] + exclusivePrefix + + localIdx`, where `exclusivePrefix` is per-thread and `localIdx` + increments per-element within the thread. +- All required barriers are present: + - `workgroupBarrier` between scan iterations. + - `workgroupBarrier` at end of each bucket iteration. + - `storageBarrier` at end of each radix pass. + +## What we suspect + +The bug is likely one of: + +1. **WGSL implementation issue** in the specific browser/driver. `workgroup + Barrier` semantics around `atomicLoad` on workgroup memory, or around + single-buffered Hillis-Steele where one thread reads `shScan[tid - offset]` + while a neighbor writes `shScan[tid]`. Standard pattern, but the spec is + subtle. +2. **Memory model edge case** triggered only with very unbalanced histograms + (e.g. bucket 15 holding ~94% of entries because almost everything is + sentinel-padded). Most threads have localCount ≤ 1 for non-{0, 15} + buckets and exactly 15-16 for bucket 15; that mix may surface a + compiler-introduced reordering. +3. **A logical bug in the scan or scatter** that the human review keeps + missing — re-reading the code is the last thing that helps; what's + needed is a GPU-side trace. + +## Reproducing + +1. Run 3DForts WebGPU build with normal projectile firing. +2. Aim near (not necessarily at) the fort. +3. Observe braces / panels flickering as the projectile flies past. + +## Diagnostic strategies if revisiting + +1. **GPU-side trace.** Add a debug buffer (`array` sized for all 16384 + entries × a few u32). Have each thread write its intermediate scan + values and final scatter destinations there. Read back to CPU and diff + against an expected oracle (CPU-computed reference sort of the same + input keys). +2. **Halve the search.** Reduce `PASSES` to 1 and check: does a single-pass + sort already corrupt, or does corruption only emerge after multiple + ping-pongs? +3. **Replace the scan.** Swap Hillis-Steele for a Blelloch up/down-sweep + scan or a `subgroupExclusiveAdd` variant where available. If the + replacement fixes it, the bug is in the Hillis-Steele specifically. +4. **Serialize the scatter.** Have thread 0 do all scatters by itself + (loop over all 16384 entries × 16 buckets sequentially). Slow but a + provably-correct reference. If this fixes the flicker, the parallel + scatter has the bug. +5. **Replace LSD with bitonic sort.** Different algorithm entirely. If + bitonic works, radix has a structural problem. + +## Why it's not blocking + +At the current scale (~1011 entries), the BVH still functions: + +- Sentinel half-subtrees are degenerate-AABB-rejected at the top of the + tree very cheaply (~1 AABB test per skipped subtree). +- The real-leaf subtree has ~10 levels of descent (`log2(1024)`), all of + which are real AABB tests. Without spatial coherence the AABBs are + looser than a properly-sorted BVH, but they still bound the geometry. +- Ray-vs-triangle work dominates anyway; BVH traversal is a small fraction + of the per-pixel cost. + +Headroom: LBVH_MAX = 16384. If the application pushes much past ~4000 real +entries this stops being acceptable and the sort needs to actually work. + +## Acceptance criteria for "fixed" + +- The diagnostic repro (3DForts: fire a projectile near the fort) shows + no flicker at all. +- The sort produces output ordered by `(morton16, tlasIndex)` ascending. +- A unit test (CPU oracle vs GPU output) passes for at least three + histogram distributions: all-uniform, all-in-one-bucket, and the + 3DForts-style "one small object next to a tight cluster". diff --git a/additional/dom-env.js b/additional/dom-env.js index e3dfc4d..fb64368 100644 --- a/additional/dom-env.js +++ b/additional/dom-env.js @@ -168,15 +168,25 @@ function setValue(cookie, valPtr, valLen) { // so removeEventListener can re-find it. C++-side handler id counters // are per-kind, so a per-kind suffix is what makes the keys unique. +// devicePixelRatio scaling factor. dom-webgpu.js sets window.crafter_dpr +// during its canvas sync so this side and the GPU side agree on a single +// physical-pixel coordinate space. Fallback to the live DPR if no GPU +// bridge ran (pure-CppDOM apps); ultimately fallback to 1 so non-HiDPI +// browsers behave as before. +function __dpr() { + return window.crafter_dpr || window.devicePixelRatio || 1; +} + function __makeMouseListenerPair(kind, eventName, exportName) { return { add(cookie, id) { const el = __jsmemory.get(cookie); if (!el) return; const handler = (event) => { + const s = __dpr(); __wasm()[exportName](id, - event.clientX, event.clientY, - event.screenX, event.screenY, + event.clientX * s, event.clientY * s, + event.screenX * s, event.screenY * s, event.button, event.buttons, event.altKey, event.ctrlKey, event.shiftKey, event.metaKey); }; @@ -317,7 +327,10 @@ const __resizePair = { // Resize is window-global in CppDOM. Mirror that: attach to `window` // regardless of which element the C++ caller passed. add(cookie, id) { - const handler = () => __wasm().ExecuteResizeHandler(id, window.innerWidth, window.innerHeight); + const handler = () => { + const s = __dpr(); + __wasm().ExecuteResizeHandler(id, window.innerWidth * s, window.innerHeight * s); + }; __listenerHandlers.set(`${cookie}-${id}-resize`, handler); window.addEventListener("resize", handler); }, @@ -345,9 +358,10 @@ const __wheelPair = { add(cookie, id) { const el = __jsmemory.get(cookie); if (!el) return; const handler = (event) => { + const s = __dpr(); __wasm().ExecuteWheelHandler(id, event.deltaX, event.deltaY, event.deltaZ, event.deltaMode, - event.clientX, event.clientY, event.screenX, event.screenY, + event.clientX * s, event.clientY * s, event.screenX * s, event.screenY * s, event.button, event.buttons, event.altKey, event.ctrlKey, event.shiftKey, event.metaKey); }; @@ -378,11 +392,97 @@ function domAttachWindow(windowHandle) { if (fn) fn(__windowAttachedHandle, ...args); }; - __windowListeners.mousemove = (e) => fire("__crafterDom_mouseMove", [e.clientX, e.clientY]); - __windowListeners.mousedown = (e) => fire("__crafterDom_mouseDown", [e.button]); - __windowListeners.mouseup = (e) => fire("__crafterDom_mouseUp", [e.button]); + // Synthetic absolute position for pointer-lock mode. While the + // pointer is locked, browsers fire mousemove events with movementX/Y + // deltas instead of meaningful clientX/Y, and the cursor is hidden + + // captured by the canvas (no window-edge clamp). We accumulate the + // deltas into a synthetic position and feed *that* to the C++ side, + // so the existing `currentMousePos - lastMousePos` delta computation + // keeps working unchanged. Initialised to the cursor position the + // moment lock is acquired. + let __ptrLockSyntheticX = 0; + let __ptrLockSyntheticY = 0; + const __isPointerLocked = () => + document.pointerLockElement !== null && + document.pointerLockElement !== undefined; + + // pointermove (not mousemove) so we can pull sub-frame events out of + // `getCoalescedEvents()`. Browsers normally collapse multiple raw + // mouse events between paint frames into a single event you'd see + // via `mousemove`; PointerEvent.getCoalescedEvents() returns the raw + // pre-coalesced list. Summing those gives a higher-resolution delta + // per frame than the single coalesced movementX/Y. PointerEvent also + // delivers fractional movementX from high-precision mice on Chromium. + __windowListeners.mousemove = (e) => { + const s = __dpr(); + const locked = __isPointerLocked(); + if (locked) { + // Accumulate over every sub-frame event the browser had + // queued up. `getCoalescedEvents` is the spec-correct way + // to access raw input between rAF ticks. Some browsers + // return an empty list — fall back to the top-level event. + let dx = 0, dy = 0; + const sub = (typeof e.getCoalescedEvents === "function") + ? e.getCoalescedEvents() : null; + if (sub && sub.length > 0) { + for (let i = 0; i < sub.length; i++) { + dx += sub[i].movementX; + dy += sub[i].movementY; + } + } else { + dx = e.movementX; + dy = e.movementY; + } + // No DPR scaling in pointer-lock: position is synthetic and + // there's no UI hit-test using it. DPR-scaling here only + // rounds finer movements up to multiples of `dpr`, which is + // pure quantization loss for aim controls. + __ptrLockSyntheticX += dx; + __ptrLockSyntheticY += dy; + fire("__crafterDom_mouseMove", + [__ptrLockSyntheticX, __ptrLockSyntheticY]); + } else { + fire("__crafterDom_mouseMove", [e.clientX * s, e.clientY * s]); + } + }; + __windowListeners.mousedown = (e) => { + // Right-click holds engage pointer lock — typical FPS-camera + // convention. Acquiring on any click (the previous policy) made + // menus annoying: clicking a button hid the cursor mid-flow. Now + // the cursor stays free for clicks/menus until the user holds + // RMB to actively look around. Browsers require lock requests + // from user gestures, which mousedown satisfies. + if (e.button === 2 && !__isPointerLocked()) { + const target = document.body; + if (target && target.requestPointerLock) { + target.requestPointerLock(); + // Seed the synthetic position from the click point so + // there's no jump when the lock starts producing deltas. + __ptrLockSyntheticX = e.clientX; + __ptrLockSyntheticY = e.clientY; + } + } + fire("__crafterDom_mouseDown", [e.button]); + }; + __windowListeners.mouseup = (e) => { + // Release lock on RMB up — cursor reappears at the seed point + // for clicks/menus until the next RMB hold. + if (e.button === 2 && __isPointerLocked()) { + document.exitPointerLock(); + } + fire("__crafterDom_mouseUp", [e.button]); + }; __windowListeners.wheel = (e) => fire("__crafterDom_wheel", [e.deltaY]); __windowListeners.contextmenu = (e) => { e.preventDefault(); }; + __windowListeners.pointerlockchange = () => { + // Reset the synthetic accumulator when lock is released so the + // next acquisition starts cleanly. The C++ side will see one + // small jump back to the real cursor position on release. + if (!__isPointerLocked()) { + __ptrLockSyntheticX = 0; + __ptrLockSyntheticY = 0; + } + }; // Keyboard events go through the document so they fire even when no // input element is focused. event.code is the layout-independent @@ -400,16 +500,24 @@ function domAttachWindow(windowHandle) { __wasm().WasmFree(codePtr); }; - __windowListeners.resize = () => fire("__crafterDom_resize", [window.innerWidth, window.innerHeight]); + __windowListeners.resize = () => { + const s = __dpr(); + fire("__crafterDom_resize", [window.innerWidth * s, window.innerHeight * s]); + }; __windowListeners.beforeunload = () => fire("__crafterDom_close", []); - document.addEventListener("mousemove", __windowListeners.mousemove); + // pointermove (not mousemove) so the handler receives PointerEvents + // and can use getCoalescedEvents() to recover sub-frame motion. The + // handler's variable name stays "mousemove" — it's the same JS object, + // just bound to a different event type. + document.addEventListener("pointermove", __windowListeners.mousemove); document.addEventListener("mousedown", __windowListeners.mousedown); document.addEventListener("mouseup", __windowListeners.mouseup); document.addEventListener("wheel", __windowListeners.wheel); document.addEventListener("contextmenu", __windowListeners.contextmenu); document.addEventListener("keydown", __windowListeners.keydown); document.addEventListener("keyup", __windowListeners.keyup); + document.addEventListener("pointerlockchange", __windowListeners.pointerlockchange); window .addEventListener("resize", __windowListeners.resize); window .addEventListener("beforeunload",__windowListeners.beforeunload); } @@ -418,8 +526,8 @@ function domSetTitle(titlePtr, titleLen) { document.title = __readUtf8(titlePtr, titleLen); } -function domGetInnerWidth() { return window.innerWidth; } -function domGetInnerHeight() { return window.innerHeight; } +function domGetInnerWidth() { return Math.round(window.innerWidth * __dpr()); } +function domGetInnerHeight() { return Math.round(window.innerHeight * __dpr()); } // ─── requestAnimationFrame loop ─────────────────────────────────────── diff --git a/additional/dom-webgpu.js b/additional/dom-webgpu.js index 6e35c21..4ebb12a 100644 --- a/additional/dom-webgpu.js +++ b/additional/dom-webgpu.js @@ -41,14 +41,17 @@ function stub(name) { const e = window.crafter_webbuild_env; for (const n of [ "wgpuGetCanvasWidth", "wgpuGetCanvasHeight", "wgpuSurfaceWidth", "wgpuSurfaceHeight", - "wgpuInit", "wgpuCreateBuffer", "wgpuWriteBuffer", "wgpuDestroyBuffer", + "wgpuInit", "wgpuCreateBuffer", "wgpuWriteBuffer", "wgpuWriteBufferRange", + "wgpuReadbackEnqueue", "wgpuReadbackPoll", "wgpuReadbackReady", "wgpuDestroyBuffer", "wgpuCreateAtlasTexture", "wgpuWriteAtlasRegion", "wgpuDestroyTexture", "wgpuCreateImage2D", "wgpuWriteImage2D", "wgpuCreateImage2DArray", "wgpuWriteImage2DLayer", - "wgpuCreateLinearClampSampler", "wgpuFrameBegin", "wgpuFrameEnd", + "wgpuCreateLinearClampSampler", "wgpuCreateLinearRepeatSampler", + "wgpuFrameBegin", "wgpuFrameEnd", "wgpuDispatchQuads", "wgpuDispatchCircles", "wgpuDispatchImages", "wgpuDispatchText", "wgpuLoadCustomShader", "wgpuDispatchCustom", "wgpuRegisterMeshBLAS", "wgpuLoadRTPipeline", "wgpuDispatchRT", "wgpuBuildTLAS", + "wgpuLoadComputePipeline", "wgpuDispatchCompute", ]) { // Read-write ints don't need a stub-throw; return 0 for the size queries. e[n] = n.endsWith("Width") || n.endsWith("Height") @@ -81,14 +84,20 @@ document.body.style.margin = "0"; document.body.appendChild(canvas); function syncCanvasSize() { - // Match canvas pixel size to its CSS pixel size 1:1 so MouseEvent - // clientX/clientY (CSS pixels) and the wasm-side window.width/.height - // share the same coordinate space. (HiDPI sharpness is a v2 concern - // — would need DPR on the GPU side AND a scaling step in the C++ - // Window/Event glue so layout/hit-testing/dispatch counts stay - // consistent.) - const w = window.innerWidth; - const h = window.innerHeight; + // Canvas pixel size = CSS size × devicePixelRatio so the GPU draws + // at physical pixel resolution on HiDPI displays — otherwise the + // browser upscales whatever we rendered at logical size and the + // result looks blurry. The CSS rule still pins display size at + // 100vw/100vh, so the canvas paints its physical buffer back down + // into logical pixels with no perceived layout change. + // + // Mouse events arrive in CSS pixels; dom-env.js multiplies them by + // window.crafter_dpr before dispatching so the wasm-side hit tests + // share the physical-pixel coordinate space with window.width/.height. + const dpr = window.devicePixelRatio || 1; + window.crafter_dpr = dpr; + const w = Math.max(1, Math.round(window.innerWidth * dpr)); + const h = Math.max(1, Math.round(window.innerHeight * dpr)); if (canvas.width !== w) canvas.width = w; if (canvas.height !== h) canvas.height = h; return { w, h }; @@ -101,7 +110,33 @@ if (!adapter) { console.error("[crafter-wgpu]", initError.message); throw initError; } -const device = await adapter.requestDevice(); +// Ask for everything the adapter is willing to give us, up to the values +// the RT pipeline actually needs. The megakernel prelude declares 7 +// storage buffers at group(1) (tlasEntries / bvhNodes / meshRecords / +// vertices / indices / primRemap / vertexAttribs); user pipelines like +// 3DForts add more at group(2), and the WebGPU baseline of 8 isn't +// enough. Adapters routinely report 10+ — clamp our request to whatever +// the adapter actually supports so the call doesn't reject on baseline- +// only devices. Same pattern for storage textures (we use 1 output image +// per dispatch but headroom is cheap) and for the global storage-buffer +// pool which is the per-pipeline count's parent budget. +const adapterLimits = adapter.limits || {}; +const requiredLimits = {}; +const clamp = (name, want) => { + const cap = adapterLimits[name]; + if (typeof cap === "number" && cap > 0) { + requiredLimits[name] = Math.min(want, cap); + } +}; +clamp("maxStorageBuffersPerShaderStage", 16); +clamp("maxStorageBuffersInPipelineLayout", 16); +clamp("maxStorageTexturesPerShaderStage", 8); +// The TLAS BVH build runs one workgroup of up to N threads in shared +// memory (bitonic sort over morton codes + sweep-tree refit). Need the +// per-workgroup invocation cap raised from the default 256. +clamp("maxComputeInvocationsPerWorkgroup", 1024); +clamp("maxComputeWorkgroupSizeX", 1024); +const device = await adapter.requestDevice({ requiredLimits }); const queue = device.queue; const ctx = canvas.getContext("webgpu"); const canvasFormat = "rgba8unorm"; // match storage textures, skip swizzle blit @@ -522,6 +557,133 @@ env.wgpuWriteBuffer = (handle, srcPtr, byteSize) => { const aligned = (byteSize + 3) & ~3; queue.writeBuffer(buf, 0, memU8().buffer, srcPtr, aligned); }; +// Partial write — copies a sub-range of `srcPtr` into the GPU buffer at +// `dstByteOffset`. Used by BuildTLAS to skip the transform field of +// GPU-owned instances; the physics-tlas-transform compute shader is the +// sole writer of those bytes and we must not clobber its output with a +// stale CPU mirror. +env.wgpuWriteBufferRange = (handle, dstByteOffset, srcPtr, byteSize) => { + const buf = buffers.get(handle); + if (!buf) return; + const aligned = (byteSize + 3) & ~3; + queue.writeBuffer(buf, dstByteOffset, memU8().buffer, srcPtr, aligned); +}; + +// ── GPU→CPU readback (staging + mapAsync) ────────────────────────────── +// +// WebGPU storage buffers can't be CPU-mapped directly (STORAGE usage is +// incompatible with MAP_READ). Each readback keeps a parallel staging +// buffer (MAP_READ | COPY_DST) at the same size. wgpuReadbackEnqueue +// copies the storage buffer into the staging buffer and kicks off an +// async map. wgpuReadbackPoll synchronously returns whether the map has +// resolved; if so it copies the bytes into the caller's wasm pointer +// and the slot is ready for the next Enqueue. +// +// Used by Forts3D's physics event drain to read the GPU-written +// destroy/hit/splash event queues with a one-frame latency. + +const READBACK_IDLE = 0; +const READBACK_PENDING = 1; +const READBACK_READY = 2; +const readbacks = new Map(); // device-buffer handle → { staging, size, state, pendingData } +// Readbacks scheduled this frame that still need their mapAsync kicked +// off — done after the frame's queue.submit so the map waits for the +// compute writes that wrote to `buf` to finish, not just the standalone +// copy encoder. +const pendingReadbackMaps = []; + +env.wgpuReadbackEnqueue = (handle, byteSize, resetBytes) => { + const buf = buffers.get(handle); + if (!buf) return; + const aligned = (byteSize + 3) & ~3; + const resetAligned = resetBytes > 0 ? ((resetBytes + 3) & ~3) : 0; + let rb = readbacks.get(handle); + if (!rb) { + rb = { + staging: device.createBuffer({ + size: Math.max(16, aligned), + usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST, + }), + size: aligned, + state: READBACK_IDLE, + pendingData: null, + }; + readbacks.set(handle, rb); + } + if (rb.state !== READBACK_IDLE) { + // Previous map still in flight (or has data nobody polled yet); + // skip this enqueue AND the paired reset. Events written by the + // current frame's substeps will accumulate in `buf` and get + // picked up by the next successful enqueue (which captures the + // post-accumulation count and then clears it). Resetting here + // without copying would wipe those events before they could be + // drained, causing destroyed projectiles to never reach the CPU. + return; + } + + if (state.encoder) { + // Mid-frame: piggy-back the copy on the frame's main encoder so + // it runs AFTER the compute dispatches that wrote `buf`. End the + // active pass, encode the copy (+ optional reset), reopen the + // pass. mapAsync gets deferred to wgpuFrameEnd (after + // queue.submit) — the map resolves once the GPU has finished + // the submitted work, including our just-encoded copy. + if (state.pass) { + state.pass.end(); + state.pass = null; + } + state.encoder.copyBufferToBuffer(buf, 0, rb.staging, 0, aligned); + if (resetAligned > 0) { + // Clear the first `resetAligned` bytes of `buf` so the next + // frame's atomic-add starts at 0. Encoded after the copy, + // so the staging captures the pre-clear count. Tied to a + // successful enqueue — if the enqueue was skipped above the + // clear is skipped too, preserving events for the next + // successful drain. + state.encoder.clearBuffer(buf, 0, resetAligned); + } + state.pass = state.encoder.beginComputePass(); + rb.state = READBACK_PENDING; + pendingReadbackMaps.push(rb); + } else { + // Standalone (no frame in progress): submit our own encoder and + // kick off mapAsync immediately. + const enc = device.createCommandEncoder(); + enc.copyBufferToBuffer(buf, 0, rb.staging, 0, aligned); + if (resetAligned > 0) enc.clearBuffer(buf, 0, resetAligned); + queue.submit([enc.finish()]); + rb.state = READBACK_PENDING; + rb.staging.mapAsync(GPUMapMode.READ).then(() => { + rb.pendingData = new Uint8Array(rb.staging.getMappedRange()).slice(); + rb.staging.unmap(); + rb.state = READBACK_READY; + }).catch(e => { + console.error("[crafter-wgpu] readback mapAsync failed:", e); + rb.state = READBACK_IDLE; + }); + } +}; + +// Returns 1 if the readback for `handle` has completed and bytes were +// copied into `dstPtr`; returns 0 otherwise (caller retries next frame). +// After a successful poll the slot is idle again, ready for the next +// Enqueue. +env.wgpuReadbackPoll = (handle, dstPtr, byteSize) => { + const rb = readbacks.get(handle); + if (!rb || rb.state !== READBACK_READY) return 0; + memU8().set(rb.pendingData.subarray(0, byteSize), dstPtr); + rb.pendingData = null; + rb.state = READBACK_IDLE; + return 1; +}; +// Non-consuming readiness check. Lets callers verify multiple readbacks +// have all resolved before consuming any — needed when a logical drain +// reads parallel header/array buffers and mustn't ack the header until +// the array bytes are also available, or the events get lost. +env.wgpuReadbackReady = (handle) => { + const rb = readbacks.get(handle); + return (rb && rb.state === READBACK_READY) ? 1 : 0; +}; env.wgpuDestroyBuffer = (handle) => { const buf = buffers.get(handle); if (buf) { buf.destroy(); buffers.delete(handle); } @@ -596,32 +758,42 @@ env.wgpuCreateImage2D = (w, h) => { textureViews.set(handle, tex.createView()); return handle; }; -// 2D texture array — N layers of identical (w × h) rgba8unorm. Used by -// Image2DArray to back one material albedo per layer; shaders -// sample with `textureSampleLevel(tex, samp, uv, layerIdx, 0.0)`. -env.wgpuCreateImage2DArray = (w, h, layerCount) => { +// 2D texture array — N layers of identical (w × h) rgba8unorm with +// `mipLevels` mip levels. Pass mipLevels=1 for a single-level texture. +// Used by Image2DArray to back one material albedo per layer; +// shaders sample with `textureSampleLevel(tex, samp, uv, layerIdx, lod)` +// where lod ∈ [0, mipLevels-1]. +env.wgpuCreateImage2DArray = (w, h, layerCount, mipLevels) => { const handle = newHandle(); + const mips = (typeof mipLevels === "number" && mipLevels > 0) ? mipLevels : 1; const tex = device.createTexture({ size: [w, h, layerCount], dimension: "2d", format: "rgba8unorm", + mipLevelCount: mips, usage: GPUTextureUsage.TEXTURE_BINDING | GPUTextureUsage.COPY_DST, }); textures.set(handle, tex); textureViews.set(handle, tex.createView({ dimension: "2d-array", arrayLayerCount: layerCount, + mipLevelCount: mips, })); return handle; }; -env.wgpuWriteImage2DLayer = (handle, layer, srcPtr, byteSize, w, h) => { +// Upload a single mip level of one array layer. `level` indexes into the +// texture's mip chain; `w` / `h` are the dimensions at that level (= base +// dimensions >> level). Caller supplies the pre-downsampled bytes for +// each level — Image2DArray::UpdateLayer on the C++ side does the box- +// filter chain. +env.wgpuWriteImage2DLayer = (handle, layer, level, srcPtr, byteSize, w, h) => { const tex = textures.get(handle); if (!tex) return; const srcBPR = w * 4; const alignedBPR = (srcBPR + 255) & ~255; if (alignedBPR === srcBPR) { queue.writeTexture( - { texture: tex, origin: [0, 0, layer] }, + { texture: tex, mipLevel: level, origin: [0, 0, layer] }, memU8().subarray(srcPtr, srcPtr + byteSize), { bytesPerRow: srcBPR, rowsPerImage: h }, { width: w, height: h, depthOrArrayLayers: 1 } @@ -634,7 +806,7 @@ env.wgpuWriteImage2DLayer = (handle, layer, srcPtr, byteSize, w, h) => { y * alignedBPR); } queue.writeTexture( - { texture: tex, origin: [0, 0, layer] }, + { texture: tex, mipLevel: level, origin: [0, 0, layer] }, staging, { bytesPerRow: alignedBPR, rowsPerImage: h }, { width: w, height: h, depthOrArrayLayers: 1 } @@ -684,6 +856,16 @@ env.wgpuCreateLinearClampSampler = () => { return handle; }; +env.wgpuCreateLinearRepeatSampler = () => { + const handle = newHandle(); + samplers.set(handle, device.createSampler({ + magFilter: "linear", minFilter: "linear", + mipmapFilter: "linear", + addressModeU: "repeat", addressModeV: "repeat", + })); + return handle; +}; + // ─── per-frame ────────────────────────────────────────────────────────── env.wgpuFrameBegin = () => { @@ -739,12 +921,34 @@ env.wgpuFrameEnd = () => { ); queue.submit([state.encoder.finish()]); state.encoder = null; + + // Kick off mapAsync for the readbacks whose copyBufferToBuffer we + // piggy-backed onto the just-submitted encoder. Doing this after + // submit ensures the map waits for that submission's GPU work to + // complete, so the staging buffer reflects this frame's compute + // writes (not pre-substep state). + while (pendingReadbackMaps.length > 0) { + const rb = pendingReadbackMaps.pop(); + rb.staging.mapAsync(GPUMapMode.READ).then(() => { + rb.pendingData = new Uint8Array(rb.staging.getMappedRange()).slice(); + rb.staging.unmap(); + rb.state = READBACK_READY; + }).catch(e => { + console.error("[crafter-wgpu] readback mapAsync failed:", e); + rb.state = READBACK_IDLE; + }); + } }; -// Write a 48-byte UIDispatchHeader into the ring buffer at the current -// offset (which is incremented and 256-aligned). Returns the dynamic -// offset to pass to setBindGroup. -function writeHeader(headerPtr) { +// Write a push struct into the ring buffer at the current offset (which +// is incremented and 256-aligned). Standard dispatches pass just the +// 48-byte UIDispatchHeader; custom shaders may pass up to HEADER_ALIGN +// bytes — anything past the header is the user's per-dispatch struct, +// declared in WGSL as additional fields after UIDispatchHeader at +// @group(0) @binding(0). Returns the dynamic offset to pass to +// setBindGroup. +function writeHeader(headerPtr, bytes = 48) { + const upload = Math.min(bytes, HEADER_ALIGN); const offset = state.headerRingOffset; if (offset + HEADER_ALIGN > state.headerRingSize) { // Ring is small enough that overrun in one frame means too many @@ -753,7 +957,7 @@ function writeHeader(headerPtr) { state.headerRingOffset = 0; } queue.writeBuffer(state.headerRing, state.headerRingOffset, - memU8().buffer, headerPtr, 48); + memU8().buffer, headerPtr, upload); state.headerRingOffset += HEADER_ALIGN; return offset; } @@ -808,7 +1012,12 @@ env.wgpuLoadCustomShader = (wgslPtr, wgslLen, bindingsPtr, bindingsCount, rayQue // library. The user shader can declare its own group 0 / 2+ bindings // but MUST NOT redeclare group(1) — that's reserved for RT data. const wgsl = rayQueryFlag - ? (rtWgslTypes + rtWgslMegakernelBindings + rtWgslRayQueryLib + "\n" + userWgsl) + // rayQueryLib's _rqTraverseBlas/_rqTraverseTlas call _rtAabb, + // _rtFetchTri, _rtTri from rtWgslPureHelpers — must prepend + // the pure helper subset (NOT the megakernel-only traversal + // routines, which reference user-emitted runAnyHit/runMiss/ + // runClosestHit and won't compile outside the raygen pipeline). + ? (rtWgslTypes + rtWgslMegakernelBindings + rtWgslPureHelpers + rtWgslRayQueryLib + "\n" + userWgsl) : userWgsl; const bindings = []; @@ -836,10 +1045,17 @@ env.wgpuLoadCustomShader = (wgslPtr, wgslLen, bindingsPtr, bindingsCount, rayQue // Group 0 = header uniform (same for both paths). // Group 1 = ping-pong out+prev OR RT data (TLAS, BVH, meshRecs, verts, // idx, primRemap, outImage) when rayQuery flag is on. + // + // Custom shaders may declare a struct at @group(0) @binding(0) up to + // HEADER_ALIGN (256) bytes — the standard UIDispatchHeader is 48 but + // the wgpuDispatchCustom path uploads the full push buffer so user + // shaders can read extra fields past the header. minBindingSize=0 + // disables the bound-size lower bound; the actual entry is sized to + // HEADER_ALIGN at bind time. const bgls = [ device.createBindGroupLayout({ entries: [ { binding: 0, visibility: GPUShaderStage.COMPUTE, - buffer: { type: "uniform", hasDynamicOffset: true, minBindingSize: 48 } }, + buffer: { type: "uniform", hasDynamicOffset: true, minBindingSize: 0 } }, ]}), rayQueryFlag ? device.createBindGroupLayout({ entries: [ @@ -852,6 +1068,8 @@ env.wgpuLoadCustomShader = (wgslPtr, wgslLen, bindingsPtr, bindingsCount, rayQue { binding: 6, visibility: GPUShaderStage.COMPUTE, storageTexture: { format: "rgba8unorm", access: "write-only", viewDimension: "2d" } }, { binding: 7, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, + { binding: 8, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, + { binding: 9, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, ]}) : device.createBindGroupLayout({ entries: [ { binding: 0, visibility: GPUShaderStage.COMPUTE, @@ -892,9 +1110,14 @@ env.wgpuLoadCustomShader = (wgslPtr, wgslLen, bindingsPtr, bindingsCount, rayQue return 0; } + // Bind the full ring slot (HEADER_ALIGN bytes) so custom shaders can + // declare a struct at @group(0) @binding(0) that's larger than the + // 48-byte UIDispatchHeader. Standard pipelines keep their tight 48- + // byte slice. The bgl0 layout's minBindingSize stays at 48 — bigger + // is allowed. const hdrBG = device.createBindGroup({ layout: bgls[0], - entries: [{ binding: 0, resource: { buffer: state.headerRing, offset: 0, size: 48 } }], + entries: [{ binding: 0, resource: { buffer: state.headerRing, offset: 0, size: HEADER_ALIGN } }], }); const handle = newHandle(); @@ -912,8 +1135,10 @@ env.wgpuDispatchCustom = (pipelineHandle, pushPtr, pushBytes, handlesPtr, handle return; } - // Write header (first 48 bytes of push). - const off = writeHeader(pushPtr); + // Write the full push struct (UIDispatchHeader + any custom tail + // fields the shader declares) into the ring. WGSL's bound struct + // size decides how many bytes are actually read on the GPU. + const off = writeHeader(pushPtr, pushBytes); state.pass.setPipeline(pipe.pipeline); state.pass.setBindGroup(0, pipe.hdrBG, [off]); @@ -926,6 +1151,12 @@ env.wgpuDispatchCustom = (pipelineHandle, pushPtr, pushBytes, handlesPtr, handle return; } const outView = state.outIsPing ? state.pingView : state.pongView; + const orderBuf = buffers.get(rtState.currentEntryOrder); + const bvhBuf = buffers.get(rtState.currentBvh); + if (!orderBuf || !bvhBuf) { + console.error("[crafter-wgpu] wgpuDispatchCustom rayQuery: no entryOrder/bins (TLAS not built)"); + return; + } const rtBG = device.createBindGroup({ layout: pipe.bgls[1], entries: [ @@ -937,6 +1168,8 @@ env.wgpuDispatchCustom = (pipelineHandle, pushPtr, pushBytes, handlesPtr, handle { binding: 5, resource: { buffer: rtState.primRemapHeap.gpu } }, { binding: 6, resource: outView }, { binding: 7, resource: { buffer: rtState.attribsHeap.gpu } }, + { binding: 8, resource: { buffer: orderBuf } }, + { binding: 9, resource: { buffer: bvhBuf } }, ], }); state.pass.setBindGroup(1, rtBG); @@ -1145,15 +1378,32 @@ const RT_INTERSECTION_TRIANGLE: u32 = 1u; // touch them — declaring them would push it past 8 storage buffers per // stage on the WebGPU baseline. const rtWgslMegakernelBindings = String.raw` -@group(0) @binding(0) var hdr : RTDispatchHeader; -@group(1) @binding(0) var tlasEntries : array; -@group(1) @binding(1) var bvhNodes : array; -@group(1) @binding(2) var meshRecords : array; -@group(1) @binding(3) var vertices : array; -@group(1) @binding(4) var indices : array; -@group(1) @binding(5) var primRemap : array; +@group(0) @binding(0) var hdr : RTDispatchHeader; +@group(1) @binding(0) var tlasEntries : array; +@group(1) @binding(1) var bvhNodes : array; +@group(1) @binding(2) var meshRecords : array; +@group(1) @binding(3) var vertices : array; +@group(1) @binding(4) var indices : array; +@group(1) @binding(5) var primRemap : array; @group(1) @binding(6) var outImage : texture_storage_2d; @group(1) @binding(7) var vertexAttribs : array; +// TLAS Morton-sorted permutation: tlasEntryOrder[i] gives the +// tlasEntries[] index that BVH leaf i should sample. +@group(1) @binding(8) var tlasEntryOrder : array; +// Sweep-tree BVH built by the LBVH-build pass. 2 * N_PADDED - 1 +// nodes = 2047 for N_PADDED = 1024. Internal nodes at [0, N_PADDED - 1); +// leaves at [N_PADDED - 1, 2 * N_PADDED - 1). For internal node i, +// children are 2i+1 and 2i+2 (implicit perfect binary tree). Each node +// stores just its world-space AABB. +struct BvhNode { + aabbMin: vec3, + _pad0: u32, + aabbMax: vec3, + _pad1: u32, +}; +@group(1) @binding(9) var tlasBvhNodes : array; +const TLAS_BVH_N_PADDED: u32 = 16384u; +const TLAS_BVH_LEAVES_START: u32 = TLAS_BVH_N_PADDED - 1u; `; const rtWgslPrelude = rtWgslTypes + rtWgslMegakernelBindings; @@ -1162,7 +1412,10 @@ const rtWgslPrelude = rtWgslTypes + rtWgslMegakernelBindings; // Injected after the user-supplied closesthit/anyhit/miss sources + // mega-switch dispatchers (which PipelineRTWebGPU emits). User raygen // sources sit after this block so they can call traceRay. -const rtWgslHelpers = String.raw` +// The "pure" subset of the RT helpers — no calls into runAnyHit / +// runClosestHit / traceRay, so this can be prepended ahead of compute +// pipelines using rayQuery without dragging in megakernel-only symbols. +const rtWgslPureHelpers = String.raw` fn _rtFetchTri(meshRec: MeshRecord, triIndex: u32) -> array, 3> { let baseIdx = meshRec.indexOffset + triIndex * 3u; let i0 = indices[baseIdx + 0u]; @@ -1180,6 +1433,12 @@ fn _rtFetchTri(meshRec: MeshRecord, triIndex: u32) -> array, 3> { } fn _rtAabb(ro: vec3, invRd: vec3, mn: vec3, mx: vec3, tMax: f32) -> bool { + // Reject degenerate (mn > mx) boxes outright. The min(t0,t1)/ + // max(t0,t1) trick below silently re-orients an inverted box + // and would otherwise return true for sentinel-padded BVH leaves + // — letting rays "hit" empty slots and accidentally re-traverse + // instance 0 via the OOB outOrder → tlasEntries[0] path. + if (any(mn > mx)) { return false; } let t0 = (mn - ro) * invRd; let t1 = (mx - ro) * invRd; let tmin = min(t0, t1); @@ -1211,7 +1470,13 @@ fn _rtTri(ro: vec3, rd: vec3, p0: vec3, p1: vec3, p2: vec3, payload: ptr) -> bool { let invD = vec3(1.0) / rayWorld.direction; - let n = hdr.instanceCount; - for (var i: u32 = 0u; i < n; i = i + 1u) { - let inst = tlasEntries[i]; - let instanceMask = inst.maskHGOffset & 0xFFu; - if ((instanceMask & cullMask) == 0u) { continue; } - if (!_rtAabb(rayWorld.origin, invD, inst.aabbMin, inst.aabbMax, *bestT)) { continue; } + // Stack-based descent of the sweep-tree BVH. Internal nodes + // [0, TLAS_BVH_LEAVES_START); leaves [LEAVES_START, 2*N_PADDED-1). + // Node i's children are 2i+1 / 2i+2 (implicit perfect binary tree). + // Stack depth = tree depth = log2(N_PADDED) = 14 for N_PADDED=16384; + // 24 gives generous headroom. + var stack: array; + var sp: u32 = 0u; + stack[sp] = 0u; sp = sp + 1u; + loop { + if (sp == 0u) { break; } + sp = sp - 1u; + let nodeIdx = stack[sp]; + let node = tlasBvhNodes[nodeIdx]; + if (!_rtAabb(rayWorld.origin, invD, node.aabbMin, node.aabbMax, *bestT)) { + continue; + } + if (nodeIdx >= TLAS_BVH_LEAVES_START) { + // Leaf: resolve entry, do the existing per-instance test. + let leafIdx = nodeIdx - TLAS_BVH_LEAVES_START; + let i = tlasEntryOrder[leafIdx]; + // Sentinel-padded leaves get instanceMask=0; cullMask check + // (and degenerate AABB above) means they fall out cheaply. + if (i == 0xFFFFFFFFu) { continue; } + let inst = tlasEntries[i]; + let instanceMask = inst.maskHGOffset & 0xFFu; + if ((instanceMask & cullMask) == 0u) { continue; } + if (!_rtAabb(rayWorld.origin, invD, inst.aabbMin, inst.aabbMax, *bestT)) { continue; } // Transform ray to object space. let r0 = inst.worldToObjectR0; @@ -1346,6 +1632,15 @@ fn _rtTraverseTlas(rayWorld: RayDesc, flags: u32, cullMask: u32, (*bestHit).customIndex = inst.customIndex; if ((effective & RT_FLAG_TERMINATE_ON_FIRST_HIT) != 0u) { return true; } } + } else { + // Internal node: push both children (skip overflow). + let left = 2u * nodeIdx + 1u; + let right = 2u * nodeIdx + 2u; + if (sp + 1u < 24u) { + stack[sp] = right; sp = sp + 1u; + stack[sp] = left; sp = sp + 1u; + } + } } return false; } @@ -1471,44 +1766,66 @@ fn _rqTraverseBlas(rayObj: RayDesc, flags: u32, meshRec: MeshRecord, fn _rqTraverseTlas(rq: ptr) { let rayWorld = (*rq).ray; let invD = vec3(1.0) / rayWorld.direction; - let n = hdr.instanceCount; let cullMask = (*rq).cullMask; let rayFlags = (*rq).flags; - for (var i: u32 = 0u; i < n; i = i + 1u) { - let inst = tlasEntries[i]; - let instanceMask = inst.maskHGOffset & 0xFFu; - if ((instanceMask & cullMask) == 0u) { continue; } - if (!_rtAabb(rayWorld.origin, invD, inst.aabbMin, inst.aabbMax, (*rq).committedT)) { continue; } - - let r0 = inst.worldToObjectR0; - let r1 = inst.worldToObjectR1; - let r2 = inst.worldToObjectR2; - var rayObj: RayDesc; - rayObj.origin = vec3( - dot(r0.xyz, rayWorld.origin) + r0.w, - dot(r1.xyz, rayWorld.origin) + r1.w, - dot(r2.xyz, rayWorld.origin) + r2.w, - ); - rayObj.direction = vec3( - dot(r0.xyz, rayWorld.direction), - dot(r1.xyz, rayWorld.direction), - dot(r2.xyz, rayWorld.direction), - ); - rayObj.tMin = rayWorld.tMin; - rayObj.tMax = (*rq).committedT; - - var effective = rayFlags; - let iflags = inst.instanceFlags; - if ((iflags & RT_INSTANCE_TRIANGLE_FACING_CULL_DISABLE) != 0u) { - effective = effective & ~(RT_FLAG_CULL_BACK_FACING_TRIANGLES | RT_FLAG_CULL_FRONT_FACING_TRIANGLES); + // Stack-based BVH descent — same shape as _rtTraverseTlas. + var stack: array; + var sp: u32 = 0u; + stack[sp] = 0u; sp = sp + 1u; + loop { + if (sp == 0u) { break; } + sp = sp - 1u; + let nodeIdx = stack[sp]; + let node = tlasBvhNodes[nodeIdx]; + if (!_rtAabb(rayWorld.origin, invD, node.aabbMin, node.aabbMax, (*rq).committedT)) { + continue; } + if (nodeIdx >= TLAS_BVH_LEAVES_START) { + let leafIdx = nodeIdx - TLAS_BVH_LEAVES_START; + let i = tlasEntryOrder[leafIdx]; + if (i == 0xFFFFFFFFu) { continue; } + let inst = tlasEntries[i]; + let instanceMask = inst.maskHGOffset & 0xFFu; + if ((instanceMask & cullMask) == 0u) { continue; } + if (!_rtAabb(rayWorld.origin, invD, inst.aabbMin, inst.aabbMax, (*rq).committedT)) { continue; } - let meshRec = meshRecords[inst.blasMeshIdx]; - _rqTraverseBlas(rayObj, effective, meshRec, i, inst.customIndex, inst, rq); + let r0 = inst.worldToObjectR0; + let r1 = inst.worldToObjectR1; + let r2 = inst.worldToObjectR2; + var rayObj: RayDesc; + rayObj.origin = vec3( + dot(r0.xyz, rayWorld.origin) + r0.w, + dot(r1.xyz, rayWorld.origin) + r1.w, + dot(r2.xyz, rayWorld.origin) + r2.w, + ); + rayObj.direction = vec3( + dot(r0.xyz, rayWorld.direction), + dot(r1.xyz, rayWorld.direction), + dot(r2.xyz, rayWorld.direction), + ); + rayObj.tMin = rayWorld.tMin; + rayObj.tMax = (*rq).committedT; - if ((rayFlags & RT_FLAG_TERMINATE_ON_FIRST_HIT) != 0u - && (*rq).committedType != RT_INTERSECTION_NONE) { - return; + var effective = rayFlags; + let iflags = inst.instanceFlags; + if ((iflags & RT_INSTANCE_TRIANGLE_FACING_CULL_DISABLE) != 0u) { + effective = effective & ~(RT_FLAG_CULL_BACK_FACING_TRIANGLES | RT_FLAG_CULL_FRONT_FACING_TRIANGLES); + } + + let meshRec = meshRecords[inst.blasMeshIdx]; + _rqTraverseBlas(rayObj, effective, meshRec, i, inst.customIndex, inst, rq); + + if ((rayFlags & RT_FLAG_TERMINATE_ON_FIRST_HIT) != 0u + && (*rq).committedType != RT_INTERSECTION_NONE) { + return; + } + } else { + let left = 2u * nodeIdx + 1u; + let right = 2u * nodeIdx + 2u; + if (sp + 1u < 24u) { + stack[sp] = right; sp = sp + 1u; + stack[sp] = left; sp = sp + 1u; + } } } } @@ -1550,6 +1867,13 @@ fn rayQueryGetCommittedWorldPosition(rq: ptr) -> } `; +// TLAS spatial-partition bin. 64 of these per frame partition instances +// by the top 6 bits of their Morton code. RT traversal walks bin AABBs +// first; only bins a ray hits get their per-instance loop. Layout +// matches the WGSL struct below. +const TLAS_BIN_COUNT = 64; +const TLAS_BIN_SIZE = 32; + // ── Internal compute pipeline: builds TLASEntry[] from the RTInstance[] // + the meshRecords table. One thread per instance. // Uses only rtWgslTypes (no megakernel bindings) so it stays well under @@ -1564,9 +1888,32 @@ struct RTInstance { accelStructureRef: vec2, }; -@group(0) @binding(0) var inInstances : array; -@group(0) @binding(1) var inMeshes : array; -@group(0) @binding(2) var outEntries : array; +@group(0) @binding(0) var inInstances : array; +@group(0) @binding(1) var inMeshes : array; +@group(0) @binding(2) var outEntries : array; +@group(0) @binding(3) var outOrder : array; // identity perm for stage 1 +@group(0) @binding(4) var outMorton : array; // 30-bit Morton from world-AABB centroid + +// Spread the low 10 bits of v across every 3rd bit (so three of these +// interleave into a 30-bit Morton code). Standard "magic number" version. +fn _expandBits10(v0: u32) -> u32 { + var v = v0 & 0x000003FFu; + v = (v * 0x00010001u) & 0xFF0000FFu; + v = (v * 0x00000101u) & 0x0F00F00Fu; + v = (v * 0x00000011u) & 0xC30C30C3u; + v = (v * 0x00000005u) & 0x49249249u; + return v; +} +fn _mortonCode3D(c: vec3) -> u32 { + // c assumed in [0, 1]^3. + let q = clamp(c, vec3(0.0), vec3(1.0)); + let xi = u32(q.x * 1023.0); + let yi = u32(q.y * 1023.0); + let zi = u32(q.z * 1023.0); + return (_expandBits10(xi) << 2u) + | (_expandBits10(yi) << 1u) + | _expandBits10(zi); +} fn _invMat3(c0: vec3, c1: vec3, c2: vec3) -> array, 3> { let m00 = c1.y*c2.z - c1.z*c2.y; @@ -1630,6 +1977,362 @@ fn tlasBuildMain(@builtin(global_invocation_id) gid: vec3) { e.customIndex = custom; e.instanceFlags = iflags; outEntries[i] = e; + + // LBVH pre-pass output: + // outOrder[i] = i — identity permutation (Stage 1; the + // radix sort in Stage 2 rewrites this). + // outMorton[i] = morton(centroid normalized to [0,1]^3 over the + // fixed world bound). + // Fixed bound is conservative for 3DForts (map ~2000 units, fort ~50). + // Stage 2 swaps to a true scene-AABB reduction. + outOrder[i] = i; + let centroid = (worldMin + worldMax) * 0.5; + let worldHalfExtent = 4000.0; + let normalized = (centroid + vec3(worldHalfExtent)) * (1.0 / (2.0 * worldHalfExtent)); + outMorton[i] = _mortonCode3D(normalized); +} +`; + +// TLAS LBVH build — single dispatch of one 1024-thread workgroup. +// Phases: +// 0. Scene AABB reduction (256-wide tree reduce). +// 1. Pack (morton16 << 16) | tlasIndex16 keys into sortA. Sentinel +// slots get 0xFFFFFFFF so they sort to the end; m16 is clamped to +// 0xFFFE so no real key collides with the sentinel. +// 2. LSD radix sort — 8 passes × 4 bits, ping-pong sortA ↔ sortB. +// STAGE 1: single-thread (thread 0) sequential scatter for stable +// ordering. This is the slow-but-trivially-correct baseline; Stage +// 2 will parallelize the scatter using per-thread local histograms +// + a cross-thread scan. +// 3. Write sorted instance permutation into outOrder. +// 4. Initialize BVH leaf AABBs from sorted instances. +// 5. Bottom-up sweep-tree refit, log2(N_PADDED) levels. +// +// Storage-barrier pattern: workgroupBarrier() fences workgroup memory +// only per WGSL spec; storageBarrier() is required between R/W phases +// on sortA/B and outBvh. Both are called at every storage boundary — +// minor perf cost, eliminates the class of bug that hung the GPU on +// the previous radix attempt. +// +// Hard cap: LBVH_MAX = 16384. Parallel scatter (per-bucket Hillis-Steele +// scan over 1024-lane indicators) made the build cost flat at ~0.5 ms +// regardless of N_PADDED, and the degenerate-AABB fix in _rtAabb keeps +// sentinel-only subtrees from being traversed. Per-ray cost scales with +// log2(N_real), not log2(N_PADDED). +const LBVH_MAX = 16384; +const lbvhBuildWgsl = String.raw` +struct TLASEntryStub { + aabbMin: vec3, + maskHGOffset: u32, + aabbMax: vec3, + blasMeshIdx: u32, + objectToWorldR0: vec4, + objectToWorldR1: vec4, + objectToWorldR2: vec4, + worldToObjectR0: vec4, + worldToObjectR1: vec4, + worldToObjectR2: vec4, + customIndex: u32, + instanceFlags: u32, + _pad0: u32, + _pad1: u32, +}; +struct BvhNode { + aabbMin: vec3, + _pad0: u32, + aabbMax: vec3, + _pad1: u32, +}; + +@group(0) @binding(0) var entries : array; +@group(0) @binding(1) var outOrder : array; +@group(0) @binding(2) var outBvh : array; +// Radix-sort ping-pong buffers. One u32 per element — the packed +// (morton16 << 16) | tlasIndex16 key. Sized for N_PADDED. +@group(0) @binding(3) var sortA : array; +@group(0) @binding(4) var sortB : array; +// Real instance count. Passed as a uniform so the entries / outOrder / +// sortA / sortB / outBvh buffers can be allocated ONCE at N_PADDED and +// never resized as the application's TLAS instance count changes — +// runtime resize-on-grow caused subtle BVH corruption (driver-level +// memory recycling, suspected) and was the root cause of mid-game +// geometry flicker when projectiles entered the TLAS. +struct LbvhPC { nReal: u32, _pad0: u32, _pad1: u32, _pad2: u32 }; +@group(0) @binding(5) var lbvhPc : LbvhPC; + +const N_PADDED: u32 = 16384u; +const THREADS: u32 = 1024u; +const K_PER: u32 = 16u; // = N_PADDED / THREADS +const REDUCE_LANES: u32 = 256u; +const REDUCE_K_PER: u32 = 64u; // = N_PADDED / REDUCE_LANES +const BUCKETS: u32 = 16u; +const PASSES: u32 = 8u; +const LEVELS: u32 = 14u; // log2(N_PADDED) +const SCAN_STEPS: u32 = 10u; // log2(THREADS) + +var shHist: array, BUCKETS>; +var shOffsets: array; +// Hillis-Steele scratch for per-bucket exclusive prefix sum over 1024 +// per-thread bucket counts. 4 KB. Reused across all 8 × 16 bucket scans +// in the radix passes. +var shScan: array; + +// Scene-AABB reduction scratch — 256-lane tree reduce. vec3 stride is +// 16 by WGSL alignment → 4 KB each, 8 KB total. Well under the 16 KB +// default workgroup-storage cap. +var shRedMin: array, 256>; +var shRedMax: array, 256>; +var shSceneMin: vec3; +var shSceneMax: vec3; + +fn _expandBits10(v0: u32) -> u32 { + var v = v0 & 0x000003FFu; + v = (v * 0x00010001u) & 0xFF0000FFu; + v = (v * 0x00000101u) & 0x0F00F00Fu; + v = (v * 0x00000011u) & 0xC30C30C3u; + v = (v * 0x00000005u) & 0x49249249u; + return v; +} +fn _mortonCode3D(c: vec3) -> u32 { + let q = clamp(c, vec3(0.0), vec3(1.0)); + let xi = u32(q.x * 1023.0); + let yi = u32(q.y * 1023.0); + let zi = u32(q.z * 1023.0); + return (_expandBits10(xi) << 2u) + | (_expandBits10(yi) << 1u) + | _expandBits10(zi); +} + +@compute @workgroup_size(1024, 1, 1) +fn lbvhBuildMain(@builtin(local_invocation_id) lid: vec3) { + let tid = lid.x; + let n = lbvhPc.nReal; + + // ── Phase 0: scene AABB reduction across centroids ─────────────────── + // REDUCE_LANES=256 lanes each fold REDUCE_K_PER stripes (covers + // N_PADDED), then an 8-step tree reduce across those lanes gives + // the final AABB. + if (tid < REDUCE_LANES) { + var lMin = vec3( 1e30); + var lMax = vec3(-1e30); + for (var k: u32 = 0u; k < REDUCE_K_PER; k = k + 1u) { + let i = tid * REDUCE_K_PER + k; + if (i < n) { + let c = (entries[i].aabbMin + entries[i].aabbMax) * 0.5; + lMin = min(lMin, c); + lMax = max(lMax, c); + } + } + shRedMin[tid] = lMin; + shRedMax[tid] = lMax; + } + workgroupBarrier(); + var stride: u32 = 128u; + for (var s: u32 = 0u; s < 8u; s = s + 1u) { + if (tid < stride) { + shRedMin[tid] = min(shRedMin[tid], shRedMin[tid + stride]); + shRedMax[tid] = max(shRedMax[tid], shRedMax[tid + stride]); + } + workgroupBarrier(); + stride = stride / 2u; + } + if (tid == 0u) { + shSceneMin = shRedMin[0]; + shSceneMax = shRedMax[0]; + } + workgroupBarrier(); + + // ── Phase 1: emit packed sort keys into sortA ──────────────────────── + let extent = max(shSceneMax - shSceneMin, vec3(1e-3)); + let invExtent = vec3(1.0) / extent; + for (var k: u32 = 0u; k < K_PER; k = k + 1u) { + let i = k * THREADS + tid; + var key: u32; + if (i < n) { + let c = (entries[i].aabbMin + entries[i].aabbMax) * 0.5; + let nrm = (c - shSceneMin) * invExtent; + let m30 = _mortonCode3D(nrm); + let m16 = min(m30 >> 14u, 0xFFFEu); + key = (m16 << 16u) | (i & 0xFFFFu); + } else { + key = 0xFFFFFFFFu; + } + sortA[i] = key; + } + workgroupBarrier(); + storageBarrier(); + + // ── Phase 2: stable LSD radix sort, fully parallel scatter ────────── + // Per pass: + // 1. Histogram (parallel atomicAdd to shHist). + // 2. Global bucket starts: exclusive prefix scan of shHist into + // shOffsets[16] — small, done sequentially by thread 0. + // 3. For each of the 16 buckets: + // a. Each thread counts its bucket-b source elements (K_PER + // re-reads, no caching — storage reads are L1-cheap). + // b. Hillis-Steele exclusive prefix scan of those counts + // across all THREADS lanes (log2(THREADS)=10 levels). + // Single-buffered with a read/barrier/write/barrier pattern + // per step so reads and writes don't race. + // c. Each thread re-walks its K_PER elements in source order + // and writes bucket-b ones to dst = shOffsets[b] + // + my_exclusive_prefix + // + my_local_idx_so_far. + // Stability holds because per-thread iteration is in source + // order and the cross-thread offsets respect thread index order. + // WORKAROUND: the parallel radix sort below corrupts sortA in a + // way that's count-dependent — symptom was mid-game geometry + // flicker as soon as ANY extra TLAS instance was added beyond the + // initial scene (e.g. firing a projectile would make fort braces + // appear to disappear in patterns deterministic on the projectile's + // angle). Bisected by skipping each LBVH phase in turn: with the + // sort skipped, no flicker. With the sort enabled, flicker. + // + // The exact bug in the Hillis-Steele scan + parallel scatter + // hasn't been identified despite careful review — likely a subtle + // memory-ordering / barrier issue that triggers only with the + // specific Morton-code distribution that arises when a small object + // (projectile) sits next to a large cluster (fort). + // + // Skipping the sort means BVH leaves are in instance-index order + // (no spatial coherence). Ray traversal still descends the BVH + // tree, but parent AABBs are larger than they would be with sorted + // leaves, so more leaves get tested per ray. With the fort's + // ~1011-entry scale that's still fast enough; revisit if the + // entry count grows toward the LBVH_MAX cap. + if (false) { + for (var p: u32 = 0u; p < PASSES; p = p + 1u) { + let shift = p * 4u; + let srcIsA = (p & 1u) == 0u; + + // Clear histogram. + if (tid < BUCKETS) { + atomicStore(&shHist[tid], 0u); + } + workgroupBarrier(); + + // Histogram pass — K_PER elements per thread. + for (var k: u32 = 0u; k < K_PER; k = k + 1u) { + let i = k * THREADS + tid; + var myKey: u32; + if (srcIsA) { myKey = sortA[i]; } else { myKey = sortB[i]; } + let bucket = (myKey >> shift) & 0xFu; + atomicAdd(&shHist[bucket], 1u); + } + workgroupBarrier(); + + // Global bucket starts (16-wide; thread 0 does it sequentially). + if (tid == 0u) { + var s2: u32 = 0u; + for (var b: u32 = 0u; b < BUCKETS; b = b + 1u) { + shOffsets[b] = s2; + s2 = s2 + atomicLoad(&shHist[b]); + } + } + workgroupBarrier(); + + // Per-bucket parallel scatter. + for (var b: u32 = 0u; b < BUCKETS; b = b + 1u) { + // (a) Count my bucket-b elements. + var localCount: u32 = 0u; + for (var k: u32 = 0u; k < K_PER; k = k + 1u) { + let i = k * THREADS + tid; + var srcKey: u32; + if (srcIsA) { srcKey = sortA[i]; } else { srcKey = sortB[i]; } + let bk = (srcKey >> shift) & 0xFu; + if (bk == b) { localCount = localCount + 1u; } + } + shScan[tid] = localCount; + workgroupBarrier(); + + // (b) Hillis-Steele inclusive prefix scan across 1024 lanes. + // Single-buffered: read snapshot → barrier → write → barrier. + for (var step: u32 = 0u; step < SCAN_STEPS; step = step + 1u) { + let offset = 1u << step; + let v = shScan[tid]; + var prev: u32 = 0u; + if (tid >= offset) { prev = shScan[tid - offset]; } + workgroupBarrier(); + shScan[tid] = v + prev; + workgroupBarrier(); + } + // Convert inclusive→exclusive by subtracting own contribution. + let myExclusivePrefix = shScan[tid] - localCount; + + // (c) Scatter my bucket-b elements at the computed positions. + var localIdx: u32 = 0u; + for (var k: u32 = 0u; k < K_PER; k = k + 1u) { + let i = k * THREADS + tid; + var srcKey: u32; + if (srcIsA) { srcKey = sortA[i]; } else { srcKey = sortB[i]; } + let bk = (srcKey >> shift) & 0xFu; + if (bk == b) { + let dst = shOffsets[b] + myExclusivePrefix + localIdx; + if (srcIsA) { sortB[dst] = srcKey; } else { sortA[dst] = srcKey; } + localIdx = localIdx + 1u; + } + } + workgroupBarrier(); + } + storageBarrier(); + } + } + // After 8 ping-pongs (even count) the sorted keys live in sortA. + + // ── Phase 3: write sorted instance permutation into outOrder ───────── + for (var k: u32 = 0u; k < K_PER; k = k + 1u) { + let i = k * THREADS + tid; + if (i < n) { + outOrder[i] = sortA[i] & 0xFFFFu; + } + } + workgroupBarrier(); + storageBarrier(); + + // ── Phase 4: initialize BVH leaf AABBs ─────────────────────────────── + for (var k: u32 = 0u; k < K_PER; k = k + 1u) { + let i = k * THREADS + tid; + let leafIdx = N_PADDED - 1u + i; + let leafKey = sortA[i]; + if (leafKey == 0xFFFFFFFFu) { + outBvh[leafIdx].aabbMin = vec3( 1e30); + outBvh[leafIdx].aabbMax = vec3(-1e30); + } else { + let e = entries[leafKey & 0xFFFFu]; + outBvh[leafIdx].aabbMin = e.aabbMin; + outBvh[leafIdx].aabbMax = e.aabbMax; + } + } + workgroupBarrier(); + storageBarrier(); + + // ── Phase 5: bottom-up sweep-tree refit, LEVELS iterations ────────── + // Deepest internal level has N_PADDED/2 nodes; perThread = ceil of + // levelCount / THREADS is uniform per step, so workgroupBarrier + // stays in uniform control flow. + var levelCount: u32 = N_PADDED / 2u; + var levelStart: u32 = N_PADDED / 2u - 1u; + for (var step: u32 = 0u; step < LEVELS; step = step + 1u) { + let perThread = (levelCount + THREADS - 1u) / THREADS; + for (var k: u32 = 0u; k < perThread; k = k + 1u) { + let nodeOff = k * THREADS + tid; + if (nodeOff < levelCount) { + let nodeIdx = levelStart + nodeOff; + let leftIdx = 2u * nodeIdx + 1u; + let rightIdx = 2u * nodeIdx + 2u; + let lMin = outBvh[leftIdx].aabbMin; + let lMax = outBvh[leftIdx].aabbMax; + let rMin = outBvh[rightIdx].aabbMin; + let rMax = outBvh[rightIdx].aabbMax; + outBvh[nodeIdx].aabbMin = min(lMin, rMin); + outBvh[nodeIdx].aabbMax = max(lMax, rMax); + } + } + workgroupBarrier(); + storageBarrier(); + levelCount = levelCount / 2u; + levelStart = (levelStart - 1u) / 2u; + } } `; @@ -1706,18 +2409,54 @@ function rtInit() { usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST, }); - // TLAS-build compute pipeline. Only group(0) is used (3 SSBOs). + // TLAS-build compute pipeline. group(0) carries: + // 0 inInstances (read-only-storage) + // 1 inMeshes (read-only-storage) + // 2 outEntries (read-write storage) + // 3 outOrder (read-write — identity perm, sort overwrites later) + // 4 outMorton (read-write — sort key, 30-bit Morton from centroid) const mod = device.createShaderModule({ code: tlasBuildWgsl, label: "rt-tlas-build" }); const tlasBuildBgl = device.createBindGroupLayout({ entries: [ { binding: 0, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, { binding: 1, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, { binding: 2, visibility: GPUShaderStage.COMPUTE, buffer: { type: "storage" } }, + { binding: 3, visibility: GPUShaderStage.COMPUTE, buffer: { type: "storage" } }, + { binding: 4, visibility: GPUShaderStage.COMPUTE, buffer: { type: "storage" } }, ]}); rtState.tlasBuildBgl = tlasBuildBgl; rtState.tlasBuildPipeline = device.createComputePipeline({ layout: device.createPipelineLayout({ bindGroupLayouts: [tlasBuildBgl] }), compute: { module: mod, entryPoint: "tlasBuildMain" }, }); + + // LBVH-build follow-up pipeline. Single workgroup of 1024 threads + // sorts instances by packed (morton16:idx16) keys (LSD radix) and + // refits a sweep-tree BVH. N_PADDED = 1024 ceiling for Stage 1. + // 0 entries (read-only) + // 1 outOrder (read-write, sorted permutation) + // 2 outBvh (read-write, 2*N_PADDED - 1 nodes × 32 bytes) + // 3 sortA (read-write, N_PADDED u32 ping-pong) + // 4 sortB (read-write, N_PADDED u32 ping-pong) + const lbvhMod = device.createShaderModule({ code: lbvhBuildWgsl, label: "rt-tlas-lbvh-build" }); + const lbvhBuildBgl = device.createBindGroupLayout({ entries: [ + { binding: 0, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, + { binding: 1, visibility: GPUShaderStage.COMPUTE, buffer: { type: "storage" } }, + { binding: 2, visibility: GPUShaderStage.COMPUTE, buffer: { type: "storage" } }, + { binding: 3, visibility: GPUShaderStage.COMPUTE, buffer: { type: "storage" } }, + { binding: 4, visibility: GPUShaderStage.COMPUTE, buffer: { type: "storage" } }, + { binding: 5, visibility: GPUShaderStage.COMPUTE, buffer: { type: "uniform" } }, + ]}); + rtState.lbvhBuildBgl = lbvhBuildBgl; + rtState.lbvhBuildPipeline = device.createComputePipeline({ + layout: device.createPipelineLayout({ bindGroupLayouts: [lbvhBuildBgl] }), + compute: { module: lbvhMod, entryPoint: "lbvhBuildMain" }, + }); + // Tiny uniform buffer for the LBVH's `nReal` field. Written each + // wgpuBuildTLAS call before dispatch. + rtState.lbvhCountBuf = device.createBuffer({ + size: 16, + usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST, + }); } function rtMeshRecordsEnsure(meshCount) { @@ -1808,14 +2547,25 @@ env.wgpuRegisterMeshBLAS = (minX, minY, minZ, maxX, maxY, maxZ, return handle; }; -env.wgpuBuildTLAS = (instanceBufHandle, instanceCount, tlasOutBufHandle) => { - if (!rtState.tlasBuildPipeline) return; - const inst = buffers.get(instanceBufHandle); - const out = buffers.get(tlasOutBufHandle); - if (!inst || !out) { +env.wgpuBuildTLAS = (instanceBufHandle, instanceCount, tlasOutBufHandle, + entryOrderHandle, mortonHandle, binsHandle, + bvhNodesHandle, sortABufHandle, sortBBufHandle) => { + if (!rtState.tlasBuildPipeline || !rtState.lbvhBuildPipeline) return; + const inst = buffers.get(instanceBufHandle); + const out = buffers.get(tlasOutBufHandle); + const order = buffers.get(entryOrderHandle); + const morton = buffers.get(mortonHandle); + const bvh = buffers.get(bvhNodesHandle); + const sortA = buffers.get(sortABufHandle); + const sortB = buffers.get(sortBBufHandle); + if (!inst || !out || !order || !morton || !bvh || !sortA || !sortB) { console.error("[crafter-wgpu] wgpuBuildTLAS: unknown buffer handle"); return; } + if (instanceCount > LBVH_MAX) { + console.error(`[crafter-wgpu] wgpuBuildTLAS: instance count ${instanceCount} > LBVH cap ${LBVH_MAX}`); + return; + } const bg = device.createBindGroup({ layout: rtState.tlasBuildBgl, @@ -1823,6 +2573,26 @@ env.wgpuBuildTLAS = (instanceBufHandle, instanceCount, tlasOutBufHandle) => { { binding: 0, resource: { buffer: inst } }, { binding: 1, resource: { buffer: rtState.meshRecordsBuffer } }, { binding: 2, resource: { buffer: out } }, + { binding: 3, resource: { buffer: order } }, + { binding: 4, resource: { buffer: morton } }, + ], + }); + // Write the real instance count to the LBVH count uniform so the + // shader can iterate exactly the right number of entries even + // though the storage buffers stay sized for N_PADDED. + const countBuf = new Uint32Array(4); + countBuf[0] = instanceCount; + queue.writeBuffer(rtState.lbvhCountBuf, 0, countBuf); + + const lbvhBg = device.createBindGroup({ + layout: rtState.lbvhBuildBgl, + entries: [ + { binding: 0, resource: { buffer: out } }, + { binding: 1, resource: { buffer: order } }, + { binding: 2, resource: { buffer: bvh } }, + { binding: 3, resource: { buffer: sortA } }, + { binding: 4, resource: { buffer: sortB } }, + { binding: 5, resource: { buffer: rtState.lbvhCountBuf } }, ], }); @@ -1834,12 +2604,25 @@ env.wgpuBuildTLAS = (instanceBufHandle, instanceCount, tlasOutBufHandle) => { } const enc = state.encoder || device.createCommandEncoder(); const ownEncoder = !state.encoder; - const pass = enc.beginComputePass({ label: "tlas-build" }); - pass.setPipeline(rtState.tlasBuildPipeline); - pass.setBindGroup(0, bg); - const groups = Math.ceil(instanceCount / 64); - pass.dispatchWorkgroups(groups, 1, 1); - pass.end(); + // Pass 1: TLAS entry build (existing). + { + const pass = enc.beginComputePass({ label: "tlas-build" }); + pass.setPipeline(rtState.tlasBuildPipeline); + pass.setBindGroup(0, bg); + const groups = Math.ceil(instanceCount / 64); + pass.dispatchWorkgroups(groups, 1, 1); + pass.end(); + } + // Pass 2: LBVH-build. Single workgroup of 1024 threads runs the + // entire BVH build (Morton, sort, sweep-tree refit) in shared + // memory. Same encoder → pipeline barrier sequences pass1 → pass2. + if (instanceCount > 0) { + const pass = enc.beginComputePass({ label: "tlas-lbvh-build" }); + pass.setPipeline(rtState.lbvhBuildPipeline); + pass.setBindGroup(0, lbvhBg); + pass.dispatchWorkgroups(1, 1, 1); + pass.end(); + } if (ownEncoder) { queue.submit([enc.finish()]); } else { @@ -1848,8 +2631,10 @@ env.wgpuBuildTLAS = (instanceBufHandle, instanceCount, tlasOutBufHandle) => { // Publish so rayQuery-capable compute pipelines pick up the latest TLAS // without each dispatch having to thread the handle explicitly. - rtState.currentTlas = tlasOutBufHandle; + rtState.currentTlas = tlasOutBufHandle; rtState.currentTlasInstanceCount = instanceCount; + rtState.currentEntryOrder = entryOrderHandle; + rtState.currentBvh = bvhNodesHandle; }; // RT pipeline loader — wraps user-supplied WGSL (sources + generated mega @@ -1873,7 +2658,7 @@ env.wgpuLoadRTPipeline = (wgslPtr, wgslLen, bindingsPtr, bindingsCount) => { beforeHelpers = userPart.substring(0, mi); afterHelpers = userPart.substring(mi + marker.length); } - const fullWgsl = rtWgslPrelude + "\n" + beforeHelpers + "\n" + rtWgslHelpers + "\n" + afterHelpers; + const fullWgsl = rtWgslPrelude + "\n" + beforeHelpers + "\n" + rtWgslPureHelpers + "\n" + rtWgslMegakernelHelpers + "\n" + afterHelpers; // Parse user bindings (same wire format as wgpuLoadCustomShader). const userBindings = []; @@ -1918,6 +2703,8 @@ env.wgpuLoadRTPipeline = (wgslPtr, wgslLen, bindingsPtr, bindingsCount) => { { binding: 6, visibility: GPUShaderStage.COMPUTE, storageTexture: { format: "rgba8unorm", access: "write-only", viewDimension: "2d" } }, { binding: 7, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, + { binding: 8, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, + { binding: 9, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, ]}); // User binding-group layouts. WebGPU pipeline layouts need a // contiguous array up to the highest group used, so pad any gaps @@ -1976,6 +2763,12 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes, entries: [{ binding: 0, resource: { buffer: rtState.rtHeader, offset: 0, size: 16 } }], }); const outView = state.outIsPing ? state.pingView : state.pongView; + const entryOrderBuf = buffers.get(rtState.currentEntryOrder); + const bvhBuf = buffers.get(rtState.currentBvh); + if (!entryOrderBuf || !bvhBuf) { + console.error("[crafter-wgpu] wgpuDispatchRT: missing entryOrder/bins (no TLAS built yet?)"); + return; + } const dataBg = device.createBindGroup({ layout: pipe.dataBgl, entries: [ @@ -1987,6 +2780,8 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes, { binding: 5, resource: { buffer: rtState.primRemapHeap.gpu } }, { binding: 6, resource: outView }, { binding: 7, resource: { buffer: rtState.attribsHeap.gpu } }, + { binding: 8, resource: { buffer: entryOrderBuf } }, + { binding: 9, resource: { buffer: bvhBuf } }, ], }); @@ -2025,7 +2820,279 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes, state.outIsPing = !state.outIsPing; }; +// ── Standalone compute pipelines ──────────────────────────────────────── +// +// Mirrors the native Vulkan `ComputeShader` API: user-authored compute +// pipelines that dispatch outside any UI render pass. Unlike +// wgpuLoadCustomShader (which lives inside the UI flow with ping/pong +// textures at @group(1)), these have NO library-supplied bindings — +// users declare all groups themselves. +// +// Layout contract: +// @group(0) @binding(0) uniform PushData // optional, only if pushUniformSize > 0 +// @group(1+) @binding(N) // user bindings via UICustomBinding +// +// rayQuery: same flag as wgpuLoadCustomShader. When set, prepends the +// RT prelude + rayQuery library; user shader must NOT redeclare the +// resulting @group(1) layout (tlas/bvh/mesh heaps). The `bindings` +// list must start at @group(2) in that case. + +const computePipelines = new Map(); // handle → { pipeline, bgls, byGroup, sortedGroups, pushUniformSize, rayQueryCapable } + +env.wgpuLoadComputePipeline = (wgslPtr, wgslLen, pushUniformSize, + bindingsPtr, bindingsCount, rayQueryFlag) => { + if (!rtState.vertHeap && rayQueryFlag) rtInit(); + const userWgsl = new TextDecoder().decode(memU8().subarray(wgslPtr, wgslPtr + wgslLen)); + const wgsl = rayQueryFlag + // rayQueryLib's _rqTraverseBlas/_rqTraverseTlas call _rtAabb, + // _rtFetchTri, _rtTri from rtWgslPureHelpers — must prepend + // the pure helper subset (NOT the megakernel-only traversal + // routines, which reference user-emitted runAnyHit/runMiss/ + // runClosestHit and won't compile outside the raygen pipeline). + ? (rtWgslTypes + rtWgslMegakernelBindings + rtWgslPureHelpers + rtWgslRayQueryLib + "\n" + userWgsl) + : userWgsl; + + const bindings = []; + if (bindingsCount > 0) { + const dv = new DataView(memU8().buffer, bindingsPtr, bindingsCount * 8); + for (let i = 0; i < bindingsCount; i++) { + bindings.push({ + group: dv.getUint8(i*8 + 0), + binding: dv.getUint8(i*8 + 1), + kind: dv.getUint8(i*8 + 2), + pushOffset: dv.getUint32(i*8 + 4, true), + }); + } + } + + // Bind-group layouts. The push uniform sits at @group(0) if present; + // user bindings start at @group(1) (or @group(2) when rayQuery is on + // — see rtWgslMegakernelBindings). + const bgls = []; + if (pushUniformSize > 0) { + bgls.push(device.createBindGroupLayout({ entries: [ + { binding: 0, visibility: GPUShaderStage.COMPUTE, + buffer: { type: "uniform", minBindingSize: pushUniformSize } }, + ]})); + } else { + bgls.push(device.createBindGroupLayout({ entries: [] })); + } + + if (rayQueryFlag) { + // group(1) is the megakernel RT bindings — same layout as the RT + // pipeline's group(1). The injected rayQuery library reads it. + bgls.push(device.createBindGroupLayout({ entries: [ + { binding: 0, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, + { binding: 1, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, + { binding: 2, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, + { binding: 3, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, + { binding: 4, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, + { binding: 5, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, + { binding: 6, visibility: GPUShaderStage.COMPUTE, + storageTexture: { format: "rgba8unorm", access: "write-only", viewDimension: "2d" } }, + { binding: 7, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, + { binding: 8, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, + { binding: 9, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } }, + ]})); + } + + const byGroup = new Map(); + const userGroupBase = rayQueryFlag ? 2 : 1; + for (const b of bindings) { + if (b.group < userGroupBase) { + console.error(`[crafter-wgpu] compute pipeline: @group(${b.group}) reserved; user groups must be >= ${userGroupBase}`); + return 0; + } + if (!byGroup.has(b.group)) byGroup.set(b.group, []); + byGroup.get(b.group).push(b); + } + const sortedGroups = [...byGroup.keys()].sort((a, b) => a - b); + const highest = sortedGroups.length ? sortedGroups[sortedGroups.length - 1] : userGroupBase - 1; + for (let g = userGroupBase; g <= highest; g++) { + if (byGroup.has(g)) { + const entries = byGroup.get(g).map(b => { + const e = { binding: b.binding, visibility: GPUShaderStage.COMPUTE }; + if (b.kind === 0) e.buffer = { type: "read-only-storage" }; + else if (b.kind === 1) e.texture = { sampleType: "float", viewDimension: "2d" }; + else if (b.kind === 2) e.sampler = { type: "filtering" }; + else if (b.kind === 3) e.texture = { sampleType: "float", viewDimension: "2d-array" }; + else if (b.kind === 4) e.buffer = { type: "storage" }; // read-write storage + return e; + }); + bgls.push(device.createBindGroupLayout({ entries })); + } else { + bgls.push(device.createBindGroupLayout({ entries: [] })); + } + } + + let pipeline; + try { + const mod = device.createShaderModule({ code: wgsl, label: "compute-pipeline" }); + // Async compile-info to surface parse errors with line numbers. + mod.getCompilationInfo().then(info => { + const issues = info.messages.filter(m => m.type === "error" || m.type === "warning"); + if (issues.length === 0) return; + const lines = wgsl.split("\n"); + for (const m of issues) { + const ln = m.lineNum || 0, col = m.linePos || 0; + console[m.type === "error" ? "error" : "warn"]( + `[crafter-wgpu] compute-pipeline ${m.type} at ${ln}:${col}: ${m.message}`); + for (let i = Math.max(1, ln - 3); i <= Math.min(lines.length, ln + 3); i++) { + const marker = i === ln ? ">> " : " "; + console.log(`${marker}${String(i).padStart(4)}: ${lines[i - 1]}`); + } + } + }); + const layout = device.createPipelineLayout({ bindGroupLayouts: bgls }); + pipeline = device.createComputePipeline({ layout, compute: { module: mod, entryPoint: "main" } }); + } catch (e) { + console.error("[crafter-wgpu] compute pipeline compile failed:", e); + return 0; + } + + const handle = newHandle(); + computePipelines.set(handle, { + pipeline, bgls, byGroup, sortedGroups, + pushUniformSize, userGroupBase, + rayQueryCapable: !!rayQueryFlag, + }); + return handle; +}; + +// Dispatch a standalone compute pipeline. Works both inside the per-frame +// UI compute pass (mid-frame physics tick) and outside it (game-tick +// dispatch from update lambda) — matching the wgpuBuildTLAS pattern. +// +// Push data: if the pipeline declared a push uniform (pushUniformSize > 0 +// at load), `pushPtr` points at exactly that many bytes. We allocate a +// transient uniform buffer per dispatch — fine for the physics rate +// (handful of dispatches per substep, single-digit substeps per frame). +// +// Handles: parallel to the UICustomBinding[] declaration order at load +// time. Each handle resolves through the engine's buffer/texture/ +// sampler tables to the live WebGPU resource. +env.wgpuDispatchCompute = (pipelineHandle, pushPtr, pushBytes, + handlesPtr, handlesCount, gx, gy, gz) => { + const pipe = computePipelines.get(pipelineHandle); + if (!pipe) { + console.error("[crafter-wgpu] wgpuDispatchCompute: unknown pipeline", pipelineHandle); + return; + } + + // ── Push uniform: transient buffer per dispatch ──────────────────── + let pushBG = null; + if (pipe.pushUniformSize > 0) { + const buf = device.createBuffer({ + size: Math.max(16, (pipe.pushUniformSize + 15) & ~15), + usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST, + }); + if (pushBytes > 0) { + queue.writeBuffer(buf, 0, memU8().subarray(pushPtr, pushPtr + pushBytes)); + } + pushBG = device.createBindGroup({ + layout: pipe.bgls[0], + entries: [{ binding: 0, resource: { buffer: buf, offset: 0, size: pipe.pushUniformSize } }], + }); + } else { + pushBG = device.createBindGroup({ layout: pipe.bgls[0], entries: [] }); + } + + // ── RT bindings (rayQuery only) ──────────────────────────────────── + let rtBG = null; + if (pipe.rayQueryCapable) { + const tlasBuf = buffers.get(rtState.currentTlas); + const orderBuf = buffers.get(rtState.currentEntryOrder); + const bvhBuf = buffers.get(rtState.currentBvh); + if (!tlasBuf || !orderBuf || !bvhBuf) { + console.error("[crafter-wgpu] wgpuDispatchCompute rayQuery: no TLAS built yet"); + return; + } + const outView = state.outIsPing ? state.pingView : state.pongView; + rtBG = device.createBindGroup({ + layout: pipe.bgls[1], + entries: [ + { binding: 0, resource: { buffer: tlasBuf } }, + { binding: 1, resource: { buffer: rtState.bvhHeap.gpu } }, + { binding: 2, resource: { buffer: rtState.meshRecordsBuffer } }, + { binding: 3, resource: { buffer: rtState.vertHeap.gpu } }, + { binding: 4, resource: { buffer: rtState.indexHeap.gpu } }, + { binding: 5, resource: { buffer: rtState.primRemapHeap.gpu } }, + { binding: 6, resource: outView }, + { binding: 7, resource: { buffer: rtState.attribsHeap.gpu } }, + { binding: 8, resource: { buffer: orderBuf } }, + { binding: 9, resource: { buffer: bvhBuf } }, + ], + }); + } + + // ── User bind groups from handles ────────────────────────────────── + const handleArr = new Uint32Array(memU8().buffer, handlesPtr, handlesCount); + const userBGs = []; + let handleCursor = 0; + const userGroupBase = pipe.userGroupBase; + const userBglStart = pipe.bgls.length - (pipe.sortedGroups.length + ? (pipe.sortedGroups[pipe.sortedGroups.length - 1] - userGroupBase + 1) + : 0); + let bglIdx = userBglStart; + if (pipe.sortedGroups.length > 0) { + for (let g = userGroupBase; g <= pipe.sortedGroups[pipe.sortedGroups.length - 1]; g++) { + if (pipe.byGroup.has(g)) { + const entries = pipe.byGroup.get(g).map(b => { + const h = handleArr[handleCursor++]; + let resource; + if (b.kind === 0 || b.kind === 4) resource = { buffer: buffers.get(h) }; + else if (b.kind === 1) resource = textureViews.get(h); + else if (b.kind === 2) resource = samplers.get(h); + else if (b.kind === 3) resource = textureViews.get(h); + return { binding: b.binding, resource }; + }); + userBGs.push(device.createBindGroup({ layout: pipe.bgls[bglIdx++], entries })); + } else { + userBGs.push(device.createBindGroup({ layout: pipe.bgls[bglIdx++], entries: [] })); + } + } + } + + // ── Execute: attach to active pass, or create a standalone one. ──── + const runDispatch = (pass) => { + pass.setPipeline(pipe.pipeline); + pass.setBindGroup(0, pushBG); + if (rtBG) pass.setBindGroup(1, rtBG); + for (let i = 0; i < userBGs.length; i++) { + pass.setBindGroup(userGroupBase + i, userBGs[i]); + } + pass.dispatchWorkgroups(gx, gy, gz); + }; + + if (state.pass) { + // Mid-frame: dispatch into the current UI compute pass. + runDispatch(state.pass); + } else { + // Standalone: build encoder + pass + submit. Mirrors wgpuBuildTLAS. + const enc = state.encoder || device.createCommandEncoder(); + const ownEncoder = !state.encoder; + const pass = enc.beginComputePass({ label: "plain-compute" }); + runDispatch(pass); + pass.end(); + if (ownEncoder) queue.submit([enc.finish()]); + } +}; + console.log("[crafter-wgpu] init complete; env handlers wired"); + +// Memory diagnostic. Logs handle-table sizes every 5s so a slow leak +// shows up as monotonic growth. Comment out for production builds. +setInterval(() => { + const m = (performance.memory && performance.memory.usedJSHeapSize) || 0; + console.log(`[crafter-wgpu] mem: js=${(m / 1024 / 1024).toFixed(1)}MB` + + ` buffers=${buffers.size}` + + ` textures=${textures.size}` + + ` samplers=${samplers.size}` + + ` customPipelines=${customPipelines.size}` + + ` rtPipelines=${rtPipelines.size}` + + ` computePipelines=${computePipelines.size}` + + ` bindGroupCache=${state.bindGroupCache ? state.bindGroupCache.size : 0}`); +}, 5000); } catch (e) { // Capture any throw so the stub error messages name the real cause // instead of "(no error captured)". Re-throw so runtime.js's catch diff --git a/implementations/Crafter.Graphics-Mesh-WebGPU.cpp b/implementations/Crafter.Graphics-Mesh-WebGPU.cpp index 5a61871..2ebd128 100644 --- a/implementations/Crafter.Graphics-Mesh-WebGPU.cpp +++ b/implementations/Crafter.Graphics-Mesh-WebGPU.cpp @@ -225,6 +225,7 @@ namespace { std::span indices, std::span attribsBytes) { mesh.triangleCount = static_cast(indices.size()) / 3; + mesh.vertexCount = static_cast(vertices.size()); Builder builder; builder.Build(vertices, indices); diff --git a/implementations/Crafter.Graphics-RenderingElement3D-WebGPU.cpp b/implementations/Crafter.Graphics-RenderingElement3D-WebGPU.cpp index 2195f85..ffa1b47 100644 --- a/implementations/Crafter.Graphics-RenderingElement3D-WebGPU.cpp +++ b/implementations/Crafter.Graphics-RenderingElement3D-WebGPU.cpp @@ -4,12 +4,21 @@ Copyright (C) 2026 Catcrafts® catcrafts.net */ -// DOM-mode TLAS upkeep. BuildTLAS copies the per-element RTInstance into -// the host-visible instance buffer (skipping the transform for elements -// whose transform is GPU-owned), uploads it, then dispatches the JS-side -// TLAS-build compute pass — which consults the per-BLAS records published -// at Mesh::Build() time to produce world-space AABBs and inverse -// transforms in the format `traceRay` / `rayQuery` consume. +// DOM-mode TLAS upkeep. BuildTLAS is split in two phases so a physics +// compute pass can run between them: +// - BuildTLASUpload mirrors the CPU-side RTInstance array into the +// host-visible instance buffer (with partial-write semantics that +// preserve the transform bytes for elements flagged +// transformOwnedByGpu, see notes in the body) and uploads the +// metadata buffer. +// - BuildTLASBuild dispatches the JS-side TLAS-build compute pass — +// which consults the per-BLAS records published at Mesh::Build() +// time to produce world-space AABBs and inverse transforms in the +// format `traceRay` / `rayQuery` consume. +// The combined BuildTLAS calls both back-to-back; callers that want to +// interleave a physics tlas-transform compute pass (which writes the +// transform bytes BuildTLASUpload leaves intact) call Upload + their +// compute pass + Build manually. module; module Crafter.Graphics:RenderingElement3D_implWebGPU; @@ -41,7 +50,7 @@ void RenderingElement3D::Remove(RenderingElement3D* e) { e->indexInElements = std::numeric_limits::max(); } -void RenderingElement3D::BuildTLAS(WebGPUCommandEncoderRef /*cmd*/, std::uint32_t index) { +void RenderingElement3D::BuildTLASUpload(WebGPUCommandEncoderRef /*cmd*/, std::uint32_t index) { auto& tlas = tlases[index]; const std::uint32_t primitiveCount = static_cast(elements.size()); if (primitiveCount == 0) { @@ -49,19 +58,52 @@ void RenderingElement3D::BuildTLAS(WebGPUCommandEncoderRef /*cmd*/, std::uint32_ return; } - // (Re)allocate instance + metadata + output TLAS buffers if the count - // changed. WebGPUBuffer::Resize destroys and recreates the GPU buffer; - // bind-group caches keyed on the buffer handle are invalidated in the - // JS bridge automatically. - if (primitiveCount != tlas.builtInstanceCount) { - tlas.instanceBuffer.Resize(primitiveCount); - tlas.metadataBuffer.Resize(primitiveCount); - // TLASEntry layout in WGSL is 144 bytes due to vec3 align/pad - // rules. Must match the struct declared in the rtWgslTypes - // block in additional/dom-webgpu.js. - tlas.buffer.Resize(primitiveCount * 144); + constexpr std::uint32_t kNPadded = 65536u; // size for instance / metadata mirrors + constexpr std::uint32_t kLbvhMax = 16384u; // matches N_PADDED in lbvhBuildWgsl + constexpr std::uint32_t kNodeCount = 2u * kNPadded - 1u; + + // ALL TLAS-side GPU buffers get allocated ONCE and never resized. + // The LBVH-build shader takes the real instance count via a uniform + // (lbvhPc.nReal) instead of arrayLength(&entries), so the + // tlas.buffer / entryOrder / mortonCodes don't need to grow when + // the application's element count changes. + // + // Why this matters: an earlier version resized these per-frame on + // primitiveCount change. The destroy+recreate cycle on the GPU + // buffer caused subtle mid-game flicker as soon as any element was + // added (e.g. firing a projectile) — fort braces would appear to + // briefly vanish in patterns deterministic on the projectile's + // angle. Suspected driver-level memory recycling without proper + // zero-init; the fixed-size allocation sidesteps it entirely. + if (tlas.instanceBuffer.handle == 0) { + tlas.instanceBuffer.Resize(kNPadded); + tlas.metadataBuffer.Resize(kNPadded); + tlas.bvhNodes.Resize(kNodeCount * 32u); + tlas.sortTempA.Resize(kNPadded * 4u); + tlas.sortTempB.Resize(kNPadded * 4u); + tlas.tlasBins.Resize(64 * 32); + // TLAS-entry / order / morton-code buffers: sized for the LBVH + // cap (16384). lbvhBuildMain iterates `lbvhPc.nReal` real + // entries; the remainder stays zero / sentinel. Keep these + // stable across element-count changes so the renderer's bind + // group references the same buffer handle every frame. + tlas.buffer.Resize(kLbvhMax * 144u); + tlas.entryOrder.Resize(kLbvhMax * 4u); + tlas.mortonCodes.Resize(kLbvhMax * 4u); } + // NB: tlas.buffer / entryOrder / mortonCodes get resized in + // BuildTLASBuild, NOT here. Resize destroys + recreates the GPU + // resource (and the JS-side handle); the rayQuery dispatches that + // run between BuildTLASUpload and BuildTLASBuild (projectile-collide, + // splash, builder-pick) still hold the previous frame's TLAS in + // rtState.current{Tlas,EntryOrder,Bvh}. If we resized here, those + // handles would point at destroyed buffers and the dispatches would + // log "no TLAS built yet" every frame the element count changed + // (e.g. every projectile fire). Resizing inside BuildTLASBuild, + // immediately before wgpuBuildTLAS publishes the new handles, keeps + // the JS-side current* refs in sync with the GPU resources. + for (std::uint32_t i = 0; i < primitiveCount; ++i) { auto& dst = tlas.instanceBuffer.value[i]; const auto& src = elements[i]->instance; @@ -80,12 +122,73 @@ void RenderingElement3D::BuildTLAS(WebGPUCommandEncoderRef /*cmd*/, std::uint32_ tlas.metadataBuffer.value[i] = elements[i]->userMetadata; } - tlas.instanceBuffer.FlushDevice(); + // Upload the instance buffer with partial-write semantics: for runs + // of CPU-driven elements (transformOwnedByGpu=false) we push the + // whole 64-byte struct in one writeBuffer call; for GPU-driven runs + // we push only the trailing 16 metadata bytes per element, leaving + // the transform field intact for the physics-tlas-transform compute + // shader to update. The two arms below produce identical GPU state + // when every element is CPU-driven — this is a no-op refactor until + // 3DForts flips its physics elements to transformOwnedByGpu=true. + constexpr std::uint32_t kInstSize = sizeof(RTInstance); // 64 + constexpr std::uint32_t kTransformSize = sizeof(RTTransformMatrix); // 48 + constexpr std::uint32_t kMetaSize = kInstSize - kTransformSize; // 16 + + std::uint32_t runStart = 0; + bool runOwned = elements[0]->transformOwnedByGpu; + for (std::uint32_t i = 1; i <= primitiveCount; ++i) { + const bool atEnd = (i == primitiveCount); + const bool currOwned = atEnd ? !runOwned : elements[i]->transformOwnedByGpu; + if (currOwned == runOwned && !atEnd) continue; + + if (runOwned) { + // GPU-driven run — metadata only, per element. Cannot batch + // because the metadata bytes are non-contiguous in the + // instance buffer (one 16-byte chunk per 64-byte slot). + for (std::uint32_t j = runStart; j < i; ++j) { + const std::uint32_t off = j * kInstSize + kTransformSize; + tlas.instanceBuffer.FlushDeviceRange(off, off, kMetaSize); + } + } else { + // CPU-driven run — one contiguous writeBuffer. + const std::uint32_t startOff = runStart * kInstSize; + const std::uint32_t bytes = (i - runStart) * kInstSize; + tlas.instanceBuffer.FlushDeviceRange(startOff, startOff, bytes); + } + runStart = i; + runOwned = currOwned; + } + tlas.metadataBuffer.FlushDevice(); +} + +void RenderingElement3D::BuildTLASBuild(WebGPUCommandEncoderRef /*cmd*/, std::uint32_t index) { + auto& tlas = tlases[index]; + const std::uint32_t primitiveCount = static_cast(elements.size()); + if (primitiveCount == 0) { + // Upload already cleared builtInstanceCount; nothing to dispatch. + return; + } + + // No per-count Resize. tlas.buffer / entryOrder / mortonCodes were + // allocated at kLbvhMax in BuildTLASUpload's first call and stay + // that size. The LBVH shader reads the real count from a uniform + // (lbvhPc.nReal) wgpuBuildTLAS writes each call. WebGPU::wgpuBuildTLAS(tlas.instanceBuffer.handle, static_cast(primitiveCount), - tlas.buffer.handle); + tlas.buffer.handle, + tlas.entryOrder.handle, + tlas.mortonCodes.handle, + tlas.tlasBins.handle, + tlas.bvhNodes.handle, + tlas.sortTempA.handle, + tlas.sortTempB.handle); tlas.builtInstanceCount = primitiveCount; } + +void RenderingElement3D::BuildTLAS(WebGPUCommandEncoderRef cmd, std::uint32_t index) { + BuildTLASUpload(cmd, index); + BuildTLASBuild(cmd, index); +} diff --git a/implementations/Crafter.Graphics-UI-WebGPU.cpp b/implementations/Crafter.Graphics-UI-WebGPU.cpp index 6e619c5..2c46c7c 100644 --- a/implementations/Crafter.Graphics-UI-WebGPU.cpp +++ b/implementations/Crafter.Graphics-UI-WebGPU.cpp @@ -98,13 +98,9 @@ void UIRenderer::DispatchImages(GraphicsCommandBuffer /*cmd*/, std::uint32_t buf if (itemCount == 0) return; UIDispatchHeader hdr = FillHeader(bufferSlot, itemCount, clipRectPx); auto handle = heap_->bufferTable[bufferSlot]; - // For DispatchImages, the WGSL expects a texture + sampler in group 3. - // The library v1 doesn't expose user-image registration on DOM (out of - // scope per plan). If the user calls DispatchImages without a registered - // image, fall back to using the font atlas binding — the user's items - // should reference texSlot/sampSlot but on DOM those are ignored. For - // now, route through the font atlas texture if available; otherwise - // skip the dispatch. + // Backward-compatible fallback: callers that don't pass a texture + // get the font atlas. Useful for tests, useless for real content. + // New code should use the 6-arg overload below. if (fontAtlasImageSlot_) { auto texHandle = heap_->imageTable[fontAtlasImageSlot_]; auto sampHandle = heap_->samplerTable[fontAtlasSamplerSlot_]; @@ -115,6 +111,21 @@ void UIRenderer::DispatchImages(GraphicsCommandBuffer /*cmd*/, std::uint32_t buf } } +void UIRenderer::DispatchImages(GraphicsCommandBuffer /*cmd*/, std::uint32_t bufferSlot, + std::uint32_t itemCount, + std::uint16_t imageSlot, std::uint16_t samplerSlot, + std::array clipRectPx) { + if (itemCount == 0) return; + UIDispatchHeader hdr = FillHeader(bufferSlot, itemCount, clipRectPx); + auto handle = heap_->bufferTable[bufferSlot]; + auto texHandle = heap_->imageTable[imageSlot]; + auto sampHandle = heap_->samplerTable[samplerSlot]; + WebGPU::wgpuDispatchImages(handle, &hdr, + static_cast(TilesFor(window_->width)), + static_cast(TilesFor(window_->height)), + texHandle, sampHandle); +} + void UIRenderer::DispatchText(GraphicsCommandBuffer /*cmd*/, std::uint32_t bufferSlot, std::uint32_t itemCount, std::array clipRectPx) { @@ -168,6 +179,7 @@ void UIRenderer::Dispatch(GraphicsCommandBuffer /*cmd*/, const GraphicsComputeSh case UICustomBindingKind::Sampler: if (slot < heap_->samplerTable.size()) handle = heap_->samplerTable[slot]; break; + default: break; } handles.push_back(handle); } diff --git a/interfaces/Crafter.Graphics-DescriptorHeapWebGPU.cppm b/interfaces/Crafter.Graphics-DescriptorHeapWebGPU.cppm index 9e50d45..d122417 100644 --- a/interfaces/Crafter.Graphics-DescriptorHeapWebGPU.cppm +++ b/interfaces/Crafter.Graphics-DescriptorHeapWebGPU.cppm @@ -191,5 +191,13 @@ export namespace Crafter { heap.samplerTable[r.firstElement] = WebGPU::wgpuCreateLinearClampSampler(); return SamplerSlot(&heap, r.firstElement); } + + // Same as AllocateLinearClampSampler but the address modes are + // `repeat` instead of `clamp-to-edge`. Mip filtering is also linear. + inline SamplerSlot AllocateLinearRepeatSampler(DescriptorHeapWebGPU& heap) { + DescriptorRange r = heap.AllocateSamplerSlots(1); + heap.samplerTable[r.firstElement] = WebGPU::wgpuCreateLinearRepeatSampler(); + return SamplerSlot(&heap, r.firstElement); + } } #endif // CRAFTER_GRAPHICS_WINDOW_DOM diff --git a/interfaces/Crafter.Graphics-Image2D.cppm b/interfaces/Crafter.Graphics-Image2D.cppm index 01643e2..369d539 100644 --- a/interfaces/Crafter.Graphics-Image2D.cppm +++ b/interfaces/Crafter.Graphics-Image2D.cppm @@ -113,17 +113,30 @@ export namespace Crafter { std::uint16_t width = 0; std::uint16_t height = 0; std::uint16_t layers = 0; + std::uint8_t mipLevels = 1; - void Create(std::uint16_t w, std::uint16_t h, std::uint16_t layerCount) { - width = w; - height = h; - layers = layerCount; - handle = WebGPU::wgpuCreateImage2DArray(w, h, layerCount); + // Create an array with `layerCount` × (w × h) layers, each carrying + // `mipLevels` mip levels. Pass mipLevels=1 (default) for a single + // base level — matching the original no-mip behaviour. Caller is + // responsible for uploading each level via UpdateLayer (which + // handles CPU mip-chain generation when mipLevels > 1). + void Create(std::uint16_t w, std::uint16_t h, std::uint16_t layerCount, + std::uint8_t mipLevelCount = 1) { + width = w; + height = h; + layers = layerCount; + mipLevels = mipLevelCount; + handle = WebGPU::wgpuCreateImage2DArray(w, h, layerCount, mipLevelCount); } - // Decompress `tex` and upload to `layer`. The asset's dims must - // match the array's (w × h) — resize beforehand on the host with - // TextureAsset::Resize() if they don't. + // Decompress `tex`, generate a CPU box-filter mip chain (if + // mipLevels > 1), and upload each level into `layer`. The asset's + // base-level dims must match the array's (w × h) — resize + // beforehand on the host with TextureAsset::Resize() if + // they don't. Pixel data is treated as raw bytes per channel for + // the box filter — for non-color data (normal maps) this gives + // approximate but adequate results; for sRGB-encoded color data + // it's also approximate but visually fine for game textures. void UpdateLayer(std::uint16_t layer, const CompressedTextureAsset& tex) { if (tex.pixelStride != sizeof(PixelType)) { std::println(std::cerr, @@ -142,11 +155,56 @@ export namespace Crafter { std::as_writable_bytes(std::span(pixels)), }; Compression::DecompressCPU(tex.blob, outputs); + + // Upload level 0. WebGPU::wgpuWriteImage2DLayer( - handle, layer, + handle, layer, /*level*/ 0, pixels.data(), static_cast(pixels.size() * sizeof(PixelType)), width, height); + + // Generate + upload subsequent mip levels via a 2x2 box filter + // on the previous level's bytes. Each channel is averaged + // independently across 4 source texels. + std::uint16_t srcW = width; + std::uint16_t srcH = height; + std::vector prev = std::move(pixels); + for (std::uint8_t lvl = 1; lvl < mipLevels; ++lvl) { + std::uint16_t dstW = std::max(1, srcW >> 1); + std::uint16_t dstH = std::max(1, srcH >> 1); + std::vector next(static_cast(dstW) * dstH); + constexpr std::size_t kChannels = sizeof(PixelType); + auto srcBytes = reinterpret_cast(prev.data()); + auto dstBytes = reinterpret_cast(next.data()); + for (std::uint16_t y = 0; y < dstH; ++y) { + std::uint16_t sy0 = static_cast(y * 2); + std::uint16_t sy1 = static_cast(std::min(sy0 + 1, srcH - 1)); + for (std::uint16_t x = 0; x < dstW; ++x) { + std::uint16_t sx0 = static_cast(x * 2); + std::uint16_t sx1 = static_cast(std::min(sx0 + 1, srcW - 1)); + std::size_t a = (static_cast(sy0) * srcW + sx0) * kChannels; + std::size_t b = (static_cast(sy0) * srcW + sx1) * kChannels; + std::size_t c = (static_cast(sy1) * srcW + sx0) * kChannels; + std::size_t d = (static_cast(sy1) * srcW + sx1) * kChannels; + std::size_t out = (static_cast(y) * dstW + x) * kChannels; + for (std::size_t ch = 0; ch < kChannels; ++ch) { + std::uint32_t sum = static_cast(srcBytes[a + ch]) + + static_cast(srcBytes[b + ch]) + + static_cast(srcBytes[c + ch]) + + static_cast(srcBytes[d + ch]); + dstBytes[out + ch] = static_cast((sum + 2u) >> 2); + } + } + } + WebGPU::wgpuWriteImage2DLayer( + handle, layer, /*level*/ lvl, + next.data(), + static_cast(next.size() * sizeof(PixelType)), + dstW, dstH); + prev = std::move(next); + srcW = dstW; + srcH = dstH; + } } ImageSlot AllocateSlot(DescriptorHeapWebGPU& heap) { diff --git a/interfaces/Crafter.Graphics-InputField.cppm b/interfaces/Crafter.Graphics-InputField.cppm index 8e24897..0812858 100644 --- a/interfaces/Crafter.Graphics-InputField.cppm +++ b/interfaces/Crafter.Graphics-InputField.cppm @@ -18,10 +18,7 @@ Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ module; -#ifndef CRAFTER_GRAPHICS_WINDOW_DOM -#endif // !CRAFTER_GRAPHICS_WINDOW_DOM export module Crafter.Graphics:InputField; -#ifndef CRAFTER_GRAPHICS_WINDOW_DOM import std; import :Types; import :Keys; @@ -110,4 +107,3 @@ export namespace Crafter { const InputFieldColors& colors, bool caretVisible); } -#endif // !CRAFTER_GRAPHICS_WINDOW_DOM diff --git a/interfaces/Crafter.Graphics-Mesh.cppm b/interfaces/Crafter.Graphics-Mesh.cppm index 087d94f..4146912 100644 --- a/interfaces/Crafter.Graphics-Mesh.cppm +++ b/interfaces/Crafter.Graphics-Mesh.cppm @@ -97,6 +97,7 @@ export namespace Crafter { // sentinel; never returned by Build(). std::uint64_t blasAddr = 0; std::uint32_t triangleCount = 0; + std::uint32_t vertexCount = 0; bool opaque = true; diff --git a/interfaces/Crafter.Graphics-PlainComputeShader.cppm b/interfaces/Crafter.Graphics-PlainComputeShader.cppm new file mode 100644 index 0000000..e8957f3 --- /dev/null +++ b/interfaces/Crafter.Graphics-PlainComputeShader.cppm @@ -0,0 +1,113 @@ +/* +Crafter®.Graphics +Copyright (C) 2026 Catcrafts® +catcrafts.net + +This library is free software; you can redistribute it and/or +modify it under the terms of the GNU Lesser General Public +License version 3.0 as published by the Free Software Foundation; +*/ + +// Standalone compute pipeline. Dispatches at any point in the frame +// (inside or outside the UI render pass) via the JS bridge's +// wgpuDispatchCompute, which mirrors the wgpuBuildTLAS pattern of +// attaching to the active encoder when one exists or creating an +// ephemeral encoder+submit when not. +// +// This is the WebGPU counterpart to the Vulkan `:ComputeShader` partition. +// They expose the same conceptual API — Load + Dispatch — but with +// backend-specific binding plumbing. See `:GraphicsTypes` for the +// `GraphicsComputeShader` alias picking the right one per target. +// +// WGSL contract: +// @group(0) @binding(0) uniform PushData // optional; only if pushUniformSize>0 +// @group(1+) @binding(N) // user bindings via UICustomBinding +// When rayQuery is on, @group(1) is reserved for the RT heap; user +// bindings start at @group(2). + +module; +export module Crafter.Graphics:PlainComputeShader; +#ifdef CRAFTER_GRAPHICS_WINDOW_DOM +import std; +import :WebGPU; +import :WebGPUComputeShader; // for UICustomBinding + UICustomBindingKind + +export namespace Crafter { + class PlainComputeShader { + public: + std::uint32_t pipelineHandle = 0; + std::uint32_t pushUniformSize = 0; + bool rayQueryCapable = false; + std::vector customBindings; + + PlainComputeShader() = default; + PlainComputeShader(const PlainComputeShader&) = delete; + PlainComputeShader& operator=(const PlainComputeShader&) = delete; + PlainComputeShader(PlainComputeShader&& o) noexcept + : pipelineHandle(o.pipelineHandle), + pushUniformSize(o.pushUniformSize), + rayQueryCapable(o.rayQueryCapable), + customBindings(std::move(o.customBindings)) { + o.pipelineHandle = 0; + } + + // Compile + link a standalone compute shader. + // wgsl — source. + // pushUniformSize — byte size of the @group(0)@binding(0) uniform + // struct, or 0 if the shader doesn't declare one. + // bindings — every user-declared resource the dispatch + // should bind (groups 1+ if no rayQuery, 2+ if + // rayQuery). Order MUST match `handles` at + // Dispatch time. + // rayQuery — prepend the RT prelude + rayQuery library + // so the shader can call `rayQuery*` helpers. + void Load(std::string_view wgsl, + std::uint32_t pushUniformSize_, + std::span bindings = {}, + bool rayQuery = false) { + pushUniformSize = pushUniformSize_; + rayQueryCapable = rayQuery; + customBindings.assign(bindings.begin(), bindings.end()); + pipelineHandle = WebGPU::wgpuLoadComputePipeline( + wgsl.data(), static_cast(wgsl.size()), + static_cast(pushUniformSize), + customBindings.empty() ? nullptr : customBindings.data(), + static_cast(customBindings.size()), + rayQuery ? 1 : 0); + } + + void Load(const std::filesystem::path& wgslPath, + std::uint32_t pushUniformSize_, + std::span bindings = {}, + bool rayQuery = false) { + std::ifstream f(wgslPath, std::ios::binary); + if (!f) { + std::println(std::cerr, + "PlainComputeShader::Load: cannot open {}", wgslPath.string()); + std::abort(); + } + std::string wgsl((std::istreambuf_iterator(f)), + std::istreambuf_iterator()); + Load(std::string_view{wgsl}, pushUniformSize_, bindings, rayQuery); + } + + // Bind, push, dispatch. `handles` is parallel to the + // UICustomBinding[] passed at Load — order matches. + void Dispatch(const void* push, std::uint32_t pushBytes, + std::span handles, + std::uint32_t gx, + std::uint32_t gy = 1, + std::uint32_t gz = 1) const { + if (pipelineHandle == 0) return; + WebGPU::wgpuDispatchCompute( + pipelineHandle, + push, static_cast(pushBytes), + handles.empty() ? nullptr : handles.data(), + static_cast(handles.size()), + static_cast(gx), + static_cast(gy), + static_cast(gz)); + } + }; +} +#endif // CRAFTER_GRAPHICS_WINDOW_DOM diff --git a/interfaces/Crafter.Graphics-RenderingElement3D.cppm b/interfaces/Crafter.Graphics-RenderingElement3D.cppm index 9756745..6dcdd92 100644 --- a/interfaces/Crafter.Graphics-RenderingElement3D.cppm +++ b/interfaces/Crafter.Graphics-RenderingElement3D.cppm @@ -121,6 +121,37 @@ export namespace Crafter { // customIndex (4) + _pad (12). Defined in the WGSL traversal // library; never directly read by C++. WebGPUBuffer buffer; + // GPU LBVH support — see additional/dom-webgpu.js's TLAS-build + // pipeline. + // + // entryOrder: per-frame permutation array of u32, indexing into + // `buffer` (the TLASEntry[] array). Populated by the radix-sort + // pass to spatially-coherent Morton order, then consumed by the + // BVH construction + traversal passes. In Stage 1 (this + // baseline) it's the identity permutation written by + // tlasBuildMain alongside the entries. + WebGPUBuffer entryOrder; + // mortonCodes: per-instance 32-bit Morton codes computed from the + // world-AABB centroid, used as the radix-sort key. Written by + // tlasBuildMain. + WebGPUBuffer mortonCodes; + // bvhNodes: 2N_PADDED - 1 sweep-tree BVH nodes built per frame + // by the LBVH-build compute pass. Each node 32 bytes (aabbMin + + // pad, aabbMax + pad). N_PADDED = 65536 (hardcoded in WGSL). + // Internal nodes [0, N_PADDED-1); leaves [N_PADDED-1, 2*N_PADDED-1). + // Node i's children are 2i+1, 2i+2 (implicit perfect binary + // tree). Cap: 65536 instances per scene. + WebGPUBuffer bvhNodes; + // tlasBins: dead, kept allocated as a 64-byte placeholder so the + // existing wgpuBuildTLAS C++ signature doesn't need a churn. + // The pre-LBVH 64-bin partition was replaced by the full BVH. + WebGPUBuffer tlasBins; + // Sort ping-pong buffers for the radix sort. Each pass reads + // from one and writes to the other, swapping role. Layout per + // element: 1 u32 packed key = (morton16 << 16) | tlasIndex16. + // Sized for N_PADDED. + WebGPUBuffer sortTempA; + WebGPUBuffer sortTempB; std::uint32_t builtInstanceCount = 0; }; @@ -141,6 +172,17 @@ export namespace Crafter { // a fresh build (no refit) — the GPU build pass is cheap at the // ~10–100 instance counts the design targets; LBVH-for-TLAS is a // future optimization for larger scenes. + // + // BuildTLAS is now split into Upload + Build so a physics + // compute pass (e.g. physics-tlas-transform) can run between the + // CPU mirror upload and the GPU LBVH build. The compute pass + // writes the per-instance transform bytes that BuildTLAS leaves + // intact for elements flagged transformOwnedByGpu, and those + // writes have to land before the LBVH reads them. The combined + // BuildTLAS is kept as a convenience for callers that don't + // interleave a compute pass (e.g. the ctor-time first build). + static void BuildTLASUpload(WebGPUCommandEncoderRef cmd, std::uint32_t index); + static void BuildTLASBuild(WebGPUCommandEncoderRef cmd, std::uint32_t index); static void BuildTLAS(WebGPUCommandEncoderRef cmd, std::uint32_t index); static void Add(RenderingElement3D* e); diff --git a/interfaces/Crafter.Graphics-UI.cppm b/interfaces/Crafter.Graphics-UI.cppm index 6dc0d36..dead5c7 100644 --- a/interfaces/Crafter.Graphics-UI.cppm +++ b/interfaces/Crafter.Graphics-UI.cppm @@ -165,6 +165,18 @@ export namespace Crafter { std::array clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f}); void DispatchImages(GraphicsCommandBuffer cmd, std::uint32_t bufferSlot, std::uint32_t itemCount, std::array clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f}); +#ifdef CRAFTER_GRAPHICS_WINDOW_DOM + // WebGPU-only overload. WebGPU bind groups can only carry one + // texture/sampler per dispatch, so all items in `bufferSlot` + // share the same texture (`imageSlot`) and sampler (`samplerSlot`). + // The per-item `slots` field in ImageItem is ignored on this + // backend. On Vulkan the bindless heap resolves per-item slots, + // so the cross-backend path is to call the 4-arg overload above + // on native and this 6-arg overload on DOM. + void DispatchImages(GraphicsCommandBuffer cmd, std::uint32_t bufferSlot, std::uint32_t itemCount, + std::uint16_t imageSlot, std::uint16_t samplerSlot, + std::array clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f}); +#endif void DispatchText(GraphicsCommandBuffer cmd, std::uint32_t bufferSlot, std::uint32_t itemCount, std::array clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f}); diff --git a/interfaces/Crafter.Graphics-WebGPU.cppm b/interfaces/Crafter.Graphics-WebGPU.cppm index 799c8c0..dc695fe 100644 --- a/interfaces/Crafter.Graphics-WebGPU.cppm +++ b/interfaces/Crafter.Graphics-WebGPU.cppm @@ -35,6 +35,40 @@ namespace Crafter::WebGPU { extern "C" std::uint32_t wgpuCreateBuffer(std::int32_t byteSize); __attribute__((import_module("env"), import_name("wgpuWriteBuffer"))) extern "C" void wgpuWriteBuffer(std::uint32_t handle, const void* srcPtr, std::int32_t byteSize); + __attribute__((import_module("env"), import_name("wgpuWriteBufferRange"))) + extern "C" void wgpuWriteBufferRange(std::uint32_t handle, + std::uint32_t dstByteOffset, + const void* srcPtr, + std::int32_t byteSize); + // Kick off a GPU→CPU readback for the entire `byteSize`-byte prefix + // of the buffer at `handle`. Returns immediately; the actual map + // resolves asynchronously. Successive Enqueues without a Poll in + // between are no-ops until the previous map resolves. + // + // `resetBytes` ≥ 0 — if non-zero, the JS bridge encodes a + // clearBuffer over the first `resetBytes` bytes of the source + // buffer immediately after the copy, in the same command encoder. + // Used by Forts3D's GPU event queues to zero the atomic-add count + // for the next frame's substeps. The reset is TIED to a successful + // enqueue: if the enqueue was skipped (previous map still pending), + // the reset is skipped too — so events written by substeps during + // the missed-drain window accumulate into the next successful + // capture instead of being silently wiped. + __attribute__((import_module("env"), import_name("wgpuReadbackEnqueue"))) + extern "C" void wgpuReadbackEnqueue(std::uint32_t handle, + std::int32_t byteSize, + std::int32_t resetBytes); + // Poll a previously-enqueued readback. Returns 1 and writes the + // bytes into `dstPtr` if the map resolved; returns 0 otherwise. + __attribute__((import_module("env"), import_name("wgpuReadbackPoll"))) + extern "C" std::int32_t wgpuReadbackPoll(std::uint32_t handle, void* dstPtr, std::int32_t byteSize); + // Non-consuming readiness probe. Returns 1 if the readback has + // resolved and the next Poll would succeed; returns 0 otherwise. + // Used to gate multi-buffer drains (header + array) so neither side + // gets consumed until both are ready — otherwise the consumed side's + // data is lost while the other side waits for its map to resolve. + __attribute__((import_module("env"), import_name("wgpuReadbackReady"))) + extern "C" std::int32_t wgpuReadbackReady(std::uint32_t handle); __attribute__((import_module("env"), import_name("wgpuDestroyBuffer"))) extern "C" void wgpuDestroyBuffer(std::uint32_t handle); @@ -64,15 +98,26 @@ namespace Crafter::WebGPU { // Used by Image2DArray to stack per-material albedos for one // multi-material scene. __attribute__((import_module("env"), import_name("wgpuCreateImage2DArray"))) - extern "C" std::uint32_t wgpuCreateImage2DArray(std::int32_t w, std::int32_t h, std::int32_t layerCount); + extern "C" std::uint32_t wgpuCreateImage2DArray(std::int32_t w, std::int32_t h, + std::int32_t layerCount, std::int32_t mipLevels); + // Upload a single mip level for one array layer. `level` indexes into + // the texture's mip chain (0 = base); `w` / `h` must be the dimensions + // at that level. Callers pass each level's pixels separately — mip + // generation is host-side. __attribute__((import_module("env"), import_name("wgpuWriteImage2DLayer"))) - extern "C" void wgpuWriteImage2DLayer(std::uint32_t handle, std::int32_t layer, + extern "C" void wgpuWriteImage2DLayer(std::uint32_t handle, std::int32_t layer, std::int32_t level, const void* srcPtr, std::int32_t byteSize, std::int32_t w, std::int32_t h); __attribute__((import_module("env"), import_name("wgpuCreateLinearClampSampler"))) extern "C" std::uint32_t wgpuCreateLinearClampSampler(); + // Linear-filtered, repeat-addressed sampler with mipmap linear-filter. + // The usual choice for tiled material textures (woodBrace, panel, etc.) + // which expect UV > 1.0 to wrap. + __attribute__((import_module("env"), import_name("wgpuCreateLinearRepeatSampler"))) + extern "C" std::uint32_t wgpuCreateLinearRepeatSampler(); + __attribute__((import_module("env"), import_name("wgpuFrameBegin"))) extern "C" void wgpuFrameBegin(); __attribute__((import_module("env"), import_name("wgpuFrameEnd"))) @@ -158,12 +203,56 @@ namespace Crafter::WebGPU { std::int32_t gx, std::int32_t gy, const void* handlesPtr, std::int32_t handlesCount); - // GPU TLAS-build dispatch. Reads the instance buffer (host-uploaded or - // GPU-written), produces per-instance world-space AABBs + per-instance - // transform matrices in a flat tlasBuf SSBO consumed by traceRay / rayQuery. + // GPU TLAS-build dispatch. Two sequential compute passes: + // 1. tlasBuildMain — per-instance world AABB + identity permutation + // + naive Morton (overwritten in pass 2). Outputs the flat + // tlasBuf SSBO consumed by traceRay / rayQuery. + // 2. lbvhBuildMain — single workgroup of 1024 threads; reduces + // scene AABB, recomputes Morton with proper normalization, + // bitonic-sorts (morton, instance_id), writes the sorted + // permutation into `entryOrderBufHandle`, and refits a + // sweep-tree BVH into `bvhNodesBufHandle` bottom-up. + // Pre-LBVH bin-build is gone; `binsBufHandle` is kept in the + // signature as a placeholder so the C++ side doesn't churn. __attribute__((import_module("env"), import_name("wgpuBuildTLAS"))) extern "C" void wgpuBuildTLAS(std::uint32_t instanceBufHandle, std::int32_t instanceCount, - std::uint32_t tlasOutBufHandle); + std::uint32_t tlasOutBufHandle, + std::uint32_t entryOrderBufHandle, + std::uint32_t mortonBufHandle, + std::uint32_t binsBufHandle, + std::uint32_t bvhNodesBufHandle, + std::uint32_t sortTempABufHandle, + std::uint32_t sortTempBBufHandle); + + // ── Standalone compute pipelines ─────────────────────────────────── + // + // Mirror of the native ComputeShader API: load a user-authored + // compute WGSL with arbitrary @group bindings, dispatch it at any + // point in the frame (inside or outside the UI compute pass — + // physics ticks dispatch from update lambdas, which fire outside + // the per-frame render encoder). + // + // WGSL contract: + // @group(0) @binding(0) — uniform PushData (optional; only if + // pushUniformSize > 0 at load). + // @group(1+) @binding(N) — user bindings declared via + // UICustomBinding[]. When rayQuery is + // on, @group(1) is reserved for the RT + // heap and user bindings start at + // @group(2). + __attribute__((import_module("env"), import_name("wgpuLoadComputePipeline"))) + extern "C" std::uint32_t wgpuLoadComputePipeline( + const void* wgslPtr, std::int32_t wgslLen, + std::int32_t pushUniformSize, + const void* bindingsPtr, std::int32_t bindingsCount, + std::int32_t rayQueryFlag); + + __attribute__((import_module("env"), import_name("wgpuDispatchCompute"))) + extern "C" void wgpuDispatchCompute( + std::uint32_t pipelineHandle, + const void* pushPtr, std::int32_t pushBytes, + const void* handlesPtr, std::int32_t handlesCount, + std::int32_t gx, std::int32_t gy, std::int32_t gz); } #endif // CRAFTER_GRAPHICS_WINDOW_DOM diff --git a/interfaces/Crafter.Graphics-WebGPUBuffer.cppm b/interfaces/Crafter.Graphics-WebGPUBuffer.cppm index 9273625..4dec146 100644 --- a/interfaces/Crafter.Graphics-WebGPUBuffer.cppm +++ b/interfaces/Crafter.Graphics-WebGPUBuffer.cppm @@ -78,6 +78,60 @@ export namespace Crafter { void FlushDevice() requires(Mapped) { WebGPU::wgpuWriteBuffer(handle, this->value, static_cast(size)); } + // Partial upload — write the bytes [srcByteOffset, srcByteOffset+byteCount) + // of the host mirror to GPU offset `dstByteOffset`. BuildTLAS uses + // this to leave the GPU-owned transform field of an RTInstance + // intact (the physics-tlas-transform compute shader is its sole + // writer) while still pushing the CPU-side metadata fields. + void FlushDeviceRange(std::uint32_t dstByteOffset, + std::uint32_t srcByteOffset, + std::uint32_t byteCount) requires(Mapped) { + const auto* base = reinterpret_cast(this->value); + WebGPU::wgpuWriteBufferRange(handle, dstByteOffset, + base + srcByteOffset, + static_cast(byteCount)); + } + + // Push one element's worth of bytes from the host mirror to GPU. + // Use when a single SoA slot was mutated (body construction, + // per-instance flag flip) and a full FlushDevice would clobber + // the GPU-side updates the sim has applied to neighboring slots. + void FlushDeviceSlot(std::uint32_t idx) requires(Mapped) { + constexpr std::uint32_t kStride = sizeof(T); + const std::uint32_t off = idx * kStride; + FlushDeviceRange(off, off, kStride); + } + + // Schedule a GPU→CPU readback of this buffer's entire contents. + // Asynchronous; data isn't ready until a later PollReadback + // returns true. Successive Enqueues without a Poll are dropped + // — they're a no-op while the previous map is in flight. + // + // `resetBytes` ≥ 0 — if non-zero, the first `resetBytes` bytes + // of THIS buffer are clearBuffer-cleared on the GPU command + // encoder immediately after the copy, so the readback captures + // the pre-clear bytes and the next frame's writers see zeros. + // The reset is tied to a successful enqueue (skipped enqueue = + // skipped reset), preserving accumulated state across missed + // drains. + void EnqueueReadback(std::uint32_t resetBytes = 0) { + WebGPU::wgpuReadbackEnqueue(handle, + static_cast(size), + static_cast(resetBytes)); + } + // Try to copy the readback bytes into this->value. Returns true + // if the previous EnqueueReadback resolved and the data is now + // mirrored into .value; false if the map is still pending. + bool PollReadback() requires(Mapped) { + return WebGPU::wgpuReadbackPoll(handle, this->value, + static_cast(size)) != 0; + } + // Non-consuming readiness probe. Returns true if a subsequent + // PollReadback would succeed without changing state otherwise. + // Use to verify a sibling buffer is also ready before consuming. + bool IsReadbackReady() const { + return WebGPU::wgpuReadbackReady(handle) != 0; + } ~WebGPUBuffer() { Clear(); } }; diff --git a/interfaces/Crafter.Graphics-WebGPUComputeShader.cppm b/interfaces/Crafter.Graphics-WebGPUComputeShader.cppm index 9726d76..68a4983 100644 --- a/interfaces/Crafter.Graphics-WebGPUComputeShader.cppm +++ b/interfaces/Crafter.Graphics-WebGPUComputeShader.cppm @@ -36,6 +36,11 @@ export namespace Crafter { SampledTexture = 1, // sampled texture_2d, handle is a slot into heap.imageTable Sampler = 2, // filtering sampler, handle is a slot into heap.samplerTable SampledTextureArray = 3, // sampled texture_2d_array, handle is a slot into heap.imageTable + // read-write storage SSBO (var in WGSL). Use + // for buffers shaders need to MUTATE — e.g. physics shaders that + // integrate node momentum, write brace stress, or output TLAS + // instance transforms. + BufferReadWrite = 4, }; struct UICustomBinding { diff --git a/interfaces/Crafter.Graphics.cppm b/interfaces/Crafter.Graphics.cppm index 5c94db9..5d136d9 100644 --- a/interfaces/Crafter.Graphics.cppm +++ b/interfaces/Crafter.Graphics.cppm @@ -71,5 +71,6 @@ export import :WebGPU; export import :WebGPUBuffer; export import :DescriptorHeapWebGPU; export import :WebGPUComputeShader; +export import :PlainComputeShader; export import :ShaderBindingTableWebGPU; export import :PipelineRTWebGPU; diff --git a/project.cpp b/project.cpp index e1fd83e..65bcb44 100644 --- a/project.cpp +++ b/project.cpp @@ -123,7 +123,7 @@ extern "C" Configuration CrafterBuildProject(std::span a // when its body is gated out. Vulkan-typed partitions stub to empty // modules under CRAFTER_GRAPHICS_WINDOW_DOM; the Dom/DomEvents/Router // partitions stub to empty modules in the opposite direction. - std::array ifaces = { + std::array ifaces = { "interfaces/Crafter.Graphics", "interfaces/Crafter.Graphics-Animation", "interfaces/Crafter.Graphics-Clipboard", @@ -147,6 +147,7 @@ extern "C" Configuration CrafterBuildProject(std::span a "interfaces/Crafter.Graphics-Mesh", "interfaces/Crafter.Graphics-PipelineRTVulkan", "interfaces/Crafter.Graphics-PipelineRTWebGPU", + "interfaces/Crafter.Graphics-PlainComputeShader", "interfaces/Crafter.Graphics-RenderingElement3D", "interfaces/Crafter.Graphics-RenderPass", "interfaces/Crafter.Graphics-Router", @@ -170,14 +171,16 @@ extern "C" Configuration CrafterBuildProject(std::span a if (dom) { // DOM impl set. UI-Shared.cpp is backend-agnostic; UI-WebGPU.cpp // is the DOM-only implementation of UIRenderer's GPU-touching - // methods. Font / FontAtlas / UIComponents are now portable. - std::array domImpls = { + // methods. Font / FontAtlas / UIComponents / InputField are now + // portable. + std::array domImpls = { "implementations/Crafter.Graphics-Clipboard", "implementations/Crafter.Graphics-Dom", "implementations/Crafter.Graphics-Font", "implementations/Crafter.Graphics-FontAtlas", "implementations/Crafter.Graphics-Gamepad", "implementations/Crafter.Graphics-Input", + "implementations/Crafter.Graphics-InputField", "implementations/Crafter.Graphics-Mesh-WebGPU", "implementations/Crafter.Graphics-PipelineRTWebGPU", "implementations/Crafter.Graphics-RenderingElement3D-WebGPU",