webgpu improvements

2026-05-24 13:32:08 +02:00 · 2026-05-24 13:32:08 +02:00 · 8347467e1e
commit 8347467e1e
parent 5a75571ffd
18 changed files with 1932 additions and 153 deletions
--- a/TODO-lbvh-sort.md
+++ b/TODO-lbvh-sort.md
@ -0,0 +1,106 @@
+# LBVH parallel radix sort: count-dependent corruption
+
+## Summary
+
+The parallel radix sort in `lbvhBuildMain` (additional/dom-webgpu.js) produces
+incorrect output that depends on the input distribution. Symptom: geometry in
+the BVH-built TLAS appears to flicker (instances missing or pointing at the
+wrong entry) as soon as a small object enters the TLAS alongside a tight
+cluster (e.g. a single projectile next to a 1000-brace fort in 3DForts).
+
+Bisected by selectively skipping each LBVH phase. Skipping only the radix
+sort eliminates the corruption — every other phase (scene-AABB reduce,
+Morton-key write, leaf init, sweep-tree refit) is correctness-clean.
+
+Current state: the sort is gated behind `if (false)` in `lbvhBuildMain`. BVH
+leaves are in instance-index order with no spatial coherence. The BVH still
+builds correctly and traversal still descends a real tree, just with looser
+parent AABBs.
+
+## What we know
+
+- The sort is LSD radix, 8 passes × 4 bits = 32-bit key.
+- Keys are `(morton16 << 16) | (tlasIndex16)`; sentinels (i >= n) get
+  `0xFFFFFFFF`.
+- Per-pass: histogram via atomicAdd, then per-bucket parallel scatter with a
+  Hillis-Steele exclusive prefix scan to compute per-thread destination
+  offsets.
+- Workgroup size 1024, K_PER 16 per thread = 16384 entries total.
+- The math of the Hillis-Steele scan was verified: after `log2(THREADS)=10`
+  steps with the read/barrier/write/barrier pattern, `shScan[tid]` holds the
+  inclusive prefix sum.
+- Scatter destinations are provably unique: `shOffsets[b] + exclusivePrefix
+  + localIdx`, where `exclusivePrefix` is per-thread and `localIdx`
+  increments per-element within the thread.
+- All required barriers are present:
+  - `workgroupBarrier` between scan iterations.
+  - `workgroupBarrier` at end of each bucket iteration.
+  - `storageBarrier` at end of each radix pass.
+
+## What we suspect
+
+The bug is likely one of:
+
+1. **WGSL implementation issue** in the specific browser/driver. `workgroup
+   Barrier` semantics around `atomicLoad` on workgroup memory, or around
+   single-buffered Hillis-Steele where one thread reads `shScan[tid - offset]`
+   while a neighbor writes `shScan[tid]`. Standard pattern, but the spec is
+   subtle.
+2. **Memory model edge case** triggered only with very unbalanced histograms
+   (e.g. bucket 15 holding ~94% of entries because almost everything is
+   sentinel-padded). Most threads have localCount ≤ 1 for non-{0, 15}
+   buckets and exactly 15-16 for bucket 15; that mix may surface a
+   compiler-introduced reordering.
+3. **A logical bug in the scan or scatter** that the human review keeps
+   missing — re-reading the code is the last thing that helps; what's
+   needed is a GPU-side trace.
+
+## Reproducing
+
+1. Run 3DForts WebGPU build with normal projectile firing.
+2. Aim near (not necessarily at) the fort.
+3. Observe braces / panels flickering as the projectile flies past.
+
+## Diagnostic strategies if revisiting
+
+1. **GPU-side trace.** Add a debug buffer (`array<u32>` sized for all 16384
+   entries × a few u32). Have each thread write its intermediate scan
+   values and final scatter destinations there. Read back to CPU and diff
+   against an expected oracle (CPU-computed reference sort of the same
+   input keys).
+2. **Halve the search.** Reduce `PASSES` to 1 and check: does a single-pass
+   sort already corrupt, or does corruption only emerge after multiple
+   ping-pongs?
+3. **Replace the scan.** Swap Hillis-Steele for a Blelloch up/down-sweep
+   scan or a `subgroupExclusiveAdd` variant where available. If the
+   replacement fixes it, the bug is in the Hillis-Steele specifically.
+4. **Serialize the scatter.** Have thread 0 do all scatters by itself
+   (loop over all 16384 entries × 16 buckets sequentially). Slow but a
+   provably-correct reference. If this fixes the flicker, the parallel
+   scatter has the bug.
+5. **Replace LSD with bitonic sort.** Different algorithm entirely. If
+   bitonic works, radix has a structural problem.
+
+## Why it's not blocking
+
+At the current scale (~1011 entries), the BVH still functions:
+
+- Sentinel half-subtrees are degenerate-AABB-rejected at the top of the
+  tree very cheaply (~1 AABB test per skipped subtree).
+- The real-leaf subtree has ~10 levels of descent (`log2(1024)`), all of
+  which are real AABB tests. Without spatial coherence the AABBs are
+  looser than a properly-sorted BVH, but they still bound the geometry.
+- Ray-vs-triangle work dominates anyway; BVH traversal is a small fraction
+  of the per-pixel cost.
+
+Headroom: LBVH_MAX = 16384. If the application pushes much past ~4000 real
+entries this stops being acceptable and the sort needs to actually work.
+
+## Acceptance criteria for "fixed"
+
+- The diagnostic repro (3DForts: fire a projectile near the fort) shows
+  no flicker at all.
+- The sort produces output ordered by `(morton16, tlasIndex)` ascending.
+- A unit test (CPU oracle vs GPU output) passes for at least three
+  histogram distributions: all-uniform, all-in-one-bucket, and the
+  3DForts-style "one small object next to a tight cluster".
--- a/additional/dom-env.js
+++ b/additional/dom-env.js
@ -168,15 +168,25 @@ function setValue(cookie, valPtr, valLen) {
 // so removeEventListener can re-find it. C++-side handler id counters
 // are per-kind, so a per-kind suffix is what makes the keys unique.

+// devicePixelRatio scaling factor. dom-webgpu.js sets window.crafter_dpr
+// during its canvas sync so this side and the GPU side agree on a single
+// physical-pixel coordinate space. Fallback to the live DPR if no GPU
+// bridge ran (pure-CppDOM apps); ultimately fallback to 1 so non-HiDPI
+// browsers behave as before.
+function __dpr() {
+    return window.crafter_dpr || window.devicePixelRatio || 1;
+}
+
 function __makeMouseListenerPair(kind, eventName, exportName) {
    return {
        add(cookie, id) {
            const el = __jsmemory.get(cookie);
            if (!el) return;
            const handler = (event) => {
+                const s = __dpr();
                __wasm()[exportName](id,
-                    event.clientX, event.clientY,
-                    event.screenX, event.screenY,
+                    event.clientX * s, event.clientY * s,
+                    event.screenX * s, event.screenY * s,
                    event.button, event.buttons,
                    event.altKey, event.ctrlKey, event.shiftKey, event.metaKey);
            };
@ -317,7 +327,10 @@ const __resizePair = {
    // Resize is window-global in CppDOM. Mirror that: attach to `window`
    // regardless of which element the C++ caller passed.
    add(cookie, id) {
-        const handler = () => __wasm().ExecuteResizeHandler(id, window.innerWidth, window.innerHeight);
+        const handler = () => {
+            const s = __dpr();
+            __wasm().ExecuteResizeHandler(id, window.innerWidth * s, window.innerHeight * s);
+        };
        __listenerHandlers.set(`${cookie}-${id}-resize`, handler);
        window.addEventListener("resize", handler);
    },
@ -345,9 +358,10 @@ const __wheelPair = {
    add(cookie, id) {
        const el = __jsmemory.get(cookie); if (!el) return;
        const handler = (event) => {
+            const s = __dpr();
            __wasm().ExecuteWheelHandler(id,
                event.deltaX, event.deltaY, event.deltaZ, event.deltaMode,
-                event.clientX, event.clientY, event.screenX, event.screenY,
+                event.clientX * s, event.clientY * s, event.screenX * s, event.screenY * s,
                event.button, event.buttons,
                event.altKey, event.ctrlKey, event.shiftKey, event.metaKey);
        };
@ -378,11 +392,97 @@ function domAttachWindow(windowHandle) {
        if (fn) fn(__windowAttachedHandle, ...args);
    };

-    __windowListeners.mousemove = (e) => fire("__crafterDom_mouseMove", [e.clientX, e.clientY]);
-    __windowListeners.mousedown = (e) => fire("__crafterDom_mouseDown", [e.button]);
-    __windowListeners.mouseup   = (e) => fire("__crafterDom_mouseUp",   [e.button]);
+    // Synthetic absolute position for pointer-lock mode. While the
+    // pointer is locked, browsers fire mousemove events with movementX/Y
+    // deltas instead of meaningful clientX/Y, and the cursor is hidden +
+    // captured by the canvas (no window-edge clamp). We accumulate the
+    // deltas into a synthetic position and feed *that* to the C++ side,
+    // so the existing `currentMousePos - lastMousePos` delta computation
+    // keeps working unchanged. Initialised to the cursor position the
+    // moment lock is acquired.
+    let __ptrLockSyntheticX = 0;
+    let __ptrLockSyntheticY = 0;
+    const __isPointerLocked = () =>
+        document.pointerLockElement !== null &&
+        document.pointerLockElement !== undefined;
+
+    // pointermove (not mousemove) so we can pull sub-frame events out of
+    // `getCoalescedEvents()`. Browsers normally collapse multiple raw
+    // mouse events between paint frames into a single event you'd see
+    // via `mousemove`; PointerEvent.getCoalescedEvents() returns the raw
+    // pre-coalesced list. Summing those gives a higher-resolution delta
+    // per frame than the single coalesced movementX/Y. PointerEvent also
+    // delivers fractional movementX from high-precision mice on Chromium.
+    __windowListeners.mousemove = (e) => {
+        const s = __dpr();
+        const locked = __isPointerLocked();
+        if (locked) {
+            // Accumulate over every sub-frame event the browser had
+            // queued up. `getCoalescedEvents` is the spec-correct way
+            // to access raw input between rAF ticks. Some browsers
+            // return an empty list — fall back to the top-level event.
+            let dx = 0, dy = 0;
+            const sub = (typeof e.getCoalescedEvents === "function")
+                ? e.getCoalescedEvents() : null;
+            if (sub && sub.length > 0) {
+                for (let i = 0; i < sub.length; i++) {
+                    dx += sub[i].movementX;
+                    dy += sub[i].movementY;
+                }
+            } else {
+                dx = e.movementX;
+                dy = e.movementY;
+            }
+            // No DPR scaling in pointer-lock: position is synthetic and
+            // there's no UI hit-test using it. DPR-scaling here only
+            // rounds finer movements up to multiples of `dpr`, which is
+            // pure quantization loss for aim controls.
+            __ptrLockSyntheticX += dx;
+            __ptrLockSyntheticY += dy;
+            fire("__crafterDom_mouseMove",
+                 [__ptrLockSyntheticX, __ptrLockSyntheticY]);
+        } else {
+            fire("__crafterDom_mouseMove", [e.clientX * s, e.clientY * s]);
+        }
+    };
+    __windowListeners.mousedown = (e) => {
+        // Right-click holds engage pointer lock — typical FPS-camera
+        // convention. Acquiring on any click (the previous policy) made
+        // menus annoying: clicking a button hid the cursor mid-flow. Now
+        // the cursor stays free for clicks/menus until the user holds
+        // RMB to actively look around. Browsers require lock requests
+        // from user gestures, which mousedown satisfies.
+        if (e.button === 2 && !__isPointerLocked()) {
+            const target = document.body;
+            if (target && target.requestPointerLock) {
+                target.requestPointerLock();
+                // Seed the synthetic position from the click point so
+                // there's no jump when the lock starts producing deltas.
+                __ptrLockSyntheticX = e.clientX;
+                __ptrLockSyntheticY = e.clientY;
+            }
+        }
+        fire("__crafterDom_mouseDown", [e.button]);
+    };
+    __windowListeners.mouseup = (e) => {
+        // Release lock on RMB up — cursor reappears at the seed point
+        // for clicks/menus until the next RMB hold.
+        if (e.button === 2 && __isPointerLocked()) {
+            document.exitPointerLock();
+        }
+        fire("__crafterDom_mouseUp", [e.button]);
+    };
    __windowListeners.wheel     = (e) => fire("__crafterDom_wheel",     [e.deltaY]);
    __windowListeners.contextmenu = (e) => { e.preventDefault(); };
+    __windowListeners.pointerlockchange = () => {
+        // Reset the synthetic accumulator when lock is released so the
+        // next acquisition starts cleanly. The C++ side will see one
+        // small jump back to the real cursor position on release.
+        if (!__isPointerLocked()) {
+            __ptrLockSyntheticX = 0;
+            __ptrLockSyntheticY = 0;
+        }
+    };

    // Keyboard events go through the document so they fire even when no
    // input element is focused. event.code is the layout-independent
@ -400,16 +500,24 @@ function domAttachWindow(windowHandle) {
        __wasm().WasmFree(codePtr);
    };

-    __windowListeners.resize = () => fire("__crafterDom_resize", [window.innerWidth, window.innerHeight]);
+    __windowListeners.resize = () => {
+        const s = __dpr();
+        fire("__crafterDom_resize", [window.innerWidth * s, window.innerHeight * s]);
+    };
    __windowListeners.beforeunload = () => fire("__crafterDom_close", []);

-    document.addEventListener("mousemove",   __windowListeners.mousemove);
+    // pointermove (not mousemove) so the handler receives PointerEvents
+    // and can use getCoalescedEvents() to recover sub-frame motion. The
+    // handler's variable name stays "mousemove" — it's the same JS object,
+    // just bound to a different event type.
+    document.addEventListener("pointermove", __windowListeners.mousemove);
    document.addEventListener("mousedown",   __windowListeners.mousedown);
    document.addEventListener("mouseup",     __windowListeners.mouseup);
    document.addEventListener("wheel",       __windowListeners.wheel);
    document.addEventListener("contextmenu", __windowListeners.contextmenu);
    document.addEventListener("keydown",     __windowListeners.keydown);
    document.addEventListener("keyup",       __windowListeners.keyup);
+    document.addEventListener("pointerlockchange", __windowListeners.pointerlockchange);
    window  .addEventListener("resize",      __windowListeners.resize);
    window  .addEventListener("beforeunload",__windowListeners.beforeunload);
 }
@ -418,8 +526,8 @@ function domSetTitle(titlePtr, titleLen) {
    document.title = __readUtf8(titlePtr, titleLen);
 }

-function domGetInnerWidth()  { return window.innerWidth;  }
-function domGetInnerHeight() { return window.innerHeight; }
+function domGetInnerWidth()  { return Math.round(window.innerWidth  * __dpr()); }
+function domGetInnerHeight() { return Math.round(window.innerHeight * __dpr()); }

 // ─── requestAnimationFrame loop ───────────────────────────────────────

--- a/additional/dom-webgpu.js
+++ b/additional/dom-webgpu.js
--- a/implementations/Crafter.Graphics-Mesh-WebGPU.cpp
+++ b/implementations/Crafter.Graphics-Mesh-WebGPU.cpp
@ -225,6 +225,7 @@ namespace {
                             std::span<const std::uint32_t>       indices,
                             std::span<const std::byte>           attribsBytes) {
        mesh.triangleCount = static_cast<std::uint32_t>(indices.size()) / 3;
+        mesh.vertexCount   = static_cast<std::uint32_t>(vertices.size());

        Builder builder;
        builder.Build(vertices, indices);
--- a/implementations/Crafter.Graphics-RenderingElement3D-WebGPU.cpp
+++ b/implementations/Crafter.Graphics-RenderingElement3D-WebGPU.cpp
@ -4,12 +4,21 @@ Copyright (C) 2026 Catcrafts®
 catcrafts.net
 */

-// DOM-mode TLAS upkeep. BuildTLAS copies the per-element RTInstance into
-// the host-visible instance buffer (skipping the transform for elements
-// whose transform is GPU-owned), uploads it, then dispatches the JS-side
-// TLAS-build compute pass — which consults the per-BLAS records published
-// at Mesh::Build() time to produce world-space AABBs and inverse
-// transforms in the format `traceRay` / `rayQuery` consume.
+// DOM-mode TLAS upkeep. BuildTLAS is split in two phases so a physics
+// compute pass can run between them:
+//   - BuildTLASUpload mirrors the CPU-side RTInstance array into the
+//     host-visible instance buffer (with partial-write semantics that
+//     preserve the transform bytes for elements flagged
+//     transformOwnedByGpu, see notes in the body) and uploads the
+//     metadata buffer.
+//   - BuildTLASBuild dispatches the JS-side TLAS-build compute pass —
+//     which consults the per-BLAS records published at Mesh::Build()
+//     time to produce world-space AABBs and inverse transforms in the
+//     format `traceRay` / `rayQuery` consume.
+// The combined BuildTLAS calls both back-to-back; callers that want to
+// interleave a physics tlas-transform compute pass (which writes the
+// transform bytes BuildTLASUpload leaves intact) call Upload + their
+// compute pass + Build manually.

 module;
 module Crafter.Graphics:RenderingElement3D_implWebGPU;
@ -41,7 +50,7 @@ void RenderingElement3D::Remove(RenderingElement3D* e) {
    e->indexInElements = std::numeric_limits<std::uint32_t>::max();
 }

-void RenderingElement3D::BuildTLAS(WebGPUCommandEncoderRef /*cmd*/, std::uint32_t index) {
+void RenderingElement3D::BuildTLASUpload(WebGPUCommandEncoderRef /*cmd*/, std::uint32_t index) {
    auto& tlas = tlases[index];
    const std::uint32_t primitiveCount = static_cast<std::uint32_t>(elements.size());
    if (primitiveCount == 0) {
@ -49,19 +58,52 @@ void RenderingElement3D::BuildTLAS(WebGPUCommandEncoderRef /*cmd*/, std::uint32_
        return;
    }

-    // (Re)allocate instance + metadata + output TLAS buffers if the count
-    // changed. WebGPUBuffer::Resize destroys and recreates the GPU buffer;
-    // bind-group caches keyed on the buffer handle are invalidated in the
-    // JS bridge automatically.
-    if (primitiveCount != tlas.builtInstanceCount) {
-        tlas.instanceBuffer.Resize(primitiveCount);
-        tlas.metadataBuffer.Resize(primitiveCount);
-        // TLASEntry layout in WGSL is 144 bytes due to vec3 align/pad
-        // rules. Must match the struct declared in the rtWgslTypes
-        // block in additional/dom-webgpu.js.
-        tlas.buffer.Resize(primitiveCount * 144);
+    constexpr std::uint32_t kNPadded   = 65536u;     // size for instance / metadata mirrors
+    constexpr std::uint32_t kLbvhMax   = 16384u;     // matches N_PADDED in lbvhBuildWgsl
+    constexpr std::uint32_t kNodeCount = 2u * kNPadded - 1u;
+
+    // ALL TLAS-side GPU buffers get allocated ONCE and never resized.
+    // The LBVH-build shader takes the real instance count via a uniform
+    // (lbvhPc.nReal) instead of arrayLength(&entries), so the
+    // tlas.buffer / entryOrder / mortonCodes don't need to grow when
+    // the application's element count changes.
+    //
+    // Why this matters: an earlier version resized these per-frame on
+    // primitiveCount change. The destroy+recreate cycle on the GPU
+    // buffer caused subtle mid-game flicker as soon as any element was
+    // added (e.g. firing a projectile) — fort braces would appear to
+    // briefly vanish in patterns deterministic on the projectile's
+    // angle. Suspected driver-level memory recycling without proper
+    // zero-init; the fixed-size allocation sidesteps it entirely.
+    if (tlas.instanceBuffer.handle == 0) {
+        tlas.instanceBuffer.Resize(kNPadded);
+        tlas.metadataBuffer.Resize(kNPadded);
+        tlas.bvhNodes.Resize(kNodeCount * 32u);
+        tlas.sortTempA.Resize(kNPadded * 4u);
+        tlas.sortTempB.Resize(kNPadded * 4u);
+        tlas.tlasBins.Resize(64 * 32);
+        // TLAS-entry / order / morton-code buffers: sized for the LBVH
+        // cap (16384). lbvhBuildMain iterates `lbvhPc.nReal` real
+        // entries; the remainder stays zero / sentinel. Keep these
+        // stable across element-count changes so the renderer's bind
+        // group references the same buffer handle every frame.
+        tlas.buffer.Resize(kLbvhMax * 144u);
+        tlas.entryOrder.Resize(kLbvhMax * 4u);
+        tlas.mortonCodes.Resize(kLbvhMax * 4u);
    }

+    // NB: tlas.buffer / entryOrder / mortonCodes get resized in
+    // BuildTLASBuild, NOT here. Resize destroys + recreates the GPU
+    // resource (and the JS-side handle); the rayQuery dispatches that
+    // run between BuildTLASUpload and BuildTLASBuild (projectile-collide,
+    // splash, builder-pick) still hold the previous frame's TLAS in
+    // rtState.current{Tlas,EntryOrder,Bvh}. If we resized here, those
+    // handles would point at destroyed buffers and the dispatches would
+    // log "no TLAS built yet" every frame the element count changed
+    // (e.g. every projectile fire). Resizing inside BuildTLASBuild,
+    // immediately before wgpuBuildTLAS publishes the new handles, keeps
+    // the JS-side current* refs in sync with the GPU resources.
+
    for (std::uint32_t i = 0; i < primitiveCount; ++i) {
        auto& dst = tlas.instanceBuffer.value[i];
        const auto& src = elements[i]->instance;
@ -80,12 +122,73 @@ void RenderingElement3D::BuildTLAS(WebGPUCommandEncoderRef /*cmd*/, std::uint32_
        tlas.metadataBuffer.value[i] = elements[i]->userMetadata;
    }

-    tlas.instanceBuffer.FlushDevice();
+    // Upload the instance buffer with partial-write semantics: for runs
+    // of CPU-driven elements (transformOwnedByGpu=false) we push the
+    // whole 64-byte struct in one writeBuffer call; for GPU-driven runs
+    // we push only the trailing 16 metadata bytes per element, leaving
+    // the transform field intact for the physics-tlas-transform compute
+    // shader to update. The two arms below produce identical GPU state
+    // when every element is CPU-driven — this is a no-op refactor until
+    // 3DForts flips its physics elements to transformOwnedByGpu=true.
+    constexpr std::uint32_t kInstSize      = sizeof(RTInstance);          // 64
+    constexpr std::uint32_t kTransformSize = sizeof(RTTransformMatrix);   // 48
+    constexpr std::uint32_t kMetaSize      = kInstSize - kTransformSize;  // 16
+
+    std::uint32_t runStart = 0;
+    bool runOwned = elements[0]->transformOwnedByGpu;
+    for (std::uint32_t i = 1; i <= primitiveCount; ++i) {
+        const bool atEnd     = (i == primitiveCount);
+        const bool currOwned = atEnd ? !runOwned : elements[i]->transformOwnedByGpu;
+        if (currOwned == runOwned && !atEnd) continue;
+
+        if (runOwned) {
+            // GPU-driven run — metadata only, per element. Cannot batch
+            // because the metadata bytes are non-contiguous in the
+            // instance buffer (one 16-byte chunk per 64-byte slot).
+            for (std::uint32_t j = runStart; j < i; ++j) {
+                const std::uint32_t off = j * kInstSize + kTransformSize;
+                tlas.instanceBuffer.FlushDeviceRange(off, off, kMetaSize);
+            }
+        } else {
+            // CPU-driven run — one contiguous writeBuffer.
+            const std::uint32_t startOff = runStart * kInstSize;
+            const std::uint32_t bytes    = (i - runStart) * kInstSize;
+            tlas.instanceBuffer.FlushDeviceRange(startOff, startOff, bytes);
+        }
+        runStart = i;
+        runOwned = currOwned;
+    }
+
    tlas.metadataBuffer.FlushDevice();
+}
+
+void RenderingElement3D::BuildTLASBuild(WebGPUCommandEncoderRef /*cmd*/, std::uint32_t index) {
+    auto& tlas = tlases[index];
+    const std::uint32_t primitiveCount = static_cast<std::uint32_t>(elements.size());
+    if (primitiveCount == 0) {
+        // Upload already cleared builtInstanceCount; nothing to dispatch.
+        return;
+    }
+
+    // No per-count Resize. tlas.buffer / entryOrder / mortonCodes were
+    // allocated at kLbvhMax in BuildTLASUpload's first call and stay
+    // that size. The LBVH shader reads the real count from a uniform
+    // (lbvhPc.nReal) wgpuBuildTLAS writes each call.

    WebGPU::wgpuBuildTLAS(tlas.instanceBuffer.handle,
                          static_cast<std::int32_t>(primitiveCount),
-                          tlas.buffer.handle);
+                          tlas.buffer.handle,
+                          tlas.entryOrder.handle,
+                          tlas.mortonCodes.handle,
+                          tlas.tlasBins.handle,
+                          tlas.bvhNodes.handle,
+                          tlas.sortTempA.handle,
+                          tlas.sortTempB.handle);

    tlas.builtInstanceCount = primitiveCount;
 }
+
+void RenderingElement3D::BuildTLAS(WebGPUCommandEncoderRef cmd, std::uint32_t index) {
+    BuildTLASUpload(cmd, index);
+    BuildTLASBuild(cmd, index);
+}
--- a/implementations/Crafter.Graphics-UI-WebGPU.cpp
+++ b/implementations/Crafter.Graphics-UI-WebGPU.cpp
@ -98,13 +98,9 @@ void UIRenderer::DispatchImages(GraphicsCommandBuffer /*cmd*/, std::uint32_t buf
    if (itemCount == 0) return;
    UIDispatchHeader hdr = FillHeader(bufferSlot, itemCount, clipRectPx);
    auto handle = heap_->bufferTable[bufferSlot];
-    // For DispatchImages, the WGSL expects a texture + sampler in group 3.
-    // The library v1 doesn't expose user-image registration on DOM (out of
-    // scope per plan). If the user calls DispatchImages without a registered
-    // image, fall back to using the font atlas binding — the user's items
-    // should reference texSlot/sampSlot but on DOM those are ignored. For
-    // now, route through the font atlas texture if available; otherwise
-    // skip the dispatch.
+    // Backward-compatible fallback: callers that don't pass a texture
+    // get the font atlas. Useful for tests, useless for real content.
+    // New code should use the 6-arg overload below.
    if (fontAtlasImageSlot_) {
        auto texHandle  = heap_->imageTable[fontAtlasImageSlot_];
        auto sampHandle = heap_->samplerTable[fontAtlasSamplerSlot_];
@ -115,6 +111,21 @@ void UIRenderer::DispatchImages(GraphicsCommandBuffer /*cmd*/, std::uint32_t buf
    }
 }

+void UIRenderer::DispatchImages(GraphicsCommandBuffer /*cmd*/, std::uint32_t bufferSlot,
+                                std::uint32_t itemCount,
+                                std::uint16_t imageSlot, std::uint16_t samplerSlot,
+                                std::array<float,4> clipRectPx) {
+    if (itemCount == 0) return;
+    UIDispatchHeader hdr = FillHeader(bufferSlot, itemCount, clipRectPx);
+    auto handle     = heap_->bufferTable[bufferSlot];
+    auto texHandle  = heap_->imageTable[imageSlot];
+    auto sampHandle = heap_->samplerTable[samplerSlot];
+    WebGPU::wgpuDispatchImages(handle, &hdr,
+        static_cast<std::int32_t>(TilesFor(window_->width)),
+        static_cast<std::int32_t>(TilesFor(window_->height)),
+        texHandle, sampHandle);
+}
+
 void UIRenderer::DispatchText(GraphicsCommandBuffer /*cmd*/, std::uint32_t bufferSlot,
                              std::uint32_t itemCount,
                              std::array<float,4> clipRectPx) {
@ -168,6 +179,7 @@ void UIRenderer::Dispatch(GraphicsCommandBuffer /*cmd*/, const GraphicsComputeSh
            case UICustomBindingKind::Sampler:
                if (slot < heap_->samplerTable.size()) handle = heap_->samplerTable[slot];
                break;
+            default: break;
        }
        handles.push_back(handle);
    }
--- a/interfaces/Crafter.Graphics-DescriptorHeapWebGPU.cppm
+++ b/interfaces/Crafter.Graphics-DescriptorHeapWebGPU.cppm
@ -191,5 +191,13 @@ export namespace Crafter {
        heap.samplerTable[r.firstElement] = WebGPU::wgpuCreateLinearClampSampler();
        return SamplerSlot(&heap, r.firstElement);
    }
+
+    // Same as AllocateLinearClampSampler but the address modes are
+    // `repeat` instead of `clamp-to-edge`. Mip filtering is also linear.
+    inline SamplerSlot AllocateLinearRepeatSampler(DescriptorHeapWebGPU& heap) {
+        DescriptorRange r = heap.AllocateSamplerSlots(1);
+        heap.samplerTable[r.firstElement] = WebGPU::wgpuCreateLinearRepeatSampler();
+        return SamplerSlot(&heap, r.firstElement);
+    }
 }
 #endif // CRAFTER_GRAPHICS_WINDOW_DOM
--- a/interfaces/Crafter.Graphics-Image2D.cppm
+++ b/interfaces/Crafter.Graphics-Image2D.cppm
@ -113,17 +113,30 @@ export namespace Crafter {
        std::uint16_t width  = 0;
        std::uint16_t height = 0;
        std::uint16_t layers = 0;
+        std::uint8_t  mipLevels = 1;

-        void Create(std::uint16_t w, std::uint16_t h, std::uint16_t layerCount) {
+        // Create an array with `layerCount` × (w × h) layers, each carrying
+        // `mipLevels` mip levels. Pass mipLevels=1 (default) for a single
+        // base level — matching the original no-mip behaviour. Caller is
+        // responsible for uploading each level via UpdateLayer (which
+        // handles CPU mip-chain generation when mipLevels > 1).
+        void Create(std::uint16_t w, std::uint16_t h, std::uint16_t layerCount,
+                    std::uint8_t mipLevelCount = 1) {
            width     = w;
            height    = h;
            layers    = layerCount;
-            handle = WebGPU::wgpuCreateImage2DArray(w, h, layerCount);
+            mipLevels = mipLevelCount;
+            handle = WebGPU::wgpuCreateImage2DArray(w, h, layerCount, mipLevelCount);
        }

-        // Decompress `tex` and upload to `layer`. The asset's dims must
-        // match the array's (w × h) — resize beforehand on the host with
-        // TextureAsset<RGBA8>::Resize() if they don't.
+        // Decompress `tex`, generate a CPU box-filter mip chain (if
+        // mipLevels > 1), and upload each level into `layer`. The asset's
+        // base-level dims must match the array's (w × h) — resize
+        // beforehand on the host with TextureAsset<RGBA8>::Resize() if
+        // they don't. Pixel data is treated as raw bytes per channel for
+        // the box filter — for non-color data (normal maps) this gives
+        // approximate but adequate results; for sRGB-encoded color data
+        // it's also approximate but visually fine for game textures.
        void UpdateLayer(std::uint16_t layer, const CompressedTextureAsset& tex) {
            if (tex.pixelStride != sizeof(PixelType)) {
                std::println(std::cerr,
@ -142,11 +155,56 @@ export namespace Crafter {
                std::as_writable_bytes(std::span(pixels)),
            };
            Compression::DecompressCPU(tex.blob, outputs);
+
+            // Upload level 0.
            WebGPU::wgpuWriteImage2DLayer(
-                handle, layer,
+                handle, layer, /*level*/ 0,
                pixels.data(),
                static_cast<std::int32_t>(pixels.size() * sizeof(PixelType)),
                width, height);
+
+            // Generate + upload subsequent mip levels via a 2x2 box filter
+            // on the previous level's bytes. Each channel is averaged
+            // independently across 4 source texels.
+            std::uint16_t srcW = width;
+            std::uint16_t srcH = height;
+            std::vector<PixelType> prev = std::move(pixels);
+            for (std::uint8_t lvl = 1; lvl < mipLevels; ++lvl) {
+                std::uint16_t dstW = std::max<std::uint16_t>(1, srcW >> 1);
+                std::uint16_t dstH = std::max<std::uint16_t>(1, srcH >> 1);
+                std::vector<PixelType> next(static_cast<std::size_t>(dstW) * dstH);
+                constexpr std::size_t kChannels = sizeof(PixelType);
+                auto srcBytes = reinterpret_cast<const std::uint8_t*>(prev.data());
+                auto dstBytes = reinterpret_cast<std::uint8_t*>(next.data());
+                for (std::uint16_t y = 0; y < dstH; ++y) {
+                    std::uint16_t sy0 = static_cast<std::uint16_t>(y * 2);
+                    std::uint16_t sy1 = static_cast<std::uint16_t>(std::min<std::int32_t>(sy0 + 1, srcH - 1));
+                    for (std::uint16_t x = 0; x < dstW; ++x) {
+                        std::uint16_t sx0 = static_cast<std::uint16_t>(x * 2);
+                        std::uint16_t sx1 = static_cast<std::uint16_t>(std::min<std::int32_t>(sx0 + 1, srcW - 1));
+                        std::size_t a = (static_cast<std::size_t>(sy0) * srcW + sx0) * kChannels;
+                        std::size_t b = (static_cast<std::size_t>(sy0) * srcW + sx1) * kChannels;
+                        std::size_t c = (static_cast<std::size_t>(sy1) * srcW + sx0) * kChannels;
+                        std::size_t d = (static_cast<std::size_t>(sy1) * srcW + sx1) * kChannels;
+                        std::size_t out = (static_cast<std::size_t>(y) * dstW + x) * kChannels;
+                        for (std::size_t ch = 0; ch < kChannels; ++ch) {
+                            std::uint32_t sum = static_cast<std::uint32_t>(srcBytes[a + ch])
+                                              + static_cast<std::uint32_t>(srcBytes[b + ch])
+                                              + static_cast<std::uint32_t>(srcBytes[c + ch])
+                                              + static_cast<std::uint32_t>(srcBytes[d + ch]);
+                            dstBytes[out + ch] = static_cast<std::uint8_t>((sum + 2u) >> 2);
+                        }
+                    }
+                }
+                WebGPU::wgpuWriteImage2DLayer(
+                    handle, layer, /*level*/ lvl,
+                    next.data(),
+                    static_cast<std::int32_t>(next.size() * sizeof(PixelType)),
+                    dstW, dstH);
+                prev = std::move(next);
+                srcW = dstW;
+                srcH = dstH;
+            }
        }

        ImageSlot AllocateSlot(DescriptorHeapWebGPU& heap) {
--- a/interfaces/Crafter.Graphics-InputField.cppm
+++ b/interfaces/Crafter.Graphics-InputField.cppm
@ -18,10 +18,7 @@ Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301 USA
 */
 module;

-#ifndef CRAFTER_GRAPHICS_WINDOW_DOM
-#endif // !CRAFTER_GRAPHICS_WINDOW_DOM
 export module Crafter.Graphics:InputField;
-#ifndef CRAFTER_GRAPHICS_WINDOW_DOM
 import std;
 import :Types;
 import :Keys;
@ -110,4 +107,3 @@ export namespace Crafter {
                        const InputFieldColors& colors,
                        bool caretVisible);
 }
-#endif // !CRAFTER_GRAPHICS_WINDOW_DOM
--- a/interfaces/Crafter.Graphics-Mesh.cppm
+++ b/interfaces/Crafter.Graphics-Mesh.cppm
@ -97,6 +97,7 @@ export namespace Crafter {
        // sentinel; never returned by Build().
        std::uint64_t blasAddr = 0;
        std::uint32_t triangleCount = 0;
+        std::uint32_t vertexCount   = 0;

        bool opaque = true;

--- a/interfaces/Crafter.Graphics-PlainComputeShader.cppm
+++ b/interfaces/Crafter.Graphics-PlainComputeShader.cppm
@ -0,0 +1,113 @@
+/*
+Crafter®.Graphics
+Copyright (C) 2026 Catcrafts®
+catcrafts.net
+
+This library is free software; you can redistribute it and/or
+modify it under the terms of the GNU Lesser General Public
+License version 3.0 as published by the Free Software Foundation;
+*/
+
+// Standalone compute pipeline. Dispatches at any point in the frame
+// (inside or outside the UI render pass) via the JS bridge's
+// wgpuDispatchCompute, which mirrors the wgpuBuildTLAS pattern of
+// attaching to the active encoder when one exists or creating an
+// ephemeral encoder+submit when not.
+//
+// This is the WebGPU counterpart to the Vulkan `:ComputeShader` partition.
+// They expose the same conceptual API — Load + Dispatch — but with
+// backend-specific binding plumbing. See `:GraphicsTypes` for the
+// `GraphicsComputeShader` alias picking the right one per target.
+//
+// WGSL contract:
+//   @group(0) @binding(0) uniform PushData    // optional; only if pushUniformSize>0
+//   @group(1+) @binding(N)                    // user bindings via UICustomBinding
+// When rayQuery is on, @group(1) is reserved for the RT heap; user
+// bindings start at @group(2).
+
+module;
+export module Crafter.Graphics:PlainComputeShader;
+#ifdef CRAFTER_GRAPHICS_WINDOW_DOM
+import std;
+import :WebGPU;
+import :WebGPUComputeShader;  // for UICustomBinding + UICustomBindingKind
+
+export namespace Crafter {
+    class PlainComputeShader {
+    public:
+        std::uint32_t pipelineHandle  = 0;
+        std::uint32_t pushUniformSize = 0;
+        bool          rayQueryCapable = false;
+        std::vector<UICustomBinding> customBindings;
+
+        PlainComputeShader() = default;
+        PlainComputeShader(const PlainComputeShader&) = delete;
+        PlainComputeShader& operator=(const PlainComputeShader&) = delete;
+        PlainComputeShader(PlainComputeShader&& o) noexcept
+            : pipelineHandle(o.pipelineHandle),
+              pushUniformSize(o.pushUniformSize),
+              rayQueryCapable(o.rayQueryCapable),
+              customBindings(std::move(o.customBindings)) {
+            o.pipelineHandle = 0;
+        }
+
+        // Compile + link a standalone compute shader.
+        //   wgsl             — source.
+        //   pushUniformSize  — byte size of the @group(0)@binding(0) uniform
+        //                      struct, or 0 if the shader doesn't declare one.
+        //   bindings         — every user-declared resource the dispatch
+        //                      should bind (groups 1+ if no rayQuery, 2+ if
+        //                      rayQuery). Order MUST match `handles` at
+        //                      Dispatch time.
+        //   rayQuery         — prepend the RT prelude + rayQuery library
+        //                      so the shader can call `rayQuery*` helpers.
+        void Load(std::string_view wgsl,
+                  std::uint32_t pushUniformSize_,
+                  std::span<const UICustomBinding> bindings = {},
+                  bool rayQuery = false) {
+            pushUniformSize = pushUniformSize_;
+            rayQueryCapable = rayQuery;
+            customBindings.assign(bindings.begin(), bindings.end());
+            pipelineHandle = WebGPU::wgpuLoadComputePipeline(
+                wgsl.data(), static_cast<std::int32_t>(wgsl.size()),
+                static_cast<std::int32_t>(pushUniformSize),
+                customBindings.empty() ? nullptr : customBindings.data(),
+                static_cast<std::int32_t>(customBindings.size()),
+                rayQuery ? 1 : 0);
+        }
+
+        void Load(const std::filesystem::path& wgslPath,
+                  std::uint32_t pushUniformSize_,
+                  std::span<const UICustomBinding> bindings = {},
+                  bool rayQuery = false) {
+            std::ifstream f(wgslPath, std::ios::binary);
+            if (!f) {
+                std::println(std::cerr,
+                    "PlainComputeShader::Load: cannot open {}", wgslPath.string());
+                std::abort();
+            }
+            std::string wgsl((std::istreambuf_iterator<char>(f)),
+                              std::istreambuf_iterator<char>());
+            Load(std::string_view{wgsl}, pushUniformSize_, bindings, rayQuery);
+        }
+
+        // Bind, push, dispatch. `handles` is parallel to the
+        // UICustomBinding[] passed at Load — order matches.
+        void Dispatch(const void* push, std::uint32_t pushBytes,
+                      std::span<const std::uint32_t> handles,
+                      std::uint32_t gx,
+                      std::uint32_t gy = 1,
+                      std::uint32_t gz = 1) const {
+            if (pipelineHandle == 0) return;
+            WebGPU::wgpuDispatchCompute(
+                pipelineHandle,
+                push, static_cast<std::int32_t>(pushBytes),
+                handles.empty() ? nullptr : handles.data(),
+                static_cast<std::int32_t>(handles.size()),
+                static_cast<std::int32_t>(gx),
+                static_cast<std::int32_t>(gy),
+                static_cast<std::int32_t>(gz));
+        }
+    };
+}
+#endif // CRAFTER_GRAPHICS_WINDOW_DOM
--- a/interfaces/Crafter.Graphics-RenderingElement3D.cppm
+++ b/interfaces/Crafter.Graphics-RenderingElement3D.cppm
@ -121,6 +121,37 @@ export namespace Crafter {
        // customIndex (4) + _pad (12). Defined in the WGSL traversal
        // library; never directly read by C++.
        WebGPUBuffer<char, false>           buffer;
+        // GPU LBVH support — see additional/dom-webgpu.js's TLAS-build
+        // pipeline.
+        //
+        // entryOrder: per-frame permutation array of u32, indexing into
+        // `buffer` (the TLASEntry[] array). Populated by the radix-sort
+        // pass to spatially-coherent Morton order, then consumed by the
+        // BVH construction + traversal passes.  In Stage 1 (this
+        // baseline) it's the identity permutation written by
+        // tlasBuildMain alongside the entries.
+        WebGPUBuffer<char, false>           entryOrder;
+        // mortonCodes: per-instance 32-bit Morton codes computed from the
+        // world-AABB centroid, used as the radix-sort key. Written by
+        // tlasBuildMain.
+        WebGPUBuffer<char, false>           mortonCodes;
+        // bvhNodes: 2N_PADDED - 1 sweep-tree BVH nodes built per frame
+        // by the LBVH-build compute pass. Each node 32 bytes (aabbMin +
+        // pad, aabbMax + pad). N_PADDED = 65536 (hardcoded in WGSL).
+        // Internal nodes [0, N_PADDED-1); leaves [N_PADDED-1, 2*N_PADDED-1).
+        // Node i's children are 2i+1, 2i+2 (implicit perfect binary
+        // tree). Cap: 65536 instances per scene.
+        WebGPUBuffer<char, false>           bvhNodes;
+        // tlasBins: dead, kept allocated as a 64-byte placeholder so the
+        // existing wgpuBuildTLAS C++ signature doesn't need a churn.
+        // The pre-LBVH 64-bin partition was replaced by the full BVH.
+        WebGPUBuffer<char, false>           tlasBins;
+        // Sort ping-pong buffers for the radix sort. Each pass reads
+        // from one and writes to the other, swapping role. Layout per
+        // element: 1 u32 packed key = (morton16 << 16) | tlasIndex16.
+        // Sized for N_PADDED.
+        WebGPUBuffer<char, false>           sortTempA;
+        WebGPUBuffer<char, false>           sortTempB;

        std::uint32_t builtInstanceCount = 0;
    };
@ -141,6 +172,17 @@ export namespace Crafter {
        // a fresh build (no refit) — the GPU build pass is cheap at the
        // ~10–100 instance counts the design targets; LBVH-for-TLAS is a
        // future optimization for larger scenes.
+        //
+        // BuildTLAS is now split into Upload + Build so a physics
+        // compute pass (e.g. physics-tlas-transform) can run between the
+        // CPU mirror upload and the GPU LBVH build. The compute pass
+        // writes the per-instance transform bytes that BuildTLAS leaves
+        // intact for elements flagged transformOwnedByGpu, and those
+        // writes have to land before the LBVH reads them. The combined
+        // BuildTLAS is kept as a convenience for callers that don't
+        // interleave a compute pass (e.g. the ctor-time first build).
+        static void BuildTLASUpload(WebGPUCommandEncoderRef cmd, std::uint32_t index);
+        static void BuildTLASBuild(WebGPUCommandEncoderRef cmd, std::uint32_t index);
        static void BuildTLAS(WebGPUCommandEncoderRef cmd, std::uint32_t index);

        static void Add(RenderingElement3D* e);
--- a/interfaces/Crafter.Graphics-UI.cppm
+++ b/interfaces/Crafter.Graphics-UI.cppm
@ -165,6 +165,18 @@ export namespace Crafter {
                             std::array<float,4> clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f});
        void DispatchImages(GraphicsCommandBuffer cmd, std::uint32_t bufferSlot, std::uint32_t itemCount,
                            std::array<float,4> clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f});
+#ifdef CRAFTER_GRAPHICS_WINDOW_DOM
+        // WebGPU-only overload. WebGPU bind groups can only carry one
+        // texture/sampler per dispatch, so all items in `bufferSlot`
+        // share the same texture (`imageSlot`) and sampler (`samplerSlot`).
+        // The per-item `slots` field in ImageItem is ignored on this
+        // backend. On Vulkan the bindless heap resolves per-item slots,
+        // so the cross-backend path is to call the 4-arg overload above
+        // on native and this 6-arg overload on DOM.
+        void DispatchImages(GraphicsCommandBuffer cmd, std::uint32_t bufferSlot, std::uint32_t itemCount,
+                            std::uint16_t imageSlot, std::uint16_t samplerSlot,
+                            std::array<float,4> clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f});
+#endif
        void DispatchText(GraphicsCommandBuffer cmd, std::uint32_t bufferSlot, std::uint32_t itemCount,
                          std::array<float,4> clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f});

--- a/interfaces/Crafter.Graphics-WebGPU.cppm
+++ b/interfaces/Crafter.Graphics-WebGPU.cppm
@ -35,6 +35,40 @@ namespace Crafter::WebGPU {
    extern "C" std::uint32_t wgpuCreateBuffer(std::int32_t byteSize);
    __attribute__((import_module("env"), import_name("wgpuWriteBuffer")))
    extern "C" void wgpuWriteBuffer(std::uint32_t handle, const void* srcPtr, std::int32_t byteSize);
+    __attribute__((import_module("env"), import_name("wgpuWriteBufferRange")))
+    extern "C" void wgpuWriteBufferRange(std::uint32_t handle,
+                                          std::uint32_t dstByteOffset,
+                                          const void* srcPtr,
+                                          std::int32_t byteSize);
+    // Kick off a GPU→CPU readback for the entire `byteSize`-byte prefix
+    // of the buffer at `handle`. Returns immediately; the actual map
+    // resolves asynchronously. Successive Enqueues without a Poll in
+    // between are no-ops until the previous map resolves.
+    //
+    // `resetBytes` ≥ 0 — if non-zero, the JS bridge encodes a
+    // clearBuffer over the first `resetBytes` bytes of the source
+    // buffer immediately after the copy, in the same command encoder.
+    // Used by Forts3D's GPU event queues to zero the atomic-add count
+    // for the next frame's substeps. The reset is TIED to a successful
+    // enqueue: if the enqueue was skipped (previous map still pending),
+    // the reset is skipped too — so events written by substeps during
+    // the missed-drain window accumulate into the next successful
+    // capture instead of being silently wiped.
+    __attribute__((import_module("env"), import_name("wgpuReadbackEnqueue")))
+    extern "C" void wgpuReadbackEnqueue(std::uint32_t handle,
+                                         std::int32_t byteSize,
+                                         std::int32_t resetBytes);
+    // Poll a previously-enqueued readback. Returns 1 and writes the
+    // bytes into `dstPtr` if the map resolved; returns 0 otherwise.
+    __attribute__((import_module("env"), import_name("wgpuReadbackPoll")))
+    extern "C" std::int32_t wgpuReadbackPoll(std::uint32_t handle, void* dstPtr, std::int32_t byteSize);
+    // Non-consuming readiness probe. Returns 1 if the readback has
+    // resolved and the next Poll would succeed; returns 0 otherwise.
+    // Used to gate multi-buffer drains (header + array) so neither side
+    // gets consumed until both are ready — otherwise the consumed side's
+    // data is lost while the other side waits for its map to resolve.
+    __attribute__((import_module("env"), import_name("wgpuReadbackReady")))
+    extern "C" std::int32_t wgpuReadbackReady(std::uint32_t handle);
    __attribute__((import_module("env"), import_name("wgpuDestroyBuffer")))
    extern "C" void wgpuDestroyBuffer(std::uint32_t handle);

@ -64,15 +98,26 @@ namespace Crafter::WebGPU {
    // Used by Image2DArray<RGBA8> to stack per-material albedos for one
    // multi-material scene.
    __attribute__((import_module("env"), import_name("wgpuCreateImage2DArray")))
-    extern "C" std::uint32_t wgpuCreateImage2DArray(std::int32_t w, std::int32_t h, std::int32_t layerCount);
+    extern "C" std::uint32_t wgpuCreateImage2DArray(std::int32_t w, std::int32_t h,
+                                                    std::int32_t layerCount, std::int32_t mipLevels);
+    // Upload a single mip level for one array layer. `level` indexes into
+    // the texture's mip chain (0 = base); `w` / `h` must be the dimensions
+    // at that level. Callers pass each level's pixels separately — mip
+    // generation is host-side.
    __attribute__((import_module("env"), import_name("wgpuWriteImage2DLayer")))
-    extern "C" void wgpuWriteImage2DLayer(std::uint32_t handle, std::int32_t layer,
+    extern "C" void wgpuWriteImage2DLayer(std::uint32_t handle, std::int32_t layer, std::int32_t level,
                                          const void* srcPtr, std::int32_t byteSize,
                                          std::int32_t w, std::int32_t h);

    __attribute__((import_module("env"), import_name("wgpuCreateLinearClampSampler")))
    extern "C" std::uint32_t wgpuCreateLinearClampSampler();

+    // Linear-filtered, repeat-addressed sampler with mipmap linear-filter.
+    // The usual choice for tiled material textures (woodBrace, panel, etc.)
+    // which expect UV > 1.0 to wrap.
+    __attribute__((import_module("env"), import_name("wgpuCreateLinearRepeatSampler")))
+    extern "C" std::uint32_t wgpuCreateLinearRepeatSampler();
+
    __attribute__((import_module("env"), import_name("wgpuFrameBegin")))
    extern "C" void wgpuFrameBegin();
    __attribute__((import_module("env"), import_name("wgpuFrameEnd")))
@ -158,12 +203,56 @@ namespace Crafter::WebGPU {
                                   std::int32_t  gx, std::int32_t gy,
                                   const void* handlesPtr, std::int32_t handlesCount);

-    // GPU TLAS-build dispatch. Reads the instance buffer (host-uploaded or
-    // GPU-written), produces per-instance world-space AABBs + per-instance
-    // transform matrices in a flat tlasBuf SSBO consumed by traceRay / rayQuery.
+    // GPU TLAS-build dispatch. Two sequential compute passes:
+    //   1. tlasBuildMain — per-instance world AABB + identity permutation
+    //      + naive Morton (overwritten in pass 2). Outputs the flat
+    //      tlasBuf SSBO consumed by traceRay / rayQuery.
+    //   2. lbvhBuildMain — single workgroup of 1024 threads; reduces
+    //      scene AABB, recomputes Morton with proper normalization,
+    //      bitonic-sorts (morton, instance_id), writes the sorted
+    //      permutation into `entryOrderBufHandle`, and refits a
+    //      sweep-tree BVH into `bvhNodesBufHandle` bottom-up.
+    // Pre-LBVH bin-build is gone; `binsBufHandle` is kept in the
+    // signature as a placeholder so the C++ side doesn't churn.
    __attribute__((import_module("env"), import_name("wgpuBuildTLAS")))
    extern "C" void wgpuBuildTLAS(std::uint32_t instanceBufHandle,
                                  std::int32_t  instanceCount,
-                                  std::uint32_t tlasOutBufHandle);
+                                  std::uint32_t tlasOutBufHandle,
+                                  std::uint32_t entryOrderBufHandle,
+                                  std::uint32_t mortonBufHandle,
+                                  std::uint32_t binsBufHandle,
+                                  std::uint32_t bvhNodesBufHandle,
+                                  std::uint32_t sortTempABufHandle,
+                                  std::uint32_t sortTempBBufHandle);
+
+    // ── Standalone compute pipelines ───────────────────────────────────
+    //
+    // Mirror of the native ComputeShader API: load a user-authored
+    // compute WGSL with arbitrary @group bindings, dispatch it at any
+    // point in the frame (inside or outside the UI compute pass —
+    // physics ticks dispatch from update lambdas, which fire outside
+    // the per-frame render encoder).
+    //
+    // WGSL contract:
+    //   @group(0) @binding(0) — uniform PushData (optional; only if
+    //                            pushUniformSize > 0 at load).
+    //   @group(1+) @binding(N) — user bindings declared via
+    //                            UICustomBinding[]. When rayQuery is
+    //                            on, @group(1) is reserved for the RT
+    //                            heap and user bindings start at
+    //                            @group(2).
+    __attribute__((import_module("env"), import_name("wgpuLoadComputePipeline")))
+    extern "C" std::uint32_t wgpuLoadComputePipeline(
+        const void* wgslPtr, std::int32_t wgslLen,
+        std::int32_t pushUniformSize,
+        const void* bindingsPtr, std::int32_t bindingsCount,
+        std::int32_t rayQueryFlag);
+
+    __attribute__((import_module("env"), import_name("wgpuDispatchCompute")))
+    extern "C" void wgpuDispatchCompute(
+        std::uint32_t pipelineHandle,
+        const void* pushPtr, std::int32_t pushBytes,
+        const void* handlesPtr, std::int32_t handlesCount,
+        std::int32_t gx, std::int32_t gy, std::int32_t gz);
 }
 #endif // CRAFTER_GRAPHICS_WINDOW_DOM
--- a/interfaces/Crafter.Graphics-WebGPUBuffer.cppm
+++ b/interfaces/Crafter.Graphics-WebGPUBuffer.cppm
@ -78,6 +78,60 @@ export namespace Crafter {
        void FlushDevice() requires(Mapped) {
            WebGPU::wgpuWriteBuffer(handle, this->value, static_cast<std::int32_t>(size));
        }
+        // Partial upload — write the bytes [srcByteOffset, srcByteOffset+byteCount)
+        // of the host mirror to GPU offset `dstByteOffset`. BuildTLAS uses
+        // this to leave the GPU-owned transform field of an RTInstance
+        // intact (the physics-tlas-transform compute shader is its sole
+        // writer) while still pushing the CPU-side metadata fields.
+        void FlushDeviceRange(std::uint32_t dstByteOffset,
+                              std::uint32_t srcByteOffset,
+                              std::uint32_t byteCount) requires(Mapped) {
+            const auto* base = reinterpret_cast<const char*>(this->value);
+            WebGPU::wgpuWriteBufferRange(handle, dstByteOffset,
+                                          base + srcByteOffset,
+                                          static_cast<std::int32_t>(byteCount));
+        }
+
+        // Push one element's worth of bytes from the host mirror to GPU.
+        // Use when a single SoA slot was mutated (body construction,
+        // per-instance flag flip) and a full FlushDevice would clobber
+        // the GPU-side updates the sim has applied to neighboring slots.
+        void FlushDeviceSlot(std::uint32_t idx) requires(Mapped) {
+            constexpr std::uint32_t kStride = sizeof(T);
+            const std::uint32_t off = idx * kStride;
+            FlushDeviceRange(off, off, kStride);
+        }
+
+        // Schedule a GPU→CPU readback of this buffer's entire contents.
+        // Asynchronous; data isn't ready until a later PollReadback
+        // returns true. Successive Enqueues without a Poll are dropped
+        // — they're a no-op while the previous map is in flight.
+        //
+        // `resetBytes` ≥ 0 — if non-zero, the first `resetBytes` bytes
+        // of THIS buffer are clearBuffer-cleared on the GPU command
+        // encoder immediately after the copy, so the readback captures
+        // the pre-clear bytes and the next frame's writers see zeros.
+        // The reset is tied to a successful enqueue (skipped enqueue =
+        // skipped reset), preserving accumulated state across missed
+        // drains.
+        void EnqueueReadback(std::uint32_t resetBytes = 0) {
+            WebGPU::wgpuReadbackEnqueue(handle,
+                                         static_cast<std::int32_t>(size),
+                                         static_cast<std::int32_t>(resetBytes));
+        }
+        // Try to copy the readback bytes into this->value. Returns true
+        // if the previous EnqueueReadback resolved and the data is now
+        // mirrored into .value; false if the map is still pending.
+        bool PollReadback() requires(Mapped) {
+            return WebGPU::wgpuReadbackPoll(handle, this->value,
+                                             static_cast<std::int32_t>(size)) != 0;
+        }
+        // Non-consuming readiness probe. Returns true if a subsequent
+        // PollReadback would succeed without changing state otherwise.
+        // Use to verify a sibling buffer is also ready before consuming.
+        bool IsReadbackReady() const {
+            return WebGPU::wgpuReadbackReady(handle) != 0;
+        }

        ~WebGPUBuffer() { Clear(); }
    };
--- a/interfaces/Crafter.Graphics-WebGPUComputeShader.cppm
+++ b/interfaces/Crafter.Graphics-WebGPUComputeShader.cppm
@ -36,6 +36,11 @@ export namespace Crafter {
        SampledTexture      = 1,   // sampled texture_2d<f32>, handle is a slot into heap.imageTable
        Sampler             = 2,   // filtering sampler, handle is a slot into heap.samplerTable
        SampledTextureArray = 3,   // sampled texture_2d_array<f32>, handle is a slot into heap.imageTable
+        // read-write storage SSBO (var<storage, read_write> in WGSL). Use
+        // for buffers shaders need to MUTATE — e.g. physics shaders that
+        // integrate node momentum, write brace stress, or output TLAS
+        // instance transforms.
+        BufferReadWrite     = 4,
    };

    struct UICustomBinding {
--- a/interfaces/Crafter.Graphics.cppm
+++ b/interfaces/Crafter.Graphics.cppm
@ -71,5 +71,6 @@ export import :WebGPU;
 export import :WebGPUBuffer;
 export import :DescriptorHeapWebGPU;
 export import :WebGPUComputeShader;
+export import :PlainComputeShader;
 export import :ShaderBindingTableWebGPU;
 export import :PipelineRTWebGPU;
--- a/project.cpp
+++ b/project.cpp
@ -123,7 +123,7 @@ extern "C" Configuration CrafterBuildProject(std::span<const std::string_view> a
    // when its body is gated out. Vulkan-typed partitions stub to empty
    // modules under CRAFTER_GRAPHICS_WINDOW_DOM; the Dom/DomEvents/Router
    // partitions stub to empty modules in the opposite direction.
-    std::array<fs::path, 41> ifaces = {
+    std::array<fs::path, 42> ifaces = {
        "interfaces/Crafter.Graphics",
        "interfaces/Crafter.Graphics-Animation",
        "interfaces/Crafter.Graphics-Clipboard",
@ -147,6 +147,7 @@ extern "C" Configuration CrafterBuildProject(std::span<const std::string_view> a
        "interfaces/Crafter.Graphics-Mesh",
        "interfaces/Crafter.Graphics-PipelineRTVulkan",
        "interfaces/Crafter.Graphics-PipelineRTWebGPU",
+        "interfaces/Crafter.Graphics-PlainComputeShader",
        "interfaces/Crafter.Graphics-RenderingElement3D",
        "interfaces/Crafter.Graphics-RenderPass",
        "interfaces/Crafter.Graphics-Router",
@ -170,14 +171,16 @@ extern "C" Configuration CrafterBuildProject(std::span<const std::string_view> a
    if (dom) {
        // DOM impl set. UI-Shared.cpp is backend-agnostic; UI-WebGPU.cpp
        // is the DOM-only implementation of UIRenderer's GPU-touching
-        // methods. Font / FontAtlas / UIComponents are now portable.
-        std::array<fs::path, 16> domImpls = {
+        // methods. Font / FontAtlas / UIComponents / InputField are now
+        // portable.
+        std::array<fs::path, 17> domImpls = {
            "implementations/Crafter.Graphics-Clipboard",
            "implementations/Crafter.Graphics-Dom",
            "implementations/Crafter.Graphics-Font",
            "implementations/Crafter.Graphics-FontAtlas",
            "implementations/Crafter.Graphics-Gamepad",
            "implementations/Crafter.Graphics-Input",
+            "implementations/Crafter.Graphics-InputField",
            "implementations/Crafter.Graphics-Mesh-WebGPU",
            "implementations/Crafter.Graphics-PipelineRTWebGPU",
            "implementations/Crafter.Graphics-RenderingElement3D-WebGPU",