diff --git a/TODO-lbvh-sort.md b/TODO-lbvh-sort.md
new file mode 100644
index 0000000..f86373e
--- /dev/null
+++ b/TODO-lbvh-sort.md
@@ -0,0 +1,106 @@
+# LBVH parallel radix sort: count-dependent corruption
+
+## Summary
+
+The parallel radix sort in `lbvhBuildMain` (additional/dom-webgpu.js) produces
+incorrect output that depends on the input distribution. Symptom: geometry in
+the BVH-built TLAS appears to flicker (instances missing or pointing at the
+wrong entry) as soon as a small object enters the TLAS alongside a tight
+cluster (e.g. a single projectile next to a 1000-brace fort in 3DForts).
+
+Bisected by selectively skipping each LBVH phase. Skipping only the radix
+sort eliminates the corruption — every other phase (scene-AABB reduce,
+Morton-key write, leaf init, sweep-tree refit) is correctness-clean.
+
+Current state: the sort is gated behind `if (false)` in `lbvhBuildMain`. BVH
+leaves are in instance-index order with no spatial coherence. The BVH still
+builds correctly and traversal still descends a real tree, just with looser
+parent AABBs.
+
+## What we know
+
+- The sort is LSD radix, 8 passes × 4 bits = 32-bit key.
+- Keys are `(morton16 << 16) | (tlasIndex16)`; sentinels (i >= n) get
+  `0xFFFFFFFF`.
+- Per-pass: histogram via atomicAdd, then per-bucket parallel scatter with a
+  Hillis-Steele exclusive prefix scan to compute per-thread destination
+  offsets.
+- Workgroup size 1024, K_PER 16 per thread = 16384 entries total.
+- The math of the Hillis-Steele scan was verified: after `log2(THREADS)=10`
+  steps with the read/barrier/write/barrier pattern, `shScan[tid]` holds the
+  inclusive prefix sum.
+- Scatter destinations are provably unique: `shOffsets[b] + exclusivePrefix
+  + localIdx`, where `exclusivePrefix` is per-thread and `localIdx`
+  increments per-element within the thread.
+- All required barriers are present:
+  - `workgroupBarrier` between scan iterations.
+  - `workgroupBarrier` at end of each bucket iteration.
+  - `storageBarrier` at end of each radix pass.
+
+## What we suspect
+
+The bug is likely one of:
+
+1. **WGSL implementation issue** in the specific browser/driver. `workgroup
+   Barrier` semantics around `atomicLoad` on workgroup memory, or around
+   single-buffered Hillis-Steele where one thread reads `shScan[tid - offset]`
+   while a neighbor writes `shScan[tid]`. Standard pattern, but the spec is
+   subtle.
+2. **Memory model edge case** triggered only with very unbalanced histograms
+   (e.g. bucket 15 holding ~94% of entries because almost everything is
+   sentinel-padded). Most threads have localCount ≤ 1 for non-{0, 15}
+   buckets and exactly 15-16 for bucket 15; that mix may surface a
+   compiler-introduced reordering.
+3. **A logical bug in the scan or scatter** that the human review keeps
+   missing — re-reading the code is the last thing that helps; what's
+   needed is a GPU-side trace.
+
+## Reproducing
+
+1. Run 3DForts WebGPU build with normal projectile firing.
+2. Aim near (not necessarily at) the fort.
+3. Observe braces / panels flickering as the projectile flies past.
+
+## Diagnostic strategies if revisiting
+
+1. **GPU-side trace.** Add a debug buffer (`array<u32>` sized for all 16384
+   entries × a few u32). Have each thread write its intermediate scan
+   values and final scatter destinations there. Read back to CPU and diff
+   against an expected oracle (CPU-computed reference sort of the same
+   input keys).
+2. **Halve the search.** Reduce `PASSES` to 1 and check: does a single-pass
+   sort already corrupt, or does corruption only emerge after multiple
+   ping-pongs?
+3. **Replace the scan.** Swap Hillis-Steele for a Blelloch up/down-sweep
+   scan or a `subgroupExclusiveAdd` variant where available. If the
+   replacement fixes it, the bug is in the Hillis-Steele specifically.
+4. **Serialize the scatter.** Have thread 0 do all scatters by itself
+   (loop over all 16384 entries × 16 buckets sequentially). Slow but a
+   provably-correct reference. If this fixes the flicker, the parallel
+   scatter has the bug.
+5. **Replace LSD with bitonic sort.** Different algorithm entirely. If
+   bitonic works, radix has a structural problem.
+
+## Why it's not blocking
+
+At the current scale (~1011 entries), the BVH still functions:
+
+- Sentinel half-subtrees are degenerate-AABB-rejected at the top of the
+  tree very cheaply (~1 AABB test per skipped subtree).
+- The real-leaf subtree has ~10 levels of descent (`log2(1024)`), all of
+  which are real AABB tests. Without spatial coherence the AABBs are
+  looser than a properly-sorted BVH, but they still bound the geometry.
+- Ray-vs-triangle work dominates anyway; BVH traversal is a small fraction
+  of the per-pixel cost.
+
+Headroom: LBVH_MAX = 16384. If the application pushes much past ~4000 real
+entries this stops being acceptable and the sort needs to actually work.
+
+## Acceptance criteria for "fixed"
+
+- The diagnostic repro (3DForts: fire a projectile near the fort) shows
+  no flicker at all.
+- The sort produces output ordered by `(morton16, tlasIndex)` ascending.
+- A unit test (CPU oracle vs GPU output) passes for at least three
+  histogram distributions: all-uniform, all-in-one-bucket, and the
+  3DForts-style "one small object next to a tight cluster".
diff --git a/additional/dom-env.js b/additional/dom-env.js
index e3dfc4d..fb64368 100644
--- a/additional/dom-env.js
+++ b/additional/dom-env.js
@@ -168,15 +168,25 @@ function setValue(cookie, valPtr, valLen) {
 // so removeEventListener can re-find it. C++-side handler id counters
 // are per-kind, so a per-kind suffix is what makes the keys unique.
 
+// devicePixelRatio scaling factor. dom-webgpu.js sets window.crafter_dpr
+// during its canvas sync so this side and the GPU side agree on a single
+// physical-pixel coordinate space. Fallback to the live DPR if no GPU
+// bridge ran (pure-CppDOM apps); ultimately fallback to 1 so non-HiDPI
+// browsers behave as before.
+function __dpr() {
+    return window.crafter_dpr || window.devicePixelRatio || 1;
+}
+
 function __makeMouseListenerPair(kind, eventName, exportName) {
     return {
         add(cookie, id) {
             const el = __jsmemory.get(cookie);
             if (!el) return;
             const handler = (event) => {
+                const s = __dpr();
                 __wasm()[exportName](id,
-                    event.clientX, event.clientY,
-                    event.screenX, event.screenY,
+                    event.clientX * s, event.clientY * s,
+                    event.screenX * s, event.screenY * s,
                     event.button, event.buttons,
                     event.altKey, event.ctrlKey, event.shiftKey, event.metaKey);
             };
@@ -317,7 +327,10 @@ const __resizePair = {
     // Resize is window-global in CppDOM. Mirror that: attach to `window`
     // regardless of which element the C++ caller passed.
     add(cookie, id) {
-        const handler = () => __wasm().ExecuteResizeHandler(id, window.innerWidth, window.innerHeight);
+        const handler = () => {
+            const s = __dpr();
+            __wasm().ExecuteResizeHandler(id, window.innerWidth * s, window.innerHeight * s);
+        };
         __listenerHandlers.set(`${cookie}-${id}-resize`, handler);
         window.addEventListener("resize", handler);
     },
@@ -345,9 +358,10 @@ const __wheelPair = {
     add(cookie, id) {
         const el = __jsmemory.get(cookie); if (!el) return;
         const handler = (event) => {
+            const s = __dpr();
             __wasm().ExecuteWheelHandler(id,
                 event.deltaX, event.deltaY, event.deltaZ, event.deltaMode,
-                event.clientX, event.clientY, event.screenX, event.screenY,
+                event.clientX * s, event.clientY * s, event.screenX * s, event.screenY * s,
                 event.button, event.buttons,
                 event.altKey, event.ctrlKey, event.shiftKey, event.metaKey);
         };
@@ -378,11 +392,97 @@ function domAttachWindow(windowHandle) {
         if (fn) fn(__windowAttachedHandle, ...args);
     };
 
-    __windowListeners.mousemove = (e) => fire("__crafterDom_mouseMove", [e.clientX, e.clientY]);
-    __windowListeners.mousedown = (e) => fire("__crafterDom_mouseDown", [e.button]);
-    __windowListeners.mouseup   = (e) => fire("__crafterDom_mouseUp",   [e.button]);
+    // Synthetic absolute position for pointer-lock mode. While the
+    // pointer is locked, browsers fire mousemove events with movementX/Y
+    // deltas instead of meaningful clientX/Y, and the cursor is hidden +
+    // captured by the canvas (no window-edge clamp). We accumulate the
+    // deltas into a synthetic position and feed *that* to the C++ side,
+    // so the existing `currentMousePos - lastMousePos` delta computation
+    // keeps working unchanged. Initialised to the cursor position the
+    // moment lock is acquired.
+    let __ptrLockSyntheticX = 0;
+    let __ptrLockSyntheticY = 0;
+    const __isPointerLocked = () =>
+        document.pointerLockElement !== null &&
+        document.pointerLockElement !== undefined;
+
+    // pointermove (not mousemove) so we can pull sub-frame events out of
+    // `getCoalescedEvents()`. Browsers normally collapse multiple raw
+    // mouse events between paint frames into a single event you'd see
+    // via `mousemove`; PointerEvent.getCoalescedEvents() returns the raw
+    // pre-coalesced list. Summing those gives a higher-resolution delta
+    // per frame than the single coalesced movementX/Y. PointerEvent also
+    // delivers fractional movementX from high-precision mice on Chromium.
+    __windowListeners.mousemove = (e) => {
+        const s = __dpr();
+        const locked = __isPointerLocked();
+        if (locked) {
+            // Accumulate over every sub-frame event the browser had
+            // queued up. `getCoalescedEvents` is the spec-correct way
+            // to access raw input between rAF ticks. Some browsers
+            // return an empty list — fall back to the top-level event.
+            let dx = 0, dy = 0;
+            const sub = (typeof e.getCoalescedEvents === "function")
+                ? e.getCoalescedEvents() : null;
+            if (sub && sub.length > 0) {
+                for (let i = 0; i < sub.length; i++) {
+                    dx += sub[i].movementX;
+                    dy += sub[i].movementY;
+                }
+            } else {
+                dx = e.movementX;
+                dy = e.movementY;
+            }
+            // No DPR scaling in pointer-lock: position is synthetic and
+            // there's no UI hit-test using it. DPR-scaling here only
+            // rounds finer movements up to multiples of `dpr`, which is
+            // pure quantization loss for aim controls.
+            __ptrLockSyntheticX += dx;
+            __ptrLockSyntheticY += dy;
+            fire("__crafterDom_mouseMove",
+                 [__ptrLockSyntheticX, __ptrLockSyntheticY]);
+        } else {
+            fire("__crafterDom_mouseMove", [e.clientX * s, e.clientY * s]);
+        }
+    };
+    __windowListeners.mousedown = (e) => {
+        // Right-click holds engage pointer lock — typical FPS-camera
+        // convention. Acquiring on any click (the previous policy) made
+        // menus annoying: clicking a button hid the cursor mid-flow. Now
+        // the cursor stays free for clicks/menus until the user holds
+        // RMB to actively look around. Browsers require lock requests
+        // from user gestures, which mousedown satisfies.
+        if (e.button === 2 && !__isPointerLocked()) {
+            const target = document.body;
+            if (target && target.requestPointerLock) {
+                target.requestPointerLock();
+                // Seed the synthetic position from the click point so
+                // there's no jump when the lock starts producing deltas.
+                __ptrLockSyntheticX = e.clientX;
+                __ptrLockSyntheticY = e.clientY;
+            }
+        }
+        fire("__crafterDom_mouseDown", [e.button]);
+    };
+    __windowListeners.mouseup = (e) => {
+        // Release lock on RMB up — cursor reappears at the seed point
+        // for clicks/menus until the next RMB hold.
+        if (e.button === 2 && __isPointerLocked()) {
+            document.exitPointerLock();
+        }
+        fire("__crafterDom_mouseUp", [e.button]);
+    };
     __windowListeners.wheel     = (e) => fire("__crafterDom_wheel",     [e.deltaY]);
     __windowListeners.contextmenu = (e) => { e.preventDefault(); };
+    __windowListeners.pointerlockchange = () => {
+        // Reset the synthetic accumulator when lock is released so the
+        // next acquisition starts cleanly. The C++ side will see one
+        // small jump back to the real cursor position on release.
+        if (!__isPointerLocked()) {
+            __ptrLockSyntheticX = 0;
+            __ptrLockSyntheticY = 0;
+        }
+    };
 
     // Keyboard events go through the document so they fire even when no
     // input element is focused. event.code is the layout-independent
@@ -400,16 +500,24 @@ function domAttachWindow(windowHandle) {
         __wasm().WasmFree(codePtr);
     };
 
-    __windowListeners.resize = () => fire("__crafterDom_resize", [window.innerWidth, window.innerHeight]);
+    __windowListeners.resize = () => {
+        const s = __dpr();
+        fire("__crafterDom_resize", [window.innerWidth * s, window.innerHeight * s]);
+    };
     __windowListeners.beforeunload = () => fire("__crafterDom_close", []);
 
-    document.addEventListener("mousemove",   __windowListeners.mousemove);
+    // pointermove (not mousemove) so the handler receives PointerEvents
+    // and can use getCoalescedEvents() to recover sub-frame motion. The
+    // handler's variable name stays "mousemove" — it's the same JS object,
+    // just bound to a different event type.
+    document.addEventListener("pointermove", __windowListeners.mousemove);
     document.addEventListener("mousedown",   __windowListeners.mousedown);
     document.addEventListener("mouseup",     __windowListeners.mouseup);
     document.addEventListener("wheel",       __windowListeners.wheel);
     document.addEventListener("contextmenu", __windowListeners.contextmenu);
     document.addEventListener("keydown",     __windowListeners.keydown);
     document.addEventListener("keyup",       __windowListeners.keyup);
+    document.addEventListener("pointerlockchange", __windowListeners.pointerlockchange);
     window  .addEventListener("resize",      __windowListeners.resize);
     window  .addEventListener("beforeunload",__windowListeners.beforeunload);
 }
@@ -418,8 +526,8 @@ function domSetTitle(titlePtr, titleLen) {
     document.title = __readUtf8(titlePtr, titleLen);
 }
 
-function domGetInnerWidth()  { return window.innerWidth;  }
-function domGetInnerHeight() { return window.innerHeight; }
+function domGetInnerWidth()  { return Math.round(window.innerWidth  * __dpr()); }
+function domGetInnerHeight() { return Math.round(window.innerHeight * __dpr()); }
 
 // ─── requestAnimationFrame loop ───────────────────────────────────────
 
diff --git a/additional/dom-webgpu.js b/additional/dom-webgpu.js
index 6e35c21..4ebb12a 100644
--- a/additional/dom-webgpu.js
+++ b/additional/dom-webgpu.js
@@ -41,14 +41,17 @@ function stub(name) {
     const e = window.crafter_webbuild_env;
     for (const n of [
         "wgpuGetCanvasWidth", "wgpuGetCanvasHeight", "wgpuSurfaceWidth", "wgpuSurfaceHeight",
-        "wgpuInit", "wgpuCreateBuffer", "wgpuWriteBuffer", "wgpuDestroyBuffer",
+        "wgpuInit", "wgpuCreateBuffer", "wgpuWriteBuffer", "wgpuWriteBufferRange",
+        "wgpuReadbackEnqueue", "wgpuReadbackPoll", "wgpuReadbackReady", "wgpuDestroyBuffer",
         "wgpuCreateAtlasTexture", "wgpuWriteAtlasRegion", "wgpuDestroyTexture",
         "wgpuCreateImage2D", "wgpuWriteImage2D",
         "wgpuCreateImage2DArray", "wgpuWriteImage2DLayer",
-        "wgpuCreateLinearClampSampler", "wgpuFrameBegin", "wgpuFrameEnd",
+        "wgpuCreateLinearClampSampler", "wgpuCreateLinearRepeatSampler",
+        "wgpuFrameBegin", "wgpuFrameEnd",
         "wgpuDispatchQuads", "wgpuDispatchCircles", "wgpuDispatchImages", "wgpuDispatchText",
         "wgpuLoadCustomShader", "wgpuDispatchCustom",
         "wgpuRegisterMeshBLAS", "wgpuLoadRTPipeline", "wgpuDispatchRT", "wgpuBuildTLAS",
+        "wgpuLoadComputePipeline", "wgpuDispatchCompute",
     ]) {
         // Read-write ints don't need a stub-throw; return 0 for the size queries.
         e[n] = n.endsWith("Width") || n.endsWith("Height")
@@ -81,14 +84,20 @@ document.body.style.margin = "0";
 document.body.appendChild(canvas);
 
 function syncCanvasSize() {
-    // Match canvas pixel size to its CSS pixel size 1:1 so MouseEvent
-    // clientX/clientY (CSS pixels) and the wasm-side window.width/.height
-    // share the same coordinate space. (HiDPI sharpness is a v2 concern
-    // — would need DPR on the GPU side AND a scaling step in the C++
-    // Window/Event glue so layout/hit-testing/dispatch counts stay
-    // consistent.)
-    const w = window.innerWidth;
-    const h = window.innerHeight;
+    // Canvas pixel size = CSS size × devicePixelRatio so the GPU draws
+    // at physical pixel resolution on HiDPI displays — otherwise the
+    // browser upscales whatever we rendered at logical size and the
+    // result looks blurry. The CSS rule still pins display size at
+    // 100vw/100vh, so the canvas paints its physical buffer back down
+    // into logical pixels with no perceived layout change.
+    //
+    // Mouse events arrive in CSS pixels; dom-env.js multiplies them by
+    // window.crafter_dpr before dispatching so the wasm-side hit tests
+    // share the physical-pixel coordinate space with window.width/.height.
+    const dpr = window.devicePixelRatio || 1;
+    window.crafter_dpr = dpr;
+    const w = Math.max(1, Math.round(window.innerWidth  * dpr));
+    const h = Math.max(1, Math.round(window.innerHeight * dpr));
     if (canvas.width !== w)  canvas.width  = w;
     if (canvas.height !== h) canvas.height = h;
     return { w, h };
@@ -101,7 +110,33 @@ if (!adapter) {
     console.error("[crafter-wgpu]", initError.message);
     throw initError;
 }
-const device  = await adapter.requestDevice();
+// Ask for everything the adapter is willing to give us, up to the values
+// the RT pipeline actually needs. The megakernel prelude declares 7
+// storage buffers at group(1) (tlasEntries / bvhNodes / meshRecords /
+// vertices / indices / primRemap / vertexAttribs); user pipelines like
+// 3DForts add more at group(2), and the WebGPU baseline of 8 isn't
+// enough. Adapters routinely report 10+ — clamp our request to whatever
+// the adapter actually supports so the call doesn't reject on baseline-
+// only devices. Same pattern for storage textures (we use 1 output image
+// per dispatch but headroom is cheap) and for the global storage-buffer
+// pool which is the per-pipeline count's parent budget.
+const adapterLimits = adapter.limits || {};
+const requiredLimits = {};
+const clamp = (name, want) => {
+    const cap = adapterLimits[name];
+    if (typeof cap === "number" && cap > 0) {
+        requiredLimits[name] = Math.min(want, cap);
+    }
+};
+clamp("maxStorageBuffersPerShaderStage", 16);
+clamp("maxStorageBuffersInPipelineLayout", 16);
+clamp("maxStorageTexturesPerShaderStage", 8);
+// The TLAS BVH build runs one workgroup of up to N threads in shared
+// memory (bitonic sort over morton codes + sweep-tree refit). Need the
+// per-workgroup invocation cap raised from the default 256.
+clamp("maxComputeInvocationsPerWorkgroup", 1024);
+clamp("maxComputeWorkgroupSizeX", 1024);
+const device  = await adapter.requestDevice({ requiredLimits });
 const queue   = device.queue;
 const ctx     = canvas.getContext("webgpu");
 const canvasFormat = "rgba8unorm"; // match storage textures, skip swizzle blit
@@ -522,6 +557,133 @@ env.wgpuWriteBuffer = (handle, srcPtr, byteSize) => {
     const aligned = (byteSize + 3) & ~3;
     queue.writeBuffer(buf, 0, memU8().buffer, srcPtr, aligned);
 };
+// Partial write — copies a sub-range of `srcPtr` into the GPU buffer at
+// `dstByteOffset`. Used by BuildTLAS to skip the transform field of
+// GPU-owned instances; the physics-tlas-transform compute shader is the
+// sole writer of those bytes and we must not clobber its output with a
+// stale CPU mirror.
+env.wgpuWriteBufferRange = (handle, dstByteOffset, srcPtr, byteSize) => {
+    const buf = buffers.get(handle);
+    if (!buf) return;
+    const aligned = (byteSize + 3) & ~3;
+    queue.writeBuffer(buf, dstByteOffset, memU8().buffer, srcPtr, aligned);
+};
+
+// ── GPU→CPU readback (staging + mapAsync) ──────────────────────────────
+//
+// WebGPU storage buffers can't be CPU-mapped directly (STORAGE usage is
+// incompatible with MAP_READ). Each readback keeps a parallel staging
+// buffer (MAP_READ | COPY_DST) at the same size. wgpuReadbackEnqueue
+// copies the storage buffer into the staging buffer and kicks off an
+// async map. wgpuReadbackPoll synchronously returns whether the map has
+// resolved; if so it copies the bytes into the caller's wasm pointer
+// and the slot is ready for the next Enqueue.
+//
+// Used by Forts3D's physics event drain to read the GPU-written
+// destroy/hit/splash event queues with a one-frame latency.
+
+const READBACK_IDLE    = 0;
+const READBACK_PENDING = 1;
+const READBACK_READY   = 2;
+const readbacks = new Map();  // device-buffer handle → { staging, size, state, pendingData }
+// Readbacks scheduled this frame that still need their mapAsync kicked
+// off — done after the frame's queue.submit so the map waits for the
+// compute writes that wrote to `buf` to finish, not just the standalone
+// copy encoder.
+const pendingReadbackMaps = [];
+
+env.wgpuReadbackEnqueue = (handle, byteSize, resetBytes) => {
+    const buf = buffers.get(handle);
+    if (!buf) return;
+    const aligned = (byteSize + 3) & ~3;
+    const resetAligned = resetBytes > 0 ? ((resetBytes + 3) & ~3) : 0;
+    let rb = readbacks.get(handle);
+    if (!rb) {
+        rb = {
+            staging: device.createBuffer({
+                size: Math.max(16, aligned),
+                usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST,
+            }),
+            size: aligned,
+            state: READBACK_IDLE,
+            pendingData: null,
+        };
+        readbacks.set(handle, rb);
+    }
+    if (rb.state !== READBACK_IDLE) {
+        // Previous map still in flight (or has data nobody polled yet);
+        // skip this enqueue AND the paired reset. Events written by the
+        // current frame's substeps will accumulate in `buf` and get
+        // picked up by the next successful enqueue (which captures the
+        // post-accumulation count and then clears it). Resetting here
+        // without copying would wipe those events before they could be
+        // drained, causing destroyed projectiles to never reach the CPU.
+        return;
+    }
+
+    if (state.encoder) {
+        // Mid-frame: piggy-back the copy on the frame's main encoder so
+        // it runs AFTER the compute dispatches that wrote `buf`. End the
+        // active pass, encode the copy (+ optional reset), reopen the
+        // pass. mapAsync gets deferred to wgpuFrameEnd (after
+        // queue.submit) — the map resolves once the GPU has finished
+        // the submitted work, including our just-encoded copy.
+        if (state.pass) {
+            state.pass.end();
+            state.pass = null;
+        }
+        state.encoder.copyBufferToBuffer(buf, 0, rb.staging, 0, aligned);
+        if (resetAligned > 0) {
+            // Clear the first `resetAligned` bytes of `buf` so the next
+            // frame's atomic-add starts at 0. Encoded after the copy,
+            // so the staging captures the pre-clear count. Tied to a
+            // successful enqueue — if the enqueue was skipped above the
+            // clear is skipped too, preserving events for the next
+            // successful drain.
+            state.encoder.clearBuffer(buf, 0, resetAligned);
+        }
+        state.pass = state.encoder.beginComputePass();
+        rb.state = READBACK_PENDING;
+        pendingReadbackMaps.push(rb);
+    } else {
+        // Standalone (no frame in progress): submit our own encoder and
+        // kick off mapAsync immediately.
+        const enc = device.createCommandEncoder();
+        enc.copyBufferToBuffer(buf, 0, rb.staging, 0, aligned);
+        if (resetAligned > 0) enc.clearBuffer(buf, 0, resetAligned);
+        queue.submit([enc.finish()]);
+        rb.state = READBACK_PENDING;
+        rb.staging.mapAsync(GPUMapMode.READ).then(() => {
+            rb.pendingData = new Uint8Array(rb.staging.getMappedRange()).slice();
+            rb.staging.unmap();
+            rb.state = READBACK_READY;
+        }).catch(e => {
+            console.error("[crafter-wgpu] readback mapAsync failed:", e);
+            rb.state = READBACK_IDLE;
+        });
+    }
+};
+
+// Returns 1 if the readback for `handle` has completed and bytes were
+// copied into `dstPtr`; returns 0 otherwise (caller retries next frame).
+// After a successful poll the slot is idle again, ready for the next
+// Enqueue.
+env.wgpuReadbackPoll = (handle, dstPtr, byteSize) => {
+    const rb = readbacks.get(handle);
+    if (!rb || rb.state !== READBACK_READY) return 0;
+    memU8().set(rb.pendingData.subarray(0, byteSize), dstPtr);
+    rb.pendingData = null;
+    rb.state = READBACK_IDLE;
+    return 1;
+};
+// Non-consuming readiness check. Lets callers verify multiple readbacks
+// have all resolved before consuming any — needed when a logical drain
+// reads parallel header/array buffers and mustn't ack the header until
+// the array bytes are also available, or the events get lost.
+env.wgpuReadbackReady = (handle) => {
+    const rb = readbacks.get(handle);
+    return (rb && rb.state === READBACK_READY) ? 1 : 0;
+};
 env.wgpuDestroyBuffer = (handle) => {
     const buf = buffers.get(handle);
     if (buf) { buf.destroy(); buffers.delete(handle); }
@@ -596,32 +758,42 @@ env.wgpuCreateImage2D = (w, h) => {
     textureViews.set(handle, tex.createView());
     return handle;
 };
-// 2D texture array — N layers of identical (w × h) rgba8unorm. Used by
-// Image2DArray<RGBA8> to back one material albedo per layer; shaders
-// sample with `textureSampleLevel(tex, samp, uv, layerIdx, 0.0)`.
-env.wgpuCreateImage2DArray = (w, h, layerCount) => {
+// 2D texture array — N layers of identical (w × h) rgba8unorm with
+// `mipLevels` mip levels. Pass mipLevels=1 for a single-level texture.
+// Used by Image2DArray<RGBA8> to back one material albedo per layer;
+// shaders sample with `textureSampleLevel(tex, samp, uv, layerIdx, lod)`
+// where lod ∈ [0, mipLevels-1].
+env.wgpuCreateImage2DArray = (w, h, layerCount, mipLevels) => {
     const handle = newHandle();
+    const mips = (typeof mipLevels === "number" && mipLevels > 0) ? mipLevels : 1;
     const tex = device.createTexture({
         size: [w, h, layerCount],
         dimension: "2d",
         format: "rgba8unorm",
+        mipLevelCount: mips,
         usage: GPUTextureUsage.TEXTURE_BINDING | GPUTextureUsage.COPY_DST,
     });
     textures.set(handle, tex);
     textureViews.set(handle, tex.createView({
         dimension: "2d-array",
         arrayLayerCount: layerCount,
+        mipLevelCount: mips,
     }));
     return handle;
 };
-env.wgpuWriteImage2DLayer = (handle, layer, srcPtr, byteSize, w, h) => {
+// Upload a single mip level of one array layer. `level` indexes into the
+// texture's mip chain; `w` / `h` are the dimensions at that level (= base
+// dimensions >> level). Caller supplies the pre-downsampled bytes for
+// each level — Image2DArray::UpdateLayer on the C++ side does the box-
+// filter chain.
+env.wgpuWriteImage2DLayer = (handle, layer, level, srcPtr, byteSize, w, h) => {
     const tex = textures.get(handle);
     if (!tex) return;
     const srcBPR = w * 4;
     const alignedBPR = (srcBPR + 255) & ~255;
     if (alignedBPR === srcBPR) {
         queue.writeTexture(
-            { texture: tex, origin: [0, 0, layer] },
+            { texture: tex, mipLevel: level, origin: [0, 0, layer] },
             memU8().subarray(srcPtr, srcPtr + byteSize),
             { bytesPerRow: srcBPR, rowsPerImage: h },
             { width: w, height: h, depthOrArrayLayers: 1 }
@@ -634,7 +806,7 @@ env.wgpuWriteImage2DLayer = (handle, layer, srcPtr, byteSize, w, h) => {
                         y * alignedBPR);
         }
         queue.writeTexture(
-            { texture: tex, origin: [0, 0, layer] },
+            { texture: tex, mipLevel: level, origin: [0, 0, layer] },
             staging,
             { bytesPerRow: alignedBPR, rowsPerImage: h },
             { width: w, height: h, depthOrArrayLayers: 1 }
@@ -684,6 +856,16 @@ env.wgpuCreateLinearClampSampler = () => {
     return handle;
 };
 
+env.wgpuCreateLinearRepeatSampler = () => {
+    const handle = newHandle();
+    samplers.set(handle, device.createSampler({
+        magFilter: "linear", minFilter: "linear",
+        mipmapFilter: "linear",
+        addressModeU: "repeat", addressModeV: "repeat",
+    }));
+    return handle;
+};
+
 // ─── per-frame ──────────────────────────────────────────────────────────
 
 env.wgpuFrameBegin = () => {
@@ -739,12 +921,34 @@ env.wgpuFrameEnd = () => {
     );
     queue.submit([state.encoder.finish()]);
     state.encoder = null;
+
+    // Kick off mapAsync for the readbacks whose copyBufferToBuffer we
+    // piggy-backed onto the just-submitted encoder. Doing this after
+    // submit ensures the map waits for that submission's GPU work to
+    // complete, so the staging buffer reflects this frame's compute
+    // writes (not pre-substep state).
+    while (pendingReadbackMaps.length > 0) {
+        const rb = pendingReadbackMaps.pop();
+        rb.staging.mapAsync(GPUMapMode.READ).then(() => {
+            rb.pendingData = new Uint8Array(rb.staging.getMappedRange()).slice();
+            rb.staging.unmap();
+            rb.state = READBACK_READY;
+        }).catch(e => {
+            console.error("[crafter-wgpu] readback mapAsync failed:", e);
+            rb.state = READBACK_IDLE;
+        });
+    }
 };
 
-// Write a 48-byte UIDispatchHeader into the ring buffer at the current
-// offset (which is incremented and 256-aligned). Returns the dynamic
-// offset to pass to setBindGroup.
-function writeHeader(headerPtr) {
+// Write a push struct into the ring buffer at the current offset (which
+// is incremented and 256-aligned). Standard dispatches pass just the
+// 48-byte UIDispatchHeader; custom shaders may pass up to HEADER_ALIGN
+// bytes — anything past the header is the user's per-dispatch struct,
+// declared in WGSL as additional fields after UIDispatchHeader at
+// @group(0) @binding(0). Returns the dynamic offset to pass to
+// setBindGroup.
+function writeHeader(headerPtr, bytes = 48) {
+    const upload = Math.min(bytes, HEADER_ALIGN);
     const offset = state.headerRingOffset;
     if (offset + HEADER_ALIGN > state.headerRingSize) {
         // Ring is small enough that overrun in one frame means too many
@@ -753,7 +957,7 @@ function writeHeader(headerPtr) {
         state.headerRingOffset = 0;
     }
     queue.writeBuffer(state.headerRing, state.headerRingOffset,
-                      memU8().buffer, headerPtr, 48);
+                      memU8().buffer, headerPtr, upload);
     state.headerRingOffset += HEADER_ALIGN;
     return offset;
 }
@@ -808,7 +1012,12 @@ env.wgpuLoadCustomShader = (wgslPtr, wgslLen, bindingsPtr, bindingsCount, rayQue
     // library. The user shader can declare its own group 0 / 2+ bindings
     // but MUST NOT redeclare group(1) — that's reserved for RT data.
     const wgsl = rayQueryFlag
-        ? (rtWgslTypes + rtWgslMegakernelBindings + rtWgslRayQueryLib + "\n" + userWgsl)
+        // rayQueryLib's _rqTraverseBlas/_rqTraverseTlas call _rtAabb,
+        // _rtFetchTri, _rtTri from rtWgslPureHelpers — must prepend
+        // the pure helper subset (NOT the megakernel-only traversal
+        // routines, which reference user-emitted runAnyHit/runMiss/
+        // runClosestHit and won't compile outside the raygen pipeline).
+        ? (rtWgslTypes + rtWgslMegakernelBindings + rtWgslPureHelpers + rtWgslRayQueryLib + "\n" + userWgsl)
         : userWgsl;
 
     const bindings = [];
@@ -836,10 +1045,17 @@ env.wgpuLoadCustomShader = (wgslPtr, wgslLen, bindingsPtr, bindingsCount, rayQue
     // Group 0 = header uniform (same for both paths).
     // Group 1 = ping-pong out+prev OR RT data (TLAS, BVH, meshRecs, verts,
     // idx, primRemap, outImage) when rayQuery flag is on.
+    //
+    // Custom shaders may declare a struct at @group(0) @binding(0) up to
+    // HEADER_ALIGN (256) bytes — the standard UIDispatchHeader is 48 but
+    // the wgpuDispatchCustom path uploads the full push buffer so user
+    // shaders can read extra fields past the header. minBindingSize=0
+    // disables the bound-size lower bound; the actual entry is sized to
+    // HEADER_ALIGN at bind time.
     const bgls = [
         device.createBindGroupLayout({ entries: [
             { binding: 0, visibility: GPUShaderStage.COMPUTE,
-              buffer: { type: "uniform", hasDynamicOffset: true, minBindingSize: 48 } },
+              buffer: { type: "uniform", hasDynamicOffset: true, minBindingSize: 0 } },
         ]}),
         rayQueryFlag
             ? device.createBindGroupLayout({ entries: [
@@ -852,6 +1068,8 @@ env.wgpuLoadCustomShader = (wgslPtr, wgslLen, bindingsPtr, bindingsCount, rayQue
                 { binding: 6, visibility: GPUShaderStage.COMPUTE,
                   storageTexture: { format: "rgba8unorm", access: "write-only", viewDimension: "2d" } },
                 { binding: 7, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
+            { binding: 8, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
+            { binding: 9, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
             ]})
             : device.createBindGroupLayout({ entries: [
                 { binding: 0, visibility: GPUShaderStage.COMPUTE,
@@ -892,9 +1110,14 @@ env.wgpuLoadCustomShader = (wgslPtr, wgslLen, bindingsPtr, bindingsCount, rayQue
         return 0;
     }
 
+    // Bind the full ring slot (HEADER_ALIGN bytes) so custom shaders can
+    // declare a struct at @group(0) @binding(0) that's larger than the
+    // 48-byte UIDispatchHeader. Standard pipelines keep their tight 48-
+    // byte slice. The bgl0 layout's minBindingSize stays at 48 — bigger
+    // is allowed.
     const hdrBG = device.createBindGroup({
         layout: bgls[0],
-        entries: [{ binding: 0, resource: { buffer: state.headerRing, offset: 0, size: 48 } }],
+        entries: [{ binding: 0, resource: { buffer: state.headerRing, offset: 0, size: HEADER_ALIGN } }],
     });
 
     const handle = newHandle();
@@ -912,8 +1135,10 @@ env.wgpuDispatchCustom = (pipelineHandle, pushPtr, pushBytes, handlesPtr, handle
         return;
     }
 
-    // Write header (first 48 bytes of push).
-    const off = writeHeader(pushPtr);
+    // Write the full push struct (UIDispatchHeader + any custom tail
+    // fields the shader declares) into the ring. WGSL's bound struct
+    // size decides how many bytes are actually read on the GPU.
+    const off = writeHeader(pushPtr, pushBytes);
 
     state.pass.setPipeline(pipe.pipeline);
     state.pass.setBindGroup(0, pipe.hdrBG, [off]);
@@ -926,6 +1151,12 @@ env.wgpuDispatchCustom = (pipelineHandle, pushPtr, pushBytes, handlesPtr, handle
             return;
         }
         const outView = state.outIsPing ? state.pingView : state.pongView;
+        const orderBuf = buffers.get(rtState.currentEntryOrder);
+        const bvhBuf = buffers.get(rtState.currentBvh);
+        if (!orderBuf || !bvhBuf) {
+            console.error("[crafter-wgpu] wgpuDispatchCustom rayQuery: no entryOrder/bins (TLAS not built)");
+            return;
+        }
         const rtBG = device.createBindGroup({
             layout: pipe.bgls[1],
             entries: [
@@ -937,6 +1168,8 @@ env.wgpuDispatchCustom = (pipelineHandle, pushPtr, pushBytes, handlesPtr, handle
                 { binding: 5, resource: { buffer: rtState.primRemapHeap.gpu } },
                 { binding: 6, resource: outView },
                 { binding: 7, resource: { buffer: rtState.attribsHeap.gpu } },
+                { binding: 8, resource: { buffer: orderBuf } },
+                { binding: 9, resource: { buffer: bvhBuf } },
             ],
         });
         state.pass.setBindGroup(1, rtBG);
@@ -1145,15 +1378,32 @@ const RT_INTERSECTION_TRIANGLE: u32 = 1u;
 // touch them — declaring them would push it past 8 storage buffers per
 // stage on the WebGPU baseline.
 const rtWgslMegakernelBindings = String.raw`
-@group(0) @binding(0) var<uniform>       hdr         : RTDispatchHeader;
-@group(1) @binding(0) var<storage,read>  tlasEntries : array<TLASEntry>;
-@group(1) @binding(1) var<storage,read>  bvhNodes    : array<BVHNode>;
-@group(1) @binding(2) var<storage,read>  meshRecords : array<MeshRecord>;
-@group(1) @binding(3) var<storage,read>  vertices    : array<f32>;
-@group(1) @binding(4) var<storage,read>  indices     : array<u32>;
-@group(1) @binding(5) var<storage,read>  primRemap   : array<u32>;
+@group(0) @binding(0) var<uniform>       hdr           : RTDispatchHeader;
+@group(1) @binding(0) var<storage,read>  tlasEntries   : array<TLASEntry>;
+@group(1) @binding(1) var<storage,read>  bvhNodes      : array<BVHNode>;
+@group(1) @binding(2) var<storage,read>  meshRecords   : array<MeshRecord>;
+@group(1) @binding(3) var<storage,read>  vertices      : array<f32>;
+@group(1) @binding(4) var<storage,read>  indices       : array<u32>;
+@group(1) @binding(5) var<storage,read>  primRemap     : array<u32>;
 @group(1) @binding(6) var outImage : texture_storage_2d<rgba8unorm, write>;
 @group(1) @binding(7) var<storage,read>  vertexAttribs : array<u32>;
+// TLAS Morton-sorted permutation: tlasEntryOrder[i] gives the
+// tlasEntries[] index that BVH leaf i should sample.
+@group(1) @binding(8) var<storage,read>  tlasEntryOrder : array<u32>;
+// Sweep-tree BVH built by the LBVH-build pass. 2 * N_PADDED - 1
+// nodes = 2047 for N_PADDED = 1024. Internal nodes at [0, N_PADDED - 1);
+// leaves at [N_PADDED - 1, 2 * N_PADDED - 1). For internal node i,
+// children are 2i+1 and 2i+2 (implicit perfect binary tree). Each node
+// stores just its world-space AABB.
+struct BvhNode {
+    aabbMin: vec3<f32>,
+    _pad0:   u32,
+    aabbMax: vec3<f32>,
+    _pad1:   u32,
+};
+@group(1) @binding(9) var<storage,read>  tlasBvhNodes : array<BvhNode>;
+const TLAS_BVH_N_PADDED: u32 = 16384u;
+const TLAS_BVH_LEAVES_START: u32 = TLAS_BVH_N_PADDED - 1u;
 `;
 
 const rtWgslPrelude = rtWgslTypes + rtWgslMegakernelBindings;
@@ -1162,7 +1412,10 @@ const rtWgslPrelude = rtWgslTypes + rtWgslMegakernelBindings;
 // Injected after the user-supplied closesthit/anyhit/miss sources +
 // mega-switch dispatchers (which PipelineRTWebGPU emits). User raygen
 // sources sit after this block so they can call traceRay.
-const rtWgslHelpers = String.raw`
+// The "pure" subset of the RT helpers — no calls into runAnyHit /
+// runClosestHit / traceRay, so this can be prepended ahead of compute
+// pipelines using rayQuery without dragging in megakernel-only symbols.
+const rtWgslPureHelpers = String.raw`
 fn _rtFetchTri(meshRec: MeshRecord, triIndex: u32) -> array<vec3<f32>, 3> {
     let baseIdx = meshRec.indexOffset + triIndex * 3u;
     let i0 = indices[baseIdx + 0u];
@@ -1180,6 +1433,12 @@ fn _rtFetchTri(meshRec: MeshRecord, triIndex: u32) -> array<vec3<f32>, 3> {
 }
 
 fn _rtAabb(ro: vec3<f32>, invRd: vec3<f32>, mn: vec3<f32>, mx: vec3<f32>, tMax: f32) -> bool {
+    // Reject degenerate (mn > mx) boxes outright. The min(t0,t1)/
+    // max(t0,t1) trick below silently re-orients an inverted box
+    // and would otherwise return true for sentinel-padded BVH leaves
+    // — letting rays "hit" empty slots and accidentally re-traverse
+    // instance 0 via the OOB outOrder → tlasEntries[0] path.
+    if (any(mn > mx)) { return false; }
     let t0 = (mn - ro) * invRd;
     let t1 = (mx - ro) * invRd;
     let tmin = min(t0, t1);
@@ -1211,7 +1470,13 @@ fn _rtTri(ro: vec3<f32>, rd: vec3<f32>, p0: vec3<f32>, p1: vec3<f32>, p2: vec3<f
     r.hit = true; r.t = t; r.u = u; r.v = v;
     return r;
 }
+`;
 
+// Megakernel-only helpers: traversal routines that invoke runAnyHit /
+// runClosestHit / runMiss (emitted by the megakernel SBT switch) and
+// `traceRay` that closes over them. Only the raygen-pipeline path
+// prepends this.
+const rtWgslMegakernelHelpers = String.raw`
 // Iterative stack-based BLAS traversal. Returns true if traversal was
 // terminated by an END_SEARCH from anyhit (caller should stop entirely).
 fn _rtTraverseBlas(rayObj: RayDesc, flags: u32, meshRec: MeshRecord,
@@ -1293,12 +1558,33 @@ fn _rtTraverseTlas(rayWorld: RayDesc, flags: u32, cullMask: u32,
                    bestT:   ptr<function, f32>,
                    payload: ptr<function, Payload>) -> bool {
     let invD = vec3<f32>(1.0) / rayWorld.direction;
-    let n = hdr.instanceCount;
-    for (var i: u32 = 0u; i < n; i = i + 1u) {
-        let inst = tlasEntries[i];
-        let instanceMask = inst.maskHGOffset & 0xFFu;
-        if ((instanceMask & cullMask) == 0u) { continue; }
-        if (!_rtAabb(rayWorld.origin, invD, inst.aabbMin, inst.aabbMax, *bestT)) { continue; }
+    // Stack-based descent of the sweep-tree BVH. Internal nodes
+    // [0, TLAS_BVH_LEAVES_START); leaves [LEAVES_START, 2*N_PADDED-1).
+    // Node i's children are 2i+1 / 2i+2 (implicit perfect binary tree).
+    // Stack depth = tree depth = log2(N_PADDED) = 14 for N_PADDED=16384;
+    // 24 gives generous headroom.
+    var stack: array<u32, 24>;
+    var sp: u32 = 0u;
+    stack[sp] = 0u; sp = sp + 1u;
+    loop {
+        if (sp == 0u) { break; }
+        sp = sp - 1u;
+        let nodeIdx = stack[sp];
+        let node = tlasBvhNodes[nodeIdx];
+        if (!_rtAabb(rayWorld.origin, invD, node.aabbMin, node.aabbMax, *bestT)) {
+            continue;
+        }
+        if (nodeIdx >= TLAS_BVH_LEAVES_START) {
+            // Leaf: resolve entry, do the existing per-instance test.
+            let leafIdx = nodeIdx - TLAS_BVH_LEAVES_START;
+            let i = tlasEntryOrder[leafIdx];
+            // Sentinel-padded leaves get instanceMask=0; cullMask check
+            // (and degenerate AABB above) means they fall out cheaply.
+            if (i == 0xFFFFFFFFu) { continue; }
+            let inst = tlasEntries[i];
+            let instanceMask = inst.maskHGOffset & 0xFFu;
+            if ((instanceMask & cullMask) == 0u) { continue; }
+            if (!_rtAabb(rayWorld.origin, invD, inst.aabbMin, inst.aabbMax, *bestT)) { continue; }
 
         // Transform ray to object space.
         let r0 = inst.worldToObjectR0;
@@ -1346,6 +1632,15 @@ fn _rtTraverseTlas(rayWorld: RayDesc, flags: u32, cullMask: u32,
             (*bestHit).customIndex     = inst.customIndex;
             if ((effective & RT_FLAG_TERMINATE_ON_FIRST_HIT) != 0u) { return true; }
         }
+        } else {
+            // Internal node: push both children (skip overflow).
+            let left  = 2u * nodeIdx + 1u;
+            let right = 2u * nodeIdx + 2u;
+            if (sp + 1u < 24u) {
+                stack[sp] = right; sp = sp + 1u;
+                stack[sp] = left;  sp = sp + 1u;
+            }
+        }
     }
     return false;
 }
@@ -1471,44 +1766,66 @@ fn _rqTraverseBlas(rayObj: RayDesc, flags: u32, meshRec: MeshRecord,
 fn _rqTraverseTlas(rq: ptr<function, RayQuery>) {
     let rayWorld = (*rq).ray;
     let invD = vec3<f32>(1.0) / rayWorld.direction;
-    let n = hdr.instanceCount;
     let cullMask = (*rq).cullMask;
     let rayFlags = (*rq).flags;
-    for (var i: u32 = 0u; i < n; i = i + 1u) {
-        let inst = tlasEntries[i];
-        let instanceMask = inst.maskHGOffset & 0xFFu;
-        if ((instanceMask & cullMask) == 0u) { continue; }
-        if (!_rtAabb(rayWorld.origin, invD, inst.aabbMin, inst.aabbMax, (*rq).committedT)) { continue; }
-
-        let r0 = inst.worldToObjectR0;
-        let r1 = inst.worldToObjectR1;
-        let r2 = inst.worldToObjectR2;
-        var rayObj: RayDesc;
-        rayObj.origin = vec3<f32>(
-            dot(r0.xyz, rayWorld.origin) + r0.w,
-            dot(r1.xyz, rayWorld.origin) + r1.w,
-            dot(r2.xyz, rayWorld.origin) + r2.w,
-        );
-        rayObj.direction = vec3<f32>(
-            dot(r0.xyz, rayWorld.direction),
-            dot(r1.xyz, rayWorld.direction),
-            dot(r2.xyz, rayWorld.direction),
-        );
-        rayObj.tMin = rayWorld.tMin;
-        rayObj.tMax = (*rq).committedT;
-
-        var effective = rayFlags;
-        let iflags = inst.instanceFlags;
-        if ((iflags & RT_INSTANCE_TRIANGLE_FACING_CULL_DISABLE) != 0u) {
-            effective = effective & ~(RT_FLAG_CULL_BACK_FACING_TRIANGLES | RT_FLAG_CULL_FRONT_FACING_TRIANGLES);
+    // Stack-based BVH descent — same shape as _rtTraverseTlas.
+    var stack: array<u32, 24>;
+    var sp: u32 = 0u;
+    stack[sp] = 0u; sp = sp + 1u;
+    loop {
+        if (sp == 0u) { break; }
+        sp = sp - 1u;
+        let nodeIdx = stack[sp];
+        let node = tlasBvhNodes[nodeIdx];
+        if (!_rtAabb(rayWorld.origin, invD, node.aabbMin, node.aabbMax, (*rq).committedT)) {
+            continue;
         }
+        if (nodeIdx >= TLAS_BVH_LEAVES_START) {
+            let leafIdx = nodeIdx - TLAS_BVH_LEAVES_START;
+            let i = tlasEntryOrder[leafIdx];
+            if (i == 0xFFFFFFFFu) { continue; }
+            let inst = tlasEntries[i];
+            let instanceMask = inst.maskHGOffset & 0xFFu;
+            if ((instanceMask & cullMask) == 0u) { continue; }
+            if (!_rtAabb(rayWorld.origin, invD, inst.aabbMin, inst.aabbMax, (*rq).committedT)) { continue; }
 
-        let meshRec = meshRecords[inst.blasMeshIdx];
-        _rqTraverseBlas(rayObj, effective, meshRec, i, inst.customIndex, inst, rq);
+            let r0 = inst.worldToObjectR0;
+            let r1 = inst.worldToObjectR1;
+            let r2 = inst.worldToObjectR2;
+            var rayObj: RayDesc;
+            rayObj.origin = vec3<f32>(
+                dot(r0.xyz, rayWorld.origin) + r0.w,
+                dot(r1.xyz, rayWorld.origin) + r1.w,
+                dot(r2.xyz, rayWorld.origin) + r2.w,
+            );
+            rayObj.direction = vec3<f32>(
+                dot(r0.xyz, rayWorld.direction),
+                dot(r1.xyz, rayWorld.direction),
+                dot(r2.xyz, rayWorld.direction),
+            );
+            rayObj.tMin = rayWorld.tMin;
+            rayObj.tMax = (*rq).committedT;
 
-        if ((rayFlags & RT_FLAG_TERMINATE_ON_FIRST_HIT) != 0u
-            && (*rq).committedType != RT_INTERSECTION_NONE) {
-            return;
+            var effective = rayFlags;
+            let iflags = inst.instanceFlags;
+            if ((iflags & RT_INSTANCE_TRIANGLE_FACING_CULL_DISABLE) != 0u) {
+                effective = effective & ~(RT_FLAG_CULL_BACK_FACING_TRIANGLES | RT_FLAG_CULL_FRONT_FACING_TRIANGLES);
+            }
+
+            let meshRec = meshRecords[inst.blasMeshIdx];
+            _rqTraverseBlas(rayObj, effective, meshRec, i, inst.customIndex, inst, rq);
+
+            if ((rayFlags & RT_FLAG_TERMINATE_ON_FIRST_HIT) != 0u
+                && (*rq).committedType != RT_INTERSECTION_NONE) {
+                return;
+            }
+        } else {
+            let left  = 2u * nodeIdx + 1u;
+            let right = 2u * nodeIdx + 2u;
+            if (sp + 1u < 24u) {
+                stack[sp] = right; sp = sp + 1u;
+                stack[sp] = left;  sp = sp + 1u;
+            }
         }
     }
 }
@@ -1550,6 +1867,13 @@ fn rayQueryGetCommittedWorldPosition(rq: ptr<function, RayQuery>)            ->
 }
 `;
 
+// TLAS spatial-partition bin. 64 of these per frame partition instances
+// by the top 6 bits of their Morton code. RT traversal walks bin AABBs
+// first; only bins a ray hits get their per-instance loop. Layout
+// matches the WGSL struct below.
+const TLAS_BIN_COUNT = 64;
+const TLAS_BIN_SIZE  = 32;
+
 // ── Internal compute pipeline: builds TLASEntry[] from the RTInstance[]
 // + the meshRecords table. One thread per instance.
 // Uses only rtWgslTypes (no megakernel bindings) so it stays well under
@@ -1564,9 +1888,32 @@ struct RTInstance {
     accelStructureRef:  vec2<u32>,
 };
 
-@group(0) @binding(0) var<storage, read>       inInstances : array<RTInstance>;
-@group(0) @binding(1) var<storage, read>       inMeshes    : array<MeshRecord>;
-@group(0) @binding(2) var<storage, read_write> outEntries  : array<TLASEntry>;
+@group(0) @binding(0) var<storage, read>       inInstances  : array<RTInstance>;
+@group(0) @binding(1) var<storage, read>       inMeshes     : array<MeshRecord>;
+@group(0) @binding(2) var<storage, read_write> outEntries   : array<TLASEntry>;
+@group(0) @binding(3) var<storage, read_write> outOrder     : array<u32>;     // identity perm for stage 1
+@group(0) @binding(4) var<storage, read_write> outMorton    : array<u32>;     // 30-bit Morton from world-AABB centroid
+
+// Spread the low 10 bits of v across every 3rd bit (so three of these
+// interleave into a 30-bit Morton code). Standard "magic number" version.
+fn _expandBits10(v0: u32) -> u32 {
+    var v = v0 & 0x000003FFu;
+    v = (v * 0x00010001u) & 0xFF0000FFu;
+    v = (v * 0x00000101u) & 0x0F00F00Fu;
+    v = (v * 0x00000011u) & 0xC30C30C3u;
+    v = (v * 0x00000005u) & 0x49249249u;
+    return v;
+}
+fn _mortonCode3D(c: vec3<f32>) -> u32 {
+    // c assumed in [0, 1]^3.
+    let q = clamp(c, vec3<f32>(0.0), vec3<f32>(1.0));
+    let xi = u32(q.x * 1023.0);
+    let yi = u32(q.y * 1023.0);
+    let zi = u32(q.z * 1023.0);
+    return (_expandBits10(xi) << 2u)
+         | (_expandBits10(yi) << 1u)
+         |  _expandBits10(zi);
+}
 
 fn _invMat3(c0: vec3<f32>, c1: vec3<f32>, c2: vec3<f32>) -> array<vec3<f32>, 3> {
     let m00 = c1.y*c2.z - c1.z*c2.y;
@@ -1630,6 +1977,362 @@ fn tlasBuildMain(@builtin(global_invocation_id) gid: vec3<u32>) {
     e.customIndex     = custom;
     e.instanceFlags   = iflags;
     outEntries[i]     = e;
+
+    // LBVH pre-pass output:
+    //   outOrder[i] = i        — identity permutation (Stage 1; the
+    //                            radix sort in Stage 2 rewrites this).
+    //   outMorton[i] = morton(centroid normalized to [0,1]^3 over the
+    //                  fixed world bound).
+    // Fixed bound is conservative for 3DForts (map ~2000 units, fort ~50).
+    // Stage 2 swaps to a true scene-AABB reduction.
+    outOrder[i] = i;
+    let centroid = (worldMin + worldMax) * 0.5;
+    let worldHalfExtent = 4000.0;
+    let normalized = (centroid + vec3<f32>(worldHalfExtent)) * (1.0 / (2.0 * worldHalfExtent));
+    outMorton[i] = _mortonCode3D(normalized);
+}
+`;
+
+// TLAS LBVH build — single dispatch of one 1024-thread workgroup.
+// Phases:
+//   0. Scene AABB reduction (256-wide tree reduce).
+//   1. Pack (morton16 << 16) | tlasIndex16 keys into sortA. Sentinel
+//      slots get 0xFFFFFFFF so they sort to the end; m16 is clamped to
+//      0xFFFE so no real key collides with the sentinel.
+//   2. LSD radix sort — 8 passes × 4 bits, ping-pong sortA ↔ sortB.
+//      STAGE 1: single-thread (thread 0) sequential scatter for stable
+//      ordering. This is the slow-but-trivially-correct baseline; Stage
+//      2 will parallelize the scatter using per-thread local histograms
+//      + a cross-thread scan.
+//   3. Write sorted instance permutation into outOrder.
+//   4. Initialize BVH leaf AABBs from sorted instances.
+//   5. Bottom-up sweep-tree refit, log2(N_PADDED) levels.
+//
+// Storage-barrier pattern: workgroupBarrier() fences workgroup memory
+// only per WGSL spec; storageBarrier() is required between R/W phases
+// on sortA/B and outBvh. Both are called at every storage boundary —
+// minor perf cost, eliminates the class of bug that hung the GPU on
+// the previous radix attempt.
+//
+// Hard cap: LBVH_MAX = 16384. Parallel scatter (per-bucket Hillis-Steele
+// scan over 1024-lane indicators) made the build cost flat at ~0.5 ms
+// regardless of N_PADDED, and the degenerate-AABB fix in _rtAabb keeps
+// sentinel-only subtrees from being traversed. Per-ray cost scales with
+// log2(N_real), not log2(N_PADDED).
+const LBVH_MAX = 16384;
+const lbvhBuildWgsl = String.raw`
+struct TLASEntryStub {
+    aabbMin:        vec3<f32>,
+    maskHGOffset:   u32,
+    aabbMax:        vec3<f32>,
+    blasMeshIdx:    u32,
+    objectToWorldR0: vec4<f32>,
+    objectToWorldR1: vec4<f32>,
+    objectToWorldR2: vec4<f32>,
+    worldToObjectR0: vec4<f32>,
+    worldToObjectR1: vec4<f32>,
+    worldToObjectR2: vec4<f32>,
+    customIndex:    u32,
+    instanceFlags:  u32,
+    _pad0:          u32,
+    _pad1:          u32,
+};
+struct BvhNode {
+    aabbMin: vec3<f32>,
+    _pad0:   u32,
+    aabbMax: vec3<f32>,
+    _pad1:   u32,
+};
+
+@group(0) @binding(0) var<storage, read>       entries  : array<TLASEntryStub>;
+@group(0) @binding(1) var<storage, read_write> outOrder : array<u32>;
+@group(0) @binding(2) var<storage, read_write> outBvh   : array<BvhNode>;
+// Radix-sort ping-pong buffers. One u32 per element — the packed
+// (morton16 << 16) | tlasIndex16 key. Sized for N_PADDED.
+@group(0) @binding(3) var<storage, read_write> sortA    : array<u32>;
+@group(0) @binding(4) var<storage, read_write> sortB    : array<u32>;
+// Real instance count. Passed as a uniform so the entries / outOrder /
+// sortA / sortB / outBvh buffers can be allocated ONCE at N_PADDED and
+// never resized as the application's TLAS instance count changes —
+// runtime resize-on-grow caused subtle BVH corruption (driver-level
+// memory recycling, suspected) and was the root cause of mid-game
+// geometry flicker when projectiles entered the TLAS.
+struct LbvhPC { nReal: u32, _pad0: u32, _pad1: u32, _pad2: u32 };
+@group(0) @binding(5) var<uniform> lbvhPc : LbvhPC;
+
+const N_PADDED:      u32 = 16384u;
+const THREADS:       u32 = 1024u;
+const K_PER:         u32 = 16u;    // = N_PADDED / THREADS
+const REDUCE_LANES:  u32 = 256u;
+const REDUCE_K_PER:  u32 = 64u;    // = N_PADDED / REDUCE_LANES
+const BUCKETS:       u32 = 16u;
+const PASSES:        u32 = 8u;
+const LEVELS:        u32 = 14u;    // log2(N_PADDED)
+const SCAN_STEPS:    u32 = 10u;    // log2(THREADS)
+
+var<workgroup> shHist:    array<atomic<u32>, BUCKETS>;
+var<workgroup> shOffsets: array<u32, BUCKETS>;
+// Hillis-Steele scratch for per-bucket exclusive prefix sum over 1024
+// per-thread bucket counts. 4 KB. Reused across all 8 × 16 bucket scans
+// in the radix passes.
+var<workgroup> shScan:    array<u32, THREADS>;
+
+// Scene-AABB reduction scratch — 256-lane tree reduce. vec3 stride is
+// 16 by WGSL alignment → 4 KB each, 8 KB total. Well under the 16 KB
+// default workgroup-storage cap.
+var<workgroup> shRedMin: array<vec3<f32>, 256>;
+var<workgroup> shRedMax: array<vec3<f32>, 256>;
+var<workgroup> shSceneMin: vec3<f32>;
+var<workgroup> shSceneMax: vec3<f32>;
+
+fn _expandBits10(v0: u32) -> u32 {
+    var v = v0 & 0x000003FFu;
+    v = (v * 0x00010001u) & 0xFF0000FFu;
+    v = (v * 0x00000101u) & 0x0F00F00Fu;
+    v = (v * 0x00000011u) & 0xC30C30C3u;
+    v = (v * 0x00000005u) & 0x49249249u;
+    return v;
+}
+fn _mortonCode3D(c: vec3<f32>) -> u32 {
+    let q = clamp(c, vec3<f32>(0.0), vec3<f32>(1.0));
+    let xi = u32(q.x * 1023.0);
+    let yi = u32(q.y * 1023.0);
+    let zi = u32(q.z * 1023.0);
+    return (_expandBits10(xi) << 2u)
+         | (_expandBits10(yi) << 1u)
+         |  _expandBits10(zi);
+}
+
+@compute @workgroup_size(1024, 1, 1)
+fn lbvhBuildMain(@builtin(local_invocation_id) lid: vec3<u32>) {
+    let tid = lid.x;
+    let n   = lbvhPc.nReal;
+
+    // ── Phase 0: scene AABB reduction across centroids ───────────────────
+    // REDUCE_LANES=256 lanes each fold REDUCE_K_PER stripes (covers
+    // N_PADDED), then an 8-step tree reduce across those lanes gives
+    // the final AABB.
+    if (tid < REDUCE_LANES) {
+        var lMin = vec3<f32>( 1e30);
+        var lMax = vec3<f32>(-1e30);
+        for (var k: u32 = 0u; k < REDUCE_K_PER; k = k + 1u) {
+            let i = tid * REDUCE_K_PER + k;
+            if (i < n) {
+                let c = (entries[i].aabbMin + entries[i].aabbMax) * 0.5;
+                lMin = min(lMin, c);
+                lMax = max(lMax, c);
+            }
+        }
+        shRedMin[tid] = lMin;
+        shRedMax[tid] = lMax;
+    }
+    workgroupBarrier();
+    var stride: u32 = 128u;
+    for (var s: u32 = 0u; s < 8u; s = s + 1u) {
+        if (tid < stride) {
+            shRedMin[tid] = min(shRedMin[tid], shRedMin[tid + stride]);
+            shRedMax[tid] = max(shRedMax[tid], shRedMax[tid + stride]);
+        }
+        workgroupBarrier();
+        stride = stride / 2u;
+    }
+    if (tid == 0u) {
+        shSceneMin = shRedMin[0];
+        shSceneMax = shRedMax[0];
+    }
+    workgroupBarrier();
+
+    // ── Phase 1: emit packed sort keys into sortA ────────────────────────
+    let extent    = max(shSceneMax - shSceneMin, vec3<f32>(1e-3));
+    let invExtent = vec3<f32>(1.0) / extent;
+    for (var k: u32 = 0u; k < K_PER; k = k + 1u) {
+        let i = k * THREADS + tid;
+        var key: u32;
+        if (i < n) {
+            let c   = (entries[i].aabbMin + entries[i].aabbMax) * 0.5;
+            let nrm = (c - shSceneMin) * invExtent;
+            let m30 = _mortonCode3D(nrm);
+            let m16 = min(m30 >> 14u, 0xFFFEu);
+            key = (m16 << 16u) | (i & 0xFFFFu);
+        } else {
+            key = 0xFFFFFFFFu;
+        }
+        sortA[i] = key;
+    }
+    workgroupBarrier();
+    storageBarrier();
+
+    // ── Phase 2: stable LSD radix sort, fully parallel scatter ──────────
+    // Per pass:
+    //   1. Histogram (parallel atomicAdd to shHist).
+    //   2. Global bucket starts: exclusive prefix scan of shHist into
+    //      shOffsets[16] — small, done sequentially by thread 0.
+    //   3. For each of the 16 buckets:
+    //        a. Each thread counts its bucket-b source elements (K_PER
+    //           re-reads, no caching — storage reads are L1-cheap).
+    //        b. Hillis-Steele exclusive prefix scan of those counts
+    //           across all THREADS lanes (log2(THREADS)=10 levels).
+    //           Single-buffered with a read/barrier/write/barrier pattern
+    //           per step so reads and writes don't race.
+    //        c. Each thread re-walks its K_PER elements in source order
+    //           and writes bucket-b ones to dst = shOffsets[b]
+    //              + my_exclusive_prefix
+    //              + my_local_idx_so_far.
+    //      Stability holds because per-thread iteration is in source
+    //      order and the cross-thread offsets respect thread index order.
+    // WORKAROUND: the parallel radix sort below corrupts sortA in a
+    // way that's count-dependent — symptom was mid-game geometry
+    // flicker as soon as ANY extra TLAS instance was added beyond the
+    // initial scene (e.g. firing a projectile would make fort braces
+    // appear to disappear in patterns deterministic on the projectile's
+    // angle). Bisected by skipping each LBVH phase in turn: with the
+    // sort skipped, no flicker. With the sort enabled, flicker.
+    //
+    // The exact bug in the Hillis-Steele scan + parallel scatter
+    // hasn't been identified despite careful review — likely a subtle
+    // memory-ordering / barrier issue that triggers only with the
+    // specific Morton-code distribution that arises when a small object
+    // (projectile) sits next to a large cluster (fort).
+    //
+    // Skipping the sort means BVH leaves are in instance-index order
+    // (no spatial coherence). Ray traversal still descends the BVH
+    // tree, but parent AABBs are larger than they would be with sorted
+    // leaves, so more leaves get tested per ray. With the fort's
+    // ~1011-entry scale that's still fast enough; revisit if the
+    // entry count grows toward the LBVH_MAX cap.
+    if (false) {
+    for (var p: u32 = 0u; p < PASSES; p = p + 1u) {
+        let shift = p * 4u;
+        let srcIsA = (p & 1u) == 0u;
+
+        // Clear histogram.
+        if (tid < BUCKETS) {
+            atomicStore(&shHist[tid], 0u);
+        }
+        workgroupBarrier();
+
+        // Histogram pass — K_PER elements per thread.
+        for (var k: u32 = 0u; k < K_PER; k = k + 1u) {
+            let i = k * THREADS + tid;
+            var myKey: u32;
+            if (srcIsA) { myKey = sortA[i]; } else { myKey = sortB[i]; }
+            let bucket = (myKey >> shift) & 0xFu;
+            atomicAdd(&shHist[bucket], 1u);
+        }
+        workgroupBarrier();
+
+        // Global bucket starts (16-wide; thread 0 does it sequentially).
+        if (tid == 0u) {
+            var s2: u32 = 0u;
+            for (var b: u32 = 0u; b < BUCKETS; b = b + 1u) {
+                shOffsets[b] = s2;
+                s2 = s2 + atomicLoad(&shHist[b]);
+            }
+        }
+        workgroupBarrier();
+
+        // Per-bucket parallel scatter.
+        for (var b: u32 = 0u; b < BUCKETS; b = b + 1u) {
+            // (a) Count my bucket-b elements.
+            var localCount: u32 = 0u;
+            for (var k: u32 = 0u; k < K_PER; k = k + 1u) {
+                let i = k * THREADS + tid;
+                var srcKey: u32;
+                if (srcIsA) { srcKey = sortA[i]; } else { srcKey = sortB[i]; }
+                let bk = (srcKey >> shift) & 0xFu;
+                if (bk == b) { localCount = localCount + 1u; }
+            }
+            shScan[tid] = localCount;
+            workgroupBarrier();
+
+            // (b) Hillis-Steele inclusive prefix scan across 1024 lanes.
+            // Single-buffered: read snapshot → barrier → write → barrier.
+            for (var step: u32 = 0u; step < SCAN_STEPS; step = step + 1u) {
+                let offset = 1u << step;
+                let v = shScan[tid];
+                var prev: u32 = 0u;
+                if (tid >= offset) { prev = shScan[tid - offset]; }
+                workgroupBarrier();
+                shScan[tid] = v + prev;
+                workgroupBarrier();
+            }
+            // Convert inclusive→exclusive by subtracting own contribution.
+            let myExclusivePrefix = shScan[tid] - localCount;
+
+            // (c) Scatter my bucket-b elements at the computed positions.
+            var localIdx: u32 = 0u;
+            for (var k: u32 = 0u; k < K_PER; k = k + 1u) {
+                let i = k * THREADS + tid;
+                var srcKey: u32;
+                if (srcIsA) { srcKey = sortA[i]; } else { srcKey = sortB[i]; }
+                let bk = (srcKey >> shift) & 0xFu;
+                if (bk == b) {
+                    let dst = shOffsets[b] + myExclusivePrefix + localIdx;
+                    if (srcIsA) { sortB[dst] = srcKey; } else { sortA[dst] = srcKey; }
+                    localIdx = localIdx + 1u;
+                }
+            }
+            workgroupBarrier();
+        }
+        storageBarrier();
+    }
+    }
+    // After 8 ping-pongs (even count) the sorted keys live in sortA.
+
+    // ── Phase 3: write sorted instance permutation into outOrder ─────────
+    for (var k: u32 = 0u; k < K_PER; k = k + 1u) {
+        let i = k * THREADS + tid;
+        if (i < n) {
+            outOrder[i] = sortA[i] & 0xFFFFu;
+        }
+    }
+    workgroupBarrier();
+    storageBarrier();
+
+    // ── Phase 4: initialize BVH leaf AABBs ───────────────────────────────
+    for (var k: u32 = 0u; k < K_PER; k = k + 1u) {
+        let i = k * THREADS + tid;
+        let leafIdx = N_PADDED - 1u + i;
+        let leafKey = sortA[i];
+        if (leafKey == 0xFFFFFFFFu) {
+            outBvh[leafIdx].aabbMin = vec3<f32>( 1e30);
+            outBvh[leafIdx].aabbMax = vec3<f32>(-1e30);
+        } else {
+            let e = entries[leafKey & 0xFFFFu];
+            outBvh[leafIdx].aabbMin = e.aabbMin;
+            outBvh[leafIdx].aabbMax = e.aabbMax;
+        }
+    }
+    workgroupBarrier();
+    storageBarrier();
+
+    // ── Phase 5: bottom-up sweep-tree refit, LEVELS iterations ──────────
+    // Deepest internal level has N_PADDED/2 nodes; perThread = ceil of
+    // levelCount / THREADS is uniform per step, so workgroupBarrier
+    // stays in uniform control flow.
+    var levelCount: u32 = N_PADDED / 2u;
+    var levelStart: u32 = N_PADDED / 2u - 1u;
+    for (var step: u32 = 0u; step < LEVELS; step = step + 1u) {
+        let perThread = (levelCount + THREADS - 1u) / THREADS;
+        for (var k: u32 = 0u; k < perThread; k = k + 1u) {
+            let nodeOff = k * THREADS + tid;
+            if (nodeOff < levelCount) {
+                let nodeIdx = levelStart + nodeOff;
+                let leftIdx  = 2u * nodeIdx + 1u;
+                let rightIdx = 2u * nodeIdx + 2u;
+                let lMin = outBvh[leftIdx].aabbMin;
+                let lMax = outBvh[leftIdx].aabbMax;
+                let rMin = outBvh[rightIdx].aabbMin;
+                let rMax = outBvh[rightIdx].aabbMax;
+                outBvh[nodeIdx].aabbMin = min(lMin, rMin);
+                outBvh[nodeIdx].aabbMax = max(lMax, rMax);
+            }
+        }
+        workgroupBarrier();
+        storageBarrier();
+        levelCount = levelCount / 2u;
+        levelStart = (levelStart - 1u) / 2u;
+    }
 }
 `;
 
@@ -1706,18 +2409,54 @@ function rtInit() {
         usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST,
     });
 
-    // TLAS-build compute pipeline. Only group(0) is used (3 SSBOs).
+    // TLAS-build compute pipeline. group(0) carries:
+    //   0  inInstances  (read-only-storage)
+    //   1  inMeshes     (read-only-storage)
+    //   2  outEntries   (read-write storage)
+    //   3  outOrder     (read-write — identity perm, sort overwrites later)
+    //   4  outMorton    (read-write — sort key, 30-bit Morton from centroid)
     const mod = device.createShaderModule({ code: tlasBuildWgsl, label: "rt-tlas-build" });
     const tlasBuildBgl = device.createBindGroupLayout({ entries: [
         { binding: 0, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
         { binding: 1, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
         { binding: 2, visibility: GPUShaderStage.COMPUTE, buffer: { type: "storage" } },
+        { binding: 3, visibility: GPUShaderStage.COMPUTE, buffer: { type: "storage" } },
+        { binding: 4, visibility: GPUShaderStage.COMPUTE, buffer: { type: "storage" } },
     ]});
     rtState.tlasBuildBgl = tlasBuildBgl;
     rtState.tlasBuildPipeline = device.createComputePipeline({
         layout: device.createPipelineLayout({ bindGroupLayouts: [tlasBuildBgl] }),
         compute: { module: mod, entryPoint: "tlasBuildMain" },
     });
+
+    // LBVH-build follow-up pipeline. Single workgroup of 1024 threads
+    // sorts instances by packed (morton16:idx16) keys (LSD radix) and
+    // refits a sweep-tree BVH. N_PADDED = 1024 ceiling for Stage 1.
+    //   0  entries  (read-only)
+    //   1  outOrder (read-write, sorted permutation)
+    //   2  outBvh   (read-write, 2*N_PADDED - 1 nodes × 32 bytes)
+    //   3  sortA    (read-write, N_PADDED u32 ping-pong)
+    //   4  sortB    (read-write, N_PADDED u32 ping-pong)
+    const lbvhMod = device.createShaderModule({ code: lbvhBuildWgsl, label: "rt-tlas-lbvh-build" });
+    const lbvhBuildBgl = device.createBindGroupLayout({ entries: [
+        { binding: 0, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
+        { binding: 1, visibility: GPUShaderStage.COMPUTE, buffer: { type: "storage" } },
+        { binding: 2, visibility: GPUShaderStage.COMPUTE, buffer: { type: "storage" } },
+        { binding: 3, visibility: GPUShaderStage.COMPUTE, buffer: { type: "storage" } },
+        { binding: 4, visibility: GPUShaderStage.COMPUTE, buffer: { type: "storage" } },
+        { binding: 5, visibility: GPUShaderStage.COMPUTE, buffer: { type: "uniform"  } },
+    ]});
+    rtState.lbvhBuildBgl = lbvhBuildBgl;
+    rtState.lbvhBuildPipeline = device.createComputePipeline({
+        layout: device.createPipelineLayout({ bindGroupLayouts: [lbvhBuildBgl] }),
+        compute: { module: lbvhMod, entryPoint: "lbvhBuildMain" },
+    });
+    // Tiny uniform buffer for the LBVH's `nReal` field. Written each
+    // wgpuBuildTLAS call before dispatch.
+    rtState.lbvhCountBuf = device.createBuffer({
+        size: 16,
+        usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST,
+    });
 }
 
 function rtMeshRecordsEnsure(meshCount) {
@@ -1808,14 +2547,25 @@ env.wgpuRegisterMeshBLAS = (minX, minY, minZ, maxX, maxY, maxZ,
     return handle;
 };
 
-env.wgpuBuildTLAS = (instanceBufHandle, instanceCount, tlasOutBufHandle) => {
-    if (!rtState.tlasBuildPipeline) return;
-    const inst = buffers.get(instanceBufHandle);
-    const out  = buffers.get(tlasOutBufHandle);
-    if (!inst || !out) {
+env.wgpuBuildTLAS = (instanceBufHandle, instanceCount, tlasOutBufHandle,
+                     entryOrderHandle, mortonHandle, binsHandle,
+                     bvhNodesHandle, sortABufHandle, sortBBufHandle) => {
+    if (!rtState.tlasBuildPipeline || !rtState.lbvhBuildPipeline) return;
+    const inst   = buffers.get(instanceBufHandle);
+    const out    = buffers.get(tlasOutBufHandle);
+    const order  = buffers.get(entryOrderHandle);
+    const morton = buffers.get(mortonHandle);
+    const bvh    = buffers.get(bvhNodesHandle);
+    const sortA  = buffers.get(sortABufHandle);
+    const sortB  = buffers.get(sortBBufHandle);
+    if (!inst || !out || !order || !morton || !bvh || !sortA || !sortB) {
         console.error("[crafter-wgpu] wgpuBuildTLAS: unknown buffer handle");
         return;
     }
+    if (instanceCount > LBVH_MAX) {
+        console.error(`[crafter-wgpu] wgpuBuildTLAS: instance count ${instanceCount} > LBVH cap ${LBVH_MAX}`);
+        return;
+    }
 
     const bg = device.createBindGroup({
         layout: rtState.tlasBuildBgl,
@@ -1823,6 +2573,26 @@ env.wgpuBuildTLAS = (instanceBufHandle, instanceCount, tlasOutBufHandle) => {
             { binding: 0, resource: { buffer: inst } },
             { binding: 1, resource: { buffer: rtState.meshRecordsBuffer } },
             { binding: 2, resource: { buffer: out } },
+            { binding: 3, resource: { buffer: order } },
+            { binding: 4, resource: { buffer: morton } },
+        ],
+    });
+    // Write the real instance count to the LBVH count uniform so the
+    // shader can iterate exactly the right number of entries even
+    // though the storage buffers stay sized for N_PADDED.
+    const countBuf = new Uint32Array(4);
+    countBuf[0] = instanceCount;
+    queue.writeBuffer(rtState.lbvhCountBuf, 0, countBuf);
+
+    const lbvhBg = device.createBindGroup({
+        layout: rtState.lbvhBuildBgl,
+        entries: [
+            { binding: 0, resource: { buffer: out } },
+            { binding: 1, resource: { buffer: order } },
+            { binding: 2, resource: { buffer: bvh } },
+            { binding: 3, resource: { buffer: sortA } },
+            { binding: 4, resource: { buffer: sortB } },
+            { binding: 5, resource: { buffer: rtState.lbvhCountBuf } },
         ],
     });
 
@@ -1834,12 +2604,25 @@ env.wgpuBuildTLAS = (instanceBufHandle, instanceCount, tlasOutBufHandle) => {
     }
     const enc = state.encoder || device.createCommandEncoder();
     const ownEncoder = !state.encoder;
-    const pass = enc.beginComputePass({ label: "tlas-build" });
-    pass.setPipeline(rtState.tlasBuildPipeline);
-    pass.setBindGroup(0, bg);
-    const groups = Math.ceil(instanceCount / 64);
-    pass.dispatchWorkgroups(groups, 1, 1);
-    pass.end();
+    // Pass 1: TLAS entry build (existing).
+    {
+        const pass = enc.beginComputePass({ label: "tlas-build" });
+        pass.setPipeline(rtState.tlasBuildPipeline);
+        pass.setBindGroup(0, bg);
+        const groups = Math.ceil(instanceCount / 64);
+        pass.dispatchWorkgroups(groups, 1, 1);
+        pass.end();
+    }
+    // Pass 2: LBVH-build. Single workgroup of 1024 threads runs the
+    // entire BVH build (Morton, sort, sweep-tree refit) in shared
+    // memory. Same encoder → pipeline barrier sequences pass1 → pass2.
+    if (instanceCount > 0) {
+        const pass = enc.beginComputePass({ label: "tlas-lbvh-build" });
+        pass.setPipeline(rtState.lbvhBuildPipeline);
+        pass.setBindGroup(0, lbvhBg);
+        pass.dispatchWorkgroups(1, 1, 1);
+        pass.end();
+    }
     if (ownEncoder) {
         queue.submit([enc.finish()]);
     } else {
@@ -1848,8 +2631,10 @@ env.wgpuBuildTLAS = (instanceBufHandle, instanceCount, tlasOutBufHandle) => {
 
     // Publish so rayQuery-capable compute pipelines pick up the latest TLAS
     // without each dispatch having to thread the handle explicitly.
-    rtState.currentTlas = tlasOutBufHandle;
+    rtState.currentTlas              = tlasOutBufHandle;
     rtState.currentTlasInstanceCount = instanceCount;
+    rtState.currentEntryOrder        = entryOrderHandle;
+    rtState.currentBvh               = bvhNodesHandle;
 };
 
 // RT pipeline loader — wraps user-supplied WGSL (sources + generated mega
@@ -1873,7 +2658,7 @@ env.wgpuLoadRTPipeline = (wgslPtr, wgslLen, bindingsPtr, bindingsCount) => {
         beforeHelpers = userPart.substring(0, mi);
         afterHelpers  = userPart.substring(mi + marker.length);
     }
-    const fullWgsl = rtWgslPrelude + "\n" + beforeHelpers + "\n" + rtWgslHelpers + "\n" + afterHelpers;
+    const fullWgsl = rtWgslPrelude + "\n" + beforeHelpers + "\n" + rtWgslPureHelpers + "\n" + rtWgslMegakernelHelpers + "\n" + afterHelpers;
 
     // Parse user bindings (same wire format as wgpuLoadCustomShader).
     const userBindings = [];
@@ -1918,6 +2703,8 @@ env.wgpuLoadRTPipeline = (wgslPtr, wgslLen, bindingsPtr, bindingsCount) => {
             { binding: 6, visibility: GPUShaderStage.COMPUTE,
               storageTexture: { format: "rgba8unorm", access: "write-only", viewDimension: "2d" } },
             { binding: 7, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
+            { binding: 8, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
+            { binding: 9, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
         ]});
         // User binding-group layouts. WebGPU pipeline layouts need a
         // contiguous array up to the highest group used, so pad any gaps
@@ -1976,6 +2763,12 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes,
         entries: [{ binding: 0, resource: { buffer: rtState.rtHeader, offset: 0, size: 16 } }],
     });
     const outView = state.outIsPing ? state.pingView : state.pongView;
+    const entryOrderBuf = buffers.get(rtState.currentEntryOrder);
+    const bvhBuf        = buffers.get(rtState.currentBvh);
+    if (!entryOrderBuf || !bvhBuf) {
+        console.error("[crafter-wgpu] wgpuDispatchRT: missing entryOrder/bins (no TLAS built yet?)");
+        return;
+    }
     const dataBg = device.createBindGroup({
         layout: pipe.dataBgl,
         entries: [
@@ -1987,6 +2780,8 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes,
             { binding: 5, resource: { buffer: rtState.primRemapHeap.gpu } },
             { binding: 6, resource: outView },
             { binding: 7, resource: { buffer: rtState.attribsHeap.gpu } },
+            { binding: 8, resource: { buffer: entryOrderBuf } },
+            { binding: 9, resource: { buffer: bvhBuf } },
         ],
     });
 
@@ -2025,7 +2820,279 @@ env.wgpuDispatchRT = (pipelineHandle, pushPtr, pushBytes,
     state.outIsPing = !state.outIsPing;
 };
 
+// ── Standalone compute pipelines ────────────────────────────────────────
+//
+// Mirrors the native Vulkan `ComputeShader` API: user-authored compute
+// pipelines that dispatch outside any UI render pass. Unlike
+// wgpuLoadCustomShader (which lives inside the UI flow with ping/pong
+// textures at @group(1)), these have NO library-supplied bindings —
+// users declare all groups themselves.
+//
+// Layout contract:
+//   @group(0) @binding(0) uniform PushData   // optional, only if pushUniformSize > 0
+//   @group(1+) @binding(N)                   // user bindings via UICustomBinding
+//
+// rayQuery: same flag as wgpuLoadCustomShader. When set, prepends the
+// RT prelude + rayQuery library; user shader must NOT redeclare the
+// resulting @group(1) layout (tlas/bvh/mesh heaps). The `bindings`
+// list must start at @group(2) in that case.
+
+const computePipelines = new Map();  // handle → { pipeline, bgls, byGroup, sortedGroups, pushUniformSize, rayQueryCapable }
+
+env.wgpuLoadComputePipeline = (wgslPtr, wgslLen, pushUniformSize,
+                               bindingsPtr, bindingsCount, rayQueryFlag) => {
+    if (!rtState.vertHeap && rayQueryFlag) rtInit();
+    const userWgsl = new TextDecoder().decode(memU8().subarray(wgslPtr, wgslPtr + wgslLen));
+    const wgsl = rayQueryFlag
+        // rayQueryLib's _rqTraverseBlas/_rqTraverseTlas call _rtAabb,
+        // _rtFetchTri, _rtTri from rtWgslPureHelpers — must prepend
+        // the pure helper subset (NOT the megakernel-only traversal
+        // routines, which reference user-emitted runAnyHit/runMiss/
+        // runClosestHit and won't compile outside the raygen pipeline).
+        ? (rtWgslTypes + rtWgslMegakernelBindings + rtWgslPureHelpers + rtWgslRayQueryLib + "\n" + userWgsl)
+        : userWgsl;
+
+    const bindings = [];
+    if (bindingsCount > 0) {
+        const dv = new DataView(memU8().buffer, bindingsPtr, bindingsCount * 8);
+        for (let i = 0; i < bindingsCount; i++) {
+            bindings.push({
+                group:   dv.getUint8(i*8 + 0),
+                binding: dv.getUint8(i*8 + 1),
+                kind:    dv.getUint8(i*8 + 2),
+                pushOffset: dv.getUint32(i*8 + 4, true),
+            });
+        }
+    }
+
+    // Bind-group layouts. The push uniform sits at @group(0) if present;
+    // user bindings start at @group(1) (or @group(2) when rayQuery is on
+    // — see rtWgslMegakernelBindings).
+    const bgls = [];
+    if (pushUniformSize > 0) {
+        bgls.push(device.createBindGroupLayout({ entries: [
+            { binding: 0, visibility: GPUShaderStage.COMPUTE,
+              buffer: { type: "uniform", minBindingSize: pushUniformSize } },
+        ]}));
+    } else {
+        bgls.push(device.createBindGroupLayout({ entries: [] }));
+    }
+
+    if (rayQueryFlag) {
+        // group(1) is the megakernel RT bindings — same layout as the RT
+        // pipeline's group(1). The injected rayQuery library reads it.
+        bgls.push(device.createBindGroupLayout({ entries: [
+            { binding: 0, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
+            { binding: 1, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
+            { binding: 2, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
+            { binding: 3, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
+            { binding: 4, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
+            { binding: 5, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
+            { binding: 6, visibility: GPUShaderStage.COMPUTE,
+              storageTexture: { format: "rgba8unorm", access: "write-only", viewDimension: "2d" } },
+            { binding: 7, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
+            { binding: 8, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
+            { binding: 9, visibility: GPUShaderStage.COMPUTE, buffer: { type: "read-only-storage" } },
+        ]}));
+    }
+
+    const byGroup = new Map();
+    const userGroupBase = rayQueryFlag ? 2 : 1;
+    for (const b of bindings) {
+        if (b.group < userGroupBase) {
+            console.error(`[crafter-wgpu] compute pipeline: @group(${b.group}) reserved; user groups must be >= ${userGroupBase}`);
+            return 0;
+        }
+        if (!byGroup.has(b.group)) byGroup.set(b.group, []);
+        byGroup.get(b.group).push(b);
+    }
+    const sortedGroups = [...byGroup.keys()].sort((a, b) => a - b);
+    const highest = sortedGroups.length ? sortedGroups[sortedGroups.length - 1] : userGroupBase - 1;
+    for (let g = userGroupBase; g <= highest; g++) {
+        if (byGroup.has(g)) {
+            const entries = byGroup.get(g).map(b => {
+                const e = { binding: b.binding, visibility: GPUShaderStage.COMPUTE };
+                if      (b.kind === 0) e.buffer  = { type: "read-only-storage" };
+                else if (b.kind === 1) e.texture = { sampleType: "float", viewDimension: "2d" };
+                else if (b.kind === 2) e.sampler = { type: "filtering" };
+                else if (b.kind === 3) e.texture = { sampleType: "float", viewDimension: "2d-array" };
+                else if (b.kind === 4) e.buffer  = { type: "storage" };  // read-write storage
+                return e;
+            });
+            bgls.push(device.createBindGroupLayout({ entries }));
+        } else {
+            bgls.push(device.createBindGroupLayout({ entries: [] }));
+        }
+    }
+
+    let pipeline;
+    try {
+        const mod = device.createShaderModule({ code: wgsl, label: "compute-pipeline" });
+        // Async compile-info to surface parse errors with line numbers.
+        mod.getCompilationInfo().then(info => {
+            const issues = info.messages.filter(m => m.type === "error" || m.type === "warning");
+            if (issues.length === 0) return;
+            const lines = wgsl.split("\n");
+            for (const m of issues) {
+                const ln = m.lineNum || 0, col = m.linePos || 0;
+                console[m.type === "error" ? "error" : "warn"](
+                    `[crafter-wgpu] compute-pipeline ${m.type} at ${ln}:${col}: ${m.message}`);
+                for (let i = Math.max(1, ln - 3); i <= Math.min(lines.length, ln + 3); i++) {
+                    const marker = i === ln ? ">> " : "   ";
+                    console.log(`${marker}${String(i).padStart(4)}: ${lines[i - 1]}`);
+                }
+            }
+        });
+        const layout = device.createPipelineLayout({ bindGroupLayouts: bgls });
+        pipeline = device.createComputePipeline({ layout, compute: { module: mod, entryPoint: "main" } });
+    } catch (e) {
+        console.error("[crafter-wgpu] compute pipeline compile failed:", e);
+        return 0;
+    }
+
+    const handle = newHandle();
+    computePipelines.set(handle, {
+        pipeline, bgls, byGroup, sortedGroups,
+        pushUniformSize, userGroupBase,
+        rayQueryCapable: !!rayQueryFlag,
+    });
+    return handle;
+};
+
+// Dispatch a standalone compute pipeline. Works both inside the per-frame
+// UI compute pass (mid-frame physics tick) and outside it (game-tick
+// dispatch from update lambda) — matching the wgpuBuildTLAS pattern.
+//
+// Push data: if the pipeline declared a push uniform (pushUniformSize > 0
+// at load), `pushPtr` points at exactly that many bytes. We allocate a
+// transient uniform buffer per dispatch — fine for the physics rate
+// (handful of dispatches per substep, single-digit substeps per frame).
+//
+// Handles: parallel to the UICustomBinding[] declaration order at load
+// time. Each handle resolves through the engine's buffer/texture/
+// sampler tables to the live WebGPU resource.
+env.wgpuDispatchCompute = (pipelineHandle, pushPtr, pushBytes,
+                           handlesPtr, handlesCount, gx, gy, gz) => {
+    const pipe = computePipelines.get(pipelineHandle);
+    if (!pipe) {
+        console.error("[crafter-wgpu] wgpuDispatchCompute: unknown pipeline", pipelineHandle);
+        return;
+    }
+
+    // ── Push uniform: transient buffer per dispatch ────────────────────
+    let pushBG = null;
+    if (pipe.pushUniformSize > 0) {
+        const buf = device.createBuffer({
+            size: Math.max(16, (pipe.pushUniformSize + 15) & ~15),
+            usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST,
+        });
+        if (pushBytes > 0) {
+            queue.writeBuffer(buf, 0, memU8().subarray(pushPtr, pushPtr + pushBytes));
+        }
+        pushBG = device.createBindGroup({
+            layout: pipe.bgls[0],
+            entries: [{ binding: 0, resource: { buffer: buf, offset: 0, size: pipe.pushUniformSize } }],
+        });
+    } else {
+        pushBG = device.createBindGroup({ layout: pipe.bgls[0], entries: [] });
+    }
+
+    // ── RT bindings (rayQuery only) ────────────────────────────────────
+    let rtBG = null;
+    if (pipe.rayQueryCapable) {
+        const tlasBuf = buffers.get(rtState.currentTlas);
+        const orderBuf = buffers.get(rtState.currentEntryOrder);
+        const bvhBuf   = buffers.get(rtState.currentBvh);
+        if (!tlasBuf || !orderBuf || !bvhBuf) {
+            console.error("[crafter-wgpu] wgpuDispatchCompute rayQuery: no TLAS built yet");
+            return;
+        }
+        const outView = state.outIsPing ? state.pingView : state.pongView;
+        rtBG = device.createBindGroup({
+            layout: pipe.bgls[1],
+            entries: [
+                { binding: 0, resource: { buffer: tlasBuf } },
+                { binding: 1, resource: { buffer: rtState.bvhHeap.gpu } },
+                { binding: 2, resource: { buffer: rtState.meshRecordsBuffer } },
+                { binding: 3, resource: { buffer: rtState.vertHeap.gpu } },
+                { binding: 4, resource: { buffer: rtState.indexHeap.gpu } },
+                { binding: 5, resource: { buffer: rtState.primRemapHeap.gpu } },
+                { binding: 6, resource: outView },
+                { binding: 7, resource: { buffer: rtState.attribsHeap.gpu } },
+                { binding: 8, resource: { buffer: orderBuf } },
+                { binding: 9, resource: { buffer: bvhBuf } },
+            ],
+        });
+    }
+
+    // ── User bind groups from handles ──────────────────────────────────
+    const handleArr = new Uint32Array(memU8().buffer, handlesPtr, handlesCount);
+    const userBGs = [];
+    let handleCursor = 0;
+    const userGroupBase = pipe.userGroupBase;
+    const userBglStart = pipe.bgls.length - (pipe.sortedGroups.length
+        ? (pipe.sortedGroups[pipe.sortedGroups.length - 1] - userGroupBase + 1)
+        : 0);
+    let bglIdx = userBglStart;
+    if (pipe.sortedGroups.length > 0) {
+        for (let g = userGroupBase; g <= pipe.sortedGroups[pipe.sortedGroups.length - 1]; g++) {
+            if (pipe.byGroup.has(g)) {
+                const entries = pipe.byGroup.get(g).map(b => {
+                    const h = handleArr[handleCursor++];
+                    let resource;
+                    if      (b.kind === 0 || b.kind === 4) resource = { buffer: buffers.get(h) };
+                    else if (b.kind === 1)                 resource = textureViews.get(h);
+                    else if (b.kind === 2)                 resource = samplers.get(h);
+                    else if (b.kind === 3)                 resource = textureViews.get(h);
+                    return { binding: b.binding, resource };
+                });
+                userBGs.push(device.createBindGroup({ layout: pipe.bgls[bglIdx++], entries }));
+            } else {
+                userBGs.push(device.createBindGroup({ layout: pipe.bgls[bglIdx++], entries: [] }));
+            }
+        }
+    }
+
+    // ── Execute: attach to active pass, or create a standalone one. ────
+    const runDispatch = (pass) => {
+        pass.setPipeline(pipe.pipeline);
+        pass.setBindGroup(0, pushBG);
+        if (rtBG) pass.setBindGroup(1, rtBG);
+        for (let i = 0; i < userBGs.length; i++) {
+            pass.setBindGroup(userGroupBase + i, userBGs[i]);
+        }
+        pass.dispatchWorkgroups(gx, gy, gz);
+    };
+
+    if (state.pass) {
+        // Mid-frame: dispatch into the current UI compute pass.
+        runDispatch(state.pass);
+    } else {
+        // Standalone: build encoder + pass + submit. Mirrors wgpuBuildTLAS.
+        const enc = state.encoder || device.createCommandEncoder();
+        const ownEncoder = !state.encoder;
+        const pass = enc.beginComputePass({ label: "plain-compute" });
+        runDispatch(pass);
+        pass.end();
+        if (ownEncoder) queue.submit([enc.finish()]);
+    }
+};
+
 console.log("[crafter-wgpu] init complete; env handlers wired");
+
+// Memory diagnostic. Logs handle-table sizes every 5s so a slow leak
+// shows up as monotonic growth. Comment out for production builds.
+setInterval(() => {
+    const m = (performance.memory && performance.memory.usedJSHeapSize) || 0;
+    console.log(`[crafter-wgpu] mem: js=${(m / 1024 / 1024).toFixed(1)}MB`
+                + `  buffers=${buffers.size}`
+                + `  textures=${textures.size}`
+                + `  samplers=${samplers.size}`
+                + `  customPipelines=${customPipelines.size}`
+                + `  rtPipelines=${rtPipelines.size}`
+                + `  computePipelines=${computePipelines.size}`
+                + `  bindGroupCache=${state.bindGroupCache ? state.bindGroupCache.size : 0}`);
+}, 5000);
 } catch (e) {
     // Capture any throw so the stub error messages name the real cause
     // instead of "(no error captured)". Re-throw so runtime.js's catch
diff --git a/implementations/Crafter.Graphics-Mesh-WebGPU.cpp b/implementations/Crafter.Graphics-Mesh-WebGPU.cpp
index 5a61871..2ebd128 100644
--- a/implementations/Crafter.Graphics-Mesh-WebGPU.cpp
+++ b/implementations/Crafter.Graphics-Mesh-WebGPU.cpp
@@ -225,6 +225,7 @@ namespace {
                              std::span<const std::uint32_t>       indices,
                              std::span<const std::byte>           attribsBytes) {
         mesh.triangleCount = static_cast<std::uint32_t>(indices.size()) / 3;
+        mesh.vertexCount   = static_cast<std::uint32_t>(vertices.size());
 
         Builder builder;
         builder.Build(vertices, indices);
diff --git a/implementations/Crafter.Graphics-RenderingElement3D-WebGPU.cpp b/implementations/Crafter.Graphics-RenderingElement3D-WebGPU.cpp
index 2195f85..ffa1b47 100644
--- a/implementations/Crafter.Graphics-RenderingElement3D-WebGPU.cpp
+++ b/implementations/Crafter.Graphics-RenderingElement3D-WebGPU.cpp
@@ -4,12 +4,21 @@ Copyright (C) 2026 Catcrafts®
 catcrafts.net
 */
 
-// DOM-mode TLAS upkeep. BuildTLAS copies the per-element RTInstance into
-// the host-visible instance buffer (skipping the transform for elements
-// whose transform is GPU-owned), uploads it, then dispatches the JS-side
-// TLAS-build compute pass — which consults the per-BLAS records published
-// at Mesh::Build() time to produce world-space AABBs and inverse
-// transforms in the format `traceRay` / `rayQuery` consume.
+// DOM-mode TLAS upkeep. BuildTLAS is split in two phases so a physics
+// compute pass can run between them:
+//   - BuildTLASUpload mirrors the CPU-side RTInstance array into the
+//     host-visible instance buffer (with partial-write semantics that
+//     preserve the transform bytes for elements flagged
+//     transformOwnedByGpu, see notes in the body) and uploads the
+//     metadata buffer.
+//   - BuildTLASBuild dispatches the JS-side TLAS-build compute pass —
+//     which consults the per-BLAS records published at Mesh::Build()
+//     time to produce world-space AABBs and inverse transforms in the
+//     format `traceRay` / `rayQuery` consume.
+// The combined BuildTLAS calls both back-to-back; callers that want to
+// interleave a physics tlas-transform compute pass (which writes the
+// transform bytes BuildTLASUpload leaves intact) call Upload + their
+// compute pass + Build manually.
 
 module;
 module Crafter.Graphics:RenderingElement3D_implWebGPU;
@@ -41,7 +50,7 @@ void RenderingElement3D::Remove(RenderingElement3D* e) {
     e->indexInElements = std::numeric_limits<std::uint32_t>::max();
 }
 
-void RenderingElement3D::BuildTLAS(WebGPUCommandEncoderRef /*cmd*/, std::uint32_t index) {
+void RenderingElement3D::BuildTLASUpload(WebGPUCommandEncoderRef /*cmd*/, std::uint32_t index) {
     auto& tlas = tlases[index];
     const std::uint32_t primitiveCount = static_cast<std::uint32_t>(elements.size());
     if (primitiveCount == 0) {
@@ -49,19 +58,52 @@ void RenderingElement3D::BuildTLAS(WebGPUCommandEncoderRef /*cmd*/, std::uint32_
         return;
     }
 
-    // (Re)allocate instance + metadata + output TLAS buffers if the count
-    // changed. WebGPUBuffer::Resize destroys and recreates the GPU buffer;
-    // bind-group caches keyed on the buffer handle are invalidated in the
-    // JS bridge automatically.
-    if (primitiveCount != tlas.builtInstanceCount) {
-        tlas.instanceBuffer.Resize(primitiveCount);
-        tlas.metadataBuffer.Resize(primitiveCount);
-        // TLASEntry layout in WGSL is 144 bytes due to vec3 align/pad
-        // rules. Must match the struct declared in the rtWgslTypes
-        // block in additional/dom-webgpu.js.
-        tlas.buffer.Resize(primitiveCount * 144);
+    constexpr std::uint32_t kNPadded   = 65536u;     // size for instance / metadata mirrors
+    constexpr std::uint32_t kLbvhMax   = 16384u;     // matches N_PADDED in lbvhBuildWgsl
+    constexpr std::uint32_t kNodeCount = 2u * kNPadded - 1u;
+
+    // ALL TLAS-side GPU buffers get allocated ONCE and never resized.
+    // The LBVH-build shader takes the real instance count via a uniform
+    // (lbvhPc.nReal) instead of arrayLength(&entries), so the
+    // tlas.buffer / entryOrder / mortonCodes don't need to grow when
+    // the application's element count changes.
+    //
+    // Why this matters: an earlier version resized these per-frame on
+    // primitiveCount change. The destroy+recreate cycle on the GPU
+    // buffer caused subtle mid-game flicker as soon as any element was
+    // added (e.g. firing a projectile) — fort braces would appear to
+    // briefly vanish in patterns deterministic on the projectile's
+    // angle. Suspected driver-level memory recycling without proper
+    // zero-init; the fixed-size allocation sidesteps it entirely.
+    if (tlas.instanceBuffer.handle == 0) {
+        tlas.instanceBuffer.Resize(kNPadded);
+        tlas.metadataBuffer.Resize(kNPadded);
+        tlas.bvhNodes.Resize(kNodeCount * 32u);
+        tlas.sortTempA.Resize(kNPadded * 4u);
+        tlas.sortTempB.Resize(kNPadded * 4u);
+        tlas.tlasBins.Resize(64 * 32);
+        // TLAS-entry / order / morton-code buffers: sized for the LBVH
+        // cap (16384). lbvhBuildMain iterates `lbvhPc.nReal` real
+        // entries; the remainder stays zero / sentinel. Keep these
+        // stable across element-count changes so the renderer's bind
+        // group references the same buffer handle every frame.
+        tlas.buffer.Resize(kLbvhMax * 144u);
+        tlas.entryOrder.Resize(kLbvhMax * 4u);
+        tlas.mortonCodes.Resize(kLbvhMax * 4u);
     }
 
+    // NB: tlas.buffer / entryOrder / mortonCodes get resized in
+    // BuildTLASBuild, NOT here. Resize destroys + recreates the GPU
+    // resource (and the JS-side handle); the rayQuery dispatches that
+    // run between BuildTLASUpload and BuildTLASBuild (projectile-collide,
+    // splash, builder-pick) still hold the previous frame's TLAS in
+    // rtState.current{Tlas,EntryOrder,Bvh}. If we resized here, those
+    // handles would point at destroyed buffers and the dispatches would
+    // log "no TLAS built yet" every frame the element count changed
+    // (e.g. every projectile fire). Resizing inside BuildTLASBuild,
+    // immediately before wgpuBuildTLAS publishes the new handles, keeps
+    // the JS-side current* refs in sync with the GPU resources.
+
     for (std::uint32_t i = 0; i < primitiveCount; ++i) {
         auto& dst = tlas.instanceBuffer.value[i];
         const auto& src = elements[i]->instance;
@@ -80,12 +122,73 @@ void RenderingElement3D::BuildTLAS(WebGPUCommandEncoderRef /*cmd*/, std::uint32_
         tlas.metadataBuffer.value[i] = elements[i]->userMetadata;
     }
 
-    tlas.instanceBuffer.FlushDevice();
+    // Upload the instance buffer with partial-write semantics: for runs
+    // of CPU-driven elements (transformOwnedByGpu=false) we push the
+    // whole 64-byte struct in one writeBuffer call; for GPU-driven runs
+    // we push only the trailing 16 metadata bytes per element, leaving
+    // the transform field intact for the physics-tlas-transform compute
+    // shader to update. The two arms below produce identical GPU state
+    // when every element is CPU-driven — this is a no-op refactor until
+    // 3DForts flips its physics elements to transformOwnedByGpu=true.
+    constexpr std::uint32_t kInstSize      = sizeof(RTInstance);          // 64
+    constexpr std::uint32_t kTransformSize = sizeof(RTTransformMatrix);   // 48
+    constexpr std::uint32_t kMetaSize      = kInstSize - kTransformSize;  // 16
+
+    std::uint32_t runStart = 0;
+    bool runOwned = elements[0]->transformOwnedByGpu;
+    for (std::uint32_t i = 1; i <= primitiveCount; ++i) {
+        const bool atEnd     = (i == primitiveCount);
+        const bool currOwned = atEnd ? !runOwned : elements[i]->transformOwnedByGpu;
+        if (currOwned == runOwned && !atEnd) continue;
+
+        if (runOwned) {
+            // GPU-driven run — metadata only, per element. Cannot batch
+            // because the metadata bytes are non-contiguous in the
+            // instance buffer (one 16-byte chunk per 64-byte slot).
+            for (std::uint32_t j = runStart; j < i; ++j) {
+                const std::uint32_t off = j * kInstSize + kTransformSize;
+                tlas.instanceBuffer.FlushDeviceRange(off, off, kMetaSize);
+            }
+        } else {
+            // CPU-driven run — one contiguous writeBuffer.
+            const std::uint32_t startOff = runStart * kInstSize;
+            const std::uint32_t bytes    = (i - runStart) * kInstSize;
+            tlas.instanceBuffer.FlushDeviceRange(startOff, startOff, bytes);
+        }
+        runStart = i;
+        runOwned = currOwned;
+    }
+
     tlas.metadataBuffer.FlushDevice();
+}
+
+void RenderingElement3D::BuildTLASBuild(WebGPUCommandEncoderRef /*cmd*/, std::uint32_t index) {
+    auto& tlas = tlases[index];
+    const std::uint32_t primitiveCount = static_cast<std::uint32_t>(elements.size());
+    if (primitiveCount == 0) {
+        // Upload already cleared builtInstanceCount; nothing to dispatch.
+        return;
+    }
+
+    // No per-count Resize. tlas.buffer / entryOrder / mortonCodes were
+    // allocated at kLbvhMax in BuildTLASUpload's first call and stay
+    // that size. The LBVH shader reads the real count from a uniform
+    // (lbvhPc.nReal) wgpuBuildTLAS writes each call.
 
     WebGPU::wgpuBuildTLAS(tlas.instanceBuffer.handle,
                           static_cast<std::int32_t>(primitiveCount),
-                          tlas.buffer.handle);
+                          tlas.buffer.handle,
+                          tlas.entryOrder.handle,
+                          tlas.mortonCodes.handle,
+                          tlas.tlasBins.handle,
+                          tlas.bvhNodes.handle,
+                          tlas.sortTempA.handle,
+                          tlas.sortTempB.handle);
 
     tlas.builtInstanceCount = primitiveCount;
 }
+
+void RenderingElement3D::BuildTLAS(WebGPUCommandEncoderRef cmd, std::uint32_t index) {
+    BuildTLASUpload(cmd, index);
+    BuildTLASBuild(cmd, index);
+}
diff --git a/implementations/Crafter.Graphics-UI-WebGPU.cpp b/implementations/Crafter.Graphics-UI-WebGPU.cpp
index 6e619c5..2c46c7c 100644
--- a/implementations/Crafter.Graphics-UI-WebGPU.cpp
+++ b/implementations/Crafter.Graphics-UI-WebGPU.cpp
@@ -98,13 +98,9 @@ void UIRenderer::DispatchImages(GraphicsCommandBuffer /*cmd*/, std::uint32_t buf
     if (itemCount == 0) return;
     UIDispatchHeader hdr = FillHeader(bufferSlot, itemCount, clipRectPx);
     auto handle = heap_->bufferTable[bufferSlot];
-    // For DispatchImages, the WGSL expects a texture + sampler in group 3.
-    // The library v1 doesn't expose user-image registration on DOM (out of
-    // scope per plan). If the user calls DispatchImages without a registered
-    // image, fall back to using the font atlas binding — the user's items
-    // should reference texSlot/sampSlot but on DOM those are ignored. For
-    // now, route through the font atlas texture if available; otherwise
-    // skip the dispatch.
+    // Backward-compatible fallback: callers that don't pass a texture
+    // get the font atlas. Useful for tests, useless for real content.
+    // New code should use the 6-arg overload below.
     if (fontAtlasImageSlot_) {
         auto texHandle  = heap_->imageTable[fontAtlasImageSlot_];
         auto sampHandle = heap_->samplerTable[fontAtlasSamplerSlot_];
@@ -115,6 +111,21 @@ void UIRenderer::DispatchImages(GraphicsCommandBuffer /*cmd*/, std::uint32_t buf
     }
 }
 
+void UIRenderer::DispatchImages(GraphicsCommandBuffer /*cmd*/, std::uint32_t bufferSlot,
+                                std::uint32_t itemCount,
+                                std::uint16_t imageSlot, std::uint16_t samplerSlot,
+                                std::array<float,4> clipRectPx) {
+    if (itemCount == 0) return;
+    UIDispatchHeader hdr = FillHeader(bufferSlot, itemCount, clipRectPx);
+    auto handle     = heap_->bufferTable[bufferSlot];
+    auto texHandle  = heap_->imageTable[imageSlot];
+    auto sampHandle = heap_->samplerTable[samplerSlot];
+    WebGPU::wgpuDispatchImages(handle, &hdr,
+        static_cast<std::int32_t>(TilesFor(window_->width)),
+        static_cast<std::int32_t>(TilesFor(window_->height)),
+        texHandle, sampHandle);
+}
+
 void UIRenderer::DispatchText(GraphicsCommandBuffer /*cmd*/, std::uint32_t bufferSlot,
                               std::uint32_t itemCount,
                               std::array<float,4> clipRectPx) {
@@ -168,6 +179,7 @@ void UIRenderer::Dispatch(GraphicsCommandBuffer /*cmd*/, const GraphicsComputeSh
             case UICustomBindingKind::Sampler:
                 if (slot < heap_->samplerTable.size()) handle = heap_->samplerTable[slot];
                 break;
+            default: break;
         }
         handles.push_back(handle);
     }
diff --git a/interfaces/Crafter.Graphics-DescriptorHeapWebGPU.cppm b/interfaces/Crafter.Graphics-DescriptorHeapWebGPU.cppm
index 9e50d45..d122417 100644
--- a/interfaces/Crafter.Graphics-DescriptorHeapWebGPU.cppm
+++ b/interfaces/Crafter.Graphics-DescriptorHeapWebGPU.cppm
@@ -191,5 +191,13 @@ export namespace Crafter {
         heap.samplerTable[r.firstElement] = WebGPU::wgpuCreateLinearClampSampler();
         return SamplerSlot(&heap, r.firstElement);
     }
+
+    // Same as AllocateLinearClampSampler but the address modes are
+    // `repeat` instead of `clamp-to-edge`. Mip filtering is also linear.
+    inline SamplerSlot AllocateLinearRepeatSampler(DescriptorHeapWebGPU& heap) {
+        DescriptorRange r = heap.AllocateSamplerSlots(1);
+        heap.samplerTable[r.firstElement] = WebGPU::wgpuCreateLinearRepeatSampler();
+        return SamplerSlot(&heap, r.firstElement);
+    }
 }
 #endif // CRAFTER_GRAPHICS_WINDOW_DOM
diff --git a/interfaces/Crafter.Graphics-Image2D.cppm b/interfaces/Crafter.Graphics-Image2D.cppm
index 01643e2..369d539 100644
--- a/interfaces/Crafter.Graphics-Image2D.cppm
+++ b/interfaces/Crafter.Graphics-Image2D.cppm
@@ -113,17 +113,30 @@ export namespace Crafter {
         std::uint16_t width  = 0;
         std::uint16_t height = 0;
         std::uint16_t layers = 0;
+        std::uint8_t  mipLevels = 1;
 
-        void Create(std::uint16_t w, std::uint16_t h, std::uint16_t layerCount) {
-            width  = w;
-            height = h;
-            layers = layerCount;
-            handle = WebGPU::wgpuCreateImage2DArray(w, h, layerCount);
+        // Create an array with `layerCount` × (w × h) layers, each carrying
+        // `mipLevels` mip levels. Pass mipLevels=1 (default) for a single
+        // base level — matching the original no-mip behaviour. Caller is
+        // responsible for uploading each level via UpdateLayer (which
+        // handles CPU mip-chain generation when mipLevels > 1).
+        void Create(std::uint16_t w, std::uint16_t h, std::uint16_t layerCount,
+                    std::uint8_t mipLevelCount = 1) {
+            width     = w;
+            height    = h;
+            layers    = layerCount;
+            mipLevels = mipLevelCount;
+            handle = WebGPU::wgpuCreateImage2DArray(w, h, layerCount, mipLevelCount);
         }
 
-        // Decompress `tex` and upload to `layer`. The asset's dims must
-        // match the array's (w × h) — resize beforehand on the host with
-        // TextureAsset<RGBA8>::Resize() if they don't.
+        // Decompress `tex`, generate a CPU box-filter mip chain (if
+        // mipLevels > 1), and upload each level into `layer`. The asset's
+        // base-level dims must match the array's (w × h) — resize
+        // beforehand on the host with TextureAsset<RGBA8>::Resize() if
+        // they don't. Pixel data is treated as raw bytes per channel for
+        // the box filter — for non-color data (normal maps) this gives
+        // approximate but adequate results; for sRGB-encoded color data
+        // it's also approximate but visually fine for game textures.
         void UpdateLayer(std::uint16_t layer, const CompressedTextureAsset& tex) {
             if (tex.pixelStride != sizeof(PixelType)) {
                 std::println(std::cerr,
@@ -142,11 +155,56 @@ export namespace Crafter {
                 std::as_writable_bytes(std::span(pixels)),
             };
             Compression::DecompressCPU(tex.blob, outputs);
+
+            // Upload level 0.
             WebGPU::wgpuWriteImage2DLayer(
-                handle, layer,
+                handle, layer, /*level*/ 0,
                 pixels.data(),
                 static_cast<std::int32_t>(pixels.size() * sizeof(PixelType)),
                 width, height);
+
+            // Generate + upload subsequent mip levels via a 2x2 box filter
+            // on the previous level's bytes. Each channel is averaged
+            // independently across 4 source texels.
+            std::uint16_t srcW = width;
+            std::uint16_t srcH = height;
+            std::vector<PixelType> prev = std::move(pixels);
+            for (std::uint8_t lvl = 1; lvl < mipLevels; ++lvl) {
+                std::uint16_t dstW = std::max<std::uint16_t>(1, srcW >> 1);
+                std::uint16_t dstH = std::max<std::uint16_t>(1, srcH >> 1);
+                std::vector<PixelType> next(static_cast<std::size_t>(dstW) * dstH);
+                constexpr std::size_t kChannels = sizeof(PixelType);
+                auto srcBytes = reinterpret_cast<const std::uint8_t*>(prev.data());
+                auto dstBytes = reinterpret_cast<std::uint8_t*>(next.data());
+                for (std::uint16_t y = 0; y < dstH; ++y) {
+                    std::uint16_t sy0 = static_cast<std::uint16_t>(y * 2);
+                    std::uint16_t sy1 = static_cast<std::uint16_t>(std::min<std::int32_t>(sy0 + 1, srcH - 1));
+                    for (std::uint16_t x = 0; x < dstW; ++x) {
+                        std::uint16_t sx0 = static_cast<std::uint16_t>(x * 2);
+                        std::uint16_t sx1 = static_cast<std::uint16_t>(std::min<std::int32_t>(sx0 + 1, srcW - 1));
+                        std::size_t a = (static_cast<std::size_t>(sy0) * srcW + sx0) * kChannels;
+                        std::size_t b = (static_cast<std::size_t>(sy0) * srcW + sx1) * kChannels;
+                        std::size_t c = (static_cast<std::size_t>(sy1) * srcW + sx0) * kChannels;
+                        std::size_t d = (static_cast<std::size_t>(sy1) * srcW + sx1) * kChannels;
+                        std::size_t out = (static_cast<std::size_t>(y) * dstW + x) * kChannels;
+                        for (std::size_t ch = 0; ch < kChannels; ++ch) {
+                            std::uint32_t sum = static_cast<std::uint32_t>(srcBytes[a + ch])
+                                              + static_cast<std::uint32_t>(srcBytes[b + ch])
+                                              + static_cast<std::uint32_t>(srcBytes[c + ch])
+                                              + static_cast<std::uint32_t>(srcBytes[d + ch]);
+                            dstBytes[out + ch] = static_cast<std::uint8_t>((sum + 2u) >> 2);
+                        }
+                    }
+                }
+                WebGPU::wgpuWriteImage2DLayer(
+                    handle, layer, /*level*/ lvl,
+                    next.data(),
+                    static_cast<std::int32_t>(next.size() * sizeof(PixelType)),
+                    dstW, dstH);
+                prev = std::move(next);
+                srcW = dstW;
+                srcH = dstH;
+            }
         }
 
         ImageSlot AllocateSlot(DescriptorHeapWebGPU& heap) {
diff --git a/interfaces/Crafter.Graphics-InputField.cppm b/interfaces/Crafter.Graphics-InputField.cppm
index 8e24897..0812858 100644
--- a/interfaces/Crafter.Graphics-InputField.cppm
+++ b/interfaces/Crafter.Graphics-InputField.cppm
@@ -18,10 +18,7 @@ Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301 USA
 */
 module;
 
-#ifndef CRAFTER_GRAPHICS_WINDOW_DOM
-#endif // !CRAFTER_GRAPHICS_WINDOW_DOM
 export module Crafter.Graphics:InputField;
-#ifndef CRAFTER_GRAPHICS_WINDOW_DOM
 import std;
 import :Types;
 import :Keys;
@@ -110,4 +107,3 @@ export namespace Crafter {
                         const InputFieldColors& colors,
                         bool caretVisible);
 }
-#endif // !CRAFTER_GRAPHICS_WINDOW_DOM
diff --git a/interfaces/Crafter.Graphics-Mesh.cppm b/interfaces/Crafter.Graphics-Mesh.cppm
index 087d94f..4146912 100644
--- a/interfaces/Crafter.Graphics-Mesh.cppm
+++ b/interfaces/Crafter.Graphics-Mesh.cppm
@@ -97,6 +97,7 @@ export namespace Crafter {
         // sentinel; never returned by Build().
         std::uint64_t blasAddr = 0;
         std::uint32_t triangleCount = 0;
+        std::uint32_t vertexCount   = 0;
 
         bool opaque = true;
 
diff --git a/interfaces/Crafter.Graphics-PlainComputeShader.cppm b/interfaces/Crafter.Graphics-PlainComputeShader.cppm
new file mode 100644
index 0000000..e8957f3
--- /dev/null
+++ b/interfaces/Crafter.Graphics-PlainComputeShader.cppm
@@ -0,0 +1,113 @@
+/*
+Crafter®.Graphics
+Copyright (C) 2026 Catcrafts®
+catcrafts.net
+
+This library is free software; you can redistribute it and/or
+modify it under the terms of the GNU Lesser General Public
+License version 3.0 as published by the Free Software Foundation;
+*/
+
+// Standalone compute pipeline. Dispatches at any point in the frame
+// (inside or outside the UI render pass) via the JS bridge's
+// wgpuDispatchCompute, which mirrors the wgpuBuildTLAS pattern of
+// attaching to the active encoder when one exists or creating an
+// ephemeral encoder+submit when not.
+//
+// This is the WebGPU counterpart to the Vulkan `:ComputeShader` partition.
+// They expose the same conceptual API — Load + Dispatch — but with
+// backend-specific binding plumbing. See `:GraphicsTypes` for the
+// `GraphicsComputeShader` alias picking the right one per target.
+//
+// WGSL contract:
+//   @group(0) @binding(0) uniform PushData    // optional; only if pushUniformSize>0
+//   @group(1+) @binding(N)                    // user bindings via UICustomBinding
+// When rayQuery is on, @group(1) is reserved for the RT heap; user
+// bindings start at @group(2).
+
+module;
+export module Crafter.Graphics:PlainComputeShader;
+#ifdef CRAFTER_GRAPHICS_WINDOW_DOM
+import std;
+import :WebGPU;
+import :WebGPUComputeShader;  // for UICustomBinding + UICustomBindingKind
+
+export namespace Crafter {
+    class PlainComputeShader {
+    public:
+        std::uint32_t pipelineHandle  = 0;
+        std::uint32_t pushUniformSize = 0;
+        bool          rayQueryCapable = false;
+        std::vector<UICustomBinding> customBindings;
+
+        PlainComputeShader() = default;
+        PlainComputeShader(const PlainComputeShader&) = delete;
+        PlainComputeShader& operator=(const PlainComputeShader&) = delete;
+        PlainComputeShader(PlainComputeShader&& o) noexcept
+            : pipelineHandle(o.pipelineHandle),
+              pushUniformSize(o.pushUniformSize),
+              rayQueryCapable(o.rayQueryCapable),
+              customBindings(std::move(o.customBindings)) {
+            o.pipelineHandle = 0;
+        }
+
+        // Compile + link a standalone compute shader.
+        //   wgsl             — source.
+        //   pushUniformSize  — byte size of the @group(0)@binding(0) uniform
+        //                      struct, or 0 if the shader doesn't declare one.
+        //   bindings         — every user-declared resource the dispatch
+        //                      should bind (groups 1+ if no rayQuery, 2+ if
+        //                      rayQuery). Order MUST match `handles` at
+        //                      Dispatch time.
+        //   rayQuery         — prepend the RT prelude + rayQuery library
+        //                      so the shader can call `rayQuery*` helpers.
+        void Load(std::string_view wgsl,
+                  std::uint32_t pushUniformSize_,
+                  std::span<const UICustomBinding> bindings = {},
+                  bool rayQuery = false) {
+            pushUniformSize = pushUniformSize_;
+            rayQueryCapable = rayQuery;
+            customBindings.assign(bindings.begin(), bindings.end());
+            pipelineHandle = WebGPU::wgpuLoadComputePipeline(
+                wgsl.data(), static_cast<std::int32_t>(wgsl.size()),
+                static_cast<std::int32_t>(pushUniformSize),
+                customBindings.empty() ? nullptr : customBindings.data(),
+                static_cast<std::int32_t>(customBindings.size()),
+                rayQuery ? 1 : 0);
+        }
+
+        void Load(const std::filesystem::path& wgslPath,
+                  std::uint32_t pushUniformSize_,
+                  std::span<const UICustomBinding> bindings = {},
+                  bool rayQuery = false) {
+            std::ifstream f(wgslPath, std::ios::binary);
+            if (!f) {
+                std::println(std::cerr,
+                    "PlainComputeShader::Load: cannot open {}", wgslPath.string());
+                std::abort();
+            }
+            std::string wgsl((std::istreambuf_iterator<char>(f)),
+                              std::istreambuf_iterator<char>());
+            Load(std::string_view{wgsl}, pushUniformSize_, bindings, rayQuery);
+        }
+
+        // Bind, push, dispatch. `handles` is parallel to the
+        // UICustomBinding[] passed at Load — order matches.
+        void Dispatch(const void* push, std::uint32_t pushBytes,
+                      std::span<const std::uint32_t> handles,
+                      std::uint32_t gx,
+                      std::uint32_t gy = 1,
+                      std::uint32_t gz = 1) const {
+            if (pipelineHandle == 0) return;
+            WebGPU::wgpuDispatchCompute(
+                pipelineHandle,
+                push, static_cast<std::int32_t>(pushBytes),
+                handles.empty() ? nullptr : handles.data(),
+                static_cast<std::int32_t>(handles.size()),
+                static_cast<std::int32_t>(gx),
+                static_cast<std::int32_t>(gy),
+                static_cast<std::int32_t>(gz));
+        }
+    };
+}
+#endif // CRAFTER_GRAPHICS_WINDOW_DOM
diff --git a/interfaces/Crafter.Graphics-RenderingElement3D.cppm b/interfaces/Crafter.Graphics-RenderingElement3D.cppm
index 9756745..6dcdd92 100644
--- a/interfaces/Crafter.Graphics-RenderingElement3D.cppm
+++ b/interfaces/Crafter.Graphics-RenderingElement3D.cppm
@@ -121,6 +121,37 @@ export namespace Crafter {
         // customIndex (4) + _pad (12). Defined in the WGSL traversal
         // library; never directly read by C++.
         WebGPUBuffer<char, false>           buffer;
+        // GPU LBVH support — see additional/dom-webgpu.js's TLAS-build
+        // pipeline.
+        //
+        // entryOrder: per-frame permutation array of u32, indexing into
+        // `buffer` (the TLASEntry[] array). Populated by the radix-sort
+        // pass to spatially-coherent Morton order, then consumed by the
+        // BVH construction + traversal passes.  In Stage 1 (this
+        // baseline) it's the identity permutation written by
+        // tlasBuildMain alongside the entries.
+        WebGPUBuffer<char, false>           entryOrder;
+        // mortonCodes: per-instance 32-bit Morton codes computed from the
+        // world-AABB centroid, used as the radix-sort key. Written by
+        // tlasBuildMain.
+        WebGPUBuffer<char, false>           mortonCodes;
+        // bvhNodes: 2N_PADDED - 1 sweep-tree BVH nodes built per frame
+        // by the LBVH-build compute pass. Each node 32 bytes (aabbMin +
+        // pad, aabbMax + pad). N_PADDED = 65536 (hardcoded in WGSL).
+        // Internal nodes [0, N_PADDED-1); leaves [N_PADDED-1, 2*N_PADDED-1).
+        // Node i's children are 2i+1, 2i+2 (implicit perfect binary
+        // tree). Cap: 65536 instances per scene.
+        WebGPUBuffer<char, false>           bvhNodes;
+        // tlasBins: dead, kept allocated as a 64-byte placeholder so the
+        // existing wgpuBuildTLAS C++ signature doesn't need a churn.
+        // The pre-LBVH 64-bin partition was replaced by the full BVH.
+        WebGPUBuffer<char, false>           tlasBins;
+        // Sort ping-pong buffers for the radix sort. Each pass reads
+        // from one and writes to the other, swapping role. Layout per
+        // element: 1 u32 packed key = (morton16 << 16) | tlasIndex16.
+        // Sized for N_PADDED.
+        WebGPUBuffer<char, false>           sortTempA;
+        WebGPUBuffer<char, false>           sortTempB;
 
         std::uint32_t builtInstanceCount = 0;
     };
@@ -141,6 +172,17 @@ export namespace Crafter {
         // a fresh build (no refit) — the GPU build pass is cheap at the
         // ~10–100 instance counts the design targets; LBVH-for-TLAS is a
         // future optimization for larger scenes.
+        //
+        // BuildTLAS is now split into Upload + Build so a physics
+        // compute pass (e.g. physics-tlas-transform) can run between the
+        // CPU mirror upload and the GPU LBVH build. The compute pass
+        // writes the per-instance transform bytes that BuildTLAS leaves
+        // intact for elements flagged transformOwnedByGpu, and those
+        // writes have to land before the LBVH reads them. The combined
+        // BuildTLAS is kept as a convenience for callers that don't
+        // interleave a compute pass (e.g. the ctor-time first build).
+        static void BuildTLASUpload(WebGPUCommandEncoderRef cmd, std::uint32_t index);
+        static void BuildTLASBuild(WebGPUCommandEncoderRef cmd, std::uint32_t index);
         static void BuildTLAS(WebGPUCommandEncoderRef cmd, std::uint32_t index);
 
         static void Add(RenderingElement3D* e);
diff --git a/interfaces/Crafter.Graphics-UI.cppm b/interfaces/Crafter.Graphics-UI.cppm
index 6dc0d36..dead5c7 100644
--- a/interfaces/Crafter.Graphics-UI.cppm
+++ b/interfaces/Crafter.Graphics-UI.cppm
@@ -165,6 +165,18 @@ export namespace Crafter {
                              std::array<float,4> clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f});
         void DispatchImages(GraphicsCommandBuffer cmd, std::uint32_t bufferSlot, std::uint32_t itemCount,
                             std::array<float,4> clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f});
+#ifdef CRAFTER_GRAPHICS_WINDOW_DOM
+        // WebGPU-only overload. WebGPU bind groups can only carry one
+        // texture/sampler per dispatch, so all items in `bufferSlot`
+        // share the same texture (`imageSlot`) and sampler (`samplerSlot`).
+        // The per-item `slots` field in ImageItem is ignored on this
+        // backend. On Vulkan the bindless heap resolves per-item slots,
+        // so the cross-backend path is to call the 4-arg overload above
+        // on native and this 6-arg overload on DOM.
+        void DispatchImages(GraphicsCommandBuffer cmd, std::uint32_t bufferSlot, std::uint32_t itemCount,
+                            std::uint16_t imageSlot, std::uint16_t samplerSlot,
+                            std::array<float,4> clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f});
+#endif
         void DispatchText(GraphicsCommandBuffer cmd, std::uint32_t bufferSlot, std::uint32_t itemCount,
                           std::array<float,4> clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f});
 
diff --git a/interfaces/Crafter.Graphics-WebGPU.cppm b/interfaces/Crafter.Graphics-WebGPU.cppm
index 799c8c0..dc695fe 100644
--- a/interfaces/Crafter.Graphics-WebGPU.cppm
+++ b/interfaces/Crafter.Graphics-WebGPU.cppm
@@ -35,6 +35,40 @@ namespace Crafter::WebGPU {
     extern "C" std::uint32_t wgpuCreateBuffer(std::int32_t byteSize);
     __attribute__((import_module("env"), import_name("wgpuWriteBuffer")))
     extern "C" void wgpuWriteBuffer(std::uint32_t handle, const void* srcPtr, std::int32_t byteSize);
+    __attribute__((import_module("env"), import_name("wgpuWriteBufferRange")))
+    extern "C" void wgpuWriteBufferRange(std::uint32_t handle,
+                                          std::uint32_t dstByteOffset,
+                                          const void* srcPtr,
+                                          std::int32_t byteSize);
+    // Kick off a GPU→CPU readback for the entire `byteSize`-byte prefix
+    // of the buffer at `handle`. Returns immediately; the actual map
+    // resolves asynchronously. Successive Enqueues without a Poll in
+    // between are no-ops until the previous map resolves.
+    //
+    // `resetBytes` ≥ 0 — if non-zero, the JS bridge encodes a
+    // clearBuffer over the first `resetBytes` bytes of the source
+    // buffer immediately after the copy, in the same command encoder.
+    // Used by Forts3D's GPU event queues to zero the atomic-add count
+    // for the next frame's substeps. The reset is TIED to a successful
+    // enqueue: if the enqueue was skipped (previous map still pending),
+    // the reset is skipped too — so events written by substeps during
+    // the missed-drain window accumulate into the next successful
+    // capture instead of being silently wiped.
+    __attribute__((import_module("env"), import_name("wgpuReadbackEnqueue")))
+    extern "C" void wgpuReadbackEnqueue(std::uint32_t handle,
+                                         std::int32_t byteSize,
+                                         std::int32_t resetBytes);
+    // Poll a previously-enqueued readback. Returns 1 and writes the
+    // bytes into `dstPtr` if the map resolved; returns 0 otherwise.
+    __attribute__((import_module("env"), import_name("wgpuReadbackPoll")))
+    extern "C" std::int32_t wgpuReadbackPoll(std::uint32_t handle, void* dstPtr, std::int32_t byteSize);
+    // Non-consuming readiness probe. Returns 1 if the readback has
+    // resolved and the next Poll would succeed; returns 0 otherwise.
+    // Used to gate multi-buffer drains (header + array) so neither side
+    // gets consumed until both are ready — otherwise the consumed side's
+    // data is lost while the other side waits for its map to resolve.
+    __attribute__((import_module("env"), import_name("wgpuReadbackReady")))
+    extern "C" std::int32_t wgpuReadbackReady(std::uint32_t handle);
     __attribute__((import_module("env"), import_name("wgpuDestroyBuffer")))
     extern "C" void wgpuDestroyBuffer(std::uint32_t handle);
 
@@ -64,15 +98,26 @@ namespace Crafter::WebGPU {
     // Used by Image2DArray<RGBA8> to stack per-material albedos for one
     // multi-material scene.
     __attribute__((import_module("env"), import_name("wgpuCreateImage2DArray")))
-    extern "C" std::uint32_t wgpuCreateImage2DArray(std::int32_t w, std::int32_t h, std::int32_t layerCount);
+    extern "C" std::uint32_t wgpuCreateImage2DArray(std::int32_t w, std::int32_t h,
+                                                    std::int32_t layerCount, std::int32_t mipLevels);
+    // Upload a single mip level for one array layer. `level` indexes into
+    // the texture's mip chain (0 = base); `w` / `h` must be the dimensions
+    // at that level. Callers pass each level's pixels separately — mip
+    // generation is host-side.
     __attribute__((import_module("env"), import_name("wgpuWriteImage2DLayer")))
-    extern "C" void wgpuWriteImage2DLayer(std::uint32_t handle, std::int32_t layer,
+    extern "C" void wgpuWriteImage2DLayer(std::uint32_t handle, std::int32_t layer, std::int32_t level,
                                           const void* srcPtr, std::int32_t byteSize,
                                           std::int32_t w, std::int32_t h);
 
     __attribute__((import_module("env"), import_name("wgpuCreateLinearClampSampler")))
     extern "C" std::uint32_t wgpuCreateLinearClampSampler();
 
+    // Linear-filtered, repeat-addressed sampler with mipmap linear-filter.
+    // The usual choice for tiled material textures (woodBrace, panel, etc.)
+    // which expect UV > 1.0 to wrap.
+    __attribute__((import_module("env"), import_name("wgpuCreateLinearRepeatSampler")))
+    extern "C" std::uint32_t wgpuCreateLinearRepeatSampler();
+
     __attribute__((import_module("env"), import_name("wgpuFrameBegin")))
     extern "C" void wgpuFrameBegin();
     __attribute__((import_module("env"), import_name("wgpuFrameEnd")))
@@ -158,12 +203,56 @@ namespace Crafter::WebGPU {
                                    std::int32_t  gx, std::int32_t gy,
                                    const void* handlesPtr, std::int32_t handlesCount);
 
-    // GPU TLAS-build dispatch. Reads the instance buffer (host-uploaded or
-    // GPU-written), produces per-instance world-space AABBs + per-instance
-    // transform matrices in a flat tlasBuf SSBO consumed by traceRay / rayQuery.
+    // GPU TLAS-build dispatch. Two sequential compute passes:
+    //   1. tlasBuildMain — per-instance world AABB + identity permutation
+    //      + naive Morton (overwritten in pass 2). Outputs the flat
+    //      tlasBuf SSBO consumed by traceRay / rayQuery.
+    //   2. lbvhBuildMain — single workgroup of 1024 threads; reduces
+    //      scene AABB, recomputes Morton with proper normalization,
+    //      bitonic-sorts (morton, instance_id), writes the sorted
+    //      permutation into `entryOrderBufHandle`, and refits a
+    //      sweep-tree BVH into `bvhNodesBufHandle` bottom-up.
+    // Pre-LBVH bin-build is gone; `binsBufHandle` is kept in the
+    // signature as a placeholder so the C++ side doesn't churn.
     __attribute__((import_module("env"), import_name("wgpuBuildTLAS")))
     extern "C" void wgpuBuildTLAS(std::uint32_t instanceBufHandle,
                                   std::int32_t  instanceCount,
-                                  std::uint32_t tlasOutBufHandle);
+                                  std::uint32_t tlasOutBufHandle,
+                                  std::uint32_t entryOrderBufHandle,
+                                  std::uint32_t mortonBufHandle,
+                                  std::uint32_t binsBufHandle,
+                                  std::uint32_t bvhNodesBufHandle,
+                                  std::uint32_t sortTempABufHandle,
+                                  std::uint32_t sortTempBBufHandle);
+
+    // ── Standalone compute pipelines ───────────────────────────────────
+    //
+    // Mirror of the native ComputeShader API: load a user-authored
+    // compute WGSL with arbitrary @group bindings, dispatch it at any
+    // point in the frame (inside or outside the UI compute pass —
+    // physics ticks dispatch from update lambdas, which fire outside
+    // the per-frame render encoder).
+    //
+    // WGSL contract:
+    //   @group(0) @binding(0) — uniform PushData (optional; only if
+    //                            pushUniformSize > 0 at load).
+    //   @group(1+) @binding(N) — user bindings declared via
+    //                            UICustomBinding[]. When rayQuery is
+    //                            on, @group(1) is reserved for the RT
+    //                            heap and user bindings start at
+    //                            @group(2).
+    __attribute__((import_module("env"), import_name("wgpuLoadComputePipeline")))
+    extern "C" std::uint32_t wgpuLoadComputePipeline(
+        const void* wgslPtr, std::int32_t wgslLen,
+        std::int32_t pushUniformSize,
+        const void* bindingsPtr, std::int32_t bindingsCount,
+        std::int32_t rayQueryFlag);
+
+    __attribute__((import_module("env"), import_name("wgpuDispatchCompute")))
+    extern "C" void wgpuDispatchCompute(
+        std::uint32_t pipelineHandle,
+        const void* pushPtr, std::int32_t pushBytes,
+        const void* handlesPtr, std::int32_t handlesCount,
+        std::int32_t gx, std::int32_t gy, std::int32_t gz);
 }
 #endif // CRAFTER_GRAPHICS_WINDOW_DOM
diff --git a/interfaces/Crafter.Graphics-WebGPUBuffer.cppm b/interfaces/Crafter.Graphics-WebGPUBuffer.cppm
index 9273625..4dec146 100644
--- a/interfaces/Crafter.Graphics-WebGPUBuffer.cppm
+++ b/interfaces/Crafter.Graphics-WebGPUBuffer.cppm
@@ -78,6 +78,60 @@ export namespace Crafter {
         void FlushDevice() requires(Mapped) {
             WebGPU::wgpuWriteBuffer(handle, this->value, static_cast<std::int32_t>(size));
         }
+        // Partial upload — write the bytes [srcByteOffset, srcByteOffset+byteCount)
+        // of the host mirror to GPU offset `dstByteOffset`. BuildTLAS uses
+        // this to leave the GPU-owned transform field of an RTInstance
+        // intact (the physics-tlas-transform compute shader is its sole
+        // writer) while still pushing the CPU-side metadata fields.
+        void FlushDeviceRange(std::uint32_t dstByteOffset,
+                              std::uint32_t srcByteOffset,
+                              std::uint32_t byteCount) requires(Mapped) {
+            const auto* base = reinterpret_cast<const char*>(this->value);
+            WebGPU::wgpuWriteBufferRange(handle, dstByteOffset,
+                                          base + srcByteOffset,
+                                          static_cast<std::int32_t>(byteCount));
+        }
+
+        // Push one element's worth of bytes from the host mirror to GPU.
+        // Use when a single SoA slot was mutated (body construction,
+        // per-instance flag flip) and a full FlushDevice would clobber
+        // the GPU-side updates the sim has applied to neighboring slots.
+        void FlushDeviceSlot(std::uint32_t idx) requires(Mapped) {
+            constexpr std::uint32_t kStride = sizeof(T);
+            const std::uint32_t off = idx * kStride;
+            FlushDeviceRange(off, off, kStride);
+        }
+
+        // Schedule a GPU→CPU readback of this buffer's entire contents.
+        // Asynchronous; data isn't ready until a later PollReadback
+        // returns true. Successive Enqueues without a Poll are dropped
+        // — they're a no-op while the previous map is in flight.
+        //
+        // `resetBytes` ≥ 0 — if non-zero, the first `resetBytes` bytes
+        // of THIS buffer are clearBuffer-cleared on the GPU command
+        // encoder immediately after the copy, so the readback captures
+        // the pre-clear bytes and the next frame's writers see zeros.
+        // The reset is tied to a successful enqueue (skipped enqueue =
+        // skipped reset), preserving accumulated state across missed
+        // drains.
+        void EnqueueReadback(std::uint32_t resetBytes = 0) {
+            WebGPU::wgpuReadbackEnqueue(handle,
+                                         static_cast<std::int32_t>(size),
+                                         static_cast<std::int32_t>(resetBytes));
+        }
+        // Try to copy the readback bytes into this->value. Returns true
+        // if the previous EnqueueReadback resolved and the data is now
+        // mirrored into .value; false if the map is still pending.
+        bool PollReadback() requires(Mapped) {
+            return WebGPU::wgpuReadbackPoll(handle, this->value,
+                                             static_cast<std::int32_t>(size)) != 0;
+        }
+        // Non-consuming readiness probe. Returns true if a subsequent
+        // PollReadback would succeed without changing state otherwise.
+        // Use to verify a sibling buffer is also ready before consuming.
+        bool IsReadbackReady() const {
+            return WebGPU::wgpuReadbackReady(handle) != 0;
+        }
 
         ~WebGPUBuffer() { Clear(); }
     };
diff --git a/interfaces/Crafter.Graphics-WebGPUComputeShader.cppm b/interfaces/Crafter.Graphics-WebGPUComputeShader.cppm
index 9726d76..68a4983 100644
--- a/interfaces/Crafter.Graphics-WebGPUComputeShader.cppm
+++ b/interfaces/Crafter.Graphics-WebGPUComputeShader.cppm
@@ -36,6 +36,11 @@ export namespace Crafter {
         SampledTexture      = 1,   // sampled texture_2d<f32>, handle is a slot into heap.imageTable
         Sampler             = 2,   // filtering sampler, handle is a slot into heap.samplerTable
         SampledTextureArray = 3,   // sampled texture_2d_array<f32>, handle is a slot into heap.imageTable
+        // read-write storage SSBO (var<storage, read_write> in WGSL). Use
+        // for buffers shaders need to MUTATE — e.g. physics shaders that
+        // integrate node momentum, write brace stress, or output TLAS
+        // instance transforms.
+        BufferReadWrite     = 4,
     };
 
     struct UICustomBinding {
diff --git a/interfaces/Crafter.Graphics.cppm b/interfaces/Crafter.Graphics.cppm
index 5c94db9..5d136d9 100644
--- a/interfaces/Crafter.Graphics.cppm
+++ b/interfaces/Crafter.Graphics.cppm
@@ -71,5 +71,6 @@ export import :WebGPU;
 export import :WebGPUBuffer;
 export import :DescriptorHeapWebGPU;
 export import :WebGPUComputeShader;
+export import :PlainComputeShader;
 export import :ShaderBindingTableWebGPU;
 export import :PipelineRTWebGPU;
diff --git a/project.cpp b/project.cpp
index e1fd83e..65bcb44 100644
--- a/project.cpp
+++ b/project.cpp
@@ -123,7 +123,7 @@ extern "C" Configuration CrafterBuildProject(std::span<const std::string_view> a
     // when its body is gated out. Vulkan-typed partitions stub to empty
     // modules under CRAFTER_GRAPHICS_WINDOW_DOM; the Dom/DomEvents/Router
     // partitions stub to empty modules in the opposite direction.
-    std::array<fs::path, 41> ifaces = {
+    std::array<fs::path, 42> ifaces = {
         "interfaces/Crafter.Graphics",
         "interfaces/Crafter.Graphics-Animation",
         "interfaces/Crafter.Graphics-Clipboard",
@@ -147,6 +147,7 @@ extern "C" Configuration CrafterBuildProject(std::span<const std::string_view> a
         "interfaces/Crafter.Graphics-Mesh",
         "interfaces/Crafter.Graphics-PipelineRTVulkan",
         "interfaces/Crafter.Graphics-PipelineRTWebGPU",
+        "interfaces/Crafter.Graphics-PlainComputeShader",
         "interfaces/Crafter.Graphics-RenderingElement3D",
         "interfaces/Crafter.Graphics-RenderPass",
         "interfaces/Crafter.Graphics-Router",
@@ -170,14 +171,16 @@ extern "C" Configuration CrafterBuildProject(std::span<const std::string_view> a
     if (dom) {
         // DOM impl set. UI-Shared.cpp is backend-agnostic; UI-WebGPU.cpp
         // is the DOM-only implementation of UIRenderer's GPU-touching
-        // methods. Font / FontAtlas / UIComponents are now portable.
-        std::array<fs::path, 16> domImpls = {
+        // methods. Font / FontAtlas / UIComponents / InputField are now
+        // portable.
+        std::array<fs::path, 17> domImpls = {
             "implementations/Crafter.Graphics-Clipboard",
             "implementations/Crafter.Graphics-Dom",
             "implementations/Crafter.Graphics-Font",
             "implementations/Crafter.Graphics-FontAtlas",
             "implementations/Crafter.Graphics-Gamepad",
             "implementations/Crafter.Graphics-Input",
+            "implementations/Crafter.Graphics-InputField",
             "implementations/Crafter.Graphics-Mesh-WebGPU",
             "implementations/Crafter.Graphics-PipelineRTWebGPU",
             "implementations/Crafter.Graphics-RenderingElement3D-WebGPU",