webgpu improvements
This commit is contained in:
parent
5a75571ffd
commit
8347467e1e
18 changed files with 1932 additions and 153 deletions
106
TODO-lbvh-sort.md
Normal file
106
TODO-lbvh-sort.md
Normal file
|
|
@ -0,0 +1,106 @@
|
||||||
|
# LBVH parallel radix sort: count-dependent corruption
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
The parallel radix sort in `lbvhBuildMain` (additional/dom-webgpu.js) produces
|
||||||
|
incorrect output that depends on the input distribution. Symptom: geometry in
|
||||||
|
the BVH-built TLAS appears to flicker (instances missing or pointing at the
|
||||||
|
wrong entry) as soon as a small object enters the TLAS alongside a tight
|
||||||
|
cluster (e.g. a single projectile next to a 1000-brace fort in 3DForts).
|
||||||
|
|
||||||
|
Bisected by selectively skipping each LBVH phase. Skipping only the radix
|
||||||
|
sort eliminates the corruption — every other phase (scene-AABB reduce,
|
||||||
|
Morton-key write, leaf init, sweep-tree refit) is correctness-clean.
|
||||||
|
|
||||||
|
Current state: the sort is gated behind `if (false)` in `lbvhBuildMain`. BVH
|
||||||
|
leaves are in instance-index order with no spatial coherence. The BVH still
|
||||||
|
builds correctly and traversal still descends a real tree, just with looser
|
||||||
|
parent AABBs.
|
||||||
|
|
||||||
|
## What we know
|
||||||
|
|
||||||
|
- The sort is LSD radix, 8 passes × 4 bits = 32-bit key.
|
||||||
|
- Keys are `(morton16 << 16) | (tlasIndex16)`; sentinels (i >= n) get
|
||||||
|
`0xFFFFFFFF`.
|
||||||
|
- Per-pass: histogram via atomicAdd, then per-bucket parallel scatter with a
|
||||||
|
Hillis-Steele exclusive prefix scan to compute per-thread destination
|
||||||
|
offsets.
|
||||||
|
- Workgroup size 1024, K_PER 16 per thread = 16384 entries total.
|
||||||
|
- The math of the Hillis-Steele scan was verified: after `log2(THREADS)=10`
|
||||||
|
steps with the read/barrier/write/barrier pattern, `shScan[tid]` holds the
|
||||||
|
inclusive prefix sum.
|
||||||
|
- Scatter destinations are provably unique: `shOffsets[b] + exclusivePrefix
|
||||||
|
+ localIdx`, where `exclusivePrefix` is per-thread and `localIdx`
|
||||||
|
increments per-element within the thread.
|
||||||
|
- All required barriers are present:
|
||||||
|
- `workgroupBarrier` between scan iterations.
|
||||||
|
- `workgroupBarrier` at end of each bucket iteration.
|
||||||
|
- `storageBarrier` at end of each radix pass.
|
||||||
|
|
||||||
|
## What we suspect
|
||||||
|
|
||||||
|
The bug is likely one of:
|
||||||
|
|
||||||
|
1. **WGSL implementation issue** in the specific browser/driver. `workgroup
|
||||||
|
Barrier` semantics around `atomicLoad` on workgroup memory, or around
|
||||||
|
single-buffered Hillis-Steele where one thread reads `shScan[tid - offset]`
|
||||||
|
while a neighbor writes `shScan[tid]`. Standard pattern, but the spec is
|
||||||
|
subtle.
|
||||||
|
2. **Memory model edge case** triggered only with very unbalanced histograms
|
||||||
|
(e.g. bucket 15 holding ~94% of entries because almost everything is
|
||||||
|
sentinel-padded). Most threads have localCount ≤ 1 for non-{0, 15}
|
||||||
|
buckets and exactly 15-16 for bucket 15; that mix may surface a
|
||||||
|
compiler-introduced reordering.
|
||||||
|
3. **A logical bug in the scan or scatter** that the human review keeps
|
||||||
|
missing — re-reading the code is the last thing that helps; what's
|
||||||
|
needed is a GPU-side trace.
|
||||||
|
|
||||||
|
## Reproducing
|
||||||
|
|
||||||
|
1. Run 3DForts WebGPU build with normal projectile firing.
|
||||||
|
2. Aim near (not necessarily at) the fort.
|
||||||
|
3. Observe braces / panels flickering as the projectile flies past.
|
||||||
|
|
||||||
|
## Diagnostic strategies if revisiting
|
||||||
|
|
||||||
|
1. **GPU-side trace.** Add a debug buffer (`array<u32>` sized for all 16384
|
||||||
|
entries × a few u32). Have each thread write its intermediate scan
|
||||||
|
values and final scatter destinations there. Read back to CPU and diff
|
||||||
|
against an expected oracle (CPU-computed reference sort of the same
|
||||||
|
input keys).
|
||||||
|
2. **Halve the search.** Reduce `PASSES` to 1 and check: does a single-pass
|
||||||
|
sort already corrupt, or does corruption only emerge after multiple
|
||||||
|
ping-pongs?
|
||||||
|
3. **Replace the scan.** Swap Hillis-Steele for a Blelloch up/down-sweep
|
||||||
|
scan or a `subgroupExclusiveAdd` variant where available. If the
|
||||||
|
replacement fixes it, the bug is in the Hillis-Steele specifically.
|
||||||
|
4. **Serialize the scatter.** Have thread 0 do all scatters by itself
|
||||||
|
(loop over all 16384 entries × 16 buckets sequentially). Slow but a
|
||||||
|
provably-correct reference. If this fixes the flicker, the parallel
|
||||||
|
scatter has the bug.
|
||||||
|
5. **Replace LSD with bitonic sort.** Different algorithm entirely. If
|
||||||
|
bitonic works, radix has a structural problem.
|
||||||
|
|
||||||
|
## Why it's not blocking
|
||||||
|
|
||||||
|
At the current scale (~1011 entries), the BVH still functions:
|
||||||
|
|
||||||
|
- Sentinel half-subtrees are degenerate-AABB-rejected at the top of the
|
||||||
|
tree very cheaply (~1 AABB test per skipped subtree).
|
||||||
|
- The real-leaf subtree has ~10 levels of descent (`log2(1024)`), all of
|
||||||
|
which are real AABB tests. Without spatial coherence the AABBs are
|
||||||
|
looser than a properly-sorted BVH, but they still bound the geometry.
|
||||||
|
- Ray-vs-triangle work dominates anyway; BVH traversal is a small fraction
|
||||||
|
of the per-pixel cost.
|
||||||
|
|
||||||
|
Headroom: LBVH_MAX = 16384. If the application pushes much past ~4000 real
|
||||||
|
entries this stops being acceptable and the sort needs to actually work.
|
||||||
|
|
||||||
|
## Acceptance criteria for "fixed"
|
||||||
|
|
||||||
|
- The diagnostic repro (3DForts: fire a projectile near the fort) shows
|
||||||
|
no flicker at all.
|
||||||
|
- The sort produces output ordered by `(morton16, tlasIndex)` ascending.
|
||||||
|
- A unit test (CPU oracle vs GPU output) passes for at least three
|
||||||
|
histogram distributions: all-uniform, all-in-one-bucket, and the
|
||||||
|
3DForts-style "one small object next to a tight cluster".
|
||||||
|
|
@ -168,15 +168,25 @@ function setValue(cookie, valPtr, valLen) {
|
||||||
// so removeEventListener can re-find it. C++-side handler id counters
|
// so removeEventListener can re-find it. C++-side handler id counters
|
||||||
// are per-kind, so a per-kind suffix is what makes the keys unique.
|
// are per-kind, so a per-kind suffix is what makes the keys unique.
|
||||||
|
|
||||||
|
// devicePixelRatio scaling factor. dom-webgpu.js sets window.crafter_dpr
|
||||||
|
// during its canvas sync so this side and the GPU side agree on a single
|
||||||
|
// physical-pixel coordinate space. Fallback to the live DPR if no GPU
|
||||||
|
// bridge ran (pure-CppDOM apps); ultimately fallback to 1 so non-HiDPI
|
||||||
|
// browsers behave as before.
|
||||||
|
function __dpr() {
|
||||||
|
return window.crafter_dpr || window.devicePixelRatio || 1;
|
||||||
|
}
|
||||||
|
|
||||||
function __makeMouseListenerPair(kind, eventName, exportName) {
|
function __makeMouseListenerPair(kind, eventName, exportName) {
|
||||||
return {
|
return {
|
||||||
add(cookie, id) {
|
add(cookie, id) {
|
||||||
const el = __jsmemory.get(cookie);
|
const el = __jsmemory.get(cookie);
|
||||||
if (!el) return;
|
if (!el) return;
|
||||||
const handler = (event) => {
|
const handler = (event) => {
|
||||||
|
const s = __dpr();
|
||||||
__wasm()[exportName](id,
|
__wasm()[exportName](id,
|
||||||
event.clientX, event.clientY,
|
event.clientX * s, event.clientY * s,
|
||||||
event.screenX, event.screenY,
|
event.screenX * s, event.screenY * s,
|
||||||
event.button, event.buttons,
|
event.button, event.buttons,
|
||||||
event.altKey, event.ctrlKey, event.shiftKey, event.metaKey);
|
event.altKey, event.ctrlKey, event.shiftKey, event.metaKey);
|
||||||
};
|
};
|
||||||
|
|
@ -317,7 +327,10 @@ const __resizePair = {
|
||||||
// Resize is window-global in CppDOM. Mirror that: attach to `window`
|
// Resize is window-global in CppDOM. Mirror that: attach to `window`
|
||||||
// regardless of which element the C++ caller passed.
|
// regardless of which element the C++ caller passed.
|
||||||
add(cookie, id) {
|
add(cookie, id) {
|
||||||
const handler = () => __wasm().ExecuteResizeHandler(id, window.innerWidth, window.innerHeight);
|
const handler = () => {
|
||||||
|
const s = __dpr();
|
||||||
|
__wasm().ExecuteResizeHandler(id, window.innerWidth * s, window.innerHeight * s);
|
||||||
|
};
|
||||||
__listenerHandlers.set(`${cookie}-${id}-resize`, handler);
|
__listenerHandlers.set(`${cookie}-${id}-resize`, handler);
|
||||||
window.addEventListener("resize", handler);
|
window.addEventListener("resize", handler);
|
||||||
},
|
},
|
||||||
|
|
@ -345,9 +358,10 @@ const __wheelPair = {
|
||||||
add(cookie, id) {
|
add(cookie, id) {
|
||||||
const el = __jsmemory.get(cookie); if (!el) return;
|
const el = __jsmemory.get(cookie); if (!el) return;
|
||||||
const handler = (event) => {
|
const handler = (event) => {
|
||||||
|
const s = __dpr();
|
||||||
__wasm().ExecuteWheelHandler(id,
|
__wasm().ExecuteWheelHandler(id,
|
||||||
event.deltaX, event.deltaY, event.deltaZ, event.deltaMode,
|
event.deltaX, event.deltaY, event.deltaZ, event.deltaMode,
|
||||||
event.clientX, event.clientY, event.screenX, event.screenY,
|
event.clientX * s, event.clientY * s, event.screenX * s, event.screenY * s,
|
||||||
event.button, event.buttons,
|
event.button, event.buttons,
|
||||||
event.altKey, event.ctrlKey, event.shiftKey, event.metaKey);
|
event.altKey, event.ctrlKey, event.shiftKey, event.metaKey);
|
||||||
};
|
};
|
||||||
|
|
@ -378,11 +392,97 @@ function domAttachWindow(windowHandle) {
|
||||||
if (fn) fn(__windowAttachedHandle, ...args);
|
if (fn) fn(__windowAttachedHandle, ...args);
|
||||||
};
|
};
|
||||||
|
|
||||||
__windowListeners.mousemove = (e) => fire("__crafterDom_mouseMove", [e.clientX, e.clientY]);
|
// Synthetic absolute position for pointer-lock mode. While the
|
||||||
__windowListeners.mousedown = (e) => fire("__crafterDom_mouseDown", [e.button]);
|
// pointer is locked, browsers fire mousemove events with movementX/Y
|
||||||
__windowListeners.mouseup = (e) => fire("__crafterDom_mouseUp", [e.button]);
|
// deltas instead of meaningful clientX/Y, and the cursor is hidden +
|
||||||
|
// captured by the canvas (no window-edge clamp). We accumulate the
|
||||||
|
// deltas into a synthetic position and feed *that* to the C++ side,
|
||||||
|
// so the existing `currentMousePos - lastMousePos` delta computation
|
||||||
|
// keeps working unchanged. Initialised to the cursor position the
|
||||||
|
// moment lock is acquired.
|
||||||
|
let __ptrLockSyntheticX = 0;
|
||||||
|
let __ptrLockSyntheticY = 0;
|
||||||
|
const __isPointerLocked = () =>
|
||||||
|
document.pointerLockElement !== null &&
|
||||||
|
document.pointerLockElement !== undefined;
|
||||||
|
|
||||||
|
// pointermove (not mousemove) so we can pull sub-frame events out of
|
||||||
|
// `getCoalescedEvents()`. Browsers normally collapse multiple raw
|
||||||
|
// mouse events between paint frames into a single event you'd see
|
||||||
|
// via `mousemove`; PointerEvent.getCoalescedEvents() returns the raw
|
||||||
|
// pre-coalesced list. Summing those gives a higher-resolution delta
|
||||||
|
// per frame than the single coalesced movementX/Y. PointerEvent also
|
||||||
|
// delivers fractional movementX from high-precision mice on Chromium.
|
||||||
|
__windowListeners.mousemove = (e) => {
|
||||||
|
const s = __dpr();
|
||||||
|
const locked = __isPointerLocked();
|
||||||
|
if (locked) {
|
||||||
|
// Accumulate over every sub-frame event the browser had
|
||||||
|
// queued up. `getCoalescedEvents` is the spec-correct way
|
||||||
|
// to access raw input between rAF ticks. Some browsers
|
||||||
|
// return an empty list — fall back to the top-level event.
|
||||||
|
let dx = 0, dy = 0;
|
||||||
|
const sub = (typeof e.getCoalescedEvents === "function")
|
||||||
|
? e.getCoalescedEvents() : null;
|
||||||
|
if (sub && sub.length > 0) {
|
||||||
|
for (let i = 0; i < sub.length; i++) {
|
||||||
|
dx += sub[i].movementX;
|
||||||
|
dy += sub[i].movementY;
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
dx = e.movementX;
|
||||||
|
dy = e.movementY;
|
||||||
|
}
|
||||||
|
// No DPR scaling in pointer-lock: position is synthetic and
|
||||||
|
// there's no UI hit-test using it. DPR-scaling here only
|
||||||
|
// rounds finer movements up to multiples of `dpr`, which is
|
||||||
|
// pure quantization loss for aim controls.
|
||||||
|
__ptrLockSyntheticX += dx;
|
||||||
|
__ptrLockSyntheticY += dy;
|
||||||
|
fire("__crafterDom_mouseMove",
|
||||||
|
[__ptrLockSyntheticX, __ptrLockSyntheticY]);
|
||||||
|
} else {
|
||||||
|
fire("__crafterDom_mouseMove", [e.clientX * s, e.clientY * s]);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
__windowListeners.mousedown = (e) => {
|
||||||
|
// Right-click holds engage pointer lock — typical FPS-camera
|
||||||
|
// convention. Acquiring on any click (the previous policy) made
|
||||||
|
// menus annoying: clicking a button hid the cursor mid-flow. Now
|
||||||
|
// the cursor stays free for clicks/menus until the user holds
|
||||||
|
// RMB to actively look around. Browsers require lock requests
|
||||||
|
// from user gestures, which mousedown satisfies.
|
||||||
|
if (e.button === 2 && !__isPointerLocked()) {
|
||||||
|
const target = document.body;
|
||||||
|
if (target && target.requestPointerLock) {
|
||||||
|
target.requestPointerLock();
|
||||||
|
// Seed the synthetic position from the click point so
|
||||||
|
// there's no jump when the lock starts producing deltas.
|
||||||
|
__ptrLockSyntheticX = e.clientX;
|
||||||
|
__ptrLockSyntheticY = e.clientY;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
fire("__crafterDom_mouseDown", [e.button]);
|
||||||
|
};
|
||||||
|
__windowListeners.mouseup = (e) => {
|
||||||
|
// Release lock on RMB up — cursor reappears at the seed point
|
||||||
|
// for clicks/menus until the next RMB hold.
|
||||||
|
if (e.button === 2 && __isPointerLocked()) {
|
||||||
|
document.exitPointerLock();
|
||||||
|
}
|
||||||
|
fire("__crafterDom_mouseUp", [e.button]);
|
||||||
|
};
|
||||||
__windowListeners.wheel = (e) => fire("__crafterDom_wheel", [e.deltaY]);
|
__windowListeners.wheel = (e) => fire("__crafterDom_wheel", [e.deltaY]);
|
||||||
__windowListeners.contextmenu = (e) => { e.preventDefault(); };
|
__windowListeners.contextmenu = (e) => { e.preventDefault(); };
|
||||||
|
__windowListeners.pointerlockchange = () => {
|
||||||
|
// Reset the synthetic accumulator when lock is released so the
|
||||||
|
// next acquisition starts cleanly. The C++ side will see one
|
||||||
|
// small jump back to the real cursor position on release.
|
||||||
|
if (!__isPointerLocked()) {
|
||||||
|
__ptrLockSyntheticX = 0;
|
||||||
|
__ptrLockSyntheticY = 0;
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
// Keyboard events go through the document so they fire even when no
|
// Keyboard events go through the document so they fire even when no
|
||||||
// input element is focused. event.code is the layout-independent
|
// input element is focused. event.code is the layout-independent
|
||||||
|
|
@ -400,16 +500,24 @@ function domAttachWindow(windowHandle) {
|
||||||
__wasm().WasmFree(codePtr);
|
__wasm().WasmFree(codePtr);
|
||||||
};
|
};
|
||||||
|
|
||||||
__windowListeners.resize = () => fire("__crafterDom_resize", [window.innerWidth, window.innerHeight]);
|
__windowListeners.resize = () => {
|
||||||
|
const s = __dpr();
|
||||||
|
fire("__crafterDom_resize", [window.innerWidth * s, window.innerHeight * s]);
|
||||||
|
};
|
||||||
__windowListeners.beforeunload = () => fire("__crafterDom_close", []);
|
__windowListeners.beforeunload = () => fire("__crafterDom_close", []);
|
||||||
|
|
||||||
document.addEventListener("mousemove", __windowListeners.mousemove);
|
// pointermove (not mousemove) so the handler receives PointerEvents
|
||||||
|
// and can use getCoalescedEvents() to recover sub-frame motion. The
|
||||||
|
// handler's variable name stays "mousemove" — it's the same JS object,
|
||||||
|
// just bound to a different event type.
|
||||||
|
document.addEventListener("pointermove", __windowListeners.mousemove);
|
||||||
document.addEventListener("mousedown", __windowListeners.mousedown);
|
document.addEventListener("mousedown", __windowListeners.mousedown);
|
||||||
document.addEventListener("mouseup", __windowListeners.mouseup);
|
document.addEventListener("mouseup", __windowListeners.mouseup);
|
||||||
document.addEventListener("wheel", __windowListeners.wheel);
|
document.addEventListener("wheel", __windowListeners.wheel);
|
||||||
document.addEventListener("contextmenu", __windowListeners.contextmenu);
|
document.addEventListener("contextmenu", __windowListeners.contextmenu);
|
||||||
document.addEventListener("keydown", __windowListeners.keydown);
|
document.addEventListener("keydown", __windowListeners.keydown);
|
||||||
document.addEventListener("keyup", __windowListeners.keyup);
|
document.addEventListener("keyup", __windowListeners.keyup);
|
||||||
|
document.addEventListener("pointerlockchange", __windowListeners.pointerlockchange);
|
||||||
window .addEventListener("resize", __windowListeners.resize);
|
window .addEventListener("resize", __windowListeners.resize);
|
||||||
window .addEventListener("beforeunload",__windowListeners.beforeunload);
|
window .addEventListener("beforeunload",__windowListeners.beforeunload);
|
||||||
}
|
}
|
||||||
|
|
@ -418,8 +526,8 @@ function domSetTitle(titlePtr, titleLen) {
|
||||||
document.title = __readUtf8(titlePtr, titleLen);
|
document.title = __readUtf8(titlePtr, titleLen);
|
||||||
}
|
}
|
||||||
|
|
||||||
function domGetInnerWidth() { return window.innerWidth; }
|
function domGetInnerWidth() { return Math.round(window.innerWidth * __dpr()); }
|
||||||
function domGetInnerHeight() { return window.innerHeight; }
|
function domGetInnerHeight() { return Math.round(window.innerHeight * __dpr()); }
|
||||||
|
|
||||||
// ─── requestAnimationFrame loop ───────────────────────────────────────
|
// ─── requestAnimationFrame loop ───────────────────────────────────────
|
||||||
|
|
||||||
|
|
|
||||||
File diff suppressed because it is too large
Load diff
|
|
@ -225,6 +225,7 @@ namespace {
|
||||||
std::span<const std::uint32_t> indices,
|
std::span<const std::uint32_t> indices,
|
||||||
std::span<const std::byte> attribsBytes) {
|
std::span<const std::byte> attribsBytes) {
|
||||||
mesh.triangleCount = static_cast<std::uint32_t>(indices.size()) / 3;
|
mesh.triangleCount = static_cast<std::uint32_t>(indices.size()) / 3;
|
||||||
|
mesh.vertexCount = static_cast<std::uint32_t>(vertices.size());
|
||||||
|
|
||||||
Builder builder;
|
Builder builder;
|
||||||
builder.Build(vertices, indices);
|
builder.Build(vertices, indices);
|
||||||
|
|
|
||||||
|
|
@ -4,12 +4,21 @@ Copyright (C) 2026 Catcrafts®
|
||||||
catcrafts.net
|
catcrafts.net
|
||||||
*/
|
*/
|
||||||
|
|
||||||
// DOM-mode TLAS upkeep. BuildTLAS copies the per-element RTInstance into
|
// DOM-mode TLAS upkeep. BuildTLAS is split in two phases so a physics
|
||||||
// the host-visible instance buffer (skipping the transform for elements
|
// compute pass can run between them:
|
||||||
// whose transform is GPU-owned), uploads it, then dispatches the JS-side
|
// - BuildTLASUpload mirrors the CPU-side RTInstance array into the
|
||||||
// TLAS-build compute pass — which consults the per-BLAS records published
|
// host-visible instance buffer (with partial-write semantics that
|
||||||
// at Mesh::Build() time to produce world-space AABBs and inverse
|
// preserve the transform bytes for elements flagged
|
||||||
// transforms in the format `traceRay` / `rayQuery` consume.
|
// transformOwnedByGpu, see notes in the body) and uploads the
|
||||||
|
// metadata buffer.
|
||||||
|
// - BuildTLASBuild dispatches the JS-side TLAS-build compute pass —
|
||||||
|
// which consults the per-BLAS records published at Mesh::Build()
|
||||||
|
// time to produce world-space AABBs and inverse transforms in the
|
||||||
|
// format `traceRay` / `rayQuery` consume.
|
||||||
|
// The combined BuildTLAS calls both back-to-back; callers that want to
|
||||||
|
// interleave a physics tlas-transform compute pass (which writes the
|
||||||
|
// transform bytes BuildTLASUpload leaves intact) call Upload + their
|
||||||
|
// compute pass + Build manually.
|
||||||
|
|
||||||
module;
|
module;
|
||||||
module Crafter.Graphics:RenderingElement3D_implWebGPU;
|
module Crafter.Graphics:RenderingElement3D_implWebGPU;
|
||||||
|
|
@ -41,7 +50,7 @@ void RenderingElement3D::Remove(RenderingElement3D* e) {
|
||||||
e->indexInElements = std::numeric_limits<std::uint32_t>::max();
|
e->indexInElements = std::numeric_limits<std::uint32_t>::max();
|
||||||
}
|
}
|
||||||
|
|
||||||
void RenderingElement3D::BuildTLAS(WebGPUCommandEncoderRef /*cmd*/, std::uint32_t index) {
|
void RenderingElement3D::BuildTLASUpload(WebGPUCommandEncoderRef /*cmd*/, std::uint32_t index) {
|
||||||
auto& tlas = tlases[index];
|
auto& tlas = tlases[index];
|
||||||
const std::uint32_t primitiveCount = static_cast<std::uint32_t>(elements.size());
|
const std::uint32_t primitiveCount = static_cast<std::uint32_t>(elements.size());
|
||||||
if (primitiveCount == 0) {
|
if (primitiveCount == 0) {
|
||||||
|
|
@ -49,19 +58,52 @@ void RenderingElement3D::BuildTLAS(WebGPUCommandEncoderRef /*cmd*/, std::uint32_
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
||||||
// (Re)allocate instance + metadata + output TLAS buffers if the count
|
constexpr std::uint32_t kNPadded = 65536u; // size for instance / metadata mirrors
|
||||||
// changed. WebGPUBuffer::Resize destroys and recreates the GPU buffer;
|
constexpr std::uint32_t kLbvhMax = 16384u; // matches N_PADDED in lbvhBuildWgsl
|
||||||
// bind-group caches keyed on the buffer handle are invalidated in the
|
constexpr std::uint32_t kNodeCount = 2u * kNPadded - 1u;
|
||||||
// JS bridge automatically.
|
|
||||||
if (primitiveCount != tlas.builtInstanceCount) {
|
// ALL TLAS-side GPU buffers get allocated ONCE and never resized.
|
||||||
tlas.instanceBuffer.Resize(primitiveCount);
|
// The LBVH-build shader takes the real instance count via a uniform
|
||||||
tlas.metadataBuffer.Resize(primitiveCount);
|
// (lbvhPc.nReal) instead of arrayLength(&entries), so the
|
||||||
// TLASEntry layout in WGSL is 144 bytes due to vec3 align/pad
|
// tlas.buffer / entryOrder / mortonCodes don't need to grow when
|
||||||
// rules. Must match the struct declared in the rtWgslTypes
|
// the application's element count changes.
|
||||||
// block in additional/dom-webgpu.js.
|
//
|
||||||
tlas.buffer.Resize(primitiveCount * 144);
|
// Why this matters: an earlier version resized these per-frame on
|
||||||
|
// primitiveCount change. The destroy+recreate cycle on the GPU
|
||||||
|
// buffer caused subtle mid-game flicker as soon as any element was
|
||||||
|
// added (e.g. firing a projectile) — fort braces would appear to
|
||||||
|
// briefly vanish in patterns deterministic on the projectile's
|
||||||
|
// angle. Suspected driver-level memory recycling without proper
|
||||||
|
// zero-init; the fixed-size allocation sidesteps it entirely.
|
||||||
|
if (tlas.instanceBuffer.handle == 0) {
|
||||||
|
tlas.instanceBuffer.Resize(kNPadded);
|
||||||
|
tlas.metadataBuffer.Resize(kNPadded);
|
||||||
|
tlas.bvhNodes.Resize(kNodeCount * 32u);
|
||||||
|
tlas.sortTempA.Resize(kNPadded * 4u);
|
||||||
|
tlas.sortTempB.Resize(kNPadded * 4u);
|
||||||
|
tlas.tlasBins.Resize(64 * 32);
|
||||||
|
// TLAS-entry / order / morton-code buffers: sized for the LBVH
|
||||||
|
// cap (16384). lbvhBuildMain iterates `lbvhPc.nReal` real
|
||||||
|
// entries; the remainder stays zero / sentinel. Keep these
|
||||||
|
// stable across element-count changes so the renderer's bind
|
||||||
|
// group references the same buffer handle every frame.
|
||||||
|
tlas.buffer.Resize(kLbvhMax * 144u);
|
||||||
|
tlas.entryOrder.Resize(kLbvhMax * 4u);
|
||||||
|
tlas.mortonCodes.Resize(kLbvhMax * 4u);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// NB: tlas.buffer / entryOrder / mortonCodes get resized in
|
||||||
|
// BuildTLASBuild, NOT here. Resize destroys + recreates the GPU
|
||||||
|
// resource (and the JS-side handle); the rayQuery dispatches that
|
||||||
|
// run between BuildTLASUpload and BuildTLASBuild (projectile-collide,
|
||||||
|
// splash, builder-pick) still hold the previous frame's TLAS in
|
||||||
|
// rtState.current{Tlas,EntryOrder,Bvh}. If we resized here, those
|
||||||
|
// handles would point at destroyed buffers and the dispatches would
|
||||||
|
// log "no TLAS built yet" every frame the element count changed
|
||||||
|
// (e.g. every projectile fire). Resizing inside BuildTLASBuild,
|
||||||
|
// immediately before wgpuBuildTLAS publishes the new handles, keeps
|
||||||
|
// the JS-side current* refs in sync with the GPU resources.
|
||||||
|
|
||||||
for (std::uint32_t i = 0; i < primitiveCount; ++i) {
|
for (std::uint32_t i = 0; i < primitiveCount; ++i) {
|
||||||
auto& dst = tlas.instanceBuffer.value[i];
|
auto& dst = tlas.instanceBuffer.value[i];
|
||||||
const auto& src = elements[i]->instance;
|
const auto& src = elements[i]->instance;
|
||||||
|
|
@ -80,12 +122,73 @@ void RenderingElement3D::BuildTLAS(WebGPUCommandEncoderRef /*cmd*/, std::uint32_
|
||||||
tlas.metadataBuffer.value[i] = elements[i]->userMetadata;
|
tlas.metadataBuffer.value[i] = elements[i]->userMetadata;
|
||||||
}
|
}
|
||||||
|
|
||||||
tlas.instanceBuffer.FlushDevice();
|
// Upload the instance buffer with partial-write semantics: for runs
|
||||||
|
// of CPU-driven elements (transformOwnedByGpu=false) we push the
|
||||||
|
// whole 64-byte struct in one writeBuffer call; for GPU-driven runs
|
||||||
|
// we push only the trailing 16 metadata bytes per element, leaving
|
||||||
|
// the transform field intact for the physics-tlas-transform compute
|
||||||
|
// shader to update. The two arms below produce identical GPU state
|
||||||
|
// when every element is CPU-driven — this is a no-op refactor until
|
||||||
|
// 3DForts flips its physics elements to transformOwnedByGpu=true.
|
||||||
|
constexpr std::uint32_t kInstSize = sizeof(RTInstance); // 64
|
||||||
|
constexpr std::uint32_t kTransformSize = sizeof(RTTransformMatrix); // 48
|
||||||
|
constexpr std::uint32_t kMetaSize = kInstSize - kTransformSize; // 16
|
||||||
|
|
||||||
|
std::uint32_t runStart = 0;
|
||||||
|
bool runOwned = elements[0]->transformOwnedByGpu;
|
||||||
|
for (std::uint32_t i = 1; i <= primitiveCount; ++i) {
|
||||||
|
const bool atEnd = (i == primitiveCount);
|
||||||
|
const bool currOwned = atEnd ? !runOwned : elements[i]->transformOwnedByGpu;
|
||||||
|
if (currOwned == runOwned && !atEnd) continue;
|
||||||
|
|
||||||
|
if (runOwned) {
|
||||||
|
// GPU-driven run — metadata only, per element. Cannot batch
|
||||||
|
// because the metadata bytes are non-contiguous in the
|
||||||
|
// instance buffer (one 16-byte chunk per 64-byte slot).
|
||||||
|
for (std::uint32_t j = runStart; j < i; ++j) {
|
||||||
|
const std::uint32_t off = j * kInstSize + kTransformSize;
|
||||||
|
tlas.instanceBuffer.FlushDeviceRange(off, off, kMetaSize);
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
// CPU-driven run — one contiguous writeBuffer.
|
||||||
|
const std::uint32_t startOff = runStart * kInstSize;
|
||||||
|
const std::uint32_t bytes = (i - runStart) * kInstSize;
|
||||||
|
tlas.instanceBuffer.FlushDeviceRange(startOff, startOff, bytes);
|
||||||
|
}
|
||||||
|
runStart = i;
|
||||||
|
runOwned = currOwned;
|
||||||
|
}
|
||||||
|
|
||||||
tlas.metadataBuffer.FlushDevice();
|
tlas.metadataBuffer.FlushDevice();
|
||||||
|
}
|
||||||
|
|
||||||
|
void RenderingElement3D::BuildTLASBuild(WebGPUCommandEncoderRef /*cmd*/, std::uint32_t index) {
|
||||||
|
auto& tlas = tlases[index];
|
||||||
|
const std::uint32_t primitiveCount = static_cast<std::uint32_t>(elements.size());
|
||||||
|
if (primitiveCount == 0) {
|
||||||
|
// Upload already cleared builtInstanceCount; nothing to dispatch.
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
// No per-count Resize. tlas.buffer / entryOrder / mortonCodes were
|
||||||
|
// allocated at kLbvhMax in BuildTLASUpload's first call and stay
|
||||||
|
// that size. The LBVH shader reads the real count from a uniform
|
||||||
|
// (lbvhPc.nReal) wgpuBuildTLAS writes each call.
|
||||||
|
|
||||||
WebGPU::wgpuBuildTLAS(tlas.instanceBuffer.handle,
|
WebGPU::wgpuBuildTLAS(tlas.instanceBuffer.handle,
|
||||||
static_cast<std::int32_t>(primitiveCount),
|
static_cast<std::int32_t>(primitiveCount),
|
||||||
tlas.buffer.handle);
|
tlas.buffer.handle,
|
||||||
|
tlas.entryOrder.handle,
|
||||||
|
tlas.mortonCodes.handle,
|
||||||
|
tlas.tlasBins.handle,
|
||||||
|
tlas.bvhNodes.handle,
|
||||||
|
tlas.sortTempA.handle,
|
||||||
|
tlas.sortTempB.handle);
|
||||||
|
|
||||||
tlas.builtInstanceCount = primitiveCount;
|
tlas.builtInstanceCount = primitiveCount;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
void RenderingElement3D::BuildTLAS(WebGPUCommandEncoderRef cmd, std::uint32_t index) {
|
||||||
|
BuildTLASUpload(cmd, index);
|
||||||
|
BuildTLASBuild(cmd, index);
|
||||||
|
}
|
||||||
|
|
|
||||||
|
|
@ -98,13 +98,9 @@ void UIRenderer::DispatchImages(GraphicsCommandBuffer /*cmd*/, std::uint32_t buf
|
||||||
if (itemCount == 0) return;
|
if (itemCount == 0) return;
|
||||||
UIDispatchHeader hdr = FillHeader(bufferSlot, itemCount, clipRectPx);
|
UIDispatchHeader hdr = FillHeader(bufferSlot, itemCount, clipRectPx);
|
||||||
auto handle = heap_->bufferTable[bufferSlot];
|
auto handle = heap_->bufferTable[bufferSlot];
|
||||||
// For DispatchImages, the WGSL expects a texture + sampler in group 3.
|
// Backward-compatible fallback: callers that don't pass a texture
|
||||||
// The library v1 doesn't expose user-image registration on DOM (out of
|
// get the font atlas. Useful for tests, useless for real content.
|
||||||
// scope per plan). If the user calls DispatchImages without a registered
|
// New code should use the 6-arg overload below.
|
||||||
// image, fall back to using the font atlas binding — the user's items
|
|
||||||
// should reference texSlot/sampSlot but on DOM those are ignored. For
|
|
||||||
// now, route through the font atlas texture if available; otherwise
|
|
||||||
// skip the dispatch.
|
|
||||||
if (fontAtlasImageSlot_) {
|
if (fontAtlasImageSlot_) {
|
||||||
auto texHandle = heap_->imageTable[fontAtlasImageSlot_];
|
auto texHandle = heap_->imageTable[fontAtlasImageSlot_];
|
||||||
auto sampHandle = heap_->samplerTable[fontAtlasSamplerSlot_];
|
auto sampHandle = heap_->samplerTable[fontAtlasSamplerSlot_];
|
||||||
|
|
@ -115,6 +111,21 @@ void UIRenderer::DispatchImages(GraphicsCommandBuffer /*cmd*/, std::uint32_t buf
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
void UIRenderer::DispatchImages(GraphicsCommandBuffer /*cmd*/, std::uint32_t bufferSlot,
|
||||||
|
std::uint32_t itemCount,
|
||||||
|
std::uint16_t imageSlot, std::uint16_t samplerSlot,
|
||||||
|
std::array<float,4> clipRectPx) {
|
||||||
|
if (itemCount == 0) return;
|
||||||
|
UIDispatchHeader hdr = FillHeader(bufferSlot, itemCount, clipRectPx);
|
||||||
|
auto handle = heap_->bufferTable[bufferSlot];
|
||||||
|
auto texHandle = heap_->imageTable[imageSlot];
|
||||||
|
auto sampHandle = heap_->samplerTable[samplerSlot];
|
||||||
|
WebGPU::wgpuDispatchImages(handle, &hdr,
|
||||||
|
static_cast<std::int32_t>(TilesFor(window_->width)),
|
||||||
|
static_cast<std::int32_t>(TilesFor(window_->height)),
|
||||||
|
texHandle, sampHandle);
|
||||||
|
}
|
||||||
|
|
||||||
void UIRenderer::DispatchText(GraphicsCommandBuffer /*cmd*/, std::uint32_t bufferSlot,
|
void UIRenderer::DispatchText(GraphicsCommandBuffer /*cmd*/, std::uint32_t bufferSlot,
|
||||||
std::uint32_t itemCount,
|
std::uint32_t itemCount,
|
||||||
std::array<float,4> clipRectPx) {
|
std::array<float,4> clipRectPx) {
|
||||||
|
|
@ -168,6 +179,7 @@ void UIRenderer::Dispatch(GraphicsCommandBuffer /*cmd*/, const GraphicsComputeSh
|
||||||
case UICustomBindingKind::Sampler:
|
case UICustomBindingKind::Sampler:
|
||||||
if (slot < heap_->samplerTable.size()) handle = heap_->samplerTable[slot];
|
if (slot < heap_->samplerTable.size()) handle = heap_->samplerTable[slot];
|
||||||
break;
|
break;
|
||||||
|
default: break;
|
||||||
}
|
}
|
||||||
handles.push_back(handle);
|
handles.push_back(handle);
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -191,5 +191,13 @@ export namespace Crafter {
|
||||||
heap.samplerTable[r.firstElement] = WebGPU::wgpuCreateLinearClampSampler();
|
heap.samplerTable[r.firstElement] = WebGPU::wgpuCreateLinearClampSampler();
|
||||||
return SamplerSlot(&heap, r.firstElement);
|
return SamplerSlot(&heap, r.firstElement);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Same as AllocateLinearClampSampler but the address modes are
|
||||||
|
// `repeat` instead of `clamp-to-edge`. Mip filtering is also linear.
|
||||||
|
inline SamplerSlot AllocateLinearRepeatSampler(DescriptorHeapWebGPU& heap) {
|
||||||
|
DescriptorRange r = heap.AllocateSamplerSlots(1);
|
||||||
|
heap.samplerTable[r.firstElement] = WebGPU::wgpuCreateLinearRepeatSampler();
|
||||||
|
return SamplerSlot(&heap, r.firstElement);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
#endif // CRAFTER_GRAPHICS_WINDOW_DOM
|
#endif // CRAFTER_GRAPHICS_WINDOW_DOM
|
||||||
|
|
|
||||||
|
|
@ -113,17 +113,30 @@ export namespace Crafter {
|
||||||
std::uint16_t width = 0;
|
std::uint16_t width = 0;
|
||||||
std::uint16_t height = 0;
|
std::uint16_t height = 0;
|
||||||
std::uint16_t layers = 0;
|
std::uint16_t layers = 0;
|
||||||
|
std::uint8_t mipLevels = 1;
|
||||||
|
|
||||||
void Create(std::uint16_t w, std::uint16_t h, std::uint16_t layerCount) {
|
// Create an array with `layerCount` × (w × h) layers, each carrying
|
||||||
|
// `mipLevels` mip levels. Pass mipLevels=1 (default) for a single
|
||||||
|
// base level — matching the original no-mip behaviour. Caller is
|
||||||
|
// responsible for uploading each level via UpdateLayer (which
|
||||||
|
// handles CPU mip-chain generation when mipLevels > 1).
|
||||||
|
void Create(std::uint16_t w, std::uint16_t h, std::uint16_t layerCount,
|
||||||
|
std::uint8_t mipLevelCount = 1) {
|
||||||
width = w;
|
width = w;
|
||||||
height = h;
|
height = h;
|
||||||
layers = layerCount;
|
layers = layerCount;
|
||||||
handle = WebGPU::wgpuCreateImage2DArray(w, h, layerCount);
|
mipLevels = mipLevelCount;
|
||||||
|
handle = WebGPU::wgpuCreateImage2DArray(w, h, layerCount, mipLevelCount);
|
||||||
}
|
}
|
||||||
|
|
||||||
// Decompress `tex` and upload to `layer`. The asset's dims must
|
// Decompress `tex`, generate a CPU box-filter mip chain (if
|
||||||
// match the array's (w × h) — resize beforehand on the host with
|
// mipLevels > 1), and upload each level into `layer`. The asset's
|
||||||
// TextureAsset<RGBA8>::Resize() if they don't.
|
// base-level dims must match the array's (w × h) — resize
|
||||||
|
// beforehand on the host with TextureAsset<RGBA8>::Resize() if
|
||||||
|
// they don't. Pixel data is treated as raw bytes per channel for
|
||||||
|
// the box filter — for non-color data (normal maps) this gives
|
||||||
|
// approximate but adequate results; for sRGB-encoded color data
|
||||||
|
// it's also approximate but visually fine for game textures.
|
||||||
void UpdateLayer(std::uint16_t layer, const CompressedTextureAsset& tex) {
|
void UpdateLayer(std::uint16_t layer, const CompressedTextureAsset& tex) {
|
||||||
if (tex.pixelStride != sizeof(PixelType)) {
|
if (tex.pixelStride != sizeof(PixelType)) {
|
||||||
std::println(std::cerr,
|
std::println(std::cerr,
|
||||||
|
|
@ -142,11 +155,56 @@ export namespace Crafter {
|
||||||
std::as_writable_bytes(std::span(pixels)),
|
std::as_writable_bytes(std::span(pixels)),
|
||||||
};
|
};
|
||||||
Compression::DecompressCPU(tex.blob, outputs);
|
Compression::DecompressCPU(tex.blob, outputs);
|
||||||
|
|
||||||
|
// Upload level 0.
|
||||||
WebGPU::wgpuWriteImage2DLayer(
|
WebGPU::wgpuWriteImage2DLayer(
|
||||||
handle, layer,
|
handle, layer, /*level*/ 0,
|
||||||
pixels.data(),
|
pixels.data(),
|
||||||
static_cast<std::int32_t>(pixels.size() * sizeof(PixelType)),
|
static_cast<std::int32_t>(pixels.size() * sizeof(PixelType)),
|
||||||
width, height);
|
width, height);
|
||||||
|
|
||||||
|
// Generate + upload subsequent mip levels via a 2x2 box filter
|
||||||
|
// on the previous level's bytes. Each channel is averaged
|
||||||
|
// independently across 4 source texels.
|
||||||
|
std::uint16_t srcW = width;
|
||||||
|
std::uint16_t srcH = height;
|
||||||
|
std::vector<PixelType> prev = std::move(pixels);
|
||||||
|
for (std::uint8_t lvl = 1; lvl < mipLevels; ++lvl) {
|
||||||
|
std::uint16_t dstW = std::max<std::uint16_t>(1, srcW >> 1);
|
||||||
|
std::uint16_t dstH = std::max<std::uint16_t>(1, srcH >> 1);
|
||||||
|
std::vector<PixelType> next(static_cast<std::size_t>(dstW) * dstH);
|
||||||
|
constexpr std::size_t kChannels = sizeof(PixelType);
|
||||||
|
auto srcBytes = reinterpret_cast<const std::uint8_t*>(prev.data());
|
||||||
|
auto dstBytes = reinterpret_cast<std::uint8_t*>(next.data());
|
||||||
|
for (std::uint16_t y = 0; y < dstH; ++y) {
|
||||||
|
std::uint16_t sy0 = static_cast<std::uint16_t>(y * 2);
|
||||||
|
std::uint16_t sy1 = static_cast<std::uint16_t>(std::min<std::int32_t>(sy0 + 1, srcH - 1));
|
||||||
|
for (std::uint16_t x = 0; x < dstW; ++x) {
|
||||||
|
std::uint16_t sx0 = static_cast<std::uint16_t>(x * 2);
|
||||||
|
std::uint16_t sx1 = static_cast<std::uint16_t>(std::min<std::int32_t>(sx0 + 1, srcW - 1));
|
||||||
|
std::size_t a = (static_cast<std::size_t>(sy0) * srcW + sx0) * kChannels;
|
||||||
|
std::size_t b = (static_cast<std::size_t>(sy0) * srcW + sx1) * kChannels;
|
||||||
|
std::size_t c = (static_cast<std::size_t>(sy1) * srcW + sx0) * kChannels;
|
||||||
|
std::size_t d = (static_cast<std::size_t>(sy1) * srcW + sx1) * kChannels;
|
||||||
|
std::size_t out = (static_cast<std::size_t>(y) * dstW + x) * kChannels;
|
||||||
|
for (std::size_t ch = 0; ch < kChannels; ++ch) {
|
||||||
|
std::uint32_t sum = static_cast<std::uint32_t>(srcBytes[a + ch])
|
||||||
|
+ static_cast<std::uint32_t>(srcBytes[b + ch])
|
||||||
|
+ static_cast<std::uint32_t>(srcBytes[c + ch])
|
||||||
|
+ static_cast<std::uint32_t>(srcBytes[d + ch]);
|
||||||
|
dstBytes[out + ch] = static_cast<std::uint8_t>((sum + 2u) >> 2);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
WebGPU::wgpuWriteImage2DLayer(
|
||||||
|
handle, layer, /*level*/ lvl,
|
||||||
|
next.data(),
|
||||||
|
static_cast<std::int32_t>(next.size() * sizeof(PixelType)),
|
||||||
|
dstW, dstH);
|
||||||
|
prev = std::move(next);
|
||||||
|
srcW = dstW;
|
||||||
|
srcH = dstH;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
ImageSlot AllocateSlot(DescriptorHeapWebGPU& heap) {
|
ImageSlot AllocateSlot(DescriptorHeapWebGPU& heap) {
|
||||||
|
|
|
||||||
|
|
@ -18,10 +18,7 @@ Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
|
||||||
*/
|
*/
|
||||||
module;
|
module;
|
||||||
|
|
||||||
#ifndef CRAFTER_GRAPHICS_WINDOW_DOM
|
|
||||||
#endif // !CRAFTER_GRAPHICS_WINDOW_DOM
|
|
||||||
export module Crafter.Graphics:InputField;
|
export module Crafter.Graphics:InputField;
|
||||||
#ifndef CRAFTER_GRAPHICS_WINDOW_DOM
|
|
||||||
import std;
|
import std;
|
||||||
import :Types;
|
import :Types;
|
||||||
import :Keys;
|
import :Keys;
|
||||||
|
|
@ -110,4 +107,3 @@ export namespace Crafter {
|
||||||
const InputFieldColors& colors,
|
const InputFieldColors& colors,
|
||||||
bool caretVisible);
|
bool caretVisible);
|
||||||
}
|
}
|
||||||
#endif // !CRAFTER_GRAPHICS_WINDOW_DOM
|
|
||||||
|
|
|
||||||
|
|
@ -97,6 +97,7 @@ export namespace Crafter {
|
||||||
// sentinel; never returned by Build().
|
// sentinel; never returned by Build().
|
||||||
std::uint64_t blasAddr = 0;
|
std::uint64_t blasAddr = 0;
|
||||||
std::uint32_t triangleCount = 0;
|
std::uint32_t triangleCount = 0;
|
||||||
|
std::uint32_t vertexCount = 0;
|
||||||
|
|
||||||
bool opaque = true;
|
bool opaque = true;
|
||||||
|
|
||||||
|
|
|
||||||
113
interfaces/Crafter.Graphics-PlainComputeShader.cppm
Normal file
113
interfaces/Crafter.Graphics-PlainComputeShader.cppm
Normal file
|
|
@ -0,0 +1,113 @@
|
||||||
|
/*
|
||||||
|
Crafter®.Graphics
|
||||||
|
Copyright (C) 2026 Catcrafts®
|
||||||
|
catcrafts.net
|
||||||
|
|
||||||
|
This library is free software; you can redistribute it and/or
|
||||||
|
modify it under the terms of the GNU Lesser General Public
|
||||||
|
License version 3.0 as published by the Free Software Foundation;
|
||||||
|
*/
|
||||||
|
|
||||||
|
// Standalone compute pipeline. Dispatches at any point in the frame
|
||||||
|
// (inside or outside the UI render pass) via the JS bridge's
|
||||||
|
// wgpuDispatchCompute, which mirrors the wgpuBuildTLAS pattern of
|
||||||
|
// attaching to the active encoder when one exists or creating an
|
||||||
|
// ephemeral encoder+submit when not.
|
||||||
|
//
|
||||||
|
// This is the WebGPU counterpart to the Vulkan `:ComputeShader` partition.
|
||||||
|
// They expose the same conceptual API — Load + Dispatch — but with
|
||||||
|
// backend-specific binding plumbing. See `:GraphicsTypes` for the
|
||||||
|
// `GraphicsComputeShader` alias picking the right one per target.
|
||||||
|
//
|
||||||
|
// WGSL contract:
|
||||||
|
// @group(0) @binding(0) uniform PushData // optional; only if pushUniformSize>0
|
||||||
|
// @group(1+) @binding(N) // user bindings via UICustomBinding
|
||||||
|
// When rayQuery is on, @group(1) is reserved for the RT heap; user
|
||||||
|
// bindings start at @group(2).
|
||||||
|
|
||||||
|
module;
|
||||||
|
export module Crafter.Graphics:PlainComputeShader;
|
||||||
|
#ifdef CRAFTER_GRAPHICS_WINDOW_DOM
|
||||||
|
import std;
|
||||||
|
import :WebGPU;
|
||||||
|
import :WebGPUComputeShader; // for UICustomBinding + UICustomBindingKind
|
||||||
|
|
||||||
|
export namespace Crafter {
|
||||||
|
class PlainComputeShader {
|
||||||
|
public:
|
||||||
|
std::uint32_t pipelineHandle = 0;
|
||||||
|
std::uint32_t pushUniformSize = 0;
|
||||||
|
bool rayQueryCapable = false;
|
||||||
|
std::vector<UICustomBinding> customBindings;
|
||||||
|
|
||||||
|
PlainComputeShader() = default;
|
||||||
|
PlainComputeShader(const PlainComputeShader&) = delete;
|
||||||
|
PlainComputeShader& operator=(const PlainComputeShader&) = delete;
|
||||||
|
PlainComputeShader(PlainComputeShader&& o) noexcept
|
||||||
|
: pipelineHandle(o.pipelineHandle),
|
||||||
|
pushUniformSize(o.pushUniformSize),
|
||||||
|
rayQueryCapable(o.rayQueryCapable),
|
||||||
|
customBindings(std::move(o.customBindings)) {
|
||||||
|
o.pipelineHandle = 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Compile + link a standalone compute shader.
|
||||||
|
// wgsl — source.
|
||||||
|
// pushUniformSize — byte size of the @group(0)@binding(0) uniform
|
||||||
|
// struct, or 0 if the shader doesn't declare one.
|
||||||
|
// bindings — every user-declared resource the dispatch
|
||||||
|
// should bind (groups 1+ if no rayQuery, 2+ if
|
||||||
|
// rayQuery). Order MUST match `handles` at
|
||||||
|
// Dispatch time.
|
||||||
|
// rayQuery — prepend the RT prelude + rayQuery library
|
||||||
|
// so the shader can call `rayQuery*` helpers.
|
||||||
|
void Load(std::string_view wgsl,
|
||||||
|
std::uint32_t pushUniformSize_,
|
||||||
|
std::span<const UICustomBinding> bindings = {},
|
||||||
|
bool rayQuery = false) {
|
||||||
|
pushUniformSize = pushUniformSize_;
|
||||||
|
rayQueryCapable = rayQuery;
|
||||||
|
customBindings.assign(bindings.begin(), bindings.end());
|
||||||
|
pipelineHandle = WebGPU::wgpuLoadComputePipeline(
|
||||||
|
wgsl.data(), static_cast<std::int32_t>(wgsl.size()),
|
||||||
|
static_cast<std::int32_t>(pushUniformSize),
|
||||||
|
customBindings.empty() ? nullptr : customBindings.data(),
|
||||||
|
static_cast<std::int32_t>(customBindings.size()),
|
||||||
|
rayQuery ? 1 : 0);
|
||||||
|
}
|
||||||
|
|
||||||
|
void Load(const std::filesystem::path& wgslPath,
|
||||||
|
std::uint32_t pushUniformSize_,
|
||||||
|
std::span<const UICustomBinding> bindings = {},
|
||||||
|
bool rayQuery = false) {
|
||||||
|
std::ifstream f(wgslPath, std::ios::binary);
|
||||||
|
if (!f) {
|
||||||
|
std::println(std::cerr,
|
||||||
|
"PlainComputeShader::Load: cannot open {}", wgslPath.string());
|
||||||
|
std::abort();
|
||||||
|
}
|
||||||
|
std::string wgsl((std::istreambuf_iterator<char>(f)),
|
||||||
|
std::istreambuf_iterator<char>());
|
||||||
|
Load(std::string_view{wgsl}, pushUniformSize_, bindings, rayQuery);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Bind, push, dispatch. `handles` is parallel to the
|
||||||
|
// UICustomBinding[] passed at Load — order matches.
|
||||||
|
void Dispatch(const void* push, std::uint32_t pushBytes,
|
||||||
|
std::span<const std::uint32_t> handles,
|
||||||
|
std::uint32_t gx,
|
||||||
|
std::uint32_t gy = 1,
|
||||||
|
std::uint32_t gz = 1) const {
|
||||||
|
if (pipelineHandle == 0) return;
|
||||||
|
WebGPU::wgpuDispatchCompute(
|
||||||
|
pipelineHandle,
|
||||||
|
push, static_cast<std::int32_t>(pushBytes),
|
||||||
|
handles.empty() ? nullptr : handles.data(),
|
||||||
|
static_cast<std::int32_t>(handles.size()),
|
||||||
|
static_cast<std::int32_t>(gx),
|
||||||
|
static_cast<std::int32_t>(gy),
|
||||||
|
static_cast<std::int32_t>(gz));
|
||||||
|
}
|
||||||
|
};
|
||||||
|
}
|
||||||
|
#endif // CRAFTER_GRAPHICS_WINDOW_DOM
|
||||||
|
|
@ -121,6 +121,37 @@ export namespace Crafter {
|
||||||
// customIndex (4) + _pad (12). Defined in the WGSL traversal
|
// customIndex (4) + _pad (12). Defined in the WGSL traversal
|
||||||
// library; never directly read by C++.
|
// library; never directly read by C++.
|
||||||
WebGPUBuffer<char, false> buffer;
|
WebGPUBuffer<char, false> buffer;
|
||||||
|
// GPU LBVH support — see additional/dom-webgpu.js's TLAS-build
|
||||||
|
// pipeline.
|
||||||
|
//
|
||||||
|
// entryOrder: per-frame permutation array of u32, indexing into
|
||||||
|
// `buffer` (the TLASEntry[] array). Populated by the radix-sort
|
||||||
|
// pass to spatially-coherent Morton order, then consumed by the
|
||||||
|
// BVH construction + traversal passes. In Stage 1 (this
|
||||||
|
// baseline) it's the identity permutation written by
|
||||||
|
// tlasBuildMain alongside the entries.
|
||||||
|
WebGPUBuffer<char, false> entryOrder;
|
||||||
|
// mortonCodes: per-instance 32-bit Morton codes computed from the
|
||||||
|
// world-AABB centroid, used as the radix-sort key. Written by
|
||||||
|
// tlasBuildMain.
|
||||||
|
WebGPUBuffer<char, false> mortonCodes;
|
||||||
|
// bvhNodes: 2N_PADDED - 1 sweep-tree BVH nodes built per frame
|
||||||
|
// by the LBVH-build compute pass. Each node 32 bytes (aabbMin +
|
||||||
|
// pad, aabbMax + pad). N_PADDED = 65536 (hardcoded in WGSL).
|
||||||
|
// Internal nodes [0, N_PADDED-1); leaves [N_PADDED-1, 2*N_PADDED-1).
|
||||||
|
// Node i's children are 2i+1, 2i+2 (implicit perfect binary
|
||||||
|
// tree). Cap: 65536 instances per scene.
|
||||||
|
WebGPUBuffer<char, false> bvhNodes;
|
||||||
|
// tlasBins: dead, kept allocated as a 64-byte placeholder so the
|
||||||
|
// existing wgpuBuildTLAS C++ signature doesn't need a churn.
|
||||||
|
// The pre-LBVH 64-bin partition was replaced by the full BVH.
|
||||||
|
WebGPUBuffer<char, false> tlasBins;
|
||||||
|
// Sort ping-pong buffers for the radix sort. Each pass reads
|
||||||
|
// from one and writes to the other, swapping role. Layout per
|
||||||
|
// element: 1 u32 packed key = (morton16 << 16) | tlasIndex16.
|
||||||
|
// Sized for N_PADDED.
|
||||||
|
WebGPUBuffer<char, false> sortTempA;
|
||||||
|
WebGPUBuffer<char, false> sortTempB;
|
||||||
|
|
||||||
std::uint32_t builtInstanceCount = 0;
|
std::uint32_t builtInstanceCount = 0;
|
||||||
};
|
};
|
||||||
|
|
@ -141,6 +172,17 @@ export namespace Crafter {
|
||||||
// a fresh build (no refit) — the GPU build pass is cheap at the
|
// a fresh build (no refit) — the GPU build pass is cheap at the
|
||||||
// ~10–100 instance counts the design targets; LBVH-for-TLAS is a
|
// ~10–100 instance counts the design targets; LBVH-for-TLAS is a
|
||||||
// future optimization for larger scenes.
|
// future optimization for larger scenes.
|
||||||
|
//
|
||||||
|
// BuildTLAS is now split into Upload + Build so a physics
|
||||||
|
// compute pass (e.g. physics-tlas-transform) can run between the
|
||||||
|
// CPU mirror upload and the GPU LBVH build. The compute pass
|
||||||
|
// writes the per-instance transform bytes that BuildTLAS leaves
|
||||||
|
// intact for elements flagged transformOwnedByGpu, and those
|
||||||
|
// writes have to land before the LBVH reads them. The combined
|
||||||
|
// BuildTLAS is kept as a convenience for callers that don't
|
||||||
|
// interleave a compute pass (e.g. the ctor-time first build).
|
||||||
|
static void BuildTLASUpload(WebGPUCommandEncoderRef cmd, std::uint32_t index);
|
||||||
|
static void BuildTLASBuild(WebGPUCommandEncoderRef cmd, std::uint32_t index);
|
||||||
static void BuildTLAS(WebGPUCommandEncoderRef cmd, std::uint32_t index);
|
static void BuildTLAS(WebGPUCommandEncoderRef cmd, std::uint32_t index);
|
||||||
|
|
||||||
static void Add(RenderingElement3D* e);
|
static void Add(RenderingElement3D* e);
|
||||||
|
|
|
||||||
|
|
@ -165,6 +165,18 @@ export namespace Crafter {
|
||||||
std::array<float,4> clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f});
|
std::array<float,4> clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f});
|
||||||
void DispatchImages(GraphicsCommandBuffer cmd, std::uint32_t bufferSlot, std::uint32_t itemCount,
|
void DispatchImages(GraphicsCommandBuffer cmd, std::uint32_t bufferSlot, std::uint32_t itemCount,
|
||||||
std::array<float,4> clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f});
|
std::array<float,4> clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f});
|
||||||
|
#ifdef CRAFTER_GRAPHICS_WINDOW_DOM
|
||||||
|
// WebGPU-only overload. WebGPU bind groups can only carry one
|
||||||
|
// texture/sampler per dispatch, so all items in `bufferSlot`
|
||||||
|
// share the same texture (`imageSlot`) and sampler (`samplerSlot`).
|
||||||
|
// The per-item `slots` field in ImageItem is ignored on this
|
||||||
|
// backend. On Vulkan the bindless heap resolves per-item slots,
|
||||||
|
// so the cross-backend path is to call the 4-arg overload above
|
||||||
|
// on native and this 6-arg overload on DOM.
|
||||||
|
void DispatchImages(GraphicsCommandBuffer cmd, std::uint32_t bufferSlot, std::uint32_t itemCount,
|
||||||
|
std::uint16_t imageSlot, std::uint16_t samplerSlot,
|
||||||
|
std::array<float,4> clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f});
|
||||||
|
#endif
|
||||||
void DispatchText(GraphicsCommandBuffer cmd, std::uint32_t bufferSlot, std::uint32_t itemCount,
|
void DispatchText(GraphicsCommandBuffer cmd, std::uint32_t bufferSlot, std::uint32_t itemCount,
|
||||||
std::array<float,4> clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f});
|
std::array<float,4> clipRectPx = {0.0f, 0.0f, 1e9f, 1e9f});
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -35,6 +35,40 @@ namespace Crafter::WebGPU {
|
||||||
extern "C" std::uint32_t wgpuCreateBuffer(std::int32_t byteSize);
|
extern "C" std::uint32_t wgpuCreateBuffer(std::int32_t byteSize);
|
||||||
__attribute__((import_module("env"), import_name("wgpuWriteBuffer")))
|
__attribute__((import_module("env"), import_name("wgpuWriteBuffer")))
|
||||||
extern "C" void wgpuWriteBuffer(std::uint32_t handle, const void* srcPtr, std::int32_t byteSize);
|
extern "C" void wgpuWriteBuffer(std::uint32_t handle, const void* srcPtr, std::int32_t byteSize);
|
||||||
|
__attribute__((import_module("env"), import_name("wgpuWriteBufferRange")))
|
||||||
|
extern "C" void wgpuWriteBufferRange(std::uint32_t handle,
|
||||||
|
std::uint32_t dstByteOffset,
|
||||||
|
const void* srcPtr,
|
||||||
|
std::int32_t byteSize);
|
||||||
|
// Kick off a GPU→CPU readback for the entire `byteSize`-byte prefix
|
||||||
|
// of the buffer at `handle`. Returns immediately; the actual map
|
||||||
|
// resolves asynchronously. Successive Enqueues without a Poll in
|
||||||
|
// between are no-ops until the previous map resolves.
|
||||||
|
//
|
||||||
|
// `resetBytes` ≥ 0 — if non-zero, the JS bridge encodes a
|
||||||
|
// clearBuffer over the first `resetBytes` bytes of the source
|
||||||
|
// buffer immediately after the copy, in the same command encoder.
|
||||||
|
// Used by Forts3D's GPU event queues to zero the atomic-add count
|
||||||
|
// for the next frame's substeps. The reset is TIED to a successful
|
||||||
|
// enqueue: if the enqueue was skipped (previous map still pending),
|
||||||
|
// the reset is skipped too — so events written by substeps during
|
||||||
|
// the missed-drain window accumulate into the next successful
|
||||||
|
// capture instead of being silently wiped.
|
||||||
|
__attribute__((import_module("env"), import_name("wgpuReadbackEnqueue")))
|
||||||
|
extern "C" void wgpuReadbackEnqueue(std::uint32_t handle,
|
||||||
|
std::int32_t byteSize,
|
||||||
|
std::int32_t resetBytes);
|
||||||
|
// Poll a previously-enqueued readback. Returns 1 and writes the
|
||||||
|
// bytes into `dstPtr` if the map resolved; returns 0 otherwise.
|
||||||
|
__attribute__((import_module("env"), import_name("wgpuReadbackPoll")))
|
||||||
|
extern "C" std::int32_t wgpuReadbackPoll(std::uint32_t handle, void* dstPtr, std::int32_t byteSize);
|
||||||
|
// Non-consuming readiness probe. Returns 1 if the readback has
|
||||||
|
// resolved and the next Poll would succeed; returns 0 otherwise.
|
||||||
|
// Used to gate multi-buffer drains (header + array) so neither side
|
||||||
|
// gets consumed until both are ready — otherwise the consumed side's
|
||||||
|
// data is lost while the other side waits for its map to resolve.
|
||||||
|
__attribute__((import_module("env"), import_name("wgpuReadbackReady")))
|
||||||
|
extern "C" std::int32_t wgpuReadbackReady(std::uint32_t handle);
|
||||||
__attribute__((import_module("env"), import_name("wgpuDestroyBuffer")))
|
__attribute__((import_module("env"), import_name("wgpuDestroyBuffer")))
|
||||||
extern "C" void wgpuDestroyBuffer(std::uint32_t handle);
|
extern "C" void wgpuDestroyBuffer(std::uint32_t handle);
|
||||||
|
|
||||||
|
|
@ -64,15 +98,26 @@ namespace Crafter::WebGPU {
|
||||||
// Used by Image2DArray<RGBA8> to stack per-material albedos for one
|
// Used by Image2DArray<RGBA8> to stack per-material albedos for one
|
||||||
// multi-material scene.
|
// multi-material scene.
|
||||||
__attribute__((import_module("env"), import_name("wgpuCreateImage2DArray")))
|
__attribute__((import_module("env"), import_name("wgpuCreateImage2DArray")))
|
||||||
extern "C" std::uint32_t wgpuCreateImage2DArray(std::int32_t w, std::int32_t h, std::int32_t layerCount);
|
extern "C" std::uint32_t wgpuCreateImage2DArray(std::int32_t w, std::int32_t h,
|
||||||
|
std::int32_t layerCount, std::int32_t mipLevels);
|
||||||
|
// Upload a single mip level for one array layer. `level` indexes into
|
||||||
|
// the texture's mip chain (0 = base); `w` / `h` must be the dimensions
|
||||||
|
// at that level. Callers pass each level's pixels separately — mip
|
||||||
|
// generation is host-side.
|
||||||
__attribute__((import_module("env"), import_name("wgpuWriteImage2DLayer")))
|
__attribute__((import_module("env"), import_name("wgpuWriteImage2DLayer")))
|
||||||
extern "C" void wgpuWriteImage2DLayer(std::uint32_t handle, std::int32_t layer,
|
extern "C" void wgpuWriteImage2DLayer(std::uint32_t handle, std::int32_t layer, std::int32_t level,
|
||||||
const void* srcPtr, std::int32_t byteSize,
|
const void* srcPtr, std::int32_t byteSize,
|
||||||
std::int32_t w, std::int32_t h);
|
std::int32_t w, std::int32_t h);
|
||||||
|
|
||||||
__attribute__((import_module("env"), import_name("wgpuCreateLinearClampSampler")))
|
__attribute__((import_module("env"), import_name("wgpuCreateLinearClampSampler")))
|
||||||
extern "C" std::uint32_t wgpuCreateLinearClampSampler();
|
extern "C" std::uint32_t wgpuCreateLinearClampSampler();
|
||||||
|
|
||||||
|
// Linear-filtered, repeat-addressed sampler with mipmap linear-filter.
|
||||||
|
// The usual choice for tiled material textures (woodBrace, panel, etc.)
|
||||||
|
// which expect UV > 1.0 to wrap.
|
||||||
|
__attribute__((import_module("env"), import_name("wgpuCreateLinearRepeatSampler")))
|
||||||
|
extern "C" std::uint32_t wgpuCreateLinearRepeatSampler();
|
||||||
|
|
||||||
__attribute__((import_module("env"), import_name("wgpuFrameBegin")))
|
__attribute__((import_module("env"), import_name("wgpuFrameBegin")))
|
||||||
extern "C" void wgpuFrameBegin();
|
extern "C" void wgpuFrameBegin();
|
||||||
__attribute__((import_module("env"), import_name("wgpuFrameEnd")))
|
__attribute__((import_module("env"), import_name("wgpuFrameEnd")))
|
||||||
|
|
@ -158,12 +203,56 @@ namespace Crafter::WebGPU {
|
||||||
std::int32_t gx, std::int32_t gy,
|
std::int32_t gx, std::int32_t gy,
|
||||||
const void* handlesPtr, std::int32_t handlesCount);
|
const void* handlesPtr, std::int32_t handlesCount);
|
||||||
|
|
||||||
// GPU TLAS-build dispatch. Reads the instance buffer (host-uploaded or
|
// GPU TLAS-build dispatch. Two sequential compute passes:
|
||||||
// GPU-written), produces per-instance world-space AABBs + per-instance
|
// 1. tlasBuildMain — per-instance world AABB + identity permutation
|
||||||
// transform matrices in a flat tlasBuf SSBO consumed by traceRay / rayQuery.
|
// + naive Morton (overwritten in pass 2). Outputs the flat
|
||||||
|
// tlasBuf SSBO consumed by traceRay / rayQuery.
|
||||||
|
// 2. lbvhBuildMain — single workgroup of 1024 threads; reduces
|
||||||
|
// scene AABB, recomputes Morton with proper normalization,
|
||||||
|
// bitonic-sorts (morton, instance_id), writes the sorted
|
||||||
|
// permutation into `entryOrderBufHandle`, and refits a
|
||||||
|
// sweep-tree BVH into `bvhNodesBufHandle` bottom-up.
|
||||||
|
// Pre-LBVH bin-build is gone; `binsBufHandle` is kept in the
|
||||||
|
// signature as a placeholder so the C++ side doesn't churn.
|
||||||
__attribute__((import_module("env"), import_name("wgpuBuildTLAS")))
|
__attribute__((import_module("env"), import_name("wgpuBuildTLAS")))
|
||||||
extern "C" void wgpuBuildTLAS(std::uint32_t instanceBufHandle,
|
extern "C" void wgpuBuildTLAS(std::uint32_t instanceBufHandle,
|
||||||
std::int32_t instanceCount,
|
std::int32_t instanceCount,
|
||||||
std::uint32_t tlasOutBufHandle);
|
std::uint32_t tlasOutBufHandle,
|
||||||
|
std::uint32_t entryOrderBufHandle,
|
||||||
|
std::uint32_t mortonBufHandle,
|
||||||
|
std::uint32_t binsBufHandle,
|
||||||
|
std::uint32_t bvhNodesBufHandle,
|
||||||
|
std::uint32_t sortTempABufHandle,
|
||||||
|
std::uint32_t sortTempBBufHandle);
|
||||||
|
|
||||||
|
// ── Standalone compute pipelines ───────────────────────────────────
|
||||||
|
//
|
||||||
|
// Mirror of the native ComputeShader API: load a user-authored
|
||||||
|
// compute WGSL with arbitrary @group bindings, dispatch it at any
|
||||||
|
// point in the frame (inside or outside the UI compute pass —
|
||||||
|
// physics ticks dispatch from update lambdas, which fire outside
|
||||||
|
// the per-frame render encoder).
|
||||||
|
//
|
||||||
|
// WGSL contract:
|
||||||
|
// @group(0) @binding(0) — uniform PushData (optional; only if
|
||||||
|
// pushUniformSize > 0 at load).
|
||||||
|
// @group(1+) @binding(N) — user bindings declared via
|
||||||
|
// UICustomBinding[]. When rayQuery is
|
||||||
|
// on, @group(1) is reserved for the RT
|
||||||
|
// heap and user bindings start at
|
||||||
|
// @group(2).
|
||||||
|
__attribute__((import_module("env"), import_name("wgpuLoadComputePipeline")))
|
||||||
|
extern "C" std::uint32_t wgpuLoadComputePipeline(
|
||||||
|
const void* wgslPtr, std::int32_t wgslLen,
|
||||||
|
std::int32_t pushUniformSize,
|
||||||
|
const void* bindingsPtr, std::int32_t bindingsCount,
|
||||||
|
std::int32_t rayQueryFlag);
|
||||||
|
|
||||||
|
__attribute__((import_module("env"), import_name("wgpuDispatchCompute")))
|
||||||
|
extern "C" void wgpuDispatchCompute(
|
||||||
|
std::uint32_t pipelineHandle,
|
||||||
|
const void* pushPtr, std::int32_t pushBytes,
|
||||||
|
const void* handlesPtr, std::int32_t handlesCount,
|
||||||
|
std::int32_t gx, std::int32_t gy, std::int32_t gz);
|
||||||
}
|
}
|
||||||
#endif // CRAFTER_GRAPHICS_WINDOW_DOM
|
#endif // CRAFTER_GRAPHICS_WINDOW_DOM
|
||||||
|
|
|
||||||
|
|
@ -78,6 +78,60 @@ export namespace Crafter {
|
||||||
void FlushDevice() requires(Mapped) {
|
void FlushDevice() requires(Mapped) {
|
||||||
WebGPU::wgpuWriteBuffer(handle, this->value, static_cast<std::int32_t>(size));
|
WebGPU::wgpuWriteBuffer(handle, this->value, static_cast<std::int32_t>(size));
|
||||||
}
|
}
|
||||||
|
// Partial upload — write the bytes [srcByteOffset, srcByteOffset+byteCount)
|
||||||
|
// of the host mirror to GPU offset `dstByteOffset`. BuildTLAS uses
|
||||||
|
// this to leave the GPU-owned transform field of an RTInstance
|
||||||
|
// intact (the physics-tlas-transform compute shader is its sole
|
||||||
|
// writer) while still pushing the CPU-side metadata fields.
|
||||||
|
void FlushDeviceRange(std::uint32_t dstByteOffset,
|
||||||
|
std::uint32_t srcByteOffset,
|
||||||
|
std::uint32_t byteCount) requires(Mapped) {
|
||||||
|
const auto* base = reinterpret_cast<const char*>(this->value);
|
||||||
|
WebGPU::wgpuWriteBufferRange(handle, dstByteOffset,
|
||||||
|
base + srcByteOffset,
|
||||||
|
static_cast<std::int32_t>(byteCount));
|
||||||
|
}
|
||||||
|
|
||||||
|
// Push one element's worth of bytes from the host mirror to GPU.
|
||||||
|
// Use when a single SoA slot was mutated (body construction,
|
||||||
|
// per-instance flag flip) and a full FlushDevice would clobber
|
||||||
|
// the GPU-side updates the sim has applied to neighboring slots.
|
||||||
|
void FlushDeviceSlot(std::uint32_t idx) requires(Mapped) {
|
||||||
|
constexpr std::uint32_t kStride = sizeof(T);
|
||||||
|
const std::uint32_t off = idx * kStride;
|
||||||
|
FlushDeviceRange(off, off, kStride);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Schedule a GPU→CPU readback of this buffer's entire contents.
|
||||||
|
// Asynchronous; data isn't ready until a later PollReadback
|
||||||
|
// returns true. Successive Enqueues without a Poll are dropped
|
||||||
|
// — they're a no-op while the previous map is in flight.
|
||||||
|
//
|
||||||
|
// `resetBytes` ≥ 0 — if non-zero, the first `resetBytes` bytes
|
||||||
|
// of THIS buffer are clearBuffer-cleared on the GPU command
|
||||||
|
// encoder immediately after the copy, so the readback captures
|
||||||
|
// the pre-clear bytes and the next frame's writers see zeros.
|
||||||
|
// The reset is tied to a successful enqueue (skipped enqueue =
|
||||||
|
// skipped reset), preserving accumulated state across missed
|
||||||
|
// drains.
|
||||||
|
void EnqueueReadback(std::uint32_t resetBytes = 0) {
|
||||||
|
WebGPU::wgpuReadbackEnqueue(handle,
|
||||||
|
static_cast<std::int32_t>(size),
|
||||||
|
static_cast<std::int32_t>(resetBytes));
|
||||||
|
}
|
||||||
|
// Try to copy the readback bytes into this->value. Returns true
|
||||||
|
// if the previous EnqueueReadback resolved and the data is now
|
||||||
|
// mirrored into .value; false if the map is still pending.
|
||||||
|
bool PollReadback() requires(Mapped) {
|
||||||
|
return WebGPU::wgpuReadbackPoll(handle, this->value,
|
||||||
|
static_cast<std::int32_t>(size)) != 0;
|
||||||
|
}
|
||||||
|
// Non-consuming readiness probe. Returns true if a subsequent
|
||||||
|
// PollReadback would succeed without changing state otherwise.
|
||||||
|
// Use to verify a sibling buffer is also ready before consuming.
|
||||||
|
bool IsReadbackReady() const {
|
||||||
|
return WebGPU::wgpuReadbackReady(handle) != 0;
|
||||||
|
}
|
||||||
|
|
||||||
~WebGPUBuffer() { Clear(); }
|
~WebGPUBuffer() { Clear(); }
|
||||||
};
|
};
|
||||||
|
|
|
||||||
|
|
@ -36,6 +36,11 @@ export namespace Crafter {
|
||||||
SampledTexture = 1, // sampled texture_2d<f32>, handle is a slot into heap.imageTable
|
SampledTexture = 1, // sampled texture_2d<f32>, handle is a slot into heap.imageTable
|
||||||
Sampler = 2, // filtering sampler, handle is a slot into heap.samplerTable
|
Sampler = 2, // filtering sampler, handle is a slot into heap.samplerTable
|
||||||
SampledTextureArray = 3, // sampled texture_2d_array<f32>, handle is a slot into heap.imageTable
|
SampledTextureArray = 3, // sampled texture_2d_array<f32>, handle is a slot into heap.imageTable
|
||||||
|
// read-write storage SSBO (var<storage, read_write> in WGSL). Use
|
||||||
|
// for buffers shaders need to MUTATE — e.g. physics shaders that
|
||||||
|
// integrate node momentum, write brace stress, or output TLAS
|
||||||
|
// instance transforms.
|
||||||
|
BufferReadWrite = 4,
|
||||||
};
|
};
|
||||||
|
|
||||||
struct UICustomBinding {
|
struct UICustomBinding {
|
||||||
|
|
|
||||||
|
|
@ -71,5 +71,6 @@ export import :WebGPU;
|
||||||
export import :WebGPUBuffer;
|
export import :WebGPUBuffer;
|
||||||
export import :DescriptorHeapWebGPU;
|
export import :DescriptorHeapWebGPU;
|
||||||
export import :WebGPUComputeShader;
|
export import :WebGPUComputeShader;
|
||||||
|
export import :PlainComputeShader;
|
||||||
export import :ShaderBindingTableWebGPU;
|
export import :ShaderBindingTableWebGPU;
|
||||||
export import :PipelineRTWebGPU;
|
export import :PipelineRTWebGPU;
|
||||||
|
|
|
||||||
|
|
@ -123,7 +123,7 @@ extern "C" Configuration CrafterBuildProject(std::span<const std::string_view> a
|
||||||
// when its body is gated out. Vulkan-typed partitions stub to empty
|
// when its body is gated out. Vulkan-typed partitions stub to empty
|
||||||
// modules under CRAFTER_GRAPHICS_WINDOW_DOM; the Dom/DomEvents/Router
|
// modules under CRAFTER_GRAPHICS_WINDOW_DOM; the Dom/DomEvents/Router
|
||||||
// partitions stub to empty modules in the opposite direction.
|
// partitions stub to empty modules in the opposite direction.
|
||||||
std::array<fs::path, 41> ifaces = {
|
std::array<fs::path, 42> ifaces = {
|
||||||
"interfaces/Crafter.Graphics",
|
"interfaces/Crafter.Graphics",
|
||||||
"interfaces/Crafter.Graphics-Animation",
|
"interfaces/Crafter.Graphics-Animation",
|
||||||
"interfaces/Crafter.Graphics-Clipboard",
|
"interfaces/Crafter.Graphics-Clipboard",
|
||||||
|
|
@ -147,6 +147,7 @@ extern "C" Configuration CrafterBuildProject(std::span<const std::string_view> a
|
||||||
"interfaces/Crafter.Graphics-Mesh",
|
"interfaces/Crafter.Graphics-Mesh",
|
||||||
"interfaces/Crafter.Graphics-PipelineRTVulkan",
|
"interfaces/Crafter.Graphics-PipelineRTVulkan",
|
||||||
"interfaces/Crafter.Graphics-PipelineRTWebGPU",
|
"interfaces/Crafter.Graphics-PipelineRTWebGPU",
|
||||||
|
"interfaces/Crafter.Graphics-PlainComputeShader",
|
||||||
"interfaces/Crafter.Graphics-RenderingElement3D",
|
"interfaces/Crafter.Graphics-RenderingElement3D",
|
||||||
"interfaces/Crafter.Graphics-RenderPass",
|
"interfaces/Crafter.Graphics-RenderPass",
|
||||||
"interfaces/Crafter.Graphics-Router",
|
"interfaces/Crafter.Graphics-Router",
|
||||||
|
|
@ -170,14 +171,16 @@ extern "C" Configuration CrafterBuildProject(std::span<const std::string_view> a
|
||||||
if (dom) {
|
if (dom) {
|
||||||
// DOM impl set. UI-Shared.cpp is backend-agnostic; UI-WebGPU.cpp
|
// DOM impl set. UI-Shared.cpp is backend-agnostic; UI-WebGPU.cpp
|
||||||
// is the DOM-only implementation of UIRenderer's GPU-touching
|
// is the DOM-only implementation of UIRenderer's GPU-touching
|
||||||
// methods. Font / FontAtlas / UIComponents are now portable.
|
// methods. Font / FontAtlas / UIComponents / InputField are now
|
||||||
std::array<fs::path, 16> domImpls = {
|
// portable.
|
||||||
|
std::array<fs::path, 17> domImpls = {
|
||||||
"implementations/Crafter.Graphics-Clipboard",
|
"implementations/Crafter.Graphics-Clipboard",
|
||||||
"implementations/Crafter.Graphics-Dom",
|
"implementations/Crafter.Graphics-Dom",
|
||||||
"implementations/Crafter.Graphics-Font",
|
"implementations/Crafter.Graphics-Font",
|
||||||
"implementations/Crafter.Graphics-FontAtlas",
|
"implementations/Crafter.Graphics-FontAtlas",
|
||||||
"implementations/Crafter.Graphics-Gamepad",
|
"implementations/Crafter.Graphics-Gamepad",
|
||||||
"implementations/Crafter.Graphics-Input",
|
"implementations/Crafter.Graphics-Input",
|
||||||
|
"implementations/Crafter.Graphics-InputField",
|
||||||
"implementations/Crafter.Graphics-Mesh-WebGPU",
|
"implementations/Crafter.Graphics-Mesh-WebGPU",
|
||||||
"implementations/Crafter.Graphics-PipelineRTWebGPU",
|
"implementations/Crafter.Graphics-PipelineRTWebGPU",
|
||||||
"implementations/Crafter.Graphics-RenderingElement3D-WebGPU",
|
"implementations/Crafter.Graphics-RenderingElement3D-WebGPU",
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue