Catcrafts/Crafter.Graphics

Author	SHA1	Message	Date
Jorijn van der Graaf	42a479572d	SPDX license update	2026-07-22 18:09:06 +02:00
Jorijn van der Graaf	a879c834c7	webgpu embedding	2026-07-19 01:38:25 +02:00
catbot	47bd4da0e3	Merge pull request 'test(bench): measure net perf gain from #40 to master in Sponza (#155 )' (#156 ) from claude/issue-155 into master	2026-06-18 21:36:37 +02:00
catbot	619e39369d	test(bench): SponzaBench harness + #40→HEAD perf measurement (#155 ) Headless benchmark around the native Sponza RT scene: times setup and a measured Render() loop over the full multi-mesh atrium, prints BENCH metrics, and exits. Includes run-bench.sh and a README documenting the methodology and the measured net gain from #40 to current master. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 19:35:54 +00:00
catbot	3e116e6e43	feat(window): env-gated uncapped present mode for benchmarking CRAFTER_PRESENT_IMMEDIATE selects IMMEDIATE (then MAILBOX) when the surface offers it, instead of the default FIFO. Needed to measure steady-state frame throughput without the compositor's vblank cap; FIFO remains the default when the variable is unset. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 19:35:54 +00:00
catbot	fc71eb36b9	Merge pull request 'fix(window): silence per-frame and setup-path Vulkan validation errors (#153 )' (#154 ) from claude/issue-153 into master	2026-06-18 20:05:06 +02:00
catbot	7316e51dca	fix(window): silence per-frame and setup-path Vulkan validation errors (#153 ) Two distinct validation errors the native frame loop emitted, both originating in Crafter.Graphics with no consumer-side influence. Problem 1 — per-frame acquire-barrier access/stage mismatch. The acquire->GENERAL barrier hardcoded dstAccessMask = SHADER_WRITE\|TRANSFER_WRITE but used the per-pass stage union as its dst stage mask. For an all-compute frame the union narrows to COMPUTE_SHADER, which does not support TRANSFER_WRITE, so VUID-02820 fired every frame. Derive the access mask from the same stage union via a new SwapchainWriterAccess() helper (mirroring SwapchainStageUnion), and apply it to both the acquire dst and present src masks for symmetry. Problem 2 — mid-session StartInit/FinishInit (and GetCmd/EndCmd) reuse the shared drawCmdBuffers[currentBuffer]. With no steady-state wait-idle the loop's last submission of that buffer may still be in flight when scene setup runs (building map meshes / acceleration structures), so the old code re-began (VUID-00049) and re-submitted (VUID-00071) a pending buffer, and resources freed in the StartInit..FinishInit bracket could still be referenced by it. Drain the queue at the start of StartInit/GetCmd before re-recording; setup is rare, so a wait-idle is fine (FinishInit/EndCmd already wait-idle at the end). Tests: extend SwapchainBarrierScope with SwapchainWriterAccess coverage (pure CPU), and add SetupCmdBufferReuse — a real-frame-loop regression test driving a compute pass plus interleaved mid-session StartInit rounds, asserting the validation layer stays silent. Verified both halves fail (reproducing the exact VUIDs) when their respective fix is reverted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 18:04:25 +00:00
catbot	7720f0a9bc	Merge pull request 'perf(window): block message loop while paused on Win32 (#134 )' (#152 ) from claude/issue-134 into master	2026-06-18 16:01:28 +02:00
catbot	23e2416ef0	Merge pull request 'perf(webgpu): byteCount-bounded readback for over-provisioned buffers (#133 )' (#151 ) from claude/issue-133 into master	2026-06-18 16:01:18 +02:00
catbot	7df82e4fee	perf(window): block message loop while paused on Win32 (#134 ) When `updating` is false (paused / minimized / stopped), the FIFO present no longer paces StartSync()'s message loop, so the bare PeekMessage(PM_REMOVE) spin pinned a core at 100% while uselessly re-running Gamepad::Tick() (mutex + per-pad COM reads) and onBeforeUpdate every iteration. Block on MsgWaitForMultipleObjectsEx with a short timeout while !updating: the loop now wakes on input or every ~16ms, eliminating the busy-spin while preserving gamepad polling / onBeforeUpdate at a sane rate (e.g. to detect a controller button asking to resume). The updating path is unchanged — present remains vsync-gated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 14:00:58 +00:00
catbot	931500ddc3	perf(webgpu): byteCount-bounded readback for over-provisioned buffers (#133 ) WebGPUBuffer::EnqueueReadback / PollReadback always copied the full buffer `size` GPU→staging→wasm. Over-provisioned event-queue buffers (e.g. Forts3D's GPU physics event drain) paid the full-capacity copy every drain even when only a small live prefix held data. Add an optional `byteCount` parameter (default 0 = whole buffer, so the existing full-buffer callers are unchanged) bounding the readback to the live prefix. Pass the same byteCount to the paired PollReadback so the matching number of bytes lands in `.value`. JS bridge: the staging buffer is now sized to the full device-buffer capacity (not the first call's byteSize) so a varying prefix length never overflows it, while the copyBufferToBuffer + mapAsync/getMappedRange are bounded to the requested prefix — that's where the copy saving lands. Verified end-to-end in the browser via RayQueryPick: the full readback still resolves correctly, and a follow-up 8-byte prefix read copies only hit+customIndex while primitiveIndex keeps its poison sentinel, proving the copy was bounded. Native `crafter-build test` suite: 24 passed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 14:00:36 +00:00
catbot	cd3a55a914	Merge pull request 'perf(rt): batch the WebGPU TLAS instance upload into one writeBuffer (#131 )' (#150 ) from claude/issue-131 into master	2026-06-18 16:00:10 +02:00
catbot	10e07575fb	perf(rt): batch the WebGPU TLAS instance upload into one writeBuffer (#131 ) BuildTLASUpload pushed GPU-driven (transformOwnedByGpu) runs one element at a time — a separate FlushDeviceRange of the 16 strided metadata bytes per element — each paying WebGPU validation / encode / JS-boundary cost, while the CPU-driven arm already batched contiguously. Upload the whole active instance range in a single writeBuffer instead. Pushing the (stale) transform bytes for GPU-driven slots is harmless: the only supported way to drive a transform from the GPU is the manual Upload -> physics compute pass -> Build sequence, and that compute pass runs after this upload and rewrites the transform on the GPU before the TLAS build reads it. For the in-repo (all-CPU) usage the bytes uploaded are identical to before — with one CPU-driven run the old loop already emitted exactly this single FlushDeviceRange(0, 0, primitiveCount*64). Verified: native suite (24 passed) and the RTStress wasm example (512 ray-traced instances) render correctly through the new path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 13:59:26 +00:00
catbot	77bc6f7aec	Merge pull request 'perf(webgpu): range-flush only live TLAS metadata slots (#130 )' (#149 ) from claude/issue-130 into master	2026-06-18 15:56:46 +02:00
catbot	db35e78eaf	perf(webgpu): range-flush only live TLAS metadata slots (#130 ) The metadata mirror is padded to kNPadded (65536) but only primitiveCount slots are live. metadataBuffer.FlushDevice() wgpuWriteBuffer'd all 256 KB through the WASM->JS staging path every frame (~100-250x waste for a few-hundred-instance scene). Switch to the existing FlushDeviceRange overload — the same one the instanceBuffer loop directly above uses — sized to primitiveCount * sizeof(uint32_t). The Vulkan parallel already sizes its flush to primitiveCount, so this was WebGPU-specific. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 13:56:21 +00:00
catbot	4c659319cb	Merge pull request 'perf(ui): reuse handles vector in WebGPU custom Dispatch (#132 )' (#148 ) from claude/issue-132 into master	2026-06-18 15:54:36 +02:00
catbot	cb650f965d	perf(ui): reuse handles vector in WebGPU custom Dispatch (#132 ) UIRenderer::Dispatch built a fresh std::vector<uint32_t> with reserve on every call, paying one malloc+free per dispatch. Custom compute dispatches run per-frame, so make the vector thread_local and clear() it each call: capacity grows to the high-water mark once and is reused thereafter, eliminating the per-frame allocation churn. Behavior is identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 13:54:09 +00:00
catbot	a4fdbf7d30	Merge pull request 'perf(ui): per-shelf dirty spans for tight font-atlas uploads (#129 )' (#147 ) from claude/issue-129 into master	2026-06-18 15:36:33 +02:00
catbot	16d291b0ad	perf(ui): per-shelf dirty spans for tight atlas uploads (#129 ) FontAtlas::Update used a single union bounding box for the dirty rect. Scattered new glyphs landing on different shelves produced a tall union box that re-uploaded mostly-unchanged texels between the shelves. Track a dirty span per shelf instead and issue one tight UpdateRegion / wgpuWriteAtlasRegion copy per armed span. A shelf packs glyphs left-to-right at a fixed top, so each span stays a contiguous X run capped by the shelf height. The whole-atlas zero-clear in Initialize keeps using the existing dirtyRect span. `dirty` is now the OR of every span and is cleared together with them in Update. Verified: full test suite green (23 passed); HelloUI text renders crisply on the WebGPU backend through the new multi-span upload path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 13:35:54 +00:00
catbot	9cf972016c	Merge pull request 'perf(ui): memoize the input-field caret prefix width (#128 )' (#146 ) from claude/issue-128 into master	2026-06-18 15:30:32 +02:00
catbot	80a69025eb	Merge pull request 'perf(ui): LRU-evict the shaped-run cache instead of clearing it (#123 )' (#143 ) from claude/issue-123 into master	2026-06-18 15:30:08 +02:00
catbot	a45e793da4	perf(ui): memoize the input-field caret prefix width (#128 ) DrawInputField re-measured the cursor prefix via Font::GetLineWidth on every frame of a focused field, even though only the blink (caretVisible) changes frame-to-frame — value, cursorPos and fontSize are stable across the vast majority of frames. Cache the measured prefix WIDTH on the InputField, keyed on (prefix bytes, fontSize). A cheap byte compare of the cached prefix guards the expensive per-glyph UTF-8 decode + advance accumulation. The cache is mutable so the const-ref draw fn can refresh it. The absolute caretX is deliberately NOT cached: it adds the layout- dependent textX (rect.x + paddingX) each frame, so caching it would give a relocated field a stale caret. Caching the width keeps that correct. Adds tests/InputFieldCaretCache pinning: the cached caret matches the uncached GetLineWidth oracle across states, a redraw (cache hit) is identical to the miss, the cache invalidates on value/cursorPos/fontSize changes, the blink never moves the caret, and a relocated field shifts the caret by exactly the rect delta. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 13:29:55 +00:00
catbot	88541bf44e	Merge remote-tracking branch 'origin/master' into claude/issue-123 # Conflicts: # implementations/Crafter.Graphics-UI-Shared.cpp # interfaces/Crafter.Graphics-UI.cppm	2026-06-18 13:29:49 +00:00
catbot	9046562a90	Merge pull request 'perf(ui): subgroup prefix-sum stream compaction in the UI compute shaders (#124 )' (#145 ) from claude/issue-124 into master	2026-06-18 15:27:30 +02:00
catbot	e1222c0ab7	Merge pull request 'perf(ui): skip the destination read-modify-write on empty tiles (#127 )' (#144 ) from claude/issue-127 into master	2026-06-18 15:27:08 +02:00
catbot	da9f2a89e8	perf(ui): subgroup prefix-sum stream compaction, not lane-0 serial scan (#124 ) uiCompactChunk ran the survivor scan entirely on gl_LocalInvocationIndex == 0: ~64 serial iterations with 63/64 lanes idle, between two barriers, 4x per chunk in the fused kernel. It must emit a stable in-order (buffer-order) exclusive prefix of the survivors so the per-pixel inner loop still sees items in draw order. Replace it with a per-subgroup subgroupExclusiveAdd of the keep bits plus a carry across subgroups: each subgroup publishes its survivor total to shared memory, then every lane sums the totals of all lower-id subgroups for its base. A survivor's slot in s_order[] therefore equals the number of survivors with a smaller local index — buffer (draw) order preserved exactly, unlike an atomicAdd which would scramble it. The carry makes the result correct for any subgroup width (2 subgroups on the 32-wide descriptor_heap target, up to 8/16 on narrower parts), relying only on the gl_LocalInvocationIndex <-> (gl_SubgroupID, gl_SubgroupInvocationID) linear mapping every Vulkan compute implementation provides. The feature is free at the device baseline (apiVersion 1.4 + VK_EXT_descriptor_heap implies Vulkan 1.1 subgroup arithmetic), so no correctness fallback is needed; WebGPU is unaffected (separate embedded WGSL). Applied to ui-fused.comp.glsl and the same pattern inlined in ui-quads/circles/images/text. Verified: all 23 tests pass (UIFusedShader recompiles + spirv-val + pins push-constant offsets); a 140k-trial CPU simulation of the algorithm matches the serial reference for subgroup sizes 1..64; HelloUI renders correctly on an RTX 4090 (DispatchFused quads+circles+text) with draw order intact. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 13:26:37 +00:00
catbot	225556532a	perf(ui): skip the destination read-modify-write on empty tiles (#127 ) The standalone per-category UI shaders (quads/circles/images/text) unconditionally imageLoad the destination pixel up front and imageStore it at the end, paying a full read-modify-write even on tiles no item touches — the common case for a sparse UI. The fused uber-kernel already amortizes a single load/store across all four categories; the standalone Dispatch* path has no such umbrella. Defer the imageLoad to just before the first surviving blend (`loaded` flag) and skip the imageStore unless something was actually blended. Output is bit-identical: before the first blend `dst` is unused, and a skipped store leaves the pixel exactly as a no-op load+store would have rewritten it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 13:26:32 +00:00
catbot	1d7d9f7b8e	perf(ui): LRU-evict the shaped-run cache instead of clearing it (#123 ) At kMaxShapedRuns (8192) UIRenderer::ShapeText did a full shapedRuns_.clear(). A stream of unique strings (FPS counters, timers) alongside many stable labels periodically nuked every stable entry, forcing a full-UI reshape the next frame. Track recency with an auxiliary std::list<const ShapedRunKey*> (front = most recently used) holding pointers to the keys owned by the node-based map. A hit splices its entry to the front; an overflow pops the least-recently-used entry from the back and erases just that one — both O(1). The hot set of labels, reshaped every frame, stays at the front and survives, so the churn of unique strings only recycles the cold tail. Output is byte-identical either way. InvalidateFont now drops matching entries from both the map and the LRU list. Adds ShapedRunCacheSize()/IsShapedRunCached() introspection hooks (the policy isn't observable through output or atlas.dirty) and a ShapeTextCache test that asserts the cache plateaus at a fixed cap (evict-one, not clear-all), the hot label survives a churn past the cap, and an untouched cold string is evicted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 13:23:34 +00:00
catbot	2c07abee00	Merge pull request 'perf(ui): hoist loop-invariant per-item/per-glyph math out of the per-pixel loop (#125 )' (#142 ) from claude/issue-125 into master	2026-06-18 15:22:45 +02:00
catbot	ac9180076c	perf(ui): hoist loop-invariant per-item math out of the per-pixel loop (#125 ) The cooperative-load section already streams each item into shared memory once per workgroup, but the per-pixel inner loop then recomputed values that are constant for the whole item/glyph. The compiler can't hoist them itself — they read shared memory at a varying index. Precompute them once per item at load time into new shared slots: - ui-images / ui-text / ui-fused (images+text phases): invRectSize = 1.0/rect.zw, turning the per-pixel `(sp-rect.xy)/rect.zw` vec2 divide into a multiply. - ui-text / ui-fused (text phase): the SDF AA `band` (a vec2 divide + two maxes depending only on the glyph's uv span and rect size). In ui-fused the new slots (s_inv, s_band) are reused across phases like the existing s_v* scratch, so the LDS bump is per-category, not summed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 13:22:07 +00:00
catbot	1b57f84348	Merge pull request 'perf(ui): transparent shaped-run cache lookup, no string copy on hit (#122 )' (#141 ) from claude/issue-122 into master	2026-06-18 15:19:43 +02:00
catbot	c8495f548b	perf(ui): transparent shaped-run cache lookup, no string copy on hit (#122 ) ShapeText built ShapedRunKey{..., std::string(utf8)} before find() unconditionally — even on cache hits, on the per-frame onBuild path. N labels meant N string copies + hashes (+ frees for non-SSO strings) every frame, partially defeating the run cache. Add is_transparent hash + equality functors on shapedRuns_ and probe with a borrowing ShapedRunViewKey {const Font*, float, array<float,4>, string_view}. The owning std::string is now materialised only on a miss, for emplace. Cache hits copy and hash no string. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 13:19:11 +00:00
catbot	2783e47674	Merge pull request 'perf(tlas): dirty-track the per-frame TLAS instance+metadata upload (#118 )' (#140 ) from claude/issue-118 into master	2026-06-18 15:07:44 +02:00
catbot	b5b8c04237	perf(tlas): dirty-track the per-frame TLAS instance+metadata upload (#118 ) BuildTLAS rebuilt the host instance+metadata buffers with an O(n) copy of every 64 B VkAccelerationStructureInstanceKHR + metadata entry every frame, unconditionally, then flushed the whole high-water capacity (VK_WHOLE_SIZE) on both buffers. At the millions-of-instances target that copy dominates the CPU frame, and the whole-buffer flush costs on non-coherent BAR/VRAM. Add a generation counter so only changed host-authored fields are copied, and feed the same dirty span into the ranged FlushDevice(offset, bytes) overload: - RenderingElement3D::hostDataVersion + MarkHostDataDirty() (bumps a global monotonic counter). TlasWithBuffer::uploadedVersion records, per frame, the version last copied into each slot. A slot is copied only when its element advanced past the recorded version; version 0 ("untracked") reads dirty every frame, so callers that don't opt in keep the prior copy-every-frame behaviour. Globally-unique versions make this correct under relocation (Remove's swap-and-pop, and remove+add that nets the same count on the refit path) without tracking element identity. The reset on every topology change covers buffer reallocation and the reshuffled element->slot mapping. - The dirty [first, last] envelope drives both the copy and the flush: a new VulkanBuffer::FlushDevice(cmd, access, stage, offset, bytes) overload flushes + barriers just that span for instanceBuffer, and the ranged FlushDevice(offset, bytes) for metadataBuffer. When nothing is dirty both are skipped — the skipped HOST->build barrier only ever ordered host writes, never the application's compute-written GPU-owned transform (that compute->build ordering is the caller's, and is unchanged). Constraint honoured: transformOwnedByGpu transforms are still never host-copied. The API field/method are mirrored on the WebGPU class for source portability (the WebGPU build re-uploads its small mirror wholesale and ignores the version). New test TLASInstanceDirtyTracking drives the real RT device and reads back the host-mapped buffers to assert: tracked elements upload once then skip until re-marked, untracked elements always upload, and relocation on the refit path re-uploads exactly the moved slots — with zero validation-layer errors over the ranged flush. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 19:58:45 +00:00
catbot	5a9d909f5d	Merge pull request 'perf(buffer): reuse staging via a per-frame-in-flight ring in UploadDeviceLocal (#120 )' (#139 ) from claude/issue-120 into master	2026-06-17 21:52:03 +02:00
catbot	85ca08047d	perf(buffer): reuse staging via a per-frame-in-flight ring in UploadDeviceLocal (#120 ) The staged branch of VulkanBuffer::UploadDeviceLocal did a full Create (create+getreq+alloc+map) + DeferredClear (later free) per call. Mesh::Refit / RecordProceduralBuild hit this every frame for deforming meshes on no-/small-BAR hardware — a device-memory alloc/free cycle per mesh per frame, on exactly the hardware the staged path targets. Replace it with a persistent per-buffer staging ring sized to a high-water mark, reallocated (via Resize, which defers the outgrown allocation) only on growth — mirroring the TLAS instance/metadata reuse. The ring is a per-frame-in-flight ring, not a single shared buffer: the vkCmdCopyBuffer still reads staging after the call returns, and with frames pipelined framesInFlight deep, overwriting one shared buffer next frame would clobber data the previous frame's copy is still reading. Indexing by frameCounter % framesInFlight gives each in-flight frame its own slot, reused only after framesInFlight frames elapse — the same window the #101 deletion queue relies on for GPU completion. The ring is grown lazily, on first entry into the staged branch, so ReBAR/UMA hardware (which always takes the direct-write branch) never constructs a staging allocation it does not use — keeping this a runtime no-op there and a straight upgrade on non-resizable-BAR machines. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 19:51:33 +00:00
catbot	e3edb87c0f	Merge pull request 'perf(mesh): dirty-range vertex upload for deforming-mesh Refit (#119 )' (#138 ) from claude/issue-119 into master	2026-06-17 21:51:26 +02:00
catbot	1f4c77000a	perf(mesh): dirty-range vertex upload for deforming-mesh Refit (#119 ) Mesh::Refit re-uploaded the entire vertex array every frame on the in-place UPDATE path (full host write + flush + barrier on the direct path; full re-stage + copy on the staged path), even when only a handful of vertices moved. Add VulkanBuffer::UploadDeviceLocalRange — a dirty-range counterpart to UploadDeviceLocal that writes/flushes/stages/copies and barriers only the half-open element range [offset, offset+count) of an already-allocated device-local buffer. It picks direct-map vs staged-copy from the memory type the buffer was actually allocated with (not the sub-range size, which PreferDirectDeviceWrite would mis-route), and the direct path's ranged flush is rounded to nonCoherentAtomSize and clamped to the allocation size (mappedSize is now recorded for every buffer, not just mapped ones). Add a dirty-range Refit overload taking the full vertex/index arrays plus a (dirtyVertexOffset, dirtyVertexCount) window. The full-span Refit now delegates to it with the whole array as the window. On the in-place UPDATE path only the declared window is uploaded — the rest of the device buffer retains last refit's positions; when an UPDATE isn't possible it falls back to the full-span rebuild, which is why the full arrays are still passed. The WebGPU/DOM backend keeps API symmetry: it has no hardware AS, so it ignores the window and rebuilds the host BVH from the full geometry. BLASBuildOptions exercises the dirty-range refit on the direct path, the staged path, and the count-change rebuild fallback, asserting AS-handle / blasAddr stability and zero Vulkan validation errors. Resolves #119 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 19:50:48 +00:00
catbot	82fd6916d9	Merge pull request 'perf(sync): scope frame-loop barriers to swapchain image + real per-pass stages (#115 )' (#137 ) from claude/issue-115 into master	2026-06-17 21:45:13 +02:00
catbot	9414863cff	perf(sync): scope frame-loop barriers to swapchain image + real per-pass stages (#115 ) The inter-pass and acquire/present barriers in the frame loop set both stage masks to ALL_COMMANDS, and the inter-pass dependency used a queue-wide VkMemoryBarrier — fully serialising against every pipeline stage and flushing all caches every frame, when all the next pass needs is the swapchain image the previous one wrote. Replace the inter-pass global VkMemoryBarrier with an image memory barrier scoped to the swapchain image's single colour subresource (as the intra-pass UI barrier already does), and derive the barrier stage masks per pass: RenderPass::SwapchainStage() is overridden by UIRenderer (COMPUTE_SHADER) and RTPass (RAY_TRACING_SHADER), so a compute->compute edge only serialises COMPUTE while an RT pass pulls in RAY_TRACING — the acquire/present frame-edge masks use the real union of the frame's passes (SwapchainStageUnion). The base default and the empty-passes fallback are the conservative COMPUTE \| RAY_TRACING \| TRANSFER union, so a polymorphic or un-overridden pass can only be over- not under-synchronised. Adds SwapchainBarrierScope (pure CPU) pinning the per-pass derivation, the union narrowing, and the inter-pass barrier scope; FrameLoopSync already drives the real GPU frame loop with validation enabled. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 19:44:33 +00:00
catbot	faf1e5f5e8	Merge pull request 'perf(image): release static texture staging after upload (#114 )' (#136 ) from claude/issue-114 into master	2026-06-17 21:41:07 +02:00
catbot	b4fa6e596e	perf(image): release static texture staging after upload (#114 ) ImageVulkan kept its host-visible staging `buffer` (sized w*h, persistently mapped) alive for the whole life of the image and Destroy() never freed it, so static textures (e.g. Sponza albedo) pinned HOST_VISIBLE / small-BAR memory forever. Update now releases the staging via VulkanBuffer::DeferredClear() right after recording the buffer→image copy — the same fence-keyed deletion queue (#101/#102) the compressed Mesh path already uses — so it is freed once the upload submit's frame has cleared. A new `streamed` flag (set by FontAtlas) keeps the persistent map for images re-uploaded from a CPU-side staging buffer every frame; its uploads go through UpdateRegion, which never releases. Destroy now also frees any staging the image still owns, gated on a live handle so a released static texture can't double-free. Adds tests/ImageStagingRelease driving the real upload + readback on a headless device: asserts the staging is enqueued (not pinned), the image reads back byte-equal (released staging outlived the submit), the entry retires only after framesInFlight frames, and a streamed image keeps then frees its staging. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 19:40:35 +00:00
catbot	5948bb91c1	Merge pull request 'perf(device): pop retired deletions off the front in O(ready) (#116 )' (#135 ) from claude/issue-116 into master	2026-06-17 21:38:28 +02:00
catbot	2460a66296	perf(device): pop retired deletions off the front in O(ready) (#116 ) ReclaimDeletions ran an O(n) walk + survivor compaction over the whole deletion queue every frame. frameCounter is monotonic, so retireAfter is non-decreasing down the queue and the entries ready to reclaim are always a contiguous prefix. Switch deletionQueue from std::vector to std::deque and pop the ready prefix off the front, stopping at the first entry still in flight — O(ready) instead of O(queue). Resolves #116 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 19:37:58 +00:00
catbot	9d9f9d9d2c	Merge pull request 'test(mesh): pin static-build deletion count for #67 staging-release (#110 )' (#112 ) from claude/issue-110 into master	2026-06-17 20:51:18 +02:00
catbot	3655cd636c	test(mesh): pin static-build deletion count in #67 staging-release test (#110 ) Issue #110 reported `crafter-build test MeshDecompressStagingRelease` failing deterministically on the GPU decompress path: the "exactly one allocation handed to the deletion queue" assertion saw more than one entry. Root cause (confirmed at Crafter.Graphics-Mesh.cpp:175-176): a static (allowUpdate=false) Build cannot refit, so it DeferredClear()s its now-dead per-mesh BLAS scratch in addition to the compressed staging (#67) — two deferred allocations, not one. The original test built statically yet asserted size==1, so the scratch was the unaccounted-for second entry. The behaviour was already corrected on master by PR #109 (the #73 merge, `ed9b3f6`), which switched the assertion's build to allowUpdate=true so the scratch is retained. This hardens that fix: it adds a sibling case that Builds with allowUpdate=false and asserts the queue holds exactly TWO entries (staging + dead scratch), builds with zero validation errors, and that both retire on schedule. The count the #67 assertion relies on is now a named, explicitly-tested quantity rather than an implicit comment, so a future change to scratch deferral can't silently shift it back. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 18:50:38 +00:00
catbot	8b1db01222	Merge pull request 'perf(mesh): place RT geometry device-local via #89 upload strategy (#73 )' (#109 ) from claude/issue-73 into master	2026-06-17 20:05:57 +02:00
catbot	ed9b3f67a7	test+usage: device-local geometry readback + isolate #67 staging count - Add TRANSFER_SRC to RT geometry usage so device-local geometry (now in VRAM, no longer host-mappable) stays copyable/inspectable. - MeshDecompressStagingRelease: read decompressed vertex/index back via a device->host copy instead of the removed host-mapped .value/FlushHost, and build the mesh with allowUpdate=true so the retained per-mesh scratch (#66) doesn't also land in the deletion queue — isolating the assertion to the released compressed staging (#67). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 18:05:13 +00:00
catbot	aafa458d41	Merge remote-tracking branch 'origin/master' into claude/issue-73 # Conflicts: # interfaces/Crafter.Graphics-Mesh.cppm	2026-06-17 17:52:06 +00:00
catbot	1582b6ceb5	Merge pull request 'perf(rt): allocate TLAS metadata buffer in BAR/VRAM, not system RAM (#75 )' (#111 ) from claude/issue-75 into master	2026-06-17 19:50:55 +02:00

1 2 3 4 5 ...

398 commits