Concurrent crafter-build invocations corrupt the shared module cache (malformed or corrupted precompiled file) #14

Closed
opened 2026-05-30 18:11:49 +02:00 by jorijnvdgraaf · 2 comments

Summary

Running two crafter-build invocations concurrently corrupts the shared per-host module cache. The second process reads a half-written .pcm that the first is still emitting, and the build dies with:

Failed to precompile Crafter.Build-Progress (exit 1): .../Crafter.Build-Progress.cppm:23:1:
  fatal error: malformed or corrupted precompiled file: 'can't skip to bit 264461088 from 199833472'
   23 | import std;
      | ^
1 error generated.

The bit offsets vary run to run (it's a torn read), and the same class of error can surface against std.pcm or any of the Crafter.Build-*.pcm artifacts.

Reproduction

From a clean cache (rm -rf "$XDG_CACHE_HOME/crafter.build" / %LOCALAPPDATA%\crafter.build), launch two builds that don't share a project but do share the host cache — e.g. a wasm app build and a sibling native tool build:

crafter-build --target=wasm32-wasip1 --debug-api -r &
cd ../some-other-project && crafter-build &
wait

One of them intermittently fails with the malformed or corrupted precompiled file error above. It reproduces most reliably on a cold cache, where both processes decide the host PCMs are missing/stale and race to (re)precompile them. Serializing the two builds always succeeds.

Root cause

LoadProject bootstraps the host-side PCMs into a shared, per-host cache directory keyed only by <target>-<march>:

  • GetCacheDir()implementations/Crafter.Build-Platform.cpp:221
  • cacheDir = GetCacheDir() / "{target}-{march}" — e.g. :310, :508 (one per platform variant)
  • BuildStdPcm(hostConfig, cacheDir / "std.pcm"):313, :511
  • EnsureCrafterBuildPcms(sourceDir, cacheDir):263, :459

Both BuildStdPcm and EnsureCrafterBuildPcms have the same unsafe shape:

  1. TOCTOU staleness checkif (fs::exists(pcm) && last_write_time(cppm) < last_write_time(pcm)) continue; (:270, :469, :252). Two processes evaluate this independently and both decide to rebuild.
  2. Non-atomic write to the final pathclang++ ... --precompile {cppm} -o {pcmPath} writes directly to the shared destination (:279, :478, :253). While process A is partway through writing std.pcm / Crafter.Build-Progress.pcm, process B opens that same path (via -fprebuilt-module-path={cacheDir}) and reads a truncated/torn file.
  3. No cross-process lock. The in-process projectCacheMutex in Crafter.Build-External.cpp only serializes within a single crafter-build process; nothing guards the cache between separate processes.

Because the cache dir is keyed only by target-march (not by PID/project), independent invocations on the same host collide on exactly the same files.

Suggested fix

Any one of these closes the race; ideally both:

  • Write to a unique temp path, then fs::rename into place. --precompile to cacheDir / (name + ".pcm." + <pid/uuid>), then atomic-rename onto name.pcm. A reader then always sees either the old complete file or the new complete file, never a torn one.
  • Take an OS advisory lock for the bootstrap critical section — e.g. flock a cacheDir/.lock (or a per-PCM lockfile) around the staleness-check + precompile so only one process rebuilds while others wait, then re-check freshness.

Impact / workaround

Anything that fans out crafter-build (CI matrices, build-the-app-plus-a-sibling-tool scripts, parallel agent harnesses) hits this nondeterministically. Workaround is to serialize all crafter-build invocations that share a host, or pre-warm the cache with a single throwaway build before fanning out.

Environment

  • Crafter.Build master @ a930a4a (latest-26-ga930a4a)
  • Linux x86_64 (x86_64-pc-linux-gnu-native); the same pattern exists in the Windows-MSVC and mingw LoadProject/EnsureCrafterBuildPcms variants
  • clang C++23 modules with import std;

Surfaced while driving the 3DForts wasm build: parallelizing the wasm app build and the host-companion native build tripped it immediately.

## Summary Running two `crafter-build` invocations concurrently corrupts the shared per-host module cache. The second process reads a half-written `.pcm` that the first is still emitting, and the build dies with: ``` Failed to precompile Crafter.Build-Progress (exit 1): .../Crafter.Build-Progress.cppm:23:1: fatal error: malformed or corrupted precompiled file: 'can't skip to bit 264461088 from 199833472' 23 | import std; | ^ 1 error generated. ``` The bit offsets vary run to run (it's a torn read), and the same class of error can surface against `std.pcm` or any of the `Crafter.Build-*.pcm` artifacts. ## Reproduction From a clean cache (`rm -rf "$XDG_CACHE_HOME/crafter.build"` / `%LOCALAPPDATA%\crafter.build`), launch two builds that don't share a project but do share the host cache — e.g. a wasm app build and a sibling native tool build: ```sh crafter-build --target=wasm32-wasip1 --debug-api -r & cd ../some-other-project && crafter-build & wait ``` One of them intermittently fails with the `malformed or corrupted precompiled file` error above. It reproduces most reliably on a cold cache, where both processes decide the host PCMs are missing/stale and race to (re)precompile them. Serializing the two builds always succeeds. ## Root cause `LoadProject` bootstraps the host-side PCMs into a shared, per-host cache directory keyed only by `<target>-<march>`: - `GetCacheDir()` — `implementations/Crafter.Build-Platform.cpp:221` - `cacheDir = GetCacheDir() / "{target}-{march}"` — e.g. `:310`, `:508` (one per platform variant) - `BuildStdPcm(hostConfig, cacheDir / "std.pcm")` — `:313`, `:511` - `EnsureCrafterBuildPcms(sourceDir, cacheDir)` — `:263`, `:459` Both `BuildStdPcm` and `EnsureCrafterBuildPcms` have the same unsafe shape: 1. **TOCTOU staleness check** — `if (fs::exists(pcm) && last_write_time(cppm) < last_write_time(pcm)) continue;` (`:270`, `:469`, `:252`). Two processes evaluate this independently and both decide to rebuild. 2. **Non-atomic write to the final path** — `clang++ ... --precompile {cppm} -o {pcmPath}` writes directly to the shared destination (`:279`, `:478`, `:253`). While process A is partway through writing `std.pcm` / `Crafter.Build-Progress.pcm`, process B opens that same path (via `-fprebuilt-module-path={cacheDir}`) and reads a truncated/torn file. 3. **No cross-process lock.** The in-process `projectCacheMutex` in `Crafter.Build-External.cpp` only serializes within a single `crafter-build` process; nothing guards the cache between separate processes. Because the cache dir is keyed only by `target-march` (not by PID/project), independent invocations on the same host collide on exactly the same files. ## Suggested fix Any one of these closes the race; ideally both: - **Write to a unique temp path, then `fs::rename` into place.** `--precompile` to `cacheDir / (name + ".pcm." + <pid/uuid>)`, then atomic-rename onto `name.pcm`. A reader then always sees either the old complete file or the new complete file, never a torn one. - **Take an OS advisory lock for the bootstrap critical section** — e.g. `flock` a `cacheDir/.lock` (or a per-PCM lockfile) around the staleness-check + precompile so only one process rebuilds while others wait, then re-check freshness. ## Impact / workaround Anything that fans out `crafter-build` (CI matrices, build-the-app-plus-a-sibling-tool scripts, parallel agent harnesses) hits this nondeterministically. Workaround is to serialize all `crafter-build` invocations that share a host, or pre-warm the cache with a single throwaway build before fanning out. ## Environment - Crafter.Build `master` @ `a930a4a` (`latest-26-ga930a4a`) - Linux x86_64 (`x86_64-pc-linux-gnu-native`); the same pattern exists in the Windows-MSVC and mingw `LoadProject`/`EnsureCrafterBuildPcms` variants - clang C++23 modules with `import std;` Surfaced while driving the 3DForts wasm build: parallelizing the wasm app build and the `host-companion` native build tripped it immediately.
Member

claude:claim:cebdd08e-e6c7-4faa-b2b0-5853f6effb4d

claude:claim:cebdd08e-e6c7-4faa-b2b0-5853f6effb4d
Member

PR opened: #15

PR opened: https://forgejo.catcrafts.net/Catcrafts/Crafter.Build/pulls/15
Sign in to join this conversation.
No description provided.