Race condition: concurrent builds clobber each other's external dep clones #2

Closed
opened 2026-05-27 03:44:55 +02:00 by jorijnvdgraaf · 6 comments

When two crafter-build invocations run concurrently and both need the same
external dep, they race on the shared .tmp clone directory and one corrupts
the other's in-flight git clone.

Repro

Two projects (A, B) that both declare the same ExternalDependency (directly
or transitively), sharing a common ~/.cache/crafter.build/external/:

cd projectA && crafter-build &
cd projectB && crafter-build &
wait

If the cache miss happens for both at roughly the same moment, one of the
builds fails.

Observed

[N/N] Building external dep <name>
git fetch for '<name>': `git clone --recursive '<url>' '~/.cache/crafter.build/external/<name>-<hash>.tmp'` failed (exit 128): Cloning into '~/.cache/crafter.build/external/<name>-<hash>.tmp'...
fatal: could not open '~/.cache/crafter.build/external/<name>-<hash>.tmp/.git/objects/pack/tmp_pack_XXXXXX' for reading: No such file or directory
fatal: fetch-pack: invalid index-pack output

tmp_pack_* is an in-flight git pack tempfile — it vanishes mid-read because
the other build either rm'd the .tmp directory or finished and renamed it
out from under us.

Likely cause

The cache path is derived from a hash of the dep spec, so concurrent builds
that resolve to the same external compute the same .tmp directory name
(both racing builds picked the identical -<hash>.tmp suffix in the failure
we hit). There's no lock around git clone → rename, so:

  • build A: git clone … .tmp (in progress)
  • build B: git clone … .tmp (sees existing dir, or wipes-and-retries)
  • build A: pack tempfile yanked → invalid index-pack output

Suggested fixes

  1. Per-invocation tmp dir — clone into <dep>.tmp.<pid> (or mktemp -d),
    then rename(2) to the canonical cache path on success. rename is atomic
    on the same filesystem; the loser of the race just discards its clone.
  2. File lock on the cache dirflock on
    ~/.cache/crafter.build/external/<dep>.lock around the clone+rename so
    only one builder actually fetches; others wait and reuse the result. Lower
    bandwidth usage when many builds start at once.

(1) alone fixes the corruption; (1) + (2) also avoids redundant clones.

Environment

  • Linux x86_64
  • Two concurrent crafter-build invocations against sibling projects that
    share the user's ~/.cache/crafter.build/external/
When two `crafter-build` invocations run concurrently and both need the same external dep, they race on the shared `.tmp` clone directory and one corrupts the other's in-flight git clone. ## Repro Two projects (A, B) that both declare the same `ExternalDependency` (directly or transitively), sharing a common `~/.cache/crafter.build/external/`: ```sh cd projectA && crafter-build & cd projectB && crafter-build & wait ``` If the cache miss happens for both at roughly the same moment, one of the builds fails. ## Observed ``` [N/N] Building external dep <name> git fetch for '<name>': `git clone --recursive '<url>' '~/.cache/crafter.build/external/<name>-<hash>.tmp'` failed (exit 128): Cloning into '~/.cache/crafter.build/external/<name>-<hash>.tmp'... fatal: could not open '~/.cache/crafter.build/external/<name>-<hash>.tmp/.git/objects/pack/tmp_pack_XXXXXX' for reading: No such file or directory fatal: fetch-pack: invalid index-pack output ``` `tmp_pack_*` is an in-flight git pack tempfile — it vanishes mid-read because the *other* build either rm'd the `.tmp` directory or finished and renamed it out from under us. ## Likely cause The cache path is derived from a hash of the dep spec, so concurrent builds that resolve to the same external compute the same `.tmp` directory name (both racing builds picked the identical `-<hash>.tmp` suffix in the failure we hit). There's no lock around `git clone → rename`, so: - build A: `git clone … .tmp` (in progress) - build B: `git clone … .tmp` (sees existing dir, or wipes-and-retries) - build A: pack tempfile yanked → `invalid index-pack output` ## Suggested fixes 1. **Per-invocation tmp dir** — clone into `<dep>.tmp.<pid>` (or `mktemp -d`), then `rename(2)` to the canonical cache path on success. `rename` is atomic on the same filesystem; the loser of the race just discards its clone. 2. **File lock on the cache dir** — `flock` on `~/.cache/crafter.build/external/<dep>.lock` around the clone+rename so only one builder actually fetches; others wait and reuse the result. Lower bandwidth usage when many builds start at once. (1) alone fixes the corruption; (1) + (2) also avoids redundant clones. ## Environment - Linux x86_64 - Two concurrent `crafter-build` invocations against sibling projects that share the user's `~/.cache/crafter.build/external/`
Member

claude:claim:b831bbc6-27c5-4963-87be-6b4dce043c86

claude:claim:b831bbc6-27c5-4963-87be-6b4dce043c86
Member

[progress]

I've found the issue. Let me look at how `Platform` is used to understand `RunCommandChecked` and how PID can be obtained.
**[progress]** ``` I've found the issue. Let me look at how `Platform` is used to understand `RunCommandChecked` and how PID can be obtained. ```
Member

[progress]

I see the race. `FetchGit` at line 106-107 derives `tmpDir` as `cloneDir + ".tmp"` — same suffix for every concurrent invocation that resolved to the same hash. The loser's pack tempfiles get yanked when the winner renames or the next process `remove_all`s the dir at line 108.

Before I patch, one question on scope:
I'll proceed with the minimal correctness fix (1) — per-invocation tmp dir, rename atomically, loser discards. That matches the guidance to not expand scope beyond what the bug requires.
**[progress]** ``` I see the race. `FetchGit` at line 106-107 derives `tmpDir` as `cloneDir + ".tmp"` — same suffix for every concurrent invocation that resolved to the same hash. The loser's pack tempfiles get yanked when the winner renames or the next process `remove_all`s the dir at line 108. Before I patch, one question on scope: I'll proceed with the minimal correctness fix (1) — per-invocation tmp dir, rename atomically, loser discards. That matches the guidance to not expand scope beyond what the bug requires. ```
Member

[progress]

Now I'll make the fix. I'll use a 64-bit random suffix from `std::random_device` so each invocation has a unique tmp dir, then `rename(2)` atomically. If the rename loses to a concurrent winner, we discard our tmp and reuse the winner's clone.
Now build to verify the change compiles.
Let me check if there's an existing build I can use to do a targeted compile of just the changed translation unit.
**[progress]** ``` Now I'll make the fix. I'll use a 64-bit random suffix from `std::random_device` so each invocation has a unique tmp dir, then `rename(2)` atomically. If the rename loses to a concurrent winner, we discard our tmp and reuse the winner's clone. Now build to verify the change compiles. Let me check if there's an existing build I can use to do a targeted compile of just the changed translation unit. ```
Member

[progress]

I'll run the full build to validate the change compiles. This will take a few minutes (it clones glslang first).
Build succeeded but `build.sh` has no `set -e`. Let me re-run the specific compile to confirm clean.
**[progress]** ``` I'll run the full build to validate the change compiles. This will take a few minutes (it clones glslang first). Build succeeded but `build.sh` has no `set -e`. Let me re-run the specific compile to confirm clean. ```
Member

PR opened: #3

PR opened: https://forgejo.catcrafts.net/Catcrafts/Crafter.Build/pulls/3
Sign in to join this conversation.
No description provided.