Skip to content

fix: decouple client download from merge; stop deleting parts under in-flight readers#244

Open
peter-svensson wants to merge 1 commit into
falcondev-oss:devfrom
peter-svensson:fix/decouple-merge-and-parts-race
Open

fix: decouple client download from merge; stop deleting parts under in-flight readers#244
peter-svensson wants to merge 1 commit into
falcondev-oss:devfrom
peter-svensson:fix/decouple-merge-and-parts-race

Conversation

@peter-svensson

Copy link
Copy Markdown

Problem

Two issues in Storage.download() for unmerged entries (large caches not yet merged):

1. Client download is throttled to the merge upload

The first download of an unmerged entry feeds parts into both the client response and the /merged re-upload through one backpressured pipe (pumpPartsToStreams). The merge upload runs at queueSize: 1, partSize: 5MB, so the client is capped to a serial 5MB PUT loop. On a ~705 MB S3 cache the restore streams ~124 MiB fast (~29 MB/s) then crawls in ~128 MiB/~30s steps — ~3m40s at ~3-4 MB/s.

2. Inline parts deletion races in-flight readers

On merge completion the code deletes {folder}/parts inline. Concurrent downloads still streaming those parts hit ObjectNotFound -> 500 (ERR_HTTP_HEADERS_SENT), treated as a miss -> re-upload churn, and entries never stay merged.

Fix

  • Serve the client directly from streamParts() at full speed; run the merge as an independent background stream. Client is never coupled to the merge.
  • Atomically claim the merge (UPDATE ... WHERE mergeStartedAt IS NULL AND mergedAt IS NULL) so concurrent/replicated downloads don't start duplicate merges.
  • Stop deleting parts inline — the existing hourly cleanup:parts task already removes parts of merged entries, so in-flight readers aren't cut off.
  • Removes pumpPartsToStreams / downloadFromCacheEntryLocation.

Restore of the same ~705 MB cache: ~3m40s -> ~20s in our environment.

Tests

Adds tests/concurrent-download.test.ts (fails on inline-delete, passes after). Existing stale-cache + e2e suites pass.

Relation to #235

#235 fixes defect 2 (parts-delete race) via a grace period + per-part check — complementary. This PR additionally fixes defect 1 (download throttled to merge). Happy to align with #235's grace approach.

…r readers

Two defects in Storage.download() for unmerged entries:

1. The first download of an unmerged entry tees the parts into both the
   client response and the /merged re-upload through one backpressured
   pipe (pumpPartsToStreams). Because the merge upload runs at
   queueSize:1, partSize:5MB, the client is throttled to a serial 5MB
   PUT loop -- a ~700MB restore stalls at ~124MiB and crawls at
   ~3-4 MB/s (observed ~3m40s).

2. On merge completion the parts folder is deleted inline. Concurrent
   downloads still streaming those parts then hit ObjectNotFound (500),
   which clients treat as a miss -> re-upload churn and entries that
   never stay merged.

Fix: serve the client directly from streamParts() at full speed and run
the merge as an independent background stream (its own part reads), so
the client is never coupled to the merge. Claim the merge atomically
(UPDATE ... WHERE mergeStartedAt IS NULL AND mergedAt IS NULL) so
concurrent/replicated downloads don't start duplicate merges. Stop
deleting parts inline -- the existing hourly cleanup:parts task already
removes parts of merged entries, so in-flight readers are never cut off.

Adds tests/concurrent-download.test.ts (fails on the inline-delete code,
passes after).

Related to falcondev-oss#235 (which addresses defect 2 via a parts-delete grace
period but leaves the defect-1 download throttling).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant