fix: decouple client download from merge; stop deleting parts under in-flight readers#244
Open
peter-svensson wants to merge 1 commit into
Open
Conversation
…r readers Two defects in Storage.download() for unmerged entries: 1. The first download of an unmerged entry tees the parts into both the client response and the /merged re-upload through one backpressured pipe (pumpPartsToStreams). Because the merge upload runs at queueSize:1, partSize:5MB, the client is throttled to a serial 5MB PUT loop -- a ~700MB restore stalls at ~124MiB and crawls at ~3-4 MB/s (observed ~3m40s). 2. On merge completion the parts folder is deleted inline. Concurrent downloads still streaming those parts then hit ObjectNotFound (500), which clients treat as a miss -> re-upload churn and entries that never stay merged. Fix: serve the client directly from streamParts() at full speed and run the merge as an independent background stream (its own part reads), so the client is never coupled to the merge. Claim the merge atomically (UPDATE ... WHERE mergeStartedAt IS NULL AND mergedAt IS NULL) so concurrent/replicated downloads don't start duplicate merges. Stop deleting parts inline -- the existing hourly cleanup:parts task already removes parts of merged entries, so in-flight readers are never cut off. Adds tests/concurrent-download.test.ts (fails on the inline-delete code, passes after). Related to falcondev-oss#235 (which addresses defect 2 via a parts-delete grace period but leaves the defect-1 download throttling).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Two issues in
Storage.download()for unmerged entries (large caches not yet merged):1. Client download is throttled to the merge upload
The first download of an unmerged entry feeds parts into both the client response and the
/mergedre-upload through one backpressured pipe (pumpPartsToStreams). The merge upload runs atqueueSize: 1, partSize: 5MB, so the client is capped to a serial 5MB PUT loop. On a ~705 MB S3 cache the restore streams ~124 MiB fast (~29 MB/s) then crawls in ~128 MiB/~30s steps — ~3m40s at ~3-4 MB/s.2. Inline parts deletion races in-flight readers
On merge completion the code deletes
{folder}/partsinline. Concurrent downloads still streaming those parts hitObjectNotFound-> 500 (ERR_HTTP_HEADERS_SENT), treated as a miss -> re-upload churn, and entries never stay merged.Fix
streamParts()at full speed; run the merge as an independent background stream. Client is never coupled to the merge.UPDATE ... WHERE mergeStartedAt IS NULL AND mergedAt IS NULL) so concurrent/replicated downloads don't start duplicate merges.cleanup:partstask already removes parts of merged entries, so in-flight readers aren't cut off.pumpPartsToStreams/downloadFromCacheEntryLocation.Restore of the same ~705 MB cache: ~3m40s -> ~20s in our environment.
Tests
Adds
tests/concurrent-download.test.ts(fails on inline-delete, passes after). Existing stale-cache + e2e suites pass.Relation to #235
#235 fixes defect 2 (parts-delete race) via a grace period + per-part check — complementary. This PR additionally fixes defect 1 (download throttled to merge). Happy to align with #235's grace approach.