[SDKv2] resumable downloads (cross-process lock + per-chunk state + linked cancellation) by bmehta001 · Pull Request #793 · microsoft/Foundry-Local

bmehta001 · 2026-06-09T17:48:55Z

Resumable downloads (C++ port)

Ports the resumable-download flow from neutron-server's AzureExtensions.cs / BlobDownloadState.cs / CrossProcessFileLock.cs to the C++ SDK in sdk_v2/cpp. Two commits, one per increment.

Increment 1 — cross-process lock + skip-existing (commit `45c99d5d`)

CrossProcessFileLock: RAII, <dir>/.download.lock exclusive lock. Win: CreateFileW FILE_SHARE_NONE FILE_FLAG_DELETE_ON_CLOSE; POSIX: flock LOCK_EX|LOCK_NB. PIMPL (struct State defined only in the .cc) keeps platform headers out of the public include.
WaitForLockForDirectory: polls every 1.25 s with a 100 ms cancellation slice and a 3 h ceiling, driven by a std::function<bool()> predicate so cancellation can also serve as a heartbeat (the progress callback returning non-zero).
DownloadManager::DownloadModel: takes the lock right after create_directories, re-checks the cache once acquired, then proceeds with the existing download flow. Stores ILogger& so the lock and resume layers can log uniformly.
DownloadBlobsToDirectory: blobs already present at the expected size are skipped; their bytes are credited to the initial progress callback so the percentage stays accurate on resume. Emits 100% immediately if everything is cached.
9 CrossProcessFileLock tests + 4 new skip-existing tests in download_test.cc.

Increment 2 — per-chunk resume + linked cancellation (commit `b6d98db5`)

BlobDownloadState (new): compact binary <file>.dlstate sidecar. ~45-byte LE header (magic FLDS + version + sizes + counters) followed by the truncated bitmap suffix. The prefix of fully-completed chunks is implicit — SaveState advances bitmap_byte_aligned_start past every all-1s word so the file stays proportional to the unfinished tail. LoadState rejects on magic, version, or layout mismatch and restarts the download in that case. Atomic save via tmp + rename, with remove-then-rename fallback.
AzureBlobDownloader::DownloadBlob rework: protected virtuals GetBlobSize and DownloadChunkToBuffer (against an opaque ChunkContext) form a test seam. Worker pool uses an atomic queue index over the pending-chunks list; workers claim, fetch, write to file (under a single file_mutex), mark complete, and periodically SaveState (every max(10, num_chunks/50) chunks). Pre-allocation is skipped if the file already matches blob size, so resume doesn't discard valid bytes.
Linked cancellation: a single shared Azure::Core::Context + a single std::atomic<bool> internal_cancel per call. The first chunk failure (or external cancel signal) flips the flag and calls ctx.Cancel(). Other in-flight chunks see either signal and exit fast — a chunk failing while 9 others are mid-flight drains in ~300 ms in the new ChunkFailureCancelsInFlightPeersFast test.
IsDownloadNeeded now also returns true if a .dlstate sidecar exists (the data file may be pre-allocated with holes).
Sidecar is persisted on any failure and deleted on full success.

Open decisions (resolved)

Question	Resolution
`BlobDownloadState` design	Port the C# truncated-bitmap shape directly.
Sidecar filename	`.dlstate` (cross-language compat with C# `.download.state` was explicitly out of scope).
`OnDownloadComplete` telemetry	Deferred per spec; C++ already signals via 100% progress callback.
Mockability	Both: `IBlobDownloader` mock for orchestrator tests + `AzureBlobDownloader` subclass via protected virtuals for chunk-level injection.

Tests

9 new CrossProcessFileLockTest cases (Inc 1)
4 new skip-existing cases in download_test.cc (Inc 1)
15 new BlobDownloadStateTest cases (Inc 2): create/mark/save/load/delete, gap enumeration, partial final chunk math, byte-aligned-start advancement, rejection of magic / size / total_chunks mismatches.
5 new AzureBlobDownloaderResumeTest cases (Inc 2): resume from sidecar, fresh download, sidecar persists on chunk failure, stale sidecar cleaned up for empty blobs, cancel-cascade timing.

Targeted BlobDownloader|DownloadManager|CrossProcessFileLock|AzureBlobDownloader suite: 60/60 in 13 s. Full gtest run is 789 passed / 54 skipped; the 3 pre-existing ModelLoadManagerUnloadTest failures are environmental (missing local test model file qwen2.5-0.5b-instruct-generic-cpu-4), not caused by this PR.

Out of scope

Region-based download (Increment 3 in the spec) — tracked separately.

Notes for reviewers

C ABI is unchanged; this PR is purely internal to sdk_v2/cpp.
Reference implementation in C# lives in neutron-server/src/Downloader/ (private repo); see Increment 2's bitmap-truncation logic in BlobDownloadState.cs:237-281 for the closest analog.

vercel · 2026-06-09T17:48:56Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
foundry-local	Ready	Preview, Comment	Jun 20, 2026 12:46am

Two unrelated -Werror diagnostics from Clang (and modern GCC) were tripping the Linux x64 and macOS ARM64 jobs on PR #793; Windows + MSVC silently accepted them. 1. blob_download_state.cc: 'kHeaderSize' was a namespace-scope constexpr that nothing referenced (the header layout is materialized by the WriteLE call sequence, not this constant). Triggers -Wunused-const-variable on Clang. Delete it; the layout comment above already documents the 45-byte size. 2. download_test.cc: ChunkFailureCancelsInFlightPeersFast captured 'kFailOffset' in a lambda, but it's a constexpr int64_t used only in a constant expression so the capture is redundant and -Wunused-lambda-capture flags it. Replace [kFailOffset] with [] to match the sister test's pattern. Also fix a latent issue surfaced during review: 3. MutexFstreamFileWriter::WriteAt now calls file_.clear() before seekp() so a prior failure doesn't permanently poison the stream and cause subsequent workers to surface a spurious 'write failed' instead of the original error. Positional writers are unaffected (pwrite/WriteFile are stateless). 71 tests still pass on Windows (35 in the affected suites verified explicitly). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Increment 1 of the resumable-downloads port (see docs/ResumableDownloadsPlan.md). No public C ABI changes. CrossProcessFileLock - New RAII helper backed by an OS-level exclusive lock on <dir>/.download.lock: Windows uses CreateFileW with FILE_SHARE_NONE + FILE_FLAG_DELETE_ON_CLOSE; POSIX uses open(O_CREAT|O_RDWR|O_CLOEXEC) + flock(LOCK_EX|LOCK_NB). - Writes a PID:<pid>,Time:<iso8601> diagnostic line for crash forensics. - WaitForLockForDirectory polls at 1.25 s with a 3 h timeout. The cancellation hook is a std::function<bool()> predicate (not a bare atomic) so callers can route it through their own cancellation channel — DownloadManager forwards it through the existing progress callback's non-zero return. DownloadManager::DownloadModel - Acquires the cross-process lock immediately after create_directories and before writing the in-progress signal file. - Re-checks the cache after acquiring the lock to short-circuit when another process just finished the same download. - Now stores ILogger& logger_ so the lock acquisition can log who is waiting. DownloadBlobsToDirectory (skip-existing) - New IsDownloadNeeded(blob, local_path) filter: blobs whose local file already exists at the expected content_length are skipped. - Skipped bytes are credited toward the total — the initial progress callback now emits skipped_bytes / total_size * 100 instead of always 0%, so resumed downloads start at an honest percentage rather than rewinding to zero. - If every blob is already on disk the function emits 100% and returns. Tests - 9 new CrossProcessFileLockTest cases (acquire, release, contention, recovery after release, directory creation, wait happy path, wait-then-acquire, cancellation, timeout). - 4 new BlobDownloadTest cases for skip-existing (same-size, wrong-size, progress accounting, everything-cached). - Full targeted suite passes 40/40 in 14 s. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Increment 2 of the resumable-downloads C++ port. Adds a <file>.dlstate sidecar that tracks per-chunk completion via a truncated bitmap (matching the C# BlobDownloadState design from neutron-server), and replaces AzureBlobDownloader::DownloadBlob's batch loop with a worker pool that shares a single Azure::Core::Context. The first chunk failure calls Cancel() on the shared context and flips an internal cancel flag, so every other in-flight chunk drains within tens of milliseconds instead of waiting on its own retry+timeout budget. Highlights: - BlobDownloadState (new): compact binary on-disk format. ~45-byte LE header (magic FLDS + version + sizes + counters) followed by the truncated bitmap suffix. The prefix of fully-completed chunks is implied — SaveState advances �itmap_byte_aligned_start past every fully-set word to keep the sidecar proportional to the unfinished tail. LoadState rejects on magic, version, or layout (blob_size / chunk_size / total_chunks) mismatch and starts the download fresh in that case. Atomic save via tmp file + rename, with remove-then-rename fallback for filesystems that don't replace atomically. - AzureBlobDownloader rework: protected virtual GetBlobSize and DownloadChunkToBuffer (against an opaque ChunkContext) form a test seam so subclasses can simulate per-chunk behavior without touching Azure. Worker pool uses an atomic queue index over pending chunks; workers claim, fetch, write, mark complete, and periodically save (max(10, num_chunks/50) chunks). Pre-allocation is skipped if the file is already at full size, so resume doesn't discard valid bytes. Sidecar is persisted on any failure and deleted on full success. - IsDownloadNeeded now treats the presence of a .dlstate sidecar as "download still needed" — the data file may be pre-allocated with holes. - AzureBlobDownloader picks up an optional ILogger*; DownloadManager passes its own logger through. Tests: - 15 BlobDownloadStateTest cases (create/mark/save/load/delete, gap enumeration, partial final chunk math, byte-aligned-start advancement, rejection of magic/size mismatches). - 5 AzureBlobDownloaderResumeTest cases via a FakeChunkAzureDownloader subclass: resume skips already-completed chunks, sidecar persists on chunk failure, stale sidecar is cleaned up for empty blobs, and a failing chunk drains 9 sleeping peers within ~300 ms (well under the 2 s threshold) — exercising the linked-cancellation cascade end to end. 20 new tests; full BlobDownloader/DownloadManager/CrossProcessFileLock suite is 59/59 in ~15 s. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ker memory at 64 KB) Eliminate the 2 MB-per-chunk std::vector<uint8_t> allocation in AzureBlobDownloader::DownloadBlob by streaming each chunk through a sink callback that forwards 64 KB pieces straight to a thread-safe file writer. Peak download memory drops from concurrency * chunk_size (128 MB at 64-way concurrency on 2 MB chunks) to roughly concurrency * 64 KB (~4 MB) regardless of chunk size, matching the .NET Stream.CopyTo semantics we ported from instead of doubling memory with a buffer copy. The file write strategy is now selected via a new `FileWriterKind` constructor argument on `AzureBlobDownloader` and backed by a small `IFileWriter` abstraction with two implementations: - `Positional` (default, recommended): lock-free positional writes using `pwrite` on POSIX and `WriteFile` + `OVERLAPPED.Offset` on Windows. No user-space mutex; the kernel orders disjoint-range writes. - `MutexFstream` (comparison / portable fallback): single shared `std::fstream` guarded by an internal mutex. The chunk-write fast path that used to open a fresh fstream and seek under a mutex per chunk is now subsumed by this writer; the file handle is opened once and reused for every WriteAt. Both writers handle `Open` correctly for the resume path: an existing file at exactly the expected size is preserved (so already-downloaded bytes survive across the writer swap), and any other state triggers pre-allocation to the expected size. The orchestrator's worker pool, atomic cancel cascade, sidecar save cadence, and progress reporting are unchanged. The renamed protected virtual `DownloadChunkStreaming` (replacing `DownloadChunkToBuffer`) is the new test seam; both production code and the existing `FakeChunkAzureDownloader` test double now use the sink callback to deliver chunk bytes. Tests: - New `WriterImpls/FileWriterTest` runs 5 correctness checks against both writer implementations (10 tests total): open semantics for fresh / existing-at-size / existing-at-different-size files, single thread WriteAt, and 8 threads writing 256 KB regions to disjoint offsets validated byte-for-byte after close. - New `FileWriterPerfComparison.PositionalVsMutexFstream` runs the realistic AzureBlobDownloader workload (32 workers, 8 chunks/worker, 2 MB chunks, 64 KB sink pieces, 512 MB total) against both writers and prints wall-clock + MB/s for comparison. Measured locally on NVMe NTFS: Positional averages ~590 MB/s, MutexFstream ~545 MB/s (Positional ~8 percent faster on average; both well above 500 MB/s). - All existing 60 BlobDownload* + AzureBlobDownloader* tests still pass without modification beyond the chunk_hook signature update. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Two unrelated -Werror diagnostics from Clang (and modern GCC) were tripping the Linux x64 and macOS ARM64 jobs on PR #793; Windows + MSVC silently accepted them. 1. blob_download_state.cc: 'kHeaderSize' was a namespace-scope constexpr that nothing referenced (the header layout is materialized by the WriteLE call sequence, not this constant). Triggers -Wunused-const-variable on Clang. Delete it; the layout comment above already documents the 45-byte size. 2. download_test.cc: ChunkFailureCancelsInFlightPeersFast captured 'kFailOffset' in a lambda, but it's a constexpr int64_t used only in a constant expression so the capture is redundant and -Wunused-lambda-capture flags it. Replace [kFailOffset] with [] to match the sister test's pattern. Also fix a latent issue surfaced during review: 3. MutexFstreamFileWriter::WriteAt now calls file_.clear() before seekp() so a prior failure doesn't permanently poison the stream and cause subsequent workers to surface a spurious 'write failed' instead of the original error. Positional writers are unaffected (pwrite/WriteFile are stateless). 71 tests still pass on Windows (35 in the affected suites verified explicitly). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Two correctness gaps surfaced during PR review of the new resumable-download machinery; both bias toward silently destroying intact on-disk progress when a transient filesystem error happens. EnsureFileExistsAtSize (file_writer.cc): the previous implementation treated any std::filesystem::file_size() error as 'file does not exist' and fell through to opening an std::ofstream — which has implicit ios::trunc — over the path. A permission glitch, NFS stat hiccup, or virus-scanner-induced EBUSY on a file that *did* exist at the right size would wipe the partial download and force a restart from chunk 0. Now: only the no_such_file_or_directory case proceeds to (re)create; any other stat error throws so the resume bitmap on disk is preserved and the caller can retry. SaveState (blob_download_state.cc): the rename-failed fallback used to do remove(state_path) + rename(tmp_path, state_path). If the second rename also failed (sharing violation, EXDEV, etc.) we had already deleted the old sidecar — leaving nothing on disk and forcing a from-scratch restart on the next run. std::filesystem::rename atomically replaces on every platform we target (POSIX rename(2); Windows MoveFileExW REPLACE_EXISTING), so that fallback was both unnecessary and destructive. Now: on rename failure, just remove the tmp file and log a warning; the previous state_path is left intact and the next SaveState call retries with the up-to-date in-memory bitmap. All 61 download-related tests still pass on Windows. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…eases The unlink-on-release behavior (POSIX explicit unlink, Windows FILE_FLAG_DELETE_ON_CLOSE) mirrored what the C# reference does but inherited the same theoretical race: between unlink and close on POSIX, another acquirer can O_CREAT a fresh inode at the path and flock that, leaving two processes briefly believing they hold the lock on different inodes. In our download protocol the race is benign because every acquirer immediately re-checks 'is the model already downloaded' under the new lock and returns a no-op — but the cleaner answer is to never open the window in the first place. Persist the .download.lock file across acquisitions: - POSIX State: drop the 'path' field; destructor just close()s. - Windows: drop FILE_FLAG_DELETE_ON_CLOSE; OPEN_ALWAYS opens the existing inode on re-acquire, and dwShareMode=0 still enforces exclusivity. Re-acquirers reopen the same inode — there is no path-to-inode race window anywhere in the lifecycle. The file is a few bytes of debug payload and lives alongside the model artifacts; no user-visible impact. Test update: ReleaseOnDestructionRemovesLockFile is replaced by ReleaseLeavesLockFileForReuse, which asserts both that the file persists and that a fresh TryAcquireForDirectory against the same directory succeeds. All 61 download-related tests pass on Windows. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ross releases" This reverts commit be3c391.

The IFileWriter interface + FileWriterKind strategy enum + MutexFstream second implementation + runtime selection were speculative generality: the only real variation (Windows vs POSIX) is already a compile-time #ifdef, and nothing in production ever selected MutexFstream (download_manager always used the default Positional; MutexFstream existed only for a unit-test parameterization and a thresholdless perf benchmark). Replace it with a single concrete FileWriter (Open/WriteAt/Close), backed by pwrite (POSIX) / WriteFile+OVERLAPPED (Windows) via #ifdef. The Windows HANDLE is stored as void* so <windows.h> stays out of the header. AzureBlobDownloader now stack-allocates the writer; the FileWriterKind enum, constructor parameter, and selection branch are gone. Tests: drop the MutexFstream parameterization (TEST_P -> TEST) and the PositionalVsMutexFstream benchmark; the same correctness assertions now run once against FileWriter. The Azure test seam (GetBlobSize/DownloadChunkStreaming) and real-temp-file writes are unchanged. Verified (RelWithDebInfo): build clean; FileWriterTest (5), DownloadManagerTest (17), CrossProcessFileLockTest (9), BlobDownloadStateTest (15) all pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Address the download-lock review items: - Per-model in-process lock. Replace the single global download_mutex_ with a per-model mutex keyed on the resolved cache path. Two downloads of the same model serialize; downloads of different models run concurrently in-process instead of queuing behind each other's (up to 3 h) cross-process waits. - Close the POSIX flock()+unlink() orphan-inode race. After flock() succeeds, verify (fstat vs stat) that the inode we locked is still the file at the lock path; if a racing releaser unlinked it and a third process recreated it, drop the stale lock and report contention so the caller retries. This makes the self-cleaning unlink-on-release provably safe and guarantees two processes can never both believe they hold the lock - so a model can never be downloaded to the same directory twice at once, across any number of processes or apps. - Fix the misleading Windows ACCESS_DENIED comment: it is the DELETE_ON_CLOSE delete-pending window (STATUS_DELETE_PENDING), not "narrower access rights". - Document why the POSIX unlink-before-close is safe (fresh-fd non-blocking waiters; no work between unlink and close; the inode check above). - Decouple the lock-wait cadence from the progress heartbeat: poll the cancellation/heartbeat callback once per poll_interval instead of every 100 ms, so a user callback is not invoked ~10x/s for the whole wait. - Tests: add ConcurrentDownloadsOfDifferentModelsRunConcurrently (proves different models do not serialize). Existing same-model serialize and CrossProcessFileLock acquire/release/wait/cancel/timeout tests still pass (66 download-suite tests green; POSIX branch syntax-checked under g++). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…actor Rebasing onto main surfaced API changes that auto-merge could not reconcile: - blob_downloader.cc: drop the local EndsWith helper; main supplies the shared EndsWithIgnoreCase (used for the inference_model.json filter), so the local copy was unreferenced (/WX dead-code break). - download_test.cc: main moved ModelRegistryClient's test HTTP injection from a SetHttpGet setter (returning a JSON string) to constructor injection of an HttpGetResponseFn returning http::HttpResponse; update the one remaining branch-added test to the new pattern (MakeRegistryResponse). - download_manager.cc: refresh a stale comment that referenced the removed global download_mutex_ (now the per-model lock). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…boundary bitmap_byte_aligned_start marks the chunk index below which every chunk is implicitly complete. SaveState advanced it by accumulating +64 per fully-set word onto the (possibly unaligned) previous start, then adding the trailing-zero offset measured from the word base. When the previous start was not a multiple of 64, the two bases disagreed and the new start overshot by (start % 64), marking never-downloaded chunks complete on reload — silent, permanent data corruption once the sidecar is deleted on completion. Derive the new start from the word index directly (word_idx * 64 + trailing zero), independent of the previous start. Add a regression test that saves from a non-word-aligned prefix, extends it across a word boundary, saves again, and asserts the never-downloaded chunks remain pending after reload. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…s resumable Found via end-to-end SDK testing: interrupting a download (process kill / crash) left the model "cached" but full of zeros. Root cause: AzureBlobDownloader pre-allocates each data file to its full size (FileWriter::Open) before downloading chunks, but the .dlstate sidecar was only written at the first periodic save (after save_interval ~= 10 chunks, ~20 MB). IsDownloadNeeded treats "data file at full content_length + no sidecar" as a completed download and skips it. So a crash in the window between pre-allocation and the first periodic save left a full-size, mostly-empty file with no sidecar; the next run skipped it and wrote inference_model.json, marking the model complete while the weights were all zeros — silent, permanent corruption. Fix: persist the sidecar immediately after CreateNew, before Open() pre-allocates the file, upholding the invariant IsDownloadNeeded relies on ("pre-allocated but unfinished <=> sidecar present"). A subsequent run then sees the sidecar and resumes the missing chunks instead of skipping the file. Regression test (AzureBlobDownloaderResumeTest.SidecarExistsBeforeFirstChunkCompletes) asserts the sidecar is on disk the moment the first chunk is requested. Verified E2E through the Python SDK: download qwen2.5-0.5b, kill at ~18%, resume to completion — the result is now byte-for-byte (SHA-256) identical to a fresh uninterrupted download, including the 862 MB model.onnx.data. Previously that file was 100% zeros. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The per-chunk progress path (per_chunk_progress -> options.progress) could be entered concurrently by up to max_concurrency (default 64) chunk worker threads, but the public download progress API does not require the caller's callback to be thread-safe. A typical callback that updates a counter, UI handle, or logger would data-race. Guard the user callback invocation with a mutex so it is never re-entered concurrently; the atomics that compute the percentage are unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The sidecar was saved only every save_interval = max(10, num_chunks/50) chunks. On a slow connection that interval can span minutes, so a hard crash could lose minutes of completed download (all re-fetched on resume). Add a time cap (AzureBlobDownloader::save_state_interval_, default 3s): flush when the chunk count OR the elapsed wall-clock since the last save is reached, whichever first. This does not slow downloads: the check runs only at chunk completion (so it never flushes more often than chunks arrive), the sidecar write is tiny and not fsync'd, and it happens off the download critical path (network I/O and the file write take no state lock). On fast links the chunk count is still hit first, so save cadence is unchanged; on slow links the tiny extra writes land in the network-idle gaps and bound crash loss to seconds. The interval is an injectable member so tests can force the time path; TimeBasedSaveFlushesBeforeChunkInterval drives a 5-chunk blob (below the 10-chunk count interval) with a zero cap and asserts the sidecar is flushed mid-download. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

+  // On resume with the same offset blocked, we should still hit the failure
+  // but skip already-completed chunks. Strip the failure and rerun: the
+  // downloader should only process the chunks that weren't completed.


The comment described stripping the failure and re-running the download, but the test only loads the persisted sidecar and asserts a partial completed_count -- there is no second DownloadBlob call. Reword it to describe what the test actually verifies: that the sidecar records partial progress for a future resume. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

+      // Persist the sidecar now, before Open() pre-allocates the data file.
+      // IsDownloadNeeded treats "data file at full size + no sidecar" as a
+      // completed download and skips it. The periodic save below does not run
+      // until save_interval chunks are done (~16 MB), so a crash between
+      // pre-allocation and that first save would otherwise leave a full-size,
+      // mostly-empty file with no sidecar that the next run silently accepts as
+      // complete — serving zeros. Writing the sidecar up front upholds the
+      // invariant "pre-allocated but unfinished <=> sidecar present".
+      state->SaveState(logger_);


…sted The "pre-allocated but unfinished <=> sidecar present" invariant that lets IsDownloadNeeded distinguish a complete file from an in-progress one depends on the initial SaveState landing before FileWriter::Open() pre-allocates the full-size data file. That first save was best-effort/void: if it silently failed but Open() still sparse-allocated the file, a crash before the next periodic save left a full-size, mostly-empty file with no sidecar that the next run accepts as complete -- serving zeros. SaveState now returns bool. Periodic and finalization saves stay best-effort (return value ignored); the initial pre-allocation save treats false as fatal and throws before Open() runs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Exercises the cross-process file-lock branch of DownloadModel that the in-process concurrency tests never reach: a second process (simulated by holding the lock directly) is mid-download on the same model directory. DownloadModel must observe the held lock, block in WaitForDirectoryLock without holding the in-process download mutex, and once the lock releases and inference_model.json is present, return the cached result via the post-lock recheck without re-downloading anything. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…el convention Adds DISABLED_DownloadFixture.ResumesPartialDownloadAfterCancel: cancels a real download once aggregate progress passes ~30% via the progress callback, then re-runs and asserts the .dlstate sidecar drove a partial resume (first reported percentage well above 0) rather than a fresh re-download. DISABLED so it only runs under the coverage/live path that downloads real artifacts. Also fixes two pre-existing DISABLED tests that returned `true` (== 1) from their progress callbacks. The download progress-callback convention is 0 = continue, non-zero = cancel, so `return true` cancelled the download immediately -- latent until this PR's linked cancellation made it bite. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

bmehta001 · 2026-06-19T23:46:57Z

+  EXPECT_GT(resume_progress.front(), 0.0f)
+      << "resume should start from the partial progress already on disk, not 0%";


Good catch — fixed in 4721485. DownloadModel emits the 0% heartbeat first, so I now skip leading zeros and assert the first non-zero sample (the on-disk reflection from DownloadBlobsToDirectory) lands well above a fresh-download first-chunk fraction, with a lenient, documented threshold to absorb the ~16 MB sidecar-save granularity.

DownloadManager::DownloadModel emits a 0% heartbeat (progress_cb(0.0f)) before the transfer starts and Model::Download forwards it unchanged, so the first captured sample in the live resume test was always that heartbeat -- the EXPECT_GT(front, 0) assertion would always fail when the DISABLED test runs. Skip leading zeros and assert the first *real* (non-zero) sample lands well above a fresh-download first-chunk fraction, validating the sidecar-driven partial resume. Internal-API tests that drive DownloadBlobsToDirectory directly bypass this heartbeat, which is why they were unaffected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

bmehta001 · 2026-06-20T00:21:34Z

+        int64_t new_total = bytes_completed.fetch_add(size, std::memory_order_relaxed) + size;
+        if (bytes_written_cb) {
+          bytes_written_cb(new_total);
+        }


Done in e4f9dfc — moved the byte-accounting + bytes_written_cb call inside the try so a user-cancel throw routes through the same catch that runs azure_ctx.Cancel(), matching the chunk-failure path. Added UserCancelDrainsInFlightPeersFast, which parks every peer inside its chunk before the cancel fires so only the Azure-context cascade can drain them — verified ~120 ms with the change vs ~7.3 s without (i.e. one full chunk per peer), so the test genuinely guards the fix.

The per-chunk byte accounting and bytes_written_cb call sat just outside the try/catch that wraps DownloadChunkStreaming. On user cancellation bytes_written_cb (per_chunk_progress) sets the shared cancel flag and throws, so that throw unwound straight out of the worker without reaching the catch's azure_ctx.Cancel(). Peers blocked mid-chunk -- the real DownloadChunkStreaming relies on the Azure context for interruption and does not poll the flag between reads -- only noticed the cancel at their next top-of-loop check, i.e. after finishing their current 2 MB chunk. The chunk-failure path already cancelled promptly from its catch block. Move the accounting and callback inside the try so a user-cancel throw routes through the same catch and triggers azure_ctx.Cancel() immediately, matching the failure path. Adds UserCancelDrainsInFlightPeersFast, which parks every peer in its chunk before the cancel fires so only the Azure-context cascade can drain them (verified: ~120 ms with the fix, ~7.3 s without). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

skottmckay · 2026-06-19T23:13:05Z

+  Azure::Core::Context* azure_ctx;
+};
+
+AzureBlobDownloader::AzureBlobDownloader(ILogger* logger) : logger_(logger) {}


nit: logger should be required so pass as a reference. can probably store as a reference (I assume this class doesn't need to be copyable which would require a pointer for the member) as well.

skottmckay · 2026-06-19T23:13:53Z

+/// in-flight chunk read.
+struct AzureBlobDownloader::ChunkContext {
+  Azure::Storage::Blobs::BlobClient* blob_client;
+  Azure::Core::Context* azure_ctx;


Use const T& for both these.

In general, use const and & wherever possible as they tighten the usage.

skottmckay · 2026-06-19T23:24:50Z

+    size_t to_read =
+        static_cast<size_t>(std::min<int64_t>(remaining, static_cast<int64_t>(scratch.size())));


nit: AI needs to frequently be reminded it can use 120 chars. this could be a single line.

skottmckay · 2026-06-19T23:28:07Z

    // Add 429 (TooManyRequests) to the default retryable status codes.
    // Defaults already include: 408, 500, 502, 503, 504.
    client_options.Retry.StatusCodes.insert(Azure::Core::Http::HttpStatusCode::TooManyRequests);


Unrelated to this review, but out of interest should we be doing this? Is it just going to make a bad situation worse?

skottmckay · 2026-06-19T23:36:50Z

+    }
+    if (!state) {


nit: whitespace between separate blocks is more readable. might need to tell the agent to read .github\instructions\cpp-formatting.instructions.md. if I see it doing this I would typically say "read this and apply to the changes on this branch".

skottmckay · 2026-06-20T00:04:01Z

+    logger->Log(LogLevel::Warning,
+                "Failed to delete download state file: " + state_path.string() + " (" +
+                    ec.message() + ")");


Should some of these be Error level? If the user is going to have to potentially manually address an issue a Warning may be too low.

skottmckay · 2026-06-20T00:11:51Z

+        std::lock_guard<std::mutex> lock(state->mutex());
+        state->SaveState(logger_);
+      }
+      if (cancelled && cancelled->load(std::memory_order_relaxed)) {


do we need to do a load here?

skottmckay · 2026-06-20T00:14:02Z

+                                ? static_cast<float>(skipped_bytes) /
+                                      static_cast<float>(total_size) * 100.0f


nit: could be single line

skottmckay · 2026-06-20T00:19:12Z

+/// returns true, or `timeout` elapses.
+/// Throws FOUNDRY_LOCAL_ERROR_OPERATION_CANCELLED on cancellation, or
+/// FOUNDRY_LOCAL_ERROR_INTERNAL on timeout.
+std::unique_ptr<CrossProcessFileLock> WaitForDirectoryLock(


nit: make static member of class

skottmckay · 2026-06-20T00:26:14Z

+
+}  // namespace
+
+#ifdef _WIN32


nit: would be nice to structure it so that platform specific code is in src/platform like we do with ORT as that makes it easy to find such differences. could be follow up PR.

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

bmehta001 · 2026-06-20T00:46:12Z

+          int64_t new_total = bytes_completed.fetch_add(size, std::memory_order_relaxed) + size;
+          if (bytes_written_cb) {
+            bytes_written_cb(new_total);
+          }


Fixed in fdfa267 — you're right, this was a regression from this PR's per-chunk rework (8225ce7). Restored the pre-rework behavior of reporting bytes_completed.load() (the global running total, always non-decreasing) after the add, instead of each worker's pre-add snapshot. Kept inside the try so the user-cancel fast-drain fix is unaffected. Verified the pre-rework code did exactly bytes_completed += total_read; bytes_written_cb(bytes_completed.load());.

The per-chunk progress rework reported each worker's own pre-add snapshot (bytes_completed.fetch_add(size) + size). Concurrent workers compute distinct snapshots and then race to call bytes_written_cb, so a worker holding a smaller snapshot can reach the user callback after one holding a larger one -- reported progress goes backwards, breaking the monotonic-non-decreasing contract that DISABLED_DownloadFixture.RemoveAndRedownloadSmallestModel asserts. Restore the pre-rework behavior of reporting bytes_completed.load() (the global running total, always non-decreasing) after the add. Stays inside the try, so the user-cancel fast-drain fix is unaffected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.

vercel Bot deployed to Preview June 9, 2026 17:48 View deployment

vercel Bot deployed to Preview June 9, 2026 19:53 View deployment

vercel Bot deployed to Preview June 10, 2026 09:26 View deployment

bmehta001 changed the title ~~sdk_v2/cpp: resumable downloads (cross-process lock + per-chunk state + linked cancellation)~~ [SDKv2] resumable downloads (cross-process lock + per-chunk state + linked cancellation) Jun 10, 2026

vercel Bot deployed to Preview June 10, 2026 14:21 View deployment

vercel Bot deployed to Preview June 10, 2026 14:44 View deployment

vercel Bot deployed to Preview June 10, 2026 15:18 View deployment

vercel Bot deployed to Preview June 17, 2026 20:02 View deployment

vercel Bot deployed to Preview June 18, 2026 00:36 View deployment

bmehta001 and others added 13 commits June 18, 2026 01:19

Revert "CrossProcessFileLock: drop self-cleanup; persist lock file ac…

d7a3072

…ross releases" This reverts commit be3c391.

bmehta001 force-pushed the bhamehta/flcore/resumable-downloads branch from b66251b to 1ee720c Compare June 18, 2026 08:34

vercel Bot deployed to Preview June 18, 2026 08:35 View deployment

vercel Bot deployed to Preview June 18, 2026 15:11 View deployment

Fix comment

4a961b5

vercel Bot deployed to Preview June 18, 2026 18:17 View deployment

Copilot stopped reviewing on behalf of bmehta001 due to an error June 19, 2026 22:35
Copilot had to stop work due to a timeout.

Copilot AI reviewed Jun 19, 2026

View reviewed changes

bmehta001 requested a review from Copilot June 19, 2026 22:40

Copilot started reviewing on behalf of bmehta001 June 19, 2026 22:41 View session

vercel Bot deployed to Preview June 19, 2026 22:41 View deployment

Copilot AI reviewed Jun 19, 2026

View reviewed changes

bmehta001 and others added 3 commits June 19, 2026 18:21

bmehta001 requested a review from Copilot June 19, 2026 23:22

vercel Bot deployed to Preview June 19, 2026 23:22 View deployment

Copilot started reviewing on behalf of bmehta001 June 19, 2026 23:22 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

bmehta001 requested a review from Copilot June 19, 2026 23:46

Copilot started reviewing on behalf of bmehta001 June 19, 2026 23:47 View session

vercel Bot deployed to Preview June 19, 2026 23:47 View deployment

Copilot AI reviewed Jun 19, 2026

View reviewed changes

bmehta001 requested a review from Copilot June 20, 2026 00:21

Copilot started reviewing on behalf of bmehta001 June 20, 2026 00:22 View session

vercel Bot deployed to Preview June 20, 2026 00:22 View deployment

skottmckay reviewed Jun 20, 2026

View reviewed changes

Copilot AI reviewed Jun 20, 2026

View reviewed changes

bmehta001 requested a review from Copilot June 20, 2026 00:46

Copilot started reviewing on behalf of bmehta001 June 20, 2026 00:46 View session

vercel Bot deployed to Preview June 20, 2026 00:46 View deployment

Copilot AI reviewed Jun 20, 2026

View reviewed changes

		EXPECT_GT(resume_progress.front(), 0.0f)
		<< "resume should start from the partial progress already on disk, not 0%";

		size_t to_read =
		static_cast<size_t>(std::min<int64_t>(remaining, static_cast<int64_t>(scratch.size())));

		? static_cast<float>(skipped_bytes) /
		static_cast<float>(total_size) * 100.0f

Conversation

bmehta001 commented Jun 9, 2026

Resumable downloads (C++ port)

Increment 1 — cross-process lock + skip-existing (commit 45c99d5d)

Increment 2 — per-chunk resume + linked cancellation (commit b6d98db5)

Open decisions (resolved)

Tests

Out of scope

Notes for reviewers

Uh oh!

vercel Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Increment 1 — cross-process lock + skip-existing (commit `45c99d5d`)

Increment 2 — per-chunk resume + linked cancellation (commit `b6d98db5`)

vercel Bot commented Jun 9, 2026 •

edited

Loading