Skip to content

fix(workflow-core): paginate S3 deleteDirectory deletions#5569

Open
kunwp1 wants to merge 5 commits into
apache:mainfrom
kunwp1:fix/s3-delete-directory-pagination
Open

fix(workflow-core): paginate S3 deleteDirectory deletions#5569
kunwp1 wants to merge 5 commits into
apache:mainfrom
kunwp1:fix/s3-delete-directory-pagination

Conversation

@kunwp1

@kunwp1 kunwp1 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

S3StorageClient.deleteDirectory listed objects with a single listObjectsV2 call and issued one deleteObjects batch. Both S3 APIs cap at 1000 keys per call, so for any prefix holding more than 1000 objects only the first 1000 were deleted and the rest causes a storage leak. This affects dataset deletion (DatasetResource) and per-execution cleanup (LargeBinaryManager), either of which can exceed 1000 objects under one prefix.

This PR:

  • Lists via listObjectsV2Paginator, which follows the continuation token across all pages, and deletes in batches of at most 1000 keys. Keys are streamed so memory stays bounded to a single batch.
  • Inspects each DeleteObjects response and throws if any key failed.

Any related issues, documentation, discussions?

Closes #5281

How was this PR tested?

  1. Create more than 1000 files for i in {1..1100}; do printf 'x' > "file_$i.txt"; done
  2. Upload them in a dataset. (There is a frontend memory issue when you upload all 1100 files at the same time. Try to upload batch-by-batch)
  3. Delete the dataset.
  4. Check if all the files are removed in the minio console. (Before this fix, some files remain)

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

deleteDirectory issued a single listObjectsV2 call (capped at 1000 keys
per page) and one deleteObjects batch, so any objects beyond the first 1000
under a prefix were silently orphaned. AWS DeleteObjects also accepts at
most 1000 keys per request, and reports per-key failures in its response
rather than throwing.

List via listObjectsV2Paginator, which transparently follows the
continuation token across all pages, deleting keys in batches of at most
1000. Inspect each DeleteObjects response and raise if any key failed, so
partial deletions are no longer swallowed. Also remove the unused MD5
computation (digested over the request's toString and never used).

Closes apache#5281
@kunwp1 kunwp1 force-pushed the fix/s3-delete-directory-pagination branch from 1b4b0c3 to e3deaba Compare June 8, 2026 22:07
@codecov-commenter

codecov-commenter commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.10%. Comparing base (cd60535) to head (fc87879).

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5569      +/-   ##
============================================
- Coverage     52.42%   52.10%   -0.32%     
  Complexity     2481     2481              
============================================
  Files          1070     1067       -3     
  Lines         41359    41262      -97     
  Branches       4441     4439       -2     
============================================
- Hits          21682    21500     -182     
- Misses        18406    18494      +88     
+ Partials       1271     1268       -3     
Flag Coverage Δ *Carryforward flag
access-control-service 64.61% <ø> (ø)
agent-service 33.76% <ø> (ø) Carriedforward from 5f54275
amber 53.29% <100.00%> (+0.01%) ⬆️
computing-unit-managing-service 1.65% <ø> (ø)
config-service 56.06% <ø> (ø)
file-service 38.21% <ø> (ø)
frontend 46.36% <ø> (-0.67%) ⬇️ Carriedforward from 5f54275
pyamber 90.61% <ø> (-0.11%) ⬇️ Carriedforward from 5f54275
python 90.75% <ø> (ø) Carriedforward from 5f54275
workflow-compiling-service 58.69% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

kunwp1 added 2 commits June 8, 2026 15:22
…ry docs

- Cap throwOnDeleteErrors enumeration at MAX_LISTED_DELETE_ERRORS (10) and summarize the remainder, so a fully-failed batch can't produce an unbounded message.
- Update the LargeBinaryManager doc that still described the pre-pagination single-page (<=1000) behavior of deleteDirectory.
@kunwp1 kunwp1 requested a review from Xiao-zhen-Liu June 9, 2026 07:59
@kunwp1

kunwp1 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

/request-review @Xiao-zhen-Liu

@Yicong-Huang

Copy link
Copy Markdown
Contributor

@xuang7 I think this may need to be back ported to v1.2

@xuang7 xuang7 added the release/v1.2 back porting to release/v1.2 label Jun 10, 2026
@Yicong-Huang

Copy link
Copy Markdown
Contributor

@kunwp1 if you agree this needs backport, it may need to be manually backported as there are commits missing between two branches.

@kunwp1

kunwp1 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Needs to be backported because it's a bug fix.

@kunwp1

kunwp1 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

I noticed that the conflict comes from this PR: #5280 . Might be better to backport this #5280 @xuang7

@xuang7

xuang7 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

I noticed that the conflict comes from this PR: #5280 . Might be better to backport this #5280 @xuang7

Since this fix needs to be manually backported and is related to #5280, could you open a backport PR for #5280 targeting the release/v1.2 branch? Thanks!

@kunwp1

kunwp1 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

I noticed that the conflict comes from this PR: #5280 . Might be better to backport this #5280 @xuang7

Since this fix needs to be manually backported and is related to #5280, could you open a backport PR for #5280 targeting the release/v1.2 branch? Thanks!

Sounds good! I'll create a PR that includes both #5249 and #5280

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common fix release/v1.2 back porting to release/v1.2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

S3StorageClient.deleteDirectory only deletes the first ≤1000 objects per prefix

4 participants