fix: scope large binary storage and cleanup by execution id (#5280) [release/v1.2 backport]#5602
Open
kunwp1 wants to merge 2 commits into
Open
Conversation
### What changes were proposed in this PR? **What Caused the Issue:** LargeBinaryOutputStream looked up the S3 client twice: once in the upload worker (correct), and once again in close() during failed-upload cleanup. When a test left a stream unclosed, Python's GC eventually called __del__ → close() → the second lookup, but by then a different test was active, so the cleanup hit the wrong test's mock_s3 and broke its assert_called_once_with. **Proposed Fix** - Move `s3.delete_object(...)` from `_cleanup_failed_upload()` into the upload worker, reusing the `s3` client already captured by the closure that did the upload. - Drop the `_cleanup_failed_upload()` method and the call to it from `close()`; the worker now handles cleanup before recording the exception. - `close()` and `__del__` no longer call back into `large_binary_manager`, so a finalizer firing under a later test's monkey-patches cannot reach the wrong S3 client. ### Any related issues, documentation, or discussions? Closes: apache#5245 Follow-up to apache#4707; surfaced on the 3.12 leg of https://github.com/apache/texera/actions/runs/26481776334/job/77980417021. ### How was this PR tested? - Ran `ruff format` and `ruff check` over `amber/src/main/python` and `amber/src/test/python` (clean). - Existing tests in `test_large_binary_output_stream.py` still cover the relevant paths: `test_close_handles_upload_error`, `test_delete_object_failure_is_swallowed`, and `test_write_after_upload_error_raises_error`. - Simplified `test_write_after_upload_error_raises_error` back to inline form and removed the `_drained` helper, both no longer needed once cleanup is structurally contained. ### Was this PR authored or co-authored using generative AI tooling? Co-authored with Claude Opus 4.7 in compliance with ASF --------- Signed-off-by: Matthew B. <mgball@uci.edu> (cherry picked from commit 09aac02)
) ### What changes were proposed in this PR? Large binaries were stored in the shared `texera-large-binaries` bucket under flat keys `objects/{timestamp}/{uuid}` with no execution id, and `clearExecutionResources(eid)` deleted all of them via `LargeBinaryManager.deleteAllObjects()`. Any cleanup for one execution therefore erased every other execution's (and user's) large binaries. This PR namespaces every large binary by its execution id and scopes deletion: - Object keys are now `objects/{eid}/{uuid}` on both the JVM and Python workers. - The execution-scoped location is named by the controller and handed to workers as data on `WorkerConfig` — no protobuf change. The controller computes the base URI `s3://{bucket}/objects/{eid}/`, and `create()` appends a unique suffix; the JVM seeds the base URI onto the data-processing thread at startup, and the Python worker receives it as a startup argument. The user-facing `largebinary()` / `new LargeBinary()` APIs are unchanged. - Cleanup uses the new `LargeBinaryManager.deleteByExecution(eid)` (prefix delete of `objects/{eid}/`). Both JVM and Python engines share the bucket and key shape, so this single JVM-side delete removes binaries created by both. - The `deleteAllObjects()` is removed. Pre-existing objects under the old `objects/{timestamp}/...` scheme are left untouched. ### Any related issues, documentation, discussions? Closes apache#4123. ### How was this PR tested? Import the following json file to create two workflows (You can configure the source operator to use any kinds of files you have), run them, and check if each execution creates 6 objects and one execution doesn't remove the other execution's large binary objects. [Large.Binary.Python (1).json](https://github.com/user-attachments/files/28369502/Large.Binary.Python.1.json) ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Anthropic), models Claude Opus 4.8, Claude Opus 4.7, and Claude Sonnet 4.6 --------- Signed-off-by: Kunwoo (Chris) <143021053+kunwp1@users.noreply.github.com> Co-authored-by: Xiaozhen Liu <xiaozl3@uci.edu> (cherry picked from commit 48e800e)
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/v1.2 #5602 +/- ##
==================================================
+ Coverage 51.41% 52.96% +1.54%
+ Complexity 2449 1410 -1039
==================================================
Files 1065 797 -268
Lines 41152 33085 -8067
Branches 4412 3317 -1095
==================================================
- Hits 21159 17523 -3636
+ Misses 18757 14721 -4036
+ Partials 1236 841 -395
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this PR?
Backport of #5280 ("scope large binary storage and cleanup by execution id") to
release/v1.2.As noted in #5569, #5280 cannot be backported as a single cherry-pick: its changes to
large_binary_output_stream.pybuild on top of #5249 ("keep failed-upload cleanup inside the upload worker"), which is not present onrelease/v1.2. Cherry-picking #5280 alone leaves a dangling reference to thelarge_binary_managermodule in_cleanup_failed_upload()(a method #5249 removes), producing aNameErrorat runtime.This PR therefore backports the dependency chain, both as clean
git cherry-pick -xof the original squash commits:#4707(which #5249 follows up on) is already present onrelease/v1.2, so the chain stops there. After applying both commits, every file touched by #5280 is byte-identical to its state onmainat the merged #5280 commit (48e800e4), andrelease/v1.2had no independent changes to any of those files.Any related issues, documentation, discussions?
Backports #5280 (which closes #4123 on
main). Prerequisite: #5249. Unblocks the #5569 backport, which can then cherry-pick cleanly ontorelease/v1.2.How was this PR tested?
This is a backport with no changes beyond the two cherry-picked commits, so it relies on the existing tests carried over from #5249 and #5280 (
LargeBinaryManagerSpec,LargeBinaryManagerUnitSpec,test_large_binary_output_stream.py,test_large_binary_manager.py, etc.).Backport fidelity was verified locally by confirming that, after the chain is applied onto
release/v1.2:main@48e800e4;deleteAllObjects()remain;deleteByExecutionis wired inWorkflowResourceandWorkflowService.Full compile and unit-test runs are left to CI.
Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.8)