Skip to content

Use HeadObject instead of GetObject for S3 metadata#565

Merged
pjbull merged 2 commits into
drivendataorg:1390-live-testsfrom
mattijsdp:head-object-for-s3-metadata
Jun 10, 2026
Merged

Use HeadObject instead of GetObject for S3 metadata#565
pjbull merged 2 commits into
drivendataorg:1390-live-testsfrom
mattijsdp:head-object-for-s3-metadata

Conversation

@mattijsdp

Copy link
Copy Markdown
Contributor

Closes #564

S3Client._get_metadata read object metadata via ObjectSummary.get() (a GetObject), which issues a full object read and leaves an unread response stream instead of returning the connection to the pool. Switch to head_object, which reads only the metadata and keeps Content-Length on S3-compatible gateways that drop it from GetObject responses, while bringing S3 in line with the other backends.

Since head_object returns no body, a missing key surfaces as a generic 404 ClientError rather than NoSuchKey, so S3Path.stat now translates a 404 to NoStatError and lets other errors (e.g. 403) propagate. Updates the S3 test mock and adds regression tests.


Contributor checklist:

  • I have read and understood CONTRIBUTING.md
  • Confirmed an issue exists for the PR, and the text Closes #issue appears in the PR summary (e.g., Closes #123).
  • Confirmed PR is rebased onto the latest base
  • Confirmed failure before change and success after change
  • Any generic new functionality is replicated across cloud providers if necessary
  • Tested manually against live server backend for at least one provider
  • Added tests for any new functionality
  • Linting passes locally
  • Tests pass locally
  • Updated HISTORY.md with the issue that is addressed and the PR you are submitting. If the top section is not `## UNRELEASED``, then you need to add a new section to the top of the document for your change.

@mattijsdp mattijsdp marked this pull request as draft June 9, 2026 17:54

@pjbull pjbull left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looking good, a few quick comments in there.

Did you test against your s3 compatible server and real s3?

Also, did you make sure linting passes?

data = self.client.head_object(
Bucket=cloud_path.bucket,
Key=cloud_path.key,
**self.boto3_dl_extra_args,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does head_object respect what we collect in these kwargs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread HISTORY.md Outdated

## UNRELEASED

- Changed `S3Client._get_metadata` to read object metadata with `HeadObject` instead of `GetObject`, so `stat`, `etag`, and `size` no longer open the object body. Also fixes a `KeyError` on `ContentLength` against S3-compatible gateways that drop `Content-Length` from `GetObject` responses. (Issue [#564](https://github.com/drivendataorg/cloudpathlib/issues/564), PR [#TODO])

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add pr number

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mattijsdp

Copy link
Copy Markdown
Contributor Author

Yes, linting passed and checked on real s3 bucket and s3 compatible server (lakefs).

@mattijsdp mattijsdp marked this pull request as ready for review June 9, 2026 18:26
@mattijsdp

Copy link
Copy Markdown
Contributor Author

Code quality errors are unrelated to my changes I think.

@pjbull

pjbull commented Jun 9, 2026

Copy link
Copy Markdown
Member

Looks like linter package version updates. I'll get it fixed on main so you can rebase.

@pjbull

pjbull commented Jun 9, 2026

Copy link
Copy Markdown
Member

@mattijsdp fixed on the main branch if you want to rebase this. Thanks for bearing with.

mattijsdp and others added 2 commits June 10, 2026 09:53
S3Client._get_metadata read object metadata via ObjectSummary.get()
(a GetObject), which issues a full object read and leaves an unread
response stream instead of returning the connection to the pool. Switch
to head_object, which reads only the metadata and keeps Content-Length
on S3-compatible gateways that drop it from GetObject responses, while
bringing S3 in line with the other backends.

Since head_object returns no body, a missing key surfaces as a generic
404 ClientError rather than NoSuchKey, so S3Path.stat now translates a
404 to NoStatError and lets other errors (e.g. 403) propagate. Updates
the S3 test mock and adds regression tests.

Closes drivendataorg#564

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mattijsdp mattijsdp force-pushed the head-object-for-s3-metadata branch from edce3fa to 8f0a536 Compare June 10, 2026 08:54
@mattijsdp

Copy link
Copy Markdown
Contributor Author

Rebased onto latest main (now includes #566). Lint (black/flake8/mypy) and the test suite pass locally. Thanks for the quick fix @pjbull!

@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.2%. Comparing base (27d749e) to head (8f0a536).

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #565     +/-   ##
========================================
- Coverage    93.9%   93.2%   -0.8%     
========================================
  Files          28      28             
  Lines        2222    2226      +4     
========================================
- Hits         2088    2075     -13     
- Misses        134     151     +17     
Files with missing lines Coverage Δ
cloudpathlib/s3/s3client.py 92.9% <100.0%> (-1.5%) ⬇️
cloudpathlib/s3/s3path.py 100.0% <100.0%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pjbull

pjbull commented Jun 10, 2026

Copy link
Copy Markdown
Member

live tests expected to fail on fork, moving to repo-local branch to test real backends.

@pjbull pjbull changed the base branch from master to 1390-live-tests June 10, 2026 22:41
@pjbull pjbull merged commit 49ffa58 into drivendataorg:1390-live-tests Jun 10, 2026
26 of 27 checks passed
pjbull added a commit that referenced this pull request Jun 10, 2026
* Use HeadObject instead of GetObject for S3 metadata

S3Client._get_metadata read object metadata via ObjectSummary.get()
(a GetObject), which issues a full object read and leaves an unread
response stream instead of returning the connection to the pool. Switch
to head_object, which reads only the metadata and keeps Content-Length
on S3-compatible gateways that drop it from GetObject responses, while
bringing S3 in line with the other backends.

Since head_object returns no body, a missing key surfaces as a generic
404 ClientError rather than NoSuchKey, so S3Path.stat now translates a
404 to NoStatError and lets other errors (e.g. 403) propagate. Updates
the S3 test mock and adds regression tests.

Closes #564



* add PR number

---------

Co-authored-by: Mattijs De Paepe <33912270+mattijsdp@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

S3Client._get_metadata uses GetObject where HeadObject would do

2 participants