GH-3561 Harden variants against malformed metadata.#3562
Open
steveloughran wants to merge 1 commit into
Open
Conversation
636e75d to
9905728
Compare
- reject oversized metadata/value declarations - reject oversize dictSize in objects - range checking Only low cost checks are made. There's also a depth check consistent with the json parser; it's arguable as to whether that is needed. It will defend against StackOverflowExceptions by anything trying to treewalk, but should that code be the place to do the checks? The new test suite TestHardenedReader can be configured to actually emit the malformed files, to see how applications deal with them. Contains contributions from Claude AI.
9905728 to
db1cfe2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Malformed parquet files could be distruptive enough to not only affect the execution of a single worker thread (which will ultimately reject it), but other threads on the same process. This can be disruptive.
What changes are included in this PR?
Only low cost checks are made, equivalent to arrow variant
try_new_with_metadata_and_shallow_validation()There's no equivalent
with_full_validation()logic is omitted. The caching logic of #3481 may be able to do this when it builds a dictionary, as range checking the increasing dictionary offsets is the key work there.There's also a depth check consistent with the json parser; it's arguable as to whether that is needed. It will defend against StackOverflowExceptions by anything trying to treewalk, but shouldn't that code be the place to do the checks?
Are these changes tested?
The new test suite TestHardenedReader can be configured to actually emit the malformed files, to see how applications deal with them.
Are there any user-facing changes?
No
Closes #3561