Skip to content

GH-3561 Harden variants against malformed metadata.#3562

Open
steveloughran wants to merge 1 commit into
apache:masterfrom
steveloughran:pr/variant-hardening
Open

GH-3561 Harden variants against malformed metadata.#3562
steveloughran wants to merge 1 commit into
apache:masterfrom
steveloughran:pr/variant-hardening

Conversation

@steveloughran
Copy link
Copy Markdown
Contributor

@steveloughran steveloughran commented May 14, 2026

Rationale for this change

Malformed parquet files could be distruptive enough to not only affect the execution of a single worker thread (which will ultimately reject it), but other threads on the same process. This can be disruptive.

What changes are included in this PR?

  • reject oversized metadata/value declarations
  • reject oversize dictSize in objects
  • range checking

Only low cost checks are made, equivalent to arrow variant try_new_with_metadata_and_shallow_validation()

There's no equivalent with_full_validation() logic is omitted. The caching logic of #3481 may be able to do this when it builds a dictionary, as range checking the increasing dictionary offsets is the key work there.

There's also a depth check consistent with the json parser; it's arguable as to whether that is needed. It will defend against StackOverflowExceptions by anything trying to treewalk, but shouldn't that code be the place to do the checks?

Are these changes tested?

The new test suite TestHardenedReader can be configured to actually emit the malformed files, to see how applications deal with them.

Are there any user-facing changes?

No

Closes #3561

@steveloughran steveloughran changed the title Harden variants against malformed metadata. GH-3561 Harden variants against malformed metadata. May 14, 2026
@steveloughran steveloughran force-pushed the pr/variant-hardening branch from 636e75d to 9905728 Compare May 14, 2026 11:55
- reject oversized metadata/value declarations
- reject oversize dictSize in objects
- range checking

Only low cost checks are made.

There's also a depth check consistent with the json parser; it's arguable as to whether that is needed. It will defend against StackOverflowExceptions by anything trying to treewalk, but should that code be the place to do the checks?

The new test suite TestHardenedReader can be configured to actually emit the malformed files, to see how applications deal with them.

Contains contributions from Claude AI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Harden variant decoding

1 participant