Skip to content

fix(frequencies): write the full empty preamble long in serialize#143

Merged
tisonkun merged 1 commit into
apache:mainfrom
k3dom:frequencies-empty-serialization
Jul 3, 2026
Merged

fix(frequencies): write the full empty preamble long in serialize#143
tisonkun merged 1 commit into
apache:mainfrom
k3dom:frequencies-empty-serialization

Conversation

@k3dom

@k3dom k3dom commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

What

FrequentItemsSketch::serialize emits only 6 bytes for an empty sketch: the preamble declares one 8-byte preamble long, but serialization stops after the flags byte. deserialize consumes the full 8-byte preamble before checking the empty flag, so a serialized empty sketch always fails to deserialize with insufficient data: <unused>. The truncated form also diverges from the Java and C++ implementations, which both write the full preamble long for empty sketches.

Empty sketches are reachable from non-trivial streams: a map saturated with count-1 items (e.g. unique identifiers) purges every counter, leaving an empty sketch that then cannot be persisted and restored.

Changes

  • Write the two unused padding bytes in the empty path of serialize_inner, making the empty encoding one full preamble long and byte-compatible with the Java/C++ readers.
  • Add round-trip tests for a fresh empty sketch (pinning the 8-byte length) and for a sketch purged to empty.
  • Add a changelog entry under Unreleased.

An empty FrequentItemsSketch serialized to 6 bytes: the preamble
declares one 8-byte preamble long, but serialization stopped after the
flags byte. Deserialization (in Rust as well as the Java and C++
readers) consumes the full 8-byte preamble, so a serialized empty
sketch failed to deserialize with an insufficient-data error.

Emptiness is reachable from non-trivial streams: a map saturated with
count-1 items purges every counter, so pipelines that persist sketches
can hit the failure on real data.

Write the two unused padding bytes so the empty encoding is one full
preamble long, matching the Java and C++ implementations.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@tisonkun tisonkun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@tisonkun tisonkun merged commit cafebaa into apache:main Jul 3, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants