Skip to content

feat(frequencies): add borrowed-key update methods to the frequent items sketch#140

Open
k3dom wants to merge 3 commits into
apache:mainfrom
k3dom:frequencies-borrowed-update
Open

feat(frequencies): add borrowed-key update methods to the frequent items sketch#140
k3dom wants to merge 3 commits into
apache:mainfrom
k3dom:frequencies-borrowed-update

Conversation

@k3dom

@k3dom k3dom commented Jun 25, 2026

Copy link
Copy Markdown

Disclaimer: This contribution was written entirely by an AI coding
assistant (Claude Opus 4.8) and has only been manually reviewed.

Motivation

FrequentItemsSketch::update and update_with_count take the item by value
(T). When T is an owned, heap-allocated type such as String, a caller that
streams many repeated items must allocate an owned key for every observation —
including when the item is already tracked, in which case
adjust_or_put_value only compares the key by reference and then immediately
drops the freshly allocated value. For a stream dominated by a small set of
recurring items (a common case for a frequent-items sketch), that is one wasted
allocation per occurrence.

Change

Add borrowed-key update methods that allocate an owned key only when the item is
actually inserted:

  • FrequentItemsSketch::update_borrowed<Q>(&Q)
  • FrequentItemsSketch::update_with_count_borrowed<Q>(&Q, u64)
  • ReversePurgeItemHashMap::adjust_or_put_value_borrowed<Q>(&Q, u64) (backs the
    two above)

They mirror the existing update / update_with_count pair and are bounded by
T: Borrow<Q>, Q: Eq + Hash + ToOwned<Owned = T> + ?Sized, so for example a
FrequentItemsSketch<String> can be updated directly from a &str. The hit
path increments the counter with no allocation; the insert path calls
key.to_owned() exactly as the owned path does today. hash_item is relaxed to
T: Hash + ?Sized so unsized borrowed keys such as str can be hashed
directly.

Testing

  • Added doctests for both new sketch methods.
  • cargo test -p datasketches --features frequencies --lib and --doc pass
    (including the new doctests).

Add FrequentItemsSketch::update_borrowed and update_with_count_borrowed, backed by ReversePurgeItemHashMap::adjust_or_put_value_borrowed. They take the item by reference and only allocate an owned key on insertion, so a caller streaming many repeated items can increment an already-tracked item without allocating an owned key per occurrence. The owned update/update_with_count path allocates on every call, even when the item already exists and the owned value is immediately dropped.

hash_item is relaxed to T: Hash + ?Sized so unsized borrowed keys such as str can be hashed directly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: kedom <kedom@vmcall.net>

@tisonkun tisonkun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still in a 0.x series, so breaking changes can happen.

I suppose a refactor on item parameter to use K: Borrow<Q> always, like what HashMap does, would make the interface clear and concise to use.

What do you think?

@tisonkun tisonkun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some local tests, I think we can have:

    pub fn estimate<Q>(&self, item: &Q) -> u64
    where
        T: Borrow<Q>,
        Q: Hash + Eq + ?Sized,
    {
        let value = self.hash_map.get(item);
        if value > 0 { value + self.offset } else { 0 }
    }

    pub fn lower_bound<Q>(&self, item: &Q) -> u64
    where
        T: Borrow<Q>,
        Q: Hash + Eq + ?Sized,
    {
        self.hash_map.get(item)
    }

    pub fn upper_bound<Q>(&self, item: &Q) -> u64
    where
        T: Borrow<Q>,
        Q: Hash + Eq + ?Sized,
    {
        self.hash_map.get(item) + self.offset
    }

But for passing-in owned value for update/update_with_count, even HashMap::insert requires:

    pub fn insert(&mut self, k: K, v: V) -> Option<V> {
        self.base.insert(k, v)
    }

That is, the caller is responsible to create the owned value of type K. Trying to provide convenient interfaces for arbitrary types, said Q: Eq + Hash + ToOwned<Owned = T> + ?Sized would be random and introduce more confusion.

@tisonkun

tisonkun commented Jun 26, 2026

Copy link
Copy Markdown
Member

If you'd like to simplify between String and &str and don't care about the clone, you can have a caller-side helper function:

fn update_items(sketch: FrequentItemsSketch<String>, k: impl Into<String>) {
  let k = k.into();
  sketch.update(k);
}

@k3dom

k3dom commented Jun 26, 2026

Copy link
Copy Markdown
Author

Agree on making estimate/lower_bound/upper_bound generic over Borrow<Q> and ToOwned<Owned = T> being an odd bound, will change.

On update though, insert always stores the key so it has to own it, but adjust_or_put_value is get-or-increment: when the item's already tracked (the common case here) it just bumps a counter and never stores the key, so there's nothing to own. Closer to the entry API than insert. std's entry can't take a borrowed key either, which is why e.g. hashbrown has entry_ref. Otherwise a hot-key stream allocates a String per hit just to drop it.

@tisonkun

tisonkun commented Jun 26, 2026

Copy link
Copy Markdown
Member

entry_ref

https://docs.rs/hashbrown/latest/hashbrown/struct.HashMap.html#method.entry_ref

This looks like a good interface to follow. You may follow its naming and trait-bound conventions, and I'll give a review once you get all updates ready.

I'm fine with ToOwned now. But the API name needs some considerations, update_borrowed doesn't sound like a good name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants