feat(frequencies): add borrowed-key update methods to the frequent items sketch#140
feat(frequencies): add borrowed-key update methods to the frequent items sketch#140k3dom wants to merge 3 commits into
Conversation
Add FrequentItemsSketch::update_borrowed and update_with_count_borrowed, backed by ReversePurgeItemHashMap::adjust_or_put_value_borrowed. They take the item by reference and only allocate an owned key on insertion, so a caller streaming many repeated items can increment an already-tracked item without allocating an owned key per occurrence. The owned update/update_with_count path allocates on every call, even when the item already exists and the owned value is immediately dropped. hash_item is relaxed to T: Hash + ?Sized so unsized borrowed keys such as str can be hashed directly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: kedom <kedom@vmcall.net>
tisonkun
left a comment
There was a problem hiding this comment.
We are still in a 0.x series, so breaking changes can happen.
I suppose a refactor on item parameter to use K: Borrow<Q> always, like what HashMap does, would make the interface clear and concise to use.
What do you think?
There was a problem hiding this comment.
After some local tests, I think we can have:
pub fn estimate<Q>(&self, item: &Q) -> u64
where
T: Borrow<Q>,
Q: Hash + Eq + ?Sized,
{
let value = self.hash_map.get(item);
if value > 0 { value + self.offset } else { 0 }
}
pub fn lower_bound<Q>(&self, item: &Q) -> u64
where
T: Borrow<Q>,
Q: Hash + Eq + ?Sized,
{
self.hash_map.get(item)
}
pub fn upper_bound<Q>(&self, item: &Q) -> u64
where
T: Borrow<Q>,
Q: Hash + Eq + ?Sized,
{
self.hash_map.get(item) + self.offset
}But for passing-in owned value for update/update_with_count, even HashMap::insert requires:
pub fn insert(&mut self, k: K, v: V) -> Option<V> {
self.base.insert(k, v)
}That is, the caller is responsible to create the owned value of type K. Trying to provide convenient interfaces for arbitrary types, said Q: Eq + Hash + ToOwned<Owned = T> + ?Sized would be random and introduce more confusion.
|
If you'd like to simplify between fn update_items(sketch: FrequentItemsSketch<String>, k: impl Into<String>) {
let k = k.into();
sketch.update(k);
} |
|
Agree on making estimate/lower_bound/upper_bound generic over On update though, insert always stores the key so it has to own it, but |
https://docs.rs/hashbrown/latest/hashbrown/struct.HashMap.html#method.entry_ref This looks like a good interface to follow. You may follow its naming and trait-bound conventions, and I'll give a review once you get all updates ready. I'm fine with ToOwned now. But the API name needs some considerations, |
Motivation
FrequentItemsSketch::updateandupdate_with_counttake the item by value(
T). WhenTis an owned, heap-allocated type such asString, a caller thatstreams many repeated items must allocate an owned key for every observation —
including when the item is already tracked, in which case
adjust_or_put_valueonly compares the key by reference and then immediatelydrops the freshly allocated value. For a stream dominated by a small set of
recurring items (a common case for a frequent-items sketch), that is one wasted
allocation per occurrence.
Change
Add borrowed-key update methods that allocate an owned key only when the item is
actually inserted:
FrequentItemsSketch::update_borrowed<Q>(&Q)FrequentItemsSketch::update_with_count_borrowed<Q>(&Q, u64)ReversePurgeItemHashMap::adjust_or_put_value_borrowed<Q>(&Q, u64)(backs thetwo above)
They mirror the existing
update/update_with_countpair and are bounded byT: Borrow<Q>, Q: Eq + Hash + ToOwned<Owned = T> + ?Sized, so for example aFrequentItemsSketch<String>can be updated directly from a&str. The hitpath increments the counter with no allocation; the insert path calls
key.to_owned()exactly as the owned path does today.hash_itemis relaxed toT: Hash + ?Sizedso unsized borrowed keys such asstrcan be hasheddirectly.
Testing
cargo test -p datasketches --features frequencies --liband--docpass(including the new doctests).