Skip to content

BitBuffer binary operations support in place updates to lhs#8162

Open
robert3005 wants to merge 6 commits into
developfrom
rk/optimise-bit-buffer
Open

BitBuffer binary operations support in place updates to lhs#8162
robert3005 wants to merge 6 commits into
developfrom
rk/optimise-bit-buffer

Conversation

@robert3005
Copy link
Copy Markdown
Contributor

Revive #7375

This speeds up And/Or/Xor - 50% to 4x time
Append - 2x (by shifting the work elsewhere in the code)
append_buffer - 10-50%

claude and others added 6 commits May 29, 2026 16:33
Key optimizations:
- `Not for &BitBuffer`: allocate directly instead of clone+try_into_mut
  which always failed (clone shares Arc, so 2 refs = failure)
  → 27-58% faster bitwise NOT on references
- `iter_bits()`: process u64 chunks instead of byte-by-byte
  → 13% faster iteration
- `PartialEq`: fast path using memcmp for byte-aligned buffers
- `append_buffer()`: memcpy fast path when both sides are byte-aligned
  → 20-52% faster buffer appends
- `append_false()`: remove unnecessary branch (new bytes are zero-init)
  → 65% faster single-bit appends
- `from_indices()`: use set_bit_unchecked directly on the byte slice
- `FromIterator` tail: batch remaining items in u64 words
  → 13% faster from_iter
- `sliced()`: use bitwise_unary_op_copy to avoid clone+fail path,
  fix byte range bug in aligned path
- `filter_bitbuffer_by_indices`: detect contiguous runs for bulk copy
- `filter_bitbuffer_by_slices`: use slice()+append_buffer() instead
  of per-bit get+append
- Add #[inline] to hot methods: set, set_to, append, true_count, etc.

Signed-off-by: Claude <noreply@anthropic.com>

https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB
BitBuffers almost always have shared backing storage (from slicing,
array construction, etc.), so try_into_mut nearly always fails. The
owned `Not for BitBuffer` now uses the same direct-copy path as the
reference version, and the dead `bitwise_unary_op` function is removed.

For true in-place mutation, use `BitBufferMut` directly.

Signed-off-by: Claude <noreply@anthropic.com>

https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB
Keep the in-place mutation fast path for owned BitBuffer when the
backing storage has exclusive access (Arc refcount == 1). When
try_into_mut fails (shared storage), delegate to bitwise_unary_op_copy
instead of duplicating the copy logic.

Signed-off-by: Claude <noreply@anthropic.com>

https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB
The scan loop pattern `mask = mask.bitand(&conjunct_mask)` previously
always allocated a new buffer because all binary ops took references.
Now owned-left operands try try_into_mut for zero-allocation in-place
mutation when the backing storage has exclusive access.

Changes:
- Add bitwise_binary_op_lhs_owned in ops.rs: tries in-place on owned
  left, falls back to allocating bitwise_binary_op
- Wire BitAnd/BitOr/BitXor owned-left impls to use in-place path
- Add BitBuffer::into_bitand_not for owned variant
- Add BitAnd<&Mask>/BitOr<&Mask> for Mask: extracts owned BitBuffer
  for in-place binary ops in the scan loop
- Update Mask::bitand_not to use owned BitBuffer path
- Fix flat/zoned readers to capture density before consuming mask

Signed-off-by: Claude <noreply@anthropic.com>

https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB
…no-op micro-opts

Rebased onto develop, then verified the optimizations with benchmarks and
hardened the ones that pay off.

Correctness fixes (each with a regression test added first):
- `bitwise_binary_op_lhs_owned`: the in-place path combined operands
  byte-for-byte, ignoring bit offsets, so `owned & &rhs` corrupted results when
  `left.offset() != right.offset()` (reachable via sliced masks in the scan
  loop). Now falls back to the offset-aware `bitwise_binary_op` unless the
  offsets match. Added an rstest comparing the owned path against the reference
  path across every offset/length/op combination.
- `append_false` dropped its read-modify-write on the assumption that bits past
  `len` are zero, but `truncate` left stale bits in the final partial byte —
  so `truncate(n)` then `append_false()` read back as true. `truncate` now
  clears the vacated bits, restoring the invariant `append_false`/`append_buffer`
  rely on (keeps the ~2.3x single-bit append win).
- vortex-cuda's flat reader had the same `mask.density()`-after-move pattern the
  flat/zoned readers were fixed for; capture density before consuming the mask.

Simplification / common patterns:
- Replaced three hand-rolled unsafe u64-chunk loops (`bitwise_unary_op_copy`,
  `bitwise_unary_op_mut`, owned binary in-place) with safe shared helpers
  `read_u64_le` / `map_u64_words_in_place` / `zip_u64_words_in_place` built on
  `chunks_exact` + `u64::{from,to}_le_bytes`. No `unsafe`, portable to
  big-endian (the raw `read_unaligned::<u64>` was a latent BE bug), and the
  owned in-place AND stays 3.7-4.7x faster than allocating at 16K-64K bits.

Dropped optimizations that benchmarks showed don't pay off:
- `FromIterator` 64-bit tail batching: never exercised by exact-size iterators
  (the common case), measured ~1.0x, ~60 lines of intricate unsafe. Reverted to
  a simple `append` loop now that `append` is fast.

Other notes (no code change here):
- The filter and `from_indices` optimizations from the original branch are
  superseded by develop's PEXT filter and u64-word `from_indices`; taken from
  develop during the rebase.
- `iter_bits`' u64-chunk rewrite is kept (cleaner/safer than the byte-by-byte
  original) but it has no non-test callers and does not feed `iter()`.

Added a `bitand_owned_lhs_vortex_buffer` bench so the owned in-place path has
ongoing coverage.

Signed-off-by: Robert Kruszewski <dzobert@gmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Robert Kruszewski <github@robertk.io>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 29, 2026

Merging this PR will improve performance by 30.89%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 20 improved benchmarks
❌ 3 regressed benchmarks
✅ 1252 untouched benchmarks
🆕 5 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_bool_canonical_into[(100, 100)] 98 µs 87.6 µs +11.89%
Simulation chunked_opt_bool_canonical_into[(100, 100)] 199.6 µs 180.8 µs +10.41%
Simulation chunked_opt_bool_canonical_into[(1000, 10)] 55.1 µs 35 µs +57.21%
Simulation chunked_opt_bool_into_canonical[(1000, 10)] 69 µs 49.9 µs +38.06%
Simulation chunked_varbinview_opt_canonical_into[(1000, 10)] 188.1 µs 215.2 µs -12.6%
Simulation chunked_bool_canonical_into[(1000, 10)] 45 µs 34.8 µs +29.26%
Simulation varbinview_zip_block_mask 2.9 ms 3.4 ms -14.06%
Simulation fsst_compress_string 13.8 ms 12.3 ms +12.67%
Simulation compress_fsst[(10000, 4, 4)] 1.5 ms 1.4 ms +11.41%
Simulation compress_fsst[(10000, 4, 8)] 1.6 ms 1.4 ms +11.14%
Simulation append_vortex_buffer[1024] 14.7 µs 7.5 µs +95.75%
🆕 Simulation bitand_owned_lhs_vortex_buffer[65536] N/A 11.3 µs N/A
Simulation append_vortex_buffer[128] 2.1 µs 1.1 µs +85.46%
🆕 Simulation bitand_owned_lhs_vortex_buffer[1024] N/A 3.5 µs N/A
🆕 Simulation bitand_owned_lhs_vortex_buffer[128] N/A 3.3 µs N/A
🆕 Simulation bitand_owned_lhs_vortex_buffer[16384] N/A 5.3 µs N/A
🆕 Simulation bitand_owned_lhs_vortex_buffer[2048] N/A 3.6 µs N/A
Simulation append_buffer_vortex_buffer[1024] 15.6 µs 12.4 µs +25.48%
Simulation append_buffer_vortex_buffer[128] 11.5 µs 10.1 µs +13.98%
Simulation append_vortex_buffer[16384] 231.1 µs 117.2 µs +97.19%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing rk/optimise-bit-buffer (c8bc973) with develop (73454db)

Open in CodSpeed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants