Skip to content

GH-3522: Add batch read APIs to ValuesReader hierarchy#3535

Open
iemejia wants to merge 10 commits into
apache:masterfrom
iemejia:perf-batch-read-api
Open

GH-3522: Add batch read APIs to ValuesReader hierarchy#3535
iemejia wants to merge 10 commits into
apache:masterfrom
iemejia:perf-batch-read-api

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented May 1, 2026

Summary

  • Add readIntegers(), readLongs(), readFloats(), readDoubles() batch methods to ValuesReader with default loop-based implementations
  • Override in specialized readers to amortize per-value overhead across batches

Overrides

  • RunLengthBitPackingHybridDecoder.readInts(): batch across RLE runs and packed groups using Arrays.fill/System.arraycopy
  • DictionaryValuesReader: batch-decode dictionary IDs first, then batch-lookup values (eliminates per-value IOException try/catch)
  • DeltaBinaryPackingValuesReader: System.arraycopy from pre-decoded buffer
  • PlainValuesReader (all types): loop over LittleEndianDataInputStream
  • ByteStreamSplitValuesReader (all types): indexed ByteBuffer bulk read

Rationale

These APIs enable callers to amortize per-value overhead (virtual dispatch, bounds checks, mode switches) across batches. Combined with other optimizations in this series (ByteBuffer-based RLE decoder, etc.), batch reads yield significant throughput improvements over per-value loops.

All 576 parquet-column tests pass.

@iemejia iemejia force-pushed the perf-batch-read-api branch 2 times, most recently from 775d723 to bcf585a Compare May 13, 2026 19:27
iemejia added 10 commits May 13, 2026 21:30
Add readIntegers(), readLongs(), readFloats(), readDoubles() batch methods
to ValuesReader with default loop-based implementations. Override in:

- RunLengthBitPackingHybridDecoder.readInts(): batch across RLE runs and
  packed groups using Arrays.fill/System.arraycopy
- DictionaryValuesReader: batch-decode dictionary IDs first, then
  batch-lookup values (eliminates per-value IOException try/catch)
- DeltaBinaryPackingValuesReader: System.arraycopy from pre-decoded buffer
- PlainValuesReader (all types): loop over LittleEndianDataInputStream
- ByteStreamSplitValuesReader (all types): indexed ByteBuffer bulk read

These APIs enable callers to amortize per-value overhead (virtual dispatch,
bounds checks, mode switches) across batches. On the perf branch where
the RLE decoder uses ByteBuffer, this yielded +148% RLE throughput and
+67% dictionary decode throughput.
RunLengthBitPackingHybridValuesReader inherited the default loop from
ValuesReader.readIntegers() which called readInteger() per value.
Delegate to decoder.readInts() which uses Arrays.fill for RLE runs
and System.arraycopy for packed groups.

Benchmark (100K values, SEQUENTIAL/RANDOM/LOW_CARDINALITY):
  Before: ~556M ops/s (same as per-value path)
  After: ~1,270M ops/s (+128%)

This matters for def/rep level decoding on every data page, BOOLEAN
columns in V2 pages, and any direct RLE consumers using batch APIs.
Replace per-value getXxx(offset) loops with position()+asXxxBuffer().get()
bulk copy in readFloats/readDoubles/readIntegers/readLongs. The decoded
data buffer is a contiguous heap byte[] in LE order, making view buffer
bulk reads a single memcpy via Unsafe.copyMemory.

Benchmark results (100K values, BSS FLOAT batch):
  Before: ~1,228M ops/s
  After:  ~1,442M ops/s (+17%)

INT32/INT64/DOUBLE show negligible change because BSS invocation cost is
dominated by page transposition in initFromPage, not the read loop.
…ns()

Replace ByteBitPackingValuesReader delegation in BooleanPlainValuesReader
with direct bit extraction from the page byte[]. The scalar path uses a
single array access + shift + mask instead of the 8-element int[] buffer
and packer dispatch. The batch path (readBooleans) unrolls 8 booleans per
byte with constant masks.

For RLE (V2), add a native readBooleans() method that uses Arrays.fill
for RLE runs (constant-time for uniform data) and direct int-to-boolean
conversion for packed groups, avoiding the intermediate int[] allocation
of the readInts() path.

Benchmark results (1M values, JDK 25, Compiler Blackholes):
- V1 PLAIN scalar: ~620M -> ~1,528-1,618M ops/s (+150%)
- V1 PLAIN batch:  ALL_TRUE/FALSE ~5,000M (+680%), RANDOM 2,757M (+337%)
- V2 RLE batch:    ALL_TRUE/FALSE ~190B (fill), RANDOM 1,335M (+93%)
Replace the per-bit unrolled extraction loop with a static boolean[256][8]
lookup table + System.arraycopy. Each byte maps to its 8 pre-decoded
booleans, and the 8-byte copy is emitted by HotSpot as a single 64-bit
load/store pair — the boolean equivalent of asIntBuffer().get() for ints.

For RLE PACKED groups (bitWidth=1), bypass the int[] intermediate and
read directly from the raw packed bytes via the same lookup table.

This makes batch decode throughput independent of data pattern:
- V1 PLAIN batch RANDOM: 2,757M -> 5,047M ops/s (+83%)
- V2 RLE batch RANDOM: 1,335M -> 1,618M ops/s (+21%)
- V2 RLE batch MOSTLY_TRUE_99: 3,205M -> 3,745M ops/s (+17%)
- Uniform patterns (ALL_TRUE/FALSE): unchanged (still Arrays.fill)
…king

Refactor BooleanPlainValuesWriter to pack bits directly into bytes
instead of delegating through ByteBitPackingValuesWriter and the generic
int[8]-based ByteBasedBitPackingEncoder. Add batch writeBooleans() API
to ValuesWriter with optimized overrides:

- PLAIN: processes 8 booleans at a time into single bytes with OR/shift,
  eliminating the per-value method call chain and int[] intermediate.
- RLE: pre-scans for runs >= 8 to emit RLE directly, fills partial
  bit-packed groups from run boundaries to avoid spurious padding.

PLAIN scalar improves +69% (890M -> 1,500M ops/s) from the refactoring.
PLAIN batch: +184% over old scalar (2,528M for RANDOM).
RLE batch: +278% for ALL_FALSE, +95% for MOSTLY_*, +36% for ALTERNATING.
…riteDoubles with bulk ByteBuffer view transfers

Add bulk write methods to CapacityByteArrayOutputStream (writeInts, writeLongs,
writeFloats, writeDoubles) that use IntBuffer/LongBuffer/FloatBuffer/DoubleBuffer
view puts to transfer entire arrays in one operation, amortizing capacity checks
across the batch. Add corresponding batch APIs to ValuesWriter (with scalar
default) and optimized overrides in PlainValuesWriter.

Performance improvement (100K values, JDK 25):
  INT32:  566M -> 2,809M ops/s (+396%)
  FLOAT:  540M -> 2,818M ops/s (+422%)
  INT64:  479M -> 1,306M ops/s (+173%)
  DOUBLE: 442M -> 1,275M ops/s (+189%)
- ValuesReader.readBinaries() / ValuesWriter.writeBinaries() default impls
- FixedLenByteArrayPlainValuesReader: bulk slice() with fixed-offset Binary views
- FixedLenByteArrayPlainValuesWriter: chunked bulk write() amortizing stream overhead
- ByteStreamSplitValuesReader: optimized array-based decode with unrolled loops
  for element sizes 2, 4, 8, 12, 16
- ByteStreamSplitValuesReaderForFLBA: batch readBinaries() with single advanceByteOffset
- FixedLenByteArrayEncodingBenchmark: full FLBA benchmark suite with batch variants
- Add TestDataFactory and BenchmarkEncodingUtils helper classes
- Fix JMH annotation processor config in pom.xml for Maven Compiler 3.14+
… writes

Replace per-value scatterBytes() in FixedLenByteArrayByteStreamSplitValuesWriter
with a BATCH_SIZE=64 buffered scatter pattern:
- Accumulate byte values into per-stream batch buffers
- Flush as bulk write(byte[], 0, count) to each stream
- Eliminates N*elementSize individual stream.write(byte) calls per batch
- Adds writeBinaries() batch override for FLBA BSS writer

Performance improvement: FLBA size=2 +85%, size=16 +160% (vs per-byte scatter).
- Add FileReadBenchmark / FileWriteBenchmark with SS warmup=5, measurement=10
- Add RowGroupFlushBenchmark with warmup=3, measurement=5
- Add RleDictionaryIndexDecodingBenchmark with encodeDictionaryIds() and
  ValuesReader-level decode benchmarks (decodeValuesReader, decodeValuesReaderBatch)
- Add BlackHoleOutputFile for write benchmarks without I/O overhead
- Adapt RLE decoder instantiation to use InputStream (par13 API)
@iemejia iemejia force-pushed the perf-batch-read-api branch from bcf585a to ec6408d Compare May 13, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant