GH-3522: Add batch read APIs to ValuesReader hierarchy#3535
Open
iemejia wants to merge 10 commits into
Open
Conversation
775d723 to
bcf585a
Compare
Add readIntegers(), readLongs(), readFloats(), readDoubles() batch methods to ValuesReader with default loop-based implementations. Override in: - RunLengthBitPackingHybridDecoder.readInts(): batch across RLE runs and packed groups using Arrays.fill/System.arraycopy - DictionaryValuesReader: batch-decode dictionary IDs first, then batch-lookup values (eliminates per-value IOException try/catch) - DeltaBinaryPackingValuesReader: System.arraycopy from pre-decoded buffer - PlainValuesReader (all types): loop over LittleEndianDataInputStream - ByteStreamSplitValuesReader (all types): indexed ByteBuffer bulk read These APIs enable callers to amortize per-value overhead (virtual dispatch, bounds checks, mode switches) across batches. On the perf branch where the RLE decoder uses ByteBuffer, this yielded +148% RLE throughput and +67% dictionary decode throughput.
RunLengthBitPackingHybridValuesReader inherited the default loop from ValuesReader.readIntegers() which called readInteger() per value. Delegate to decoder.readInts() which uses Arrays.fill for RLE runs and System.arraycopy for packed groups. Benchmark (100K values, SEQUENTIAL/RANDOM/LOW_CARDINALITY): Before: ~556M ops/s (same as per-value path) After: ~1,270M ops/s (+128%) This matters for def/rep level decoding on every data page, BOOLEAN columns in V2 pages, and any direct RLE consumers using batch APIs.
Replace per-value getXxx(offset) loops with position()+asXxxBuffer().get() bulk copy in readFloats/readDoubles/readIntegers/readLongs. The decoded data buffer is a contiguous heap byte[] in LE order, making view buffer bulk reads a single memcpy via Unsafe.copyMemory. Benchmark results (100K values, BSS FLOAT batch): Before: ~1,228M ops/s After: ~1,442M ops/s (+17%) INT32/INT64/DOUBLE show negligible change because BSS invocation cost is dominated by page transposition in initFromPage, not the read loop.
…ns() Replace ByteBitPackingValuesReader delegation in BooleanPlainValuesReader with direct bit extraction from the page byte[]. The scalar path uses a single array access + shift + mask instead of the 8-element int[] buffer and packer dispatch. The batch path (readBooleans) unrolls 8 booleans per byte with constant masks. For RLE (V2), add a native readBooleans() method that uses Arrays.fill for RLE runs (constant-time for uniform data) and direct int-to-boolean conversion for packed groups, avoiding the intermediate int[] allocation of the readInts() path. Benchmark results (1M values, JDK 25, Compiler Blackholes): - V1 PLAIN scalar: ~620M -> ~1,528-1,618M ops/s (+150%) - V1 PLAIN batch: ALL_TRUE/FALSE ~5,000M (+680%), RANDOM 2,757M (+337%) - V2 RLE batch: ALL_TRUE/FALSE ~190B (fill), RANDOM 1,335M (+93%)
Replace the per-bit unrolled extraction loop with a static boolean[256][8] lookup table + System.arraycopy. Each byte maps to its 8 pre-decoded booleans, and the 8-byte copy is emitted by HotSpot as a single 64-bit load/store pair — the boolean equivalent of asIntBuffer().get() for ints. For RLE PACKED groups (bitWidth=1), bypass the int[] intermediate and read directly from the raw packed bytes via the same lookup table. This makes batch decode throughput independent of data pattern: - V1 PLAIN batch RANDOM: 2,757M -> 5,047M ops/s (+83%) - V2 RLE batch RANDOM: 1,335M -> 1,618M ops/s (+21%) - V2 RLE batch MOSTLY_TRUE_99: 3,205M -> 3,745M ops/s (+17%) - Uniform patterns (ALL_TRUE/FALSE): unchanged (still Arrays.fill)
…king Refactor BooleanPlainValuesWriter to pack bits directly into bytes instead of delegating through ByteBitPackingValuesWriter and the generic int[8]-based ByteBasedBitPackingEncoder. Add batch writeBooleans() API to ValuesWriter with optimized overrides: - PLAIN: processes 8 booleans at a time into single bytes with OR/shift, eliminating the per-value method call chain and int[] intermediate. - RLE: pre-scans for runs >= 8 to emit RLE directly, fills partial bit-packed groups from run boundaries to avoid spurious padding. PLAIN scalar improves +69% (890M -> 1,500M ops/s) from the refactoring. PLAIN batch: +184% over old scalar (2,528M for RANDOM). RLE batch: +278% for ALL_FALSE, +95% for MOSTLY_*, +36% for ALTERNATING.
…riteDoubles with bulk ByteBuffer view transfers Add bulk write methods to CapacityByteArrayOutputStream (writeInts, writeLongs, writeFloats, writeDoubles) that use IntBuffer/LongBuffer/FloatBuffer/DoubleBuffer view puts to transfer entire arrays in one operation, amortizing capacity checks across the batch. Add corresponding batch APIs to ValuesWriter (with scalar default) and optimized overrides in PlainValuesWriter. Performance improvement (100K values, JDK 25): INT32: 566M -> 2,809M ops/s (+396%) FLOAT: 540M -> 2,818M ops/s (+422%) INT64: 479M -> 1,306M ops/s (+173%) DOUBLE: 442M -> 1,275M ops/s (+189%)
- ValuesReader.readBinaries() / ValuesWriter.writeBinaries() default impls - FixedLenByteArrayPlainValuesReader: bulk slice() with fixed-offset Binary views - FixedLenByteArrayPlainValuesWriter: chunked bulk write() amortizing stream overhead - ByteStreamSplitValuesReader: optimized array-based decode with unrolled loops for element sizes 2, 4, 8, 12, 16 - ByteStreamSplitValuesReaderForFLBA: batch readBinaries() with single advanceByteOffset - FixedLenByteArrayEncodingBenchmark: full FLBA benchmark suite with batch variants - Add TestDataFactory and BenchmarkEncodingUtils helper classes - Fix JMH annotation processor config in pom.xml for Maven Compiler 3.14+
… writes Replace per-value scatterBytes() in FixedLenByteArrayByteStreamSplitValuesWriter with a BATCH_SIZE=64 buffered scatter pattern: - Accumulate byte values into per-stream batch buffers - Flush as bulk write(byte[], 0, count) to each stream - Eliminates N*elementSize individual stream.write(byte) calls per batch - Adds writeBinaries() batch override for FLBA BSS writer Performance improvement: FLBA size=2 +85%, size=16 +160% (vs per-byte scatter).
- Add FileReadBenchmark / FileWriteBenchmark with SS warmup=5, measurement=10 - Add RowGroupFlushBenchmark with warmup=3, measurement=5 - Add RleDictionaryIndexDecodingBenchmark with encodeDictionaryIds() and ValuesReader-level decode benchmarks (decodeValuesReader, decodeValuesReaderBatch) - Add BlackHoleOutputFile for write benchmarks without I/O overhead - Adapt RLE decoder instantiation to use InputStream (par13 API)
bcf585a to
ec6408d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
readIntegers(),readLongs(),readFloats(),readDoubles()batch methods toValuesReaderwith default loop-based implementationsOverrides
Arrays.fill/System.arraycopySystem.arraycopyfrom pre-decoded bufferRationale
These APIs enable callers to amortize per-value overhead (virtual dispatch, bounds checks, mode switches) across batches. Combined with other optimizations in this series (ByteBuffer-based RLE decoder, etc.), batch reads yield significant throughput improvements over per-value loops.
All 576 parquet-column tests pass.