test(delta): add embedded e2e ingestion test for Delta Lake input source#19611
Draft
rinchinov wants to merge 3 commits into
Draft
test(delta): add embedded e2e ingestion test for Delta Lake input source#19611rinchinov wants to merge 3 commits into
rinchinov wants to merge 3 commits into
Conversation
Adds an end-to-end ingestion test that runs inside an embedded Druid cluster (overlord, coordinator, indexer, broker, historical) and ingests a Delta table via a native IndexTask using the Delta Lake input source. The table has 2 Parquet files x 2000 rows = 4000 rows total. Because each file exceeds the Delta kernel's default 1024-row batch size, this is the integration-level counterpart of the unit regression test DeltaInputSourceTest.BatchDrainRegressionTests for the per-file batch-drain bug (apacheGH-18606): without the fix the reader returns 1024 * 2 = 2048 rows. Modeled on the Iceberg embedded test (IcebergRestCatalogIngestionTest). Delta needs no catalog/container since it reads directly from a local path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4 tasks
…pleteness Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… test
Switches the embedded e2e test from a native IndexTask to an MSQ
INSERT ... EXTERN(...) statement, matching the structure of
IcebergRestCatalogIngestionTest that this test was modeled on, and
exercising the SQL ingestion path users actually write.
EXTERN requires a non-null inputFormat argument, but the Delta input
source reads Parquet via the Delta kernel and ignores it
(DeltaInputSource.needsFormat() == false), so a throwaway '{"type":"json"}'
is supplied and never used. The test still validates apacheGH-18606 end to end:
COUNT(*) must be 4000 (the bug returned 1024 * 2 = 2048).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds an embedded end-to-end integration test for the Delta Lake input source, as suggested by @abhishekrb19 in #19592 (comment).
It is modeled on the existing Iceberg embedded test (
IcebergRestCatalogIngestionTest). Unlike Iceberg, Delta Lake reads directly from a local filesystem path, so no catalog or testcontainer is required — the test ingests a Delta table through a nativeIndexTaskusingDeltaInputSourceand verifies the result over a real embedded Druid cluster (overlord, coordinator, indexer, broker, historical).What it covers
The test reuses the regression table from #19592: 2 Parquet files × 2000 rows = 4000 rows total. Because each file exceeds the Delta kernel's default batch size of 1024 rows, this is the integration-level counterpart of the unit test
DeltaInputSourceTest.BatchDrainRegressionTestsfor the per-file batch-drain bug (#18606):1024 × 2 = 2048rows ingested4000rows ingestedAssertions:
COUNT(*)= 4000 (exact; the core regression signal)MIN/MAX(__time)bounded by theidcolumn's documented min/max (0 and 3999), confirming rows from both files were readDepends on #19592
This test exercises the fix in #19592 and is green only with that fix present. On
master(which still has the bug) it asserts 4000 but the input source returns 2048, so CI here will be red until #19592 is merged. Kept as a draft for that reason. The change is otherwise self-contained (test + copied Delta table resource + a test-scopeddruid-deltalake-extensionsdependency).Key changed/added classes in this PR
DeltaLakeInputSourceIngestionTest(new embedded e2e test)embedded-tests/pom.xml(test-scopeddruid-deltalake-extensionsdependency)embedded-tests/src/test/resources/delta/large-row-group-table(Delta table fixture)This PR has: