Skip to content

test(delta): add embedded e2e ingestion test for Delta Lake input source#19611

Draft
rinchinov wants to merge 3 commits into
apache:masterfrom
rinchinov:test/delta-lake-embedded-e2e-18606
Draft

test(delta): add embedded e2e ingestion test for Delta Lake input source#19611
rinchinov wants to merge 3 commits into
apache:masterfrom
rinchinov:test/delta-lake-embedded-e2e-18606

Conversation

@rinchinov

Copy link
Copy Markdown
Contributor

Description

Adds an embedded end-to-end integration test for the Delta Lake input source, as suggested by @abhishekrb19 in #19592 (comment).

It is modeled on the existing Iceberg embedded test (IcebergRestCatalogIngestionTest). Unlike Iceberg, Delta Lake reads directly from a local filesystem path, so no catalog or testcontainer is required — the test ingests a Delta table through a native IndexTask using DeltaInputSource and verifies the result over a real embedded Druid cluster (overlord, coordinator, indexer, broker, historical).

What it covers

The test reuses the regression table from #19592: 2 Parquet files × 2000 rows = 4000 rows total. Because each file exceeds the Delta kernel's default batch size of 1024 rows, this is the integration-level counterpart of the unit test DeltaInputSourceTest.BatchDrainRegressionTests for the per-file batch-drain bug (#18606):

  • Without the fix: 1024 × 2 = 2048 rows ingested
  • With the fix: 4000 rows ingested

Assertions:

  • COUNT(*) = 4000 (exact; the core regression signal)
  • MIN/MAX(__time) bounded by the id column's documented min/max (0 and 3999), confirming rows from both files were read

Depends on #19592

This test exercises the fix in #19592 and is green only with that fix present. On master (which still has the bug) it asserts 4000 but the input source returns 2048, so CI here will be red until #19592 is merged. Kept as a draft for that reason. The change is otherwise self-contained (test + copied Delta table resource + a test-scoped druid-deltalake-extensions dependency).

Key changed/added classes in this PR

  • DeltaLakeInputSourceIngestionTest (new embedded e2e test)
  • embedded-tests/pom.xml (test-scoped druid-deltalake-extensions dependency)
  • embedded-tests/src/test/resources/delta/large-row-group-table (Delta table fixture)

This PR has:

  • been self-reviewed.
  • added unit tests or modified existing tests to cover new code paths.

Adds an end-to-end ingestion test that runs inside an embedded Druid
cluster (overlord, coordinator, indexer, broker, historical) and ingests
a Delta table via a native IndexTask using the Delta Lake input source.

The table has 2 Parquet files x 2000 rows = 4000 rows total. Because each
file exceeds the Delta kernel's default 1024-row batch size, this is the
integration-level counterpart of the unit regression test
DeltaInputSourceTest.BatchDrainRegressionTests for the per-file batch-drain
bug (apacheGH-18606): without the fix the reader returns 1024 * 2 = 2048 rows.

Modeled on the Iceberg embedded test (IcebergRestCatalogIngestionTest).
Delta needs no catalog/container since it reads directly from a local path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
rinchinov and others added 2 commits June 22, 2026 16:16
…pleteness

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… test

Switches the embedded e2e test from a native IndexTask to an MSQ
INSERT ... EXTERN(...) statement, matching the structure of
IcebergRestCatalogIngestionTest that this test was modeled on, and
exercising the SQL ingestion path users actually write.

EXTERN requires a non-null inputFormat argument, but the Delta input
source reads Parquet via the Delta kernel and ignores it
(DeltaInputSource.needsFormat() == false), so a throwaway '{"type":"json"}'
is supplied and never used. The test still validates apacheGH-18606 end to end:
COUNT(*) must be 4000 (the bug returned 1024 * 2 = 2048).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant