Skip to content

[CALCITE-7618] Add filter pushdown support to the file adapter's CSV table implementation#5048

Open
Diveyam-Mishra wants to merge 2 commits into
apache:mainfrom
Diveyam-Mishra:CALCITE-7618
Open

[CALCITE-7618] Add filter pushdown support to the file adapter's CSV table implementation#5048
Diveyam-Mishra wants to merge 2 commits into
apache:mainfrom
Diveyam-Mishra:CALCITE-7618

Conversation

@Diveyam-Mishra

Copy link
Copy Markdown
Contributor

Jira Link

[CALCITE-7618]

Changes Proposed

This PR implements filter pushdown support for the file adapter's CSV table using a planner-rule-based approach instead of a FilterableTable interface. This allows Calcite to make more intelligent planning decisions, estimate cost reductions, and display pushed-down predicates in EXPLAIN plans.

Implementation Details:

  1. Rule-Based Pushdown:
    • Introduced CsvFilterTableScanRule which matches LogicalFilter on a CsvTableScan and pushes simple equality predicates (col = literal) into the scan.
    • Introduced CsvProjectFilterTableScanRule which matches LogicalProjectLogicalFilterCsvTableScan and pushes down the filter first, preventing the planner from prematurely collapsing projects and filters into a generic EnumerableCalc and bypassing pushdown.
  2. Scan State & Costing:
    • Updated CsvTableScan to store and propagate @Nullable String[] filterValues.
    • Updated CsvTableScan#computeSelfCost to reduce planning cost proportionally to the number of pushed-down filters.
    • Extended CsvTableScan#explainTerms to format filters as filters=[[colIndex=value]] in EXPLAIN outputs.
  3. Execution Support:
    • Added CsvTranslatableTable#scan(DataContext, int[], String[]) which is dynamically invoked by the generated code when filters are present.
    • Made CsvEnumerator#converter package-private so it can be reused inside CsvTranslatableTable to resolve correct row converters (ensuring single-column projections return raw objects rather than Object[] arrays to prevent class cast errors).
  4. Testing:
    • Added target unit tests in FileAdapterTest.java verifying pushdown, projection combination, result correctness, and non-pushable residual filter persistence.
    • Updated existing plans in testPushDownProjectAggregateWithFilter to reflect the newly optimized scan plans.

To verify the change, run:

.\sqlline.bat -u "jdbc:calcite:model=file/src/test/resources/smart.json" -n admin -p admin -e "!set maxwidth 10000" -e "explain plan for select name, empno from EMPS where deptno = 20"

Before this change, the plan was:

PLAN=EnumerableCalc(expr#0..2=[{inputs}], expr#3=[20], expr#4=[=($t2, $t3)], NAME=[$t1], EMPNO=[$t0], $condition=[$t4])
CsvTableScan(table=[[SALES, EMPS]], fields=[[0, 1, 2]])

After this change, the filter and projection are pushed down into CsvTableScan, resulting in:

CsvTableScan(table=[[SALES, EMPS]], fields=[[1, 0]], filters=[[2=20]])

This demonstrates that the scan now reads only the required columns (name, empno) and applies the deptno = 20 filter during the table scan itself.

@Diveyam-Mishra Diveyam-Mishra force-pushed the CALCITE-7618 branch 2 times, most recently from 66fb2ac to e0535c1 Compare June 24, 2026 21:37

protected CsvTableScan(RelOptCluster cluster, RelOptTable table,
CsvTranslatableTable csvTable, int[] fields,
@Nullable String @Nullable [] filterValues) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CsvEnumerator is actually broken, since it does string comparisons.
This means for example that 0.0 != 0 in a filter.

@Diveyam-Mishra Diveyam-Mishra marked this pull request as draft June 24, 2026 22:30
@Diveyam-Mishra Diveyam-Mishra force-pushed the CALCITE-7618 branch 3 times, most recently from f60ce88 to d5d601a Compare June 27, 2026 18:29
@sonarqubecloud

Copy link
Copy Markdown

@Diveyam-Mishra Diveyam-Mishra marked this pull request as ready for review June 27, 2026 19:24
@Diveyam-Mishra

Copy link
Copy Markdown
Contributor Author

I might have complicated a few things because I was getting some Style errors constantly on local which i tried to fix but idk maybe was doing something wrong i tried stopping daemon thread and rebuild yet something went haywire So If its needed i can open a new PR with single proper commit

@mihaibudiu

Copy link
Copy Markdown
Contributor

Please use fresh commits until we finish the review, to make it easier to see what changed in response to reviewers.

if (o1 == null || o2 == null) {
return false;
}
if (o1 instanceof BigDecimal && o2 instanceof BigDecimal) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this case needed? Doesn't BigDecimal have equals?
If it does, can this become Objects.equals()?

@Diveyam-Mishra Diveyam-Mishra Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core problem is that BigDecimal violates the intuitive expectation that "same number = equal object":
new BigDecimal("2.0").equals(new BigDecimal("2.00")) // false
new BigDecimal("2.0").compareTo(new BigDecimal("2.00")) == 0 // true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using compareTo for everything and using Comaprable for o1 and o2?

* {@link CsvTableScan}.
*
* <p>Only equality conditions of the form {@code column = literal} can be
* pushed down, because {@link CsvEnumerator} only supports per-column

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this situation be improved? Is this a fundamental limitation of CsvEnumerator?
Maybe we need a more powerful enumerator.
In principle I think any predicate of the current row value should work.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current plan is to introduce a CsvFilter abstraction to represent the subset of filters that can be pushed down (initially AND, OR, = and <>, including null comparisons). Rather than encoding pushdown state as column-value arrays, the planner will build a CsvFilter tree, serialize it, and pass the serialized representation through CsvTableScan/CsvTranslatableTable to CsvEnumerator, where it will be deserialized and evaluated against each row.

The CsvFilter classes are intended to be a lightweight data model representing pushdownable predicates, while evaluation, serialization/deserialization, and pretty-printing remain separate concerns. This keeps the representation extensible for additional pushdown operators in the future without requiring further changes to the transport mechanism between planning and execution.
There is one more option which is to do exactly what spark does compile the filter all the way down to actual bytecode but in my opinion thats a bit overkill

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calcite already includes a compiler which generates the enumerable code, why can't the same compiler generate the filter implementation as a compiled Java function? Then you can support arbitrary functions.

sql("model-with-custom-table", sql).ok();
}

/** Test case for

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to higher a higher coverage in terms of SQL types for columns.

Copilot AI review requested due to automatic review settings July 2, 2026 22:11

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds planner-rule-based filter pushdown for the file adapter’s CSV tables by carrying a pushed-down filter condition inside CsvTableScan and compiling it into a runtime predicate during enumerable implementation. It also strengthens CSV row conversion and expands tests to validate predicate behavior (including null-handling and short rows).

Changes:

  • Added CsvFilterTableScanRule and CsvProjectFilterTableScanRule, and registered them via FileRules / CsvTableScan#register.
  • Extended CsvTableScan to carry a @Nullable RexNode condition, emit it in EXPLAIN, adjust costing, and apply it via a compiled Predicate1 in implement.
  • Updated CsvEnumerator conversion and added tests around missing fields and equality/null semantics, plus additional plan/result coverage in adapter/example tests.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
file/src/test/java/org/apache/calcite/adapter/file/FileAdapterTest.java Adds/updates tests for pushdown behavior, plans, and null/equality semantics.
file/src/test/java/org/apache/calcite/adapter/file/CsvEnumeratorTest.java Adds a test covering conversion when CSV rows are shorter than the projected schema.
file/src/main/java/org/apache/calcite/adapter/file/FileRules.java Registers new planner rules and documents their intent.
file/src/main/java/org/apache/calcite/adapter/file/CsvTableScan.java Stores pushed-down filter condition and applies it during enumerable implementation.
file/src/main/java/org/apache/calcite/adapter/file/CsvProjectTableScanRule.java Adjusts projection pushdown mapping through existing scan.fields and adds a condition guard.
file/src/main/java/org/apache/calcite/adapter/file/CsvProjectFilterTableScanRule.java New rule to push filter into scan and remap input refs for project/filter when combined.
file/src/main/java/org/apache/calcite/adapter/file/CsvFilterTableScanRule.java New rule to push LogicalFilter condition into CsvTableScan.
file/src/main/java/org/apache/calcite/adapter/file/CsvEnumerator.java Makes converter reusable, adds safer field access, adds objectsEqual, and modifies filter evaluation loop.
example/csv/src/test/java/org/apache/calcite/test/CsvTest.java Adds example tests validating equality semantics with nulls under filterable model.
Comments suppressed due to low confidence (1)

file/src/main/java/org/apache/calcite/adapter/file/CsvEnumerator.java:318

  • Filtering uses strings[i] while iterating up to filterValues.size(). If a CSV row has fewer columns than the schema (which this PR now explicitly supports via field(strings, idx)), this will throw ArrayIndexOutOfBoundsException during filtering. Use the safe field(...) accessor (and treat missing fields as non-matching when a filter value is required).
        if (filterValues != null) {
          for (int i = 0; i < filterValues.size(); i++) {
            String filterValue = filterValues.get(i);
            if (filterValue != null) {
              if (!filterValue.equals(strings[i])) {
                continue outer;
              }
            }
          }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +157 to +163
RexToLixTranslator.translateCondition(
program,
implementor.getTypeFactory(),
builder,
inputGetter,
null,
implementor.getConformance());
Comment on lines +28 to +40
/** Rule that matches a {@link org.apache.calcite.rel.core.Filter} on
* a {@link CsvTableScan} and pushes arbitrary predicates into the scan.
* Any {@link org.apache.calcite.rex.RexNode} condition is compiled at plan
* time via {@link org.apache.calcite.adapter.enumerable.RexToLixTranslator}
* into a {@link org.apache.calcite.linq4j.function.Predicate1}. */
public static final CsvFilterTableScanRule FILTER_SCAN =
CsvFilterTableScanRule.Config.DEFAULT.toRule();

/** Rule that matches a {@link org.apache.calcite.rel.core.Project} on
* a {@link org.apache.calcite.rel.core.Filter} on a {@link CsvTableScan}
* and pushes down simple equality predicates. */
public static final CsvProjectFilterTableScanRule PROJECT_FILTER_SCAN =
CsvProjectFilterTableScanRule.Config.DEFAULT.toRule();
Comment on lines +459 to +462
@Test void testNonPushableFilterRemains() {
// empno > 110 is a range filter; under the compiler-based filter pushdown
// it is pushed down into the scan, leaving only the projection on top.
final String sql = "select name from EMPS where empno > 110";
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants