Skip to content

[SPARK-57143][SQL][TESTS] Extend SQL test coverage for grouping analytics#56202

Closed
vladimirg-db wants to merge 1 commit into
apache:masterfrom
vladimirg-db:import-grouping-analytics-goldens
Closed

[SPARK-57143][SQL][TESTS] Extend SQL test coverage for grouping analytics#56202
vladimirg-db wants to merge 1 commit into
apache:masterfrom
vladimirg-db:import-grouping-analytics-goldens

Conversation

@vladimirg-db
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR extends group-analytics.sql with additional query-level coverage for GROUPING SETS / CUBE / ROLLUP, exercising scenarios that were previously under-covered:

  • grouping_id() (no-arg and explicit-arg) across GROUPING SETS, CUBE, and ROLLUP.
  • Lateral column aliases that reference grouping() / grouping_id() results.
  • Aggregate functions in HAVING and ORDER BY over grouping analytics (including rolled-up groups and aggregate arguments that are also grouping keys).
  • Expression grouping keys, SELECT * with CUBE, and ordinal references inside ROLLUP / GROUPING SETS.
  • Struct field access inside aggregates over grouping analytics.
  • Scalar / EXISTS / NOT IN subqueries combined with grouping analytics.

The input data is defined as temporary views and each query is formatted multi-line for readability.

Why are the changes needed?

These combinations (notably aggregate functions in HAVING/ORDER BY over rolled-up groups, lateral column aliases over grouping functions, and struct field access) were not covered by the existing golden tests. Locking down the current, correct behavior guards against regressions.

Does this PR introduce any user-facing change?

No. Test-only change.

How was this patch tested?

Golden files regenerated with
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z group-analytics.sql" and the suite passes.

Was this patch authored or co-authored using generative AI tooling?

Yes.

Co-authored-by: Claude

@vladimirg-db vladimirg-db force-pushed the import-grouping-analytics-goldens branch 2 times, most recently from 0c9f5d3 to 15195c7 Compare May 29, 2026 12:55
@vladimirg-db
Copy link
Copy Markdown
Contributor Author

@vladimirg-db vladimirg-db changed the title [SPARK-57143][SQL][TESTS] Add SQL test coverage for grouping analytics [SPARK-57143][SQL][TESTS] Extend SQL test coverage for grouping analytics May 29, 2026
@vladimirg-db vladimirg-db force-pushed the import-grouping-analytics-goldens branch 2 times, most recently from 6fb5e26 to 94206df Compare May 29, 2026 13:01
### What changes were proposed in this pull request?

This PR extends `group-analytics.sql` with additional query-level coverage for
GROUPING SETS / CUBE / ROLLUP, exercising combinations that were previously
under-covered:

- Aggregate functions in `HAVING` and `ORDER BY` over grouping analytics,
  including filtering/sorting rolled-up groups and aggregate arguments that are
  also grouping keys.
- The no-argument `grouping_id()` function and lateral column aliases that
  reference `grouping()` / `grouping_id()` results.
- `DISTINCT` aggregates and aggregate `FILTER (WHERE ...)` over grouping analytics,
  and a grouping function combined with an aggregate predicate in `HAVING`.
- Struct field access inside aggregates over grouping analytics.
- Uncorrelated subqueries (scalar / `IN` in the SELECT list, `IN` / `EXISTS` /
  `NOT IN` in `WHERE`) combined with grouping analytics, including the `NULL`
  grouping key of the grand-total row on the left side of `IN`.
- Multiple, nested, and complex subqueries with grouping analytics: subqueries in
  several clauses at once, subqueries nested inside subqueries, subqueries whose
  inner query itself uses grouping analytics, subquery values combined with
  aggregates, and pre-aggregation correlated subqueries in `WHERE`.
- Ordinal references inside ROLLUP / GROUPING SETS, a wide (34-column) grouping
  set, and the empty grouping set.
- Negative cases: `grouping()` / `grouping_id()` on a non-grouping column, and a
  window function in `GROUP BY`.

Input data is defined as temporary views, each query is formatted multi-line for
readability, and all temporary views are dropped at the end of the file.

### Why are the changes needed?

These combinations were not covered by the existing golden tests. Locking down
the current behavior guards against regressions.

### Does this PR introduce any user-facing change?

No. Test-only change.

### How was this patch tested?

Golden files regenerated with
`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z group-analytics.sql"`
and the suite passes.

### Was this patch authored or co-authored using generative AI tooling?

Yes.
@vladimirg-db vladimirg-db force-pushed the import-grouping-analytics-goldens branch from 94206df to 639ef42 Compare May 29, 2026 13:41
@dtenedor
Copy link
Copy Markdown
Contributor

LGTM, merging to master and 4.x

@dtenedor dtenedor closed this in 54dbb38 May 29, 2026
dtenedor pushed a commit that referenced this pull request May 29, 2026
### What changes were proposed in this pull request?

This PR extends `group-analytics.sql` with additional query-level coverage for GROUPING SETS / CUBE / ROLLUP, exercising scenarios that were previously under-covered:

- `grouping_id()` (no-arg and explicit-arg) across GROUPING SETS, CUBE, and ROLLUP.
- Lateral column aliases that reference `grouping()` / `grouping_id()` results.
- Aggregate functions in `HAVING` and `ORDER BY` over grouping analytics (including rolled-up groups and aggregate arguments that are also grouping keys).
- Expression grouping keys, `SELECT *` with CUBE, and ordinal references inside ROLLUP / GROUPING SETS.
- Struct field access inside aggregates over grouping analytics.
- Scalar / EXISTS / NOT IN subqueries combined with grouping analytics.

The input data is defined as temporary views and each query is formatted multi-line for readability.

### Why are the changes needed?

These combinations (notably aggregate functions in HAVING/ORDER BY over rolled-up groups, lateral column aliases over grouping functions, and struct field access) were not covered by the existing golden tests. Locking down the current, correct behavior guards against regressions.

### Does this PR introduce any user-facing change?

No. Test-only change.

### How was this patch tested?

Golden files regenerated with
`SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z group-analytics.sql"` and the suite passes.

### Was this patch authored or co-authored using generative AI tooling?

Yes.

Co-authored-by: Claude

Closes #56202 from vladimirg-db/import-grouping-analytics-goldens.

Authored-by: Vladimir Golubev <vladimir.golubev@databricks.com>
Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
(cherry picked from commit 54dbb38)
Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants