[SPARK-57020][PYTHON][TESTS] Add ASV microbenchmark for SQL_TRANSFORM_WITH_STATE_PANDAS_UDF by Yicong-Huang · Pull Request #56192 · apache/spark

Yicong-Huang · 2026-05-28T20:52:54Z

What changes were proposed in this pull request?

Add an ASV micro-benchmark for SQL_TRANSFORM_WITH_STATE_PANDAS_UDF to bench_eval_type.py.

A stub TCP listener (_StubStateServer) satisfies StatefulProcessorApiClient's socket connect; the benchmark UDFs never call any state API so no protocol exchange beyond connect is needed.

Scenarios cover few/many groups, small/large group sizes, and wide columns. UDFs: identity_udf, sort_udf, count_udf.

Why are the changes needed?

Part of SPARK-55724. Establishes a performance baseline before refactoring SQL_TRANSFORM_WITH_STATE_PANDAS_UDF.

Does this PR introduce any user-facing change?

No

How was this patch tested?

COLUMNS=120 ./python/asv run --python=same --bench "TransformWithStatePandas" -a "repeat=(3,5,5.0)" (one of two stable runs):

TransformWithStatePandasUDFTimeBench:

================ ============== ============ ============
--                                 udf
---------------- ----------------------------------------
    scenario      identity_udf    sort_udf    count_udf
================ ============== ============ ============
 few_groups_sm      393+/-1ms      404+/-0.7ms     380+/-2ms
 few_groups_lg     3.68+/-0.01s    3.80+/-0.01s   3.46+/-0.01s
 many_groups_sm    3.34+/-0.01s    3.62+/-0.02s   2.86+/-0.01s
 many_groups_lg    1.90+/-0.01s    1.98+/-0.01s    1.77+/-0s
   wide_cols       3.71+/-0.01s    3.79+/-0.02s   3.40+/-0.01s
================ ============== ============ ============

TransformWithStatePandasUDFPeakmemBench:

================ ============== ========== ===========
--                                udf
---------------- -------------------------------------
    scenario      identity_udf   sort_udf   count_udf
================ ============== ========== ===========
 few_groups_sm        486M         486M        476M
 few_groups_lg        569M         579M        541M
 many_groups_sm       511M         512M        492M
 many_groups_lg       510M         510M        492M
   wide_cols          619M         610M        585M
================ ============== ========== ===========

Was this patch authored or co-authored using generative AI tooling?

No

funrollloops

I'm kind of confused about what this actually benchmarks. We seem to manipulate the data directly in the UDF functions here, so the logic to invoke the transform with state user function is never executed.

Are the code paths you are benchmarking different between the different UDF types?

funrollloops · 2026-05-28T21:51:14Z

+            np.repeat(np.arange(num_groups, dtype=np.int32), rows_per_group),
+            type=pa.int32(),
+        )
+        value_pool = MockDataFactory.NUMERIC_TYPES


Sure you don't want to add non-numeric types? Maybe some cases with nested types (arrays, structs, etc)?

funrollloops · 2026-05-28T22:31:59Z

+        batches, schema = self._build_scenario(scenario)
+        udf_func, ret_type = self._udfs[udf_name]
+        if ret_type is None:
+            ret_type = StructType(schema.fields[self._NUM_KEY_COLS :])


nit: we typically see the keys included in the output schema for transform with state.

test: add ASV microbenchmark for SQL_TRANSFORM_WITH_STATE_PANDAS_UDF

3af5015

funrollloops reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57020][PYTHON][TESTS] Add ASV microbenchmark for SQL_TRANSFORM_WITH_STATE_PANDAS_UDF#56192

[SPARK-57020][PYTHON][TESTS] Add ASV microbenchmark for SQL_TRANSFORM_WITH_STATE_PANDAS_UDF#56192
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:SPARK-57020/bench/tws-pandas

Yicong-Huang commented May 28, 2026 •

edited

Loading

Uh oh!

funrollloops left a comment

Uh oh!

funrollloops May 28, 2026

Uh oh!

funrollloops May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yicong-Huang commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

funrollloops left a comment

Choose a reason for hiding this comment

Uh oh!

funrollloops May 28, 2026

Choose a reason for hiding this comment

Uh oh!

funrollloops May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yicong-Huang commented May 28, 2026 •

edited

Loading