[SPARK-57103][SQL] Add hashing for nanosecond timestamp types by stevomitric · Pull Request #56203 · apache/spark

stevomitric · 2026-05-29T13:11:17Z

What changes were proposed in this pull request?

Add hashing support for the nanosecond timestamp types TimestampNTZNanosType(p) and TimestampLTZNanosType(p), in both the interpreted and codegen paths of hash.scala:

Murmur3Hash / XxHash64: mix both fields, following the existing CalendarInterval pattern - hashInt(nanosWithinMicro, hashLong(epochMicros, seed)).
HiveHash: a dedicated hashTimestampNanos that extends the existing hashTimestamp with the sub-microsecond nanoseconds using the same * 37 + field idiom as hashCalendarInterval.

Why are the changes needed?

hash-based GROUP BY / DISTINCT / joins - failed on nanosecond timestamp columns.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit tests.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

Add Murmur3/XxHash64/HiveHash support (interpreted + codegen) for TimestampNTZNanosType and TimestampLTZNanosType, whose physical value is TimestampNanosVal (epochMicros + nanosWithinMicro). Murmur3/XxHash64 mix both fields following the CalendarInterval precedent; HiveHash adds a dedicated hashTimestampNanos extending hashTimestamp. New HashExpressionsSuite tests cover interpreted/codegen/unsafe agreement and equality consistency. Co-authored-by: Isaac

stevomitric · 2026-05-29T15:52:10Z

cc @MaxGekk please take a look at this PR.

MaxGekk

Thanks for the PR! The implementation looks correct and the new suite passes locally (interpreted / codegen / optimized paths agree for Murmur3, XxHash64 and HiveHash across both nanos types at precisions 7 and 9). My main feedback is about test coverage of the real code path that motivates this change (hash-based GROUP BY / shuffle / joins), plus a couple of smaller notes. Inline comments below.

MaxGekk · 2026-05-29T19:22:28Z

+
+    Seq(TimestampNTZNanosType(9), TimestampLTZNanosType(9),
+        TimestampNTZNanosType(7), TimestampLTZNanosType(7)).foreach { dt =>
+      (values.map(Literal.create(_, dt)) :+ Literal.create(null, dt)).foreach { lit =>


All hash inputs here are Literals, so the generated code embeds the TimestampNanosVal as a reference object and the ordinal-read codegen path (CodeGenerator.getValue -> getTimestampNTZNanos/getTimestampLTZNanos) is never exercised, and the value never round-trips through an UnsafeRow as a hash input. checkEvaluationWithUnsafeProjection here only projects the resulting int/long hash, not the nanos input.

Since the motivation is hash-based GROUP BY / shuffle / joins (where the input is a BoundReference reading from a possibly-unsafe row), could we add a BoundReference-over-row case, e.g.:

val row = InternalRow(TimestampNanosVal.fromParts(1234567890L, 999.toShort)) val ref = BoundReference(0, dt, nullable = true) checkEvaluation(Murmur3Hash(Seq(ref), 42), Murmur3Hash(Seq(ref), 42).eval(row), row)

This drives the row read + unsafe round-trip that the literal-based tests skip.

MaxGekk · 2026-05-29T19:22:28Z

+    Seq(TimestampNTZNanosType(9), TimestampLTZNanosType(9),
+        TimestampNTZNanosType(7), TimestampLTZNanosType(7)).foreach { dt =>
+      (values.map(Literal.create(_, dt)) :+ Literal.create(null, dt)).foreach { lit =>
+        // checkEvaluation asserts the interpreted, codegen, and unsafe paths all agree.


Minor: with a Literal child, the unsafe projection only stores the scalar hash result, not the TimestampNanosVal input, so the unsafe input path isn't actually covered here. Either reword this comment or add the BoundReference-over-row case suggested above so the comment becomes accurate.

MaxGekk · 2026-05-29T19:22:29Z

+        TimestampNTZNanosType(7), TimestampLTZNanosType(7)).foreach { dt =>
+      (values.map(Literal.create(_, dt)) :+ Literal.create(null, dt)).foreach { lit =>
+        // checkEvaluation asserts the interpreted, codegen, and unsafe paths all agree.
+        checkEvaluation(Murmur3Hash(Seq(lit), 42), Murmur3Hash(Seq(lit), 42).eval())


The expected value is Murmur3Hash(Seq(lit), 42).eval(), i.e. computed by the same expression under test, so this only proves the eval paths agree with each other (and, via the second test, that both fields contribute). A bug shared across all paths (e.g. a wrong constant, or a symmetric field swap) wouldn't be caught. The existing tests in this suite pin literals (e.g. checkEvaluation(HiveHash(Seq(time)), -1567775210)). Could we pin at least one golden constant per algorithm for a fixed (epochMicros, nanosWithinMicro) pair?

stevomitric force-pushed the stevomitric/SPARK-57103-hash branch from 6fc3185 to ff23dd1 Compare May 29, 2026 13:14

MaxGekk reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57103][SQL] Add hashing for nanosecond timestamp types#56203

[SPARK-57103][SQL] Add hashing for nanosecond timestamp types#56203
stevomitric wants to merge 1 commit into
apache:masterfrom
stevomitric:stevomitric/SPARK-57103-hash

stevomitric commented May 29, 2026

Uh oh!

stevomitric commented May 29, 2026

Uh oh!

MaxGekk left a comment •

edited

Loading

Uh oh!

MaxGekk May 29, 2026

Uh oh!

MaxGekk May 29, 2026

Uh oh!

MaxGekk May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stevomitric commented May 29, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

stevomitric commented May 29, 2026

Uh oh!

MaxGekk left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 29, 2026

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 29, 2026

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MaxGekk left a comment •

edited

Loading