[SPARK-57103][SQL] Add hashing for nanosecond timestamp types#56203
[SPARK-57103][SQL] Add hashing for nanosecond timestamp types#56203stevomitric wants to merge 1 commit into
Conversation
Add Murmur3/XxHash64/HiveHash support (interpreted + codegen) for TimestampNTZNanosType and TimestampLTZNanosType, whose physical value is TimestampNanosVal (epochMicros + nanosWithinMicro). Murmur3/XxHash64 mix both fields following the CalendarInterval precedent; HiveHash adds a dedicated hashTimestampNanos extending hashTimestamp. New HashExpressionsSuite tests cover interpreted/codegen/unsafe agreement and equality consistency. Co-authored-by: Isaac
6fc3185 to
ff23dd1
Compare
|
cc @MaxGekk please take a look at this PR. |
There was a problem hiding this comment.
Thanks for the PR! The implementation looks correct and the new suite passes locally (interpreted / codegen / optimized paths agree for Murmur3, XxHash64 and HiveHash across both nanos types at precisions 7 and 9). My main feedback is about test coverage of the real code path that motivates this change (hash-based GROUP BY / shuffle / joins), plus a couple of smaller notes. Inline comments below.
|
|
||
| Seq(TimestampNTZNanosType(9), TimestampLTZNanosType(9), | ||
| TimestampNTZNanosType(7), TimestampLTZNanosType(7)).foreach { dt => | ||
| (values.map(Literal.create(_, dt)) :+ Literal.create(null, dt)).foreach { lit => |
There was a problem hiding this comment.
All hash inputs here are Literals, so the generated code embeds the TimestampNanosVal as a reference object and the ordinal-read codegen path (CodeGenerator.getValue -> getTimestampNTZNanos/getTimestampLTZNanos) is never exercised, and the value never round-trips through an UnsafeRow as a hash input. checkEvaluationWithUnsafeProjection here only projects the resulting int/long hash, not the nanos input.
Since the motivation is hash-based GROUP BY / shuffle / joins (where the input is a BoundReference reading from a possibly-unsafe row), could we add a BoundReference-over-row case, e.g.:
val row = InternalRow(TimestampNanosVal.fromParts(1234567890L, 999.toShort))
val ref = BoundReference(0, dt, nullable = true)
checkEvaluation(Murmur3Hash(Seq(ref), 42), Murmur3Hash(Seq(ref), 42).eval(row), row)This drives the row read + unsafe round-trip that the literal-based tests skip.
| Seq(TimestampNTZNanosType(9), TimestampLTZNanosType(9), | ||
| TimestampNTZNanosType(7), TimestampLTZNanosType(7)).foreach { dt => | ||
| (values.map(Literal.create(_, dt)) :+ Literal.create(null, dt)).foreach { lit => | ||
| // checkEvaluation asserts the interpreted, codegen, and unsafe paths all agree. |
There was a problem hiding this comment.
Minor: with a Literal child, the unsafe projection only stores the scalar hash result, not the TimestampNanosVal input, so the unsafe input path isn't actually covered here. Either reword this comment or add the BoundReference-over-row case suggested above so the comment becomes accurate.
| TimestampNTZNanosType(7), TimestampLTZNanosType(7)).foreach { dt => | ||
| (values.map(Literal.create(_, dt)) :+ Literal.create(null, dt)).foreach { lit => | ||
| // checkEvaluation asserts the interpreted, codegen, and unsafe paths all agree. | ||
| checkEvaluation(Murmur3Hash(Seq(lit), 42), Murmur3Hash(Seq(lit), 42).eval()) |
There was a problem hiding this comment.
The expected value is Murmur3Hash(Seq(lit), 42).eval(), i.e. computed by the same expression under test, so this only proves the eval paths agree with each other (and, via the second test, that both fields contribute). A bug shared across all paths (e.g. a wrong constant, or a symmetric field swap) wouldn't be caught. The existing tests in this suite pin literals (e.g. checkEvaluation(HiveHash(Seq(time)), -1567775210)). Could we pin at least one golden constant per algorithm for a fixed (epochMicros, nanosWithinMicro) pair?
What changes were proposed in this pull request?
Add hashing support for the nanosecond timestamp types
TimestampNTZNanosType(p)andTimestampLTZNanosType(p), in both the interpreted and codegen paths of hash.scala:hashInt(nanosWithinMicro, hashLong(epochMicros, seed)).hashTimestampNanosthat extends the existinghashTimestampwith the sub-microsecond nanoseconds using the same * 37 + field idiom ashashCalendarInterval.Why are the changes needed?
hash-based GROUP BY / DISTINCT / joins - failed on nanosecond timestamp columns.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New unit tests.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.7