[WIP][SPARK-57032][SQL] Extend timestamp string parsing for nanosecond fractional precision#56205
Open
MaxGekk wants to merge 4 commits into
Open
[WIP][SPARK-57032][SQL] Extend timestamp string parsing for nanosecond fractional precision#56205MaxGekk wants to merge 4 commits into
MaxGekk wants to merge 4 commits into
Conversation
…ctional precision ### What changes were proposed in this pull request? Extend `SparkDateTimeUtils.parseTimestampString` to preserve fractional-second digits 7-9 in a new output-only slot `segments(9)` (sub-microsecond remainder in [0, 999]), while keeping `segments(6)` as microseconds so all existing callers are unaffected. Add package-private parse entry points that return a normalized `TimestampNanosVal` for `TIMESTAMP_NTZ(p)`/`TIMESTAMP_LTZ(p)` with `p` in [7, 9]: `stringToTimestampNTZNanos`, `stringToTimestampLTZNanos`, and their ANSI variants. Fractional digits beyond the target precision `p` are truncated toward zero, consistent with the existing microsecond parsing behavior. ### Why are the changes needed? This is the first sub-task of the nanosecond datetime conversion utilities under SPARK-56822 (SPIP: Timestamps with nanosecond precision). Without it, timestamp strings with 7-9 fractional digits cannot be converted to the nanosecond-capable composite representation (epochMicros + nanosWithinMicro). ### Does this PR introduce any user-facing change? No. Existing `TimestampType`/`TimestampNTZType` string parsing is unchanged; the new parse APIs are package-private and not yet wired to user-facing casts. ### How was this patch tested? Added `TimestampNanosParseSuite` covering 7/8/9-digit fractions, per-precision truncation, NTZ/LTZ, zone suffixes, range edge cases, and ANSI errors. Verified existing `DateTimeUtilsSuite` and `TimestampFormatterSuite` still pass.
- Fix stale `isValidDigits` comment (digits 7-9 are now retained, not truncated) - Clarify segments(7-8) comment: values are written by loop as `i` advances but never read by any caller - Extend format-string examples in `parseTimestampString` Scaladoc to show the optional [ns][ns][ns] digits - Add precision guard (throws SparkException.internalError) before the try/catch in stringToTimestampLTZNanos and stringToTimestampNTZNanos, and explicit case 9 + error fallback in truncateNanosWithinMicro - Add Scaladoc to stringToTimestampNTZNanosAnsi noting that allowTimeZone defaults to true (TZ suffix is discarded, not rejected) - New tests: null input, time-only LTZ, pre-epoch negative timestamps, out-of-range precision (checkError / INTERNAL_ERROR), ANSI NTZ TZ-discard Co-authored-by: Isaac
Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR extends Spark's existing timestamp string parser to preserve fractional-second digits beyond microsecond precision, and adds package-private parse entry points that produce the nanosecond-capable composite representation for
TIMESTAMP_NTZ(p)/TIMESTAMP_LTZ(p)withpin[7, 9].SparkDateTimeUtils.parseTimestampStringnow retains fractional digits 7-9 in a new output-only slotsegments(9)(the sub-microsecond remainder, a value in[0, 999]).segments(6)continues to hold microseconds (digits 1-6), so all existing callers are unaffected. Digits beyond the 9th are dropped. The parsing loop bound is pinned to9(the original number of parsed segments) so the new slot is never written by the loop, keeping acceptance behavior identical.org.apache.spark.unsafe.types.TimestampNanosVal(epochMicros+nanosWithinMicro):stringToTimestampLTZNanos(s, precision, timeZoneId)andstringToTimestampLTZNanosAnsi(...)stringToTimestampNTZNanos(s, precision, allowTimeZone = true)andstringToTimestampNTZNanosAnsi(...)truncateNanosWithinMicrohelper applies the target precisionp: digits beyondpare truncated toward zero (consistent with the existing microsecond path, which already drops digits 7+). Since microseconds occupy fractional digits 1-6,pin[7, 9]only affects the sub-microsecond remainder.The normalization invariant (
nanosWithinMicroin[0, 999]) holds for free: the remainder is parsed as exactly the 3 sub-micro digits andepochMicroscomes from the independent microsecond path, so no carry is needed;TimestampNanosVal.fromPartsre-validates the range.Why are the changes needed?
The logical types
TimestampNTZNanosType/TimestampLTZNanosType, the physical valueTimestampNanosVal, and theTIMESTAMP_NTZ(p)/TIMESTAMP_LTZ(p)SQL syntax already exist, but string inputs with 7-9 fractional digits could not be converted to the SPIP composite representation because the parser truncated the fractional part to microseconds. This change provides the missing string-to-nanos parsing building block that downstream work (cast matrix, typed SQL literals, ingest tests) depends on.Does this PR introduce any user-facing change?
No. Existing
TimestampType/TimestampNTZTypestring parsing is byte-for-byte unchanged, and the new parse APIs are package-private and not yet wired to user-facing casts or literals.How was this patch tested?
Added
TimestampNanosParseSuite(insql/catalyst) covering:nanosWithinMicro;.123456789->700at p=7,780at p=8,789at p=9), and digits beyond the 9th dropped;.0,.999999999, trailing zeros, exactly 6 digits,.000000001;allowTimeZone/ time-only rejection for NTZ;Verified existing
DateTimeUtilsSuite(including "nanoseconds truncation") andTimestampFormatterSuitestill pass unchanged.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor (Claude Opus 4.8)