Skip to content

gh-150638: Improve performance of json.loads and json.load for numeric data#150639

Open
eendebakpt wants to merge 5 commits into
python:mainfrom
eendebakpt:json-loads-opt
Open

gh-150638: Improve performance of json.loads and json.load for numeric data#150639
eendebakpt wants to merge 5 commits into
python:mainfrom
eendebakpt:json-loads-opt

Conversation

@eendebakpt
Copy link
Copy Markdown
Contributor

@eendebakpt eendebakpt commented May 30, 2026

_match_number_unicode() (the C accelerator behind json.loads) previously allocated a PyBytes object for every number, copied the digits into it, and then called the generic PyLong_FromString / PyFloat_FromString parsers.
This PR parses the common cases directly from the already-scanned text.

Benchmark main this PR speedup
json.loads, number-heavy document (script below) 3.05 ms 2.38 ms 1.28×
json.load, same document via file object 3.17 ms 2.48 ms 1.28×
pyperformance bm_json_loads 25.2 µs 23.9 µs 1.05×

The standard bm_json_loads document is string/dict-dominated, so it gains
less.

Benchmark script
"""Benchmark json.loads() and json.load() on a number-heavy document.

The document is generated deterministically at import time (no external
files) and resembles a typical telemetry/API payload: a list of records
mixing integers, 19-digit timestamps, negative integers, floats, short
strings, booleans and small integer arrays.

json.load(fp) is json.loads(fp.read()); here fp is an in-memory io.StringIO
(rewound each call) so the same document is parsed without disk noise.

Inline data size: ~304 KiB (2000 records).
"""
import io
import json
import pyperf


def build_document(n=2000):
    return [
        {
            "id": i,
            "timestamp": 1_700_000_000_000_000_000 + i * 1_000,  # 19-digit int
            "value": i * 1.5 - 1000.0,                           # float
            "delta": -i,                                         # negative int
            "label": "item-%d" % i,                              # short string
            "ok": i % 2 == 0,                                    # bool
            "samples": [i, -i, i * 2, i * 3, i * 5],             # int array
        }
        for i in range(n)
    ]


JSON_DATA = json.dumps(build_document())
STREAM = io.StringIO(JSON_DATA)


def load_from_stream():
    STREAM.seek(0)
    return json.load(STREAM)


if __name__ == "__main__":
    runner = pyperf.Runner()
    runner.metadata["description"] = "json.loads()/json.load() on a number-heavy document"
    runner.bench_func("json_loads", json.loads, JSON_DATA)
    runner.bench_func("json_load", load_from_stream)

Add a fast path to _match_number_unicode for integers that fit in a
64-bit integer (at most 19 decimal digits): accumulate the value
directly into an unsigned long long instead of allocating a PyBytes and
calling the generic PyLong_FromString.  Positive values use
PyLong_FromUnsignedLongLong; negatives within long long range use
PyLong_FromLongLong; larger integers fall back to the previous path.

For floats and big integers, copy the (always-ASCII) number text into a
stack buffer for the common short case to avoid the PyBytes allocation,
and call PyOS_string_to_double directly for floats.

Benchmarks (optimized free-threaded build):
* pyperformance json_loads: 1.06x faster overall
* microbench: small int arrays ~2x, 20-int doc 1.48x, mixed dict 1.16x

All test_json tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant