Add the SQLStorm query suite to vortex-bench (nightly-only) by mprammer · Pull Request #8165 · vortex-data/vortex

mprammer · 2026-05-29T19:41:35Z

Summary

Adds the SQLStorm query suite to vortex-bench as a nightly-only benchmark. SQLStorm is an LLM-generated SQL stress corpus; this vendors a confirmed-working sample of 500 queries — 125 each across four schemas (stackoverflow, job, tpch, tpcds) — and runs them Parquet-vs-Vortex on both DataFusion and DuckDB, structured the same way as the existing TPC-DS benchmark so it needs no runner changes. The point is coverage: these queries exercise joins, subqueries, and aggregation shapes that TPC-H and TPC-DS don't, so they stress Vortex's scan and compute breadth cheaply.

It runs only in the overnight nightly-bench.yml matrix, never the per-PR path. The four datasets are sized to sit within one order of magnitude of each other (40M–192M rows): TPC-H and TPC-DS generate their own data at scale factor 10, StackOverflow downloads the ~12 GB "math" tier, and JOB downloads the IMDB snapshot; the two non-TPC schemas convert to Parquet once and cache behind a .success marker. There is no scale-factor knob — each schema runs at a single fixed size set in code — and the vendored queries are curated to pass both engines and stay under ~5 s each at that scale, keeping the nightly query wall around 22 minutes.

Query sample and the fuzzer to come

The 500 vendored queries are a deliberately small, fixed, hand-verified slice of SQLStorm's ~62k-query corpus, pinned at a known SHA and curated to run deterministically and cheaply enough for nightly. That fixed shape is the near-term step on purpose. The longer-term goal is a SQLStorm fuzzer that samples or regenerates from the full corpus on each run to surface Vortex-vs-Parquet correctness and performance divergences across a far wider query space than a frozen sample can reach. That fuzzer is explicitly not this PR — this lands only the fixed sample plus the data-acquisition and harness plumbing it needs, and is the foundation the fuzzer would build on later.

Testing

The 500-query suite runs only in nightly, so per-PR CI doesn't exercise it; it was validated manually by running all four schemas strict (the harness aborts on the first failing query) on both engines across Parquet and Vortex — 125/125 each. The added unit tests run in the normal workspace test job: data-directory resolution per schema, the table-name drift guards (registration vs data-gen, and DDL vs COPY), and the StackOverflow tier pin.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>

Implements Tasks 2 & 3 of the SQLStorm benchmark. - sqlstorm/sqlstorm_benchmark.rs: SqlstormBenchmark implementing Benchmark, parameterized by SqlstormOrigin. Mirrors TpcDsBenchmark; TPC-H/DS origins reuse canonical SF=1 paths; StackOverflow/JOB get sqlstorm-<origin> dirs. - sqlstorm/data.rs: table_names() as single source of truth per origin; table_specs() delegates here; async data-gen stubs bail for so/job. - sqlstorm/mod.rs: re-enable sqlstorm_benchmark module and re-export; add FromStr impl and from_name() helper to SqlstormOrigin. - datasets/mod.rs: BenchmarkDataset::Sqlstorm { origin } variant with name(), Display, and tables() arms delegating to data::table_names. - lib.rs: BenchmarkArg::Sqlstorm, imports, create_benchmark arm reading --opt origin=<name> (default TpcH). - v3.rs: benchmark_dataset_dims arm for Sqlstorm (origin -> dataset_variant) and matching test case. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>

Replaces the stub in `sqlstorm/data.rs` with a full async `generate_stackoverflow` implementation that: - Downloads the upstream schema DDL and ~1 GB gzip tarball from `db.in.tum.de/~schmidt/data/` using the shared `download_data` helper (idempotent, progress bar, retry). - Shells out to `tar -xzf` to extract the 13 camelCase CSV files. - Locates the CSV directory (flat or single-subdirectory archive layouts). - Builds and runs a single DuckDB script that reads the schema DDL, COPYs each headerless CSV into a typed table, then COPYs each table to a Parquet shard with all column names lowercased — mirroring the Appian benchmark's identifier-normalization approach so DataFusion's `enable_ident_normalization=true` resolves queries correctly. - Guards idempotency: skips all work if all 13 `parquet/*.parquet` shards are present. - Only runs for `file://` data URLs; remote data directories are assumed to already contain the shards. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>

Selected from SQLStorm v1.0 (pinned SHA b3bb0b9) by running candidate queries through both DuckDB and DataFusion 53 over the origin Parquet, keeping only those that execute on both (refill-on-failure). Provenance in queries.csv. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>

Adds 4 per-origin sqlstorm entries to the nightly matrix and an additive --opt origin passthrough in the reusable sql-benchmarks workflow. Not added to the per-PR default matrix. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>

Selected from SQLStorm v1.0 (pinned SHA b3bb0b9). Each vendored query runs on BOTH DuckDB (CLI) and DataFusion (the real datafusion-bench harness) within SQLStorm.s ~10s per-query budget. DataFusion is checked via the harness, not datafusion-cli, whose extra subquery decorrelation accepted EXISTS queries the harness cannot physically plan. queries.csv is a complete log of every tested (query, engine): status is works/error/timeout/crash with the failure reason, so future workers do not re-test or re-add known-bad queries as the corpus scales. 500 works on both engines; the rest are recorded failures (e.g. 26 DataFusion timeouts >10s, plus unsupported-SQL errors). DuckDB is tried first; DataFusion is only tried when DuckDB passes, to bound selection runtime. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>

Drop dead row_counts.rs placeholder (returned None for every origin; SQLStorm query IDs are too sparse for a Vec<usize> indexed by query_idx, and the Benchmark trait default of None already covers this). Drop SqlstormOrigin::all() and the clap::ValueEnum derive on SqlstormOrigin - both unused. The CLI exposes the suite via BenchmarkArg::Sqlstorm; the origin itself is read via Opts::get_as::<SqlstormOrigin> which goes through FromStr. Drop the now-redundant expected_row_counts override on SqlstormBenchmark and update the stale doc comments that referenced a stackoverflow.dbschema.json (the upstream stackoverflow_schema.sql DDL is the actual source) and a "later task" for JOB tables (now inlined as JOB_DDL). Swap tpch/5862.sql for tpch/1261.sql: 5862 surfaces a DuckDB <-> Vortex bridge bug that the original DuckDB-CLI + DataFusion-harness selection could not see. Its queries.csv rows are removed (not annotated) so a future full-coverage selection pass will re-evaluate 5862 from scratch once the bridge bug is addressed -- queries.csv is an add-this / known-to-fail gate for future selection passes, not a permanent audit log, and 5862 is not "known-to-fail" in the SQL-compat sense. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>

…_csv_dir StackOverflow DDL was downloaded from the upstream URL at data-gen time and then filtered to strip ALTER TABLE FOREIGN KEY statements that DuckDB rejects. Inline it as STACKOVERFLOW_DDL alongside the existing JOB_DDL: drops one network dep, the line-based FK filter, the SCHEMA_URL const, and the anyhow::Result return on build_duckdb_script (no file read remaining). Column types and NOT NULL constraints are preserved verbatim; inline references / primary key declarations are stripped since they are not enforced by COPY. JOB extraction was passing base_dir directly to build_job_duckdb_script, assuming the upstream imdb.tzst always lays CSVs flat. Route it through locate_csv_dir (which generate_stackoverflow already uses) so both origins handle a possible wrapping subdirectory consistently. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>

Collapse the near-identical generate_stackoverflow/generate_job paths into a single generate_origin(&OriginData) driver parameterized by a per-origin recipe. Bake lowercase column names into each DDL so a plain `SELECT *` exports lowercase Parquet, dropping the TABLE_COLUMNS/build_projection machinery (~90 lines). Unify tar.gz/tzst extraction behind extract_archive + an Archive enum, and switch data-gen idempotency to a `.success` sentinel. Move StackOverflow/JOB data under a shared `sqlstorm/<origin>/` dir (mirroring the vendored-query layout) and reuse the crate `DEFAULT_SCALE_FACTOR` const instead of a literal "1.0". Add tests guarding the two table-name invariants that otherwise surface only at nightly data-gen time: tables<->table_names() (registration vs gen) and DDL CREATE TABLE names <-> COPY tables. Also add data-url layout / remote-override tests for the per-origin path resolution. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>

Document layout, the by-hand refresh procedure (verify candidates against the bench's own DataFusion SessionContext, not datafusion-cli), the pinned upstream SHA, and how to run each origin locally. Remove queries.csv: the current phase is performance over the vendored 500-query sample, not SQL-compatibility completeness, so the machine-readable pass/fail log is no longer carried in-tree; a future full-corpus selection pass re-evaluates candidates from scratch. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>

The earlier long-runner cleanup picked the lowest-id passing candidate (refill `sort -n`), biasing the replacements toward low ids (job 5..156, stackoverflow/tpcds both id 3). Re-pick those slots with a seeded-random shuffle of the corpus plus a short-runtime cap (<=2.0s/engine on parquet+vortex, matching the kept set's median 0.40s / max 1.63s envelope): 30 job ids now span 111..34217, stackoverflow id 3 -> 33961, tpcds id 3 -> 7155. Each new query was confirmed to pass DuckDB and DataFusion; all four origins were re-validated end-to-end strict (125/125 per origin, both engines, parquet+vortex). Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>

Add a "Data size (fixed scale)" section to the sqlstorm README spelling out the fixed per-origin sizes (TPC-H/TPC-DS at SF1, StackOverflow `dba` ~1 GB, JOB the fixed IMDB snapshot), that `--opt scale-factor` is silently ignored, and that this mirrors upstream (OLAPBench selects size per origin; there is no uniform scale knob). Add a matching comment at the benchmark factory noting the opt is intentionally not read. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>

Add a fixed SQLSTORM_TPC_SCALE_FACTOR const (10.0) driving the TPC-H/TPC-DS data paths and delegated generation, replacing the crate DEFAULT_SCALE_FACTOR (now reverted to private, since sqlstorm no longer imports it). SQLStorm still has no user-facing scale factor; this just moves the fixed point to SF10. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>

Point the StackOverflow OriginData at stackoverflow_math.tar.gz (~12 GB) instead of the dba (~1 GB) tarball; identical 13-table schema, more rows. Add a guard test pinning the tier, and fix the now-stale factory comment. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>

…ath CSVs The `math` tier's large free-text columns (Posts.body, PostHistory.text, …) contain rows whose embedded quotes don't strictly comply with RFC-4180, which makes DuckDB's CSV dialect sniffer fail outright (it could parse the smaller `dba` tier). Pin the dialect explicitly and parse leniently via extra_copy_opts: `AUTO_DETECT false, QUOTE '"', ESCAPE '"', strict_mode false, ignore_errors true`. Verified end-to-end: all 13 shards generate, 39.6M rows total. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>

…w scale At SF10 (tpch/tpcds) and the StackOverflow math tier, some queries curated to be short at SF1/dba now exceed budget. Re-curate each origin: drop queries that fail or exceed ~5s/engine (parquet+vortex, 1 iter) at the new scale, and refill to 125 with seeded-random short-at-scale candidates from the pinned corpus, verified on both DuckDB and DataFusion. Swaps: tpch 29, tpcds 4, stackoverflow 15. JOB unchanged. ~80% of each origin's prior set survives, so the query mix stays close. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>

Update the README for the new fixed scales: TPC-H/TPC-DS now generate dedicated SF10 datasets (no longer reuse the SF1 data), StackOverflow uses the math tier. Refresh the Data size table with measured row counts (so 40M / job 74M / tpch 87M / tpcds 192M — within one order of magnitude), note the scale is set in code (SQLSTORM_TPC_SCALE_FACTOR / the STACKOVERFLOW recipe) not a runtime flag, and that the refresh procedure verifies candidates short (<=5s/engine) at scale. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>

Resolve conflicts in the benchmark registration touchpoints where develop added the Appian benchmark in the same slots this branch added SQLStorm: keep both in the BenchmarkDataset enum + name/Display/tables arms (datasets/mod.rs), the BenchmarkArg enum + create_benchmark match (lib.rs), the v3 dataset-dims mapping, and the orchestrator README benchmark list. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>

The public module doc and `table_names` doc linked to `OriginData::tables`, a private field, which `cargo doc` rejects under `-D warnings` (rustdoc::private_intra_doc_links). De-link both to plain code spans; the field-level docs that link the same private items are private-context and stay. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>

mprammer and others added 20 commits May 27, 2026 14:33

bench: add sqlstorm module skeleton and query loader

2039b99

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>

bench-orchestrator: register sqlstorm benchmark

979bbcd

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>

bench: make sqlstorm pattern origin-aware for tpch shards

285d2d1

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>

bench: self-contained sqlstorm data-gen (job + tpch/tpcds delegation)

f950866

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>

mprammer requested review from AdamGS and robert3005 May 29, 2026 19:44

mprammer added changelog/feature A new feature changelog/ci labels May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the SQLStorm query suite to vortex-bench (nightly-only)#8165

Add the SQLStorm query suite to vortex-bench (nightly-only)#8165
mprammer wants to merge 22 commits into
developfrom
mp/benchmark-sqlstorm

mprammer commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mprammer commented May 29, 2026

Summary

Query sample and the fuzzer to come

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant