Skip to content

Add the SQLStorm query suite to vortex-bench (nightly-only)#8165

Draft
mprammer wants to merge 22 commits into
developfrom
mp/benchmark-sqlstorm
Draft

Add the SQLStorm query suite to vortex-bench (nightly-only)#8165
mprammer wants to merge 22 commits into
developfrom
mp/benchmark-sqlstorm

Conversation

@mprammer
Copy link
Copy Markdown
Contributor

Summary

Adds the SQLStorm query suite to vortex-bench as a nightly-only benchmark. SQLStorm is an LLM-generated SQL stress corpus; this vendors a confirmed-working sample of 500 queries — 125 each across four schemas (stackoverflow, job, tpch, tpcds) — and runs them Parquet-vs-Vortex on both DataFusion and DuckDB, structured the same way as the existing TPC-DS benchmark so it needs no runner changes. The point is coverage: these queries exercise joins, subqueries, and aggregation shapes that TPC-H and TPC-DS don't, so they stress Vortex's scan and compute breadth cheaply.

It runs only in the overnight nightly-bench.yml matrix, never the per-PR path. The four datasets are sized to sit within one order of magnitude of each other (40M–192M rows): TPC-H and TPC-DS generate their own data at scale factor 10, StackOverflow downloads the ~12 GB "math" tier, and JOB downloads the IMDB snapshot; the two non-TPC schemas convert to Parquet once and cache behind a .success marker. There is no scale-factor knob — each schema runs at a single fixed size set in code — and the vendored queries are curated to pass both engines and stay under ~5 s each at that scale, keeping the nightly query wall around 22 minutes.

Query sample and the fuzzer to come

The 500 vendored queries are a deliberately small, fixed, hand-verified slice of SQLStorm's ~62k-query corpus, pinned at a known SHA and curated to run deterministically and cheaply enough for nightly. That fixed shape is the near-term step on purpose. The longer-term goal is a SQLStorm fuzzer that samples or regenerates from the full corpus on each run to surface Vortex-vs-Parquet correctness and performance divergences across a far wider query space than a frozen sample can reach. That fuzzer is explicitly not this PR — this lands only the fixed sample plus the data-acquisition and harness plumbing it needs, and is the foundation the fuzzer would build on later.

Testing

The 500-query suite runs only in nightly, so per-PR CI doesn't exercise it; it was validated manually by running all four schemas strict (the harness aborts on the first failing query) on both engines across Parquet and Vortex — 125/125 each. The added unit tests run in the normal workspace test job: data-directory resolution per schema, the table-name drift guards (registration vs data-gen, and DDL vs COPY), and the StackOverflow tier pin.

🤖 Generated with Claude Code

mprammer and others added 20 commits May 27, 2026 14:33
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Implements Tasks 2 & 3 of the SQLStorm benchmark.

- sqlstorm/sqlstorm_benchmark.rs: SqlstormBenchmark implementing Benchmark,
  parameterized by SqlstormOrigin. Mirrors TpcDsBenchmark; TPC-H/DS origins
  reuse canonical SF=1 paths; StackOverflow/JOB get sqlstorm-<origin> dirs.
- sqlstorm/data.rs: table_names() as single source of truth per origin;
  table_specs() delegates here; async data-gen stubs bail for so/job.
- sqlstorm/mod.rs: re-enable sqlstorm_benchmark module and re-export;
  add FromStr impl and from_name() helper to SqlstormOrigin.
- datasets/mod.rs: BenchmarkDataset::Sqlstorm { origin } variant with
  name(), Display, and tables() arms delegating to data::table_names.
- lib.rs: BenchmarkArg::Sqlstorm, imports, create_benchmark arm reading
  --opt origin=<name> (default TpcH).
- v3.rs: benchmark_dataset_dims arm for Sqlstorm (origin -> dataset_variant)
  and matching test case.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Replaces the stub in `sqlstorm/data.rs` with a full async
`generate_stackoverflow` implementation that:

- Downloads the upstream schema DDL and ~1 GB gzip tarball from
  `db.in.tum.de/~schmidt/data/` using the shared `download_data` helper
  (idempotent, progress bar, retry).
- Shells out to `tar -xzf` to extract the 13 camelCase CSV files.
- Locates the CSV directory (flat or single-subdirectory archive layouts).
- Builds and runs a single DuckDB script that reads the schema DDL,
  COPYs each headerless CSV into a typed table, then COPYs each table to
  a Parquet shard with all column names lowercased — mirroring the Appian
  benchmark's identifier-normalization approach so DataFusion's
  `enable_ident_normalization=true` resolves queries correctly.
- Guards idempotency: skips all work if all 13 `parquet/*.parquet` shards
  are present.
- Only runs for `file://` data URLs; remote data directories are assumed
  to already contain the shards.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Selected from SQLStorm v1.0 (pinned SHA b3bb0b9) by running candidate queries
through both DuckDB and DataFusion 53 over the origin Parquet, keeping only
those that execute on both (refill-on-failure). Provenance in queries.csv.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Adds 4 per-origin sqlstorm entries to the nightly matrix and an additive
--opt origin passthrough in the reusable sql-benchmarks workflow. Not added
to the per-PR default matrix.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Selected from SQLStorm v1.0 (pinned SHA b3bb0b9). Each vendored query runs on
BOTH DuckDB (CLI) and DataFusion (the real datafusion-bench harness) within
SQLStorm.s ~10s per-query budget. DataFusion is checked via the harness, not
datafusion-cli, whose extra subquery decorrelation accepted EXISTS queries the
harness cannot physically plan.

queries.csv is a complete log of every tested (query, engine): status is
works/error/timeout/crash with the failure reason, so future workers do not
re-test or re-add known-bad queries as the corpus scales. 500 works on both
engines; the rest are recorded failures (e.g. 26 DataFusion timeouts >10s, plus
unsupported-SQL errors). DuckDB is tried first; DataFusion is only tried when
DuckDB passes, to bound selection runtime.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Drop dead row_counts.rs placeholder (returned None for every origin;
SQLStorm query IDs are too sparse for a Vec<usize> indexed by query_idx,
and the Benchmark trait default of None already covers this).

Drop SqlstormOrigin::all() and the clap::ValueEnum derive on SqlstormOrigin
- both unused. The CLI exposes the suite via BenchmarkArg::Sqlstorm; the
origin itself is read via Opts::get_as::<SqlstormOrigin> which goes through
FromStr.

Drop the now-redundant expected_row_counts override on SqlstormBenchmark
and update the stale doc comments that referenced a
stackoverflow.dbschema.json (the upstream stackoverflow_schema.sql DDL is
the actual source) and a "later task" for JOB tables (now inlined as
JOB_DDL).

Swap tpch/5862.sql for tpch/1261.sql: 5862 surfaces a DuckDB <-> Vortex
bridge bug that the original DuckDB-CLI + DataFusion-harness selection
could not see. Its queries.csv rows are removed (not annotated) so a
future full-coverage selection pass will re-evaluate 5862 from scratch
once the bridge bug is addressed -- queries.csv is an add-this /
known-to-fail gate for future selection passes, not a permanent audit
log, and 5862 is not "known-to-fail" in the SQL-compat sense.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
…_csv_dir

StackOverflow DDL was downloaded from the upstream URL at data-gen time
and then filtered to strip ALTER TABLE FOREIGN KEY statements that
DuckDB rejects. Inline it as STACKOVERFLOW_DDL alongside the existing
JOB_DDL: drops one network dep, the line-based FK filter, the SCHEMA_URL
const, and the anyhow::Result return on build_duckdb_script (no file
read remaining). Column types and NOT NULL constraints are preserved
verbatim; inline references / primary key declarations are stripped
since they are not enforced by COPY.

JOB extraction was passing base_dir directly to build_job_duckdb_script,
assuming the upstream imdb.tzst always lays CSVs flat. Route it through
locate_csv_dir (which generate_stackoverflow already uses) so both
origins handle a possible wrapping subdirectory consistently.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Collapse the near-identical generate_stackoverflow/generate_job paths into a
single generate_origin(&OriginData) driver parameterized by a per-origin recipe.
Bake lowercase column names into each DDL so a plain `SELECT *` exports lowercase
Parquet, dropping the TABLE_COLUMNS/build_projection machinery (~90 lines). Unify
tar.gz/tzst extraction behind extract_archive + an Archive enum, and switch
data-gen idempotency to a `.success` sentinel.

Move StackOverflow/JOB data under a shared `sqlstorm/<origin>/` dir (mirroring the
vendored-query layout) and reuse the crate `DEFAULT_SCALE_FACTOR` const instead of
a literal "1.0".

Add tests guarding the two table-name invariants that otherwise surface only at
nightly data-gen time: tables<->table_names() (registration vs gen) and DDL
CREATE TABLE names <-> COPY tables. Also add data-url layout / remote-override
tests for the per-origin path resolution.

Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Document layout, the by-hand refresh procedure (verify candidates against the
bench's own DataFusion SessionContext, not datafusion-cli), the pinned upstream
SHA, and how to run each origin locally. Remove queries.csv: the current phase is
performance over the vendored 500-query sample, not SQL-compatibility
completeness, so the machine-readable pass/fail log is no longer carried in-tree;
a future full-corpus selection pass re-evaluates candidates from scratch.

Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
The earlier long-runner cleanup picked the lowest-id passing candidate
(refill `sort -n`), biasing the replacements toward low ids (job 5..156,
stackoverflow/tpcds both id 3). Re-pick those slots with a seeded-random shuffle
of the corpus plus a short-runtime cap (<=2.0s/engine on parquet+vortex, matching
the kept set's median 0.40s / max 1.63s envelope): 30 job ids now span 111..34217,
stackoverflow id 3 -> 33961, tpcds id 3 -> 7155.

Each new query was confirmed to pass DuckDB and DataFusion; all four origins were
re-validated end-to-end strict (125/125 per origin, both engines, parquet+vortex).

Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Add a "Data size (fixed scale)" section to the sqlstorm README spelling out the
fixed per-origin sizes (TPC-H/TPC-DS at SF1, StackOverflow `dba` ~1 GB, JOB the
fixed IMDB snapshot), that `--opt scale-factor` is silently ignored, and that
this mirrors upstream (OLAPBench selects size per origin; there is no uniform
scale knob). Add a matching comment at the benchmark factory noting the opt is
intentionally not read.

Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Add a fixed SQLSTORM_TPC_SCALE_FACTOR const (10.0) driving the TPC-H/TPC-DS
data paths and delegated generation, replacing the crate DEFAULT_SCALE_FACTOR
(now reverted to private, since sqlstorm no longer imports it). SQLStorm still
has no user-facing scale factor; this just moves the fixed point to SF10.

Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Point the StackOverflow OriginData at stackoverflow_math.tar.gz (~12 GB) instead
of the dba (~1 GB) tarball; identical 13-table schema, more rows. Add a guard
test pinning the tier, and fix the now-stale factory comment.

Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
…ath CSVs

The `math` tier's large free-text columns (Posts.body, PostHistory.text, …)
contain rows whose embedded quotes don't strictly comply with RFC-4180, which
makes DuckDB's CSV dialect sniffer fail outright (it could parse the smaller
`dba` tier). Pin the dialect explicitly and parse leniently via extra_copy_opts:
`AUTO_DETECT false, QUOTE '"', ESCAPE '"', strict_mode false, ignore_errors true`.
Verified end-to-end: all 13 shards generate, 39.6M rows total.

Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
…w scale

At SF10 (tpch/tpcds) and the StackOverflow math tier, some queries curated to be
short at SF1/dba now exceed budget. Re-curate each origin: drop queries that fail
or exceed ~5s/engine (parquet+vortex, 1 iter) at the new scale, and refill to 125
with seeded-random short-at-scale candidates from the pinned corpus, verified on
both DuckDB and DataFusion. Swaps: tpch 29, tpcds 4, stackoverflow 15. JOB
unchanged. ~80% of each origin's prior set survives, so the query mix stays close.

Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Update the README for the new fixed scales: TPC-H/TPC-DS now generate dedicated
SF10 datasets (no longer reuse the SF1 data), StackOverflow uses the math tier.
Refresh the Data size table with measured row counts (so 40M / job 74M / tpch
87M / tpcds 192M — within one order of magnitude), note the scale is set in code
(SQLSTORM_TPC_SCALE_FACTOR / the STACKOVERFLOW recipe) not a runtime flag, and
that the refresh procedure verifies candidates short (<=5s/engine) at scale.

Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
@mprammer mprammer requested review from AdamGS and robert3005 May 29, 2026 19:44
Resolve conflicts in the benchmark registration touchpoints where develop added
the Appian benchmark in the same slots this branch added SQLStorm: keep both in
the BenchmarkDataset enum + name/Display/tables arms (datasets/mod.rs), the
BenchmarkArg enum + create_benchmark match (lib.rs), the v3 dataset-dims mapping,
and the orchestrator README benchmark list.

Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
The public module doc and `table_names` doc linked to `OriginData::tables`, a
private field, which `cargo doc` rejects under `-D warnings`
(rustdoc::private_intra_doc_links). De-link both to plain code spans; the
field-level docs that link the same private items are private-context and stay.

Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant