Skip to content

fix(cypher): bound query_graph execution with timeout + working-set cap (#601)#626

Open
yasku wants to merge 6 commits into
DeusData:mainfrom
yasku:fix/601-query-timeout
Open

fix(cypher): bound query_graph execution with timeout + working-set cap (#601)#626
yasku wants to merge 6 commits into
DeusData:mainfrom
yasku:fix/601-query-timeout

Conversation

@yasku

@yasku yasku commented Jun 25, 2026

Copy link
Copy Markdown

What does this PR do?

Fixes #601query_graph hangs indefinitely on a whole-graph OPTIONAL MATCH, with no execution timeout or graceful degradation.

The reported pattern, on a 90k+ node graph:

MATCH (a)
OPTIONAL MATCH (a)-[:CALLS]->(b)
RETURN a.qualified_name, count(b)

never returns — no partial result, no error. This PR makes the Cypher executor bounded: a query that would otherwise run forever (or exhaust memory) now stops at a wall-clock deadline or a working-set cap and returns a clear, actionable error.


Root cause (traced in src/cypher/cypher.c)

The executor builds and expands query bindings entirely in C, in memory, before projecting results. Three things compound:

  1. Unbounded driving set. scan_pattern_nodes() for an unlabeled MATCH (a) scans up to max_rows * CYP_GROWTH_10 nodes (≈ 1,000,000 with the default ceiling). Every node becomes a binding.
  2. 10× growth allocation per relationship. expand_pattern_rels() does
    malloc((bind_cap * CYP_GROWTH_10 + 1) * sizeof(binding_t)) up front. With a ~93k driving set and a ~1.5 KB binding_t, that is a multi-GB single allocation per relationship, followed by a per-binding edge-expansion grind.
  3. The existing ceiling is post-hoc and mis-targeted. CYPHER_RESULT_CEILING (100k) is only checked after execution completes (cbm_cypher_execute), and count() aggregates the result under 100k — so the ceiling never fires while the process is already wedged. The reporter observed exactly this: "hangs rather than hit the documented 100k-row ceiling."

There was no wall-clock guard anywhere in cypher.c (verified by grep: no clock_gettime / deadline / timeout).


How the fix was researched (provenance)

The approach was validated against the repo's own history and the database it sits on, not assumed:


How it was fixed — three execution guards

All in src/cypher/cypher.c, single-threaded executor (the MCP event loop), so file-static guard state is safe:

  1. Wall-clock timeout. cyp_guard_begin() (called at cbm_cypher_execute entry) reads CBM_CYPHER_TIMEOUT_MS (default 10 000 ms, 0 disables) and arms a CLOCK_MONOTONIC deadline. cyp_deadline_exceeded() is sampled every 1024 iterations inside the hot expansion loop; on trip it drops the partial set and unwinds.
  2. Early working-set cap. In execute_single(), before the bindings array is allocated, an oversized driving set (bind_cap > CYPHER_MAX_BINDINGS / CYP_GROWTH_10) is rejected up front — so a whole-graph MATCH can't even request the multi-GB buffer.
  3. Expansion cap + NULL-check. A matching cap inside expand_pattern_rels() for working sets that only blow up mid-traversal, plus a previously-missing NULL-check on the growth malloc.

On any trip, execute_single stops cleanly and cbm_cypher_execute turns the latched reason into a clear error:

query expansion exceeded the working-set limit — add LIMIT, a directed MATCH, or a WHERE filter
query exceeded the execution time limit — … (raise CBM_CYPHER_TIMEOUT_MS to allow longer)

Ordinary queries — where bind_cap sits at the default row ceiling (100k) — are unaffected; the caps use the same threshold so they never reject a normal query.


Stack / where this lives

Layer Detail
Language C11 (-std=c11 -Wall -Wextra -Werror)
Component Cypher query engine (src/cypher/cypher.c) — translates Cypher → in-memory graph traversal over the SQLite-backed store
Storage SQLite (vendored, in-process); the executor reads edges via cbm_store_find_edges_by_*
Build Makefile.cbm (ASan + UBSan test build; -O2 production binary)
Tests first-party C test harness (tests/test_cypher.c, make -f Makefile.cbm test)
New surface one env var (CBM_CYPHER_TIMEOUT_MS), no new dependencies, no API/schema change

Tests added (tests/test_cypher.c)

  • cypher_exec_guard_caps_huge_driving_set — a large max_rows pushes bind_cap past the cap deterministically (the early guard fires before allocation, so no large graph is needed); asserts the clear "working-set" error.
  • cypher_exec_guard_timeout_trips — builds 3,000 nodes + CALLS edges and runs a whole-graph expansion under CBM_CYPHER_TIMEOUT_MS=1, kept below the cap via max_rows; under the ASan test build the expansion comfortably outlasts 1 ms, so the timeout trips deterministically; asserts the "time" error.
  • cypher_exec_guard_normal_query_unaffected — an ordinary directed MATCH (f:Function)-[:CALLS]->(g) RETURN f.name, count(g) still returns correct rows with no error.

Manual end-to-end reproduction

Indexed a real 13,450-file workspace → 277,759 nodes / 705,478 edges, then ran the issue's query via the CLI:

$ codebase-memory-mcp cli query_graph '{"project":"…","query":"MATCH (a) OPTIONAL MATCH (a)-[:CALLS]->(b) RETURN a.qualified_name, count(b)"}'
query expansion exceeded the working-set limit — add LIMIT, a directed MATCH, or a WHERE filter

Before this PR the same call never returned. Normal query_graph calls against the same index are unchanged.


Validation

Check Command Result
Full test suite make -f Makefile.cbm test 5690 passed, 0 failures
Build (prod) make -f Makefile.cbm cbm ✅ no warnings (-Werror)
Formatting clang-format (changed code) ✅ clean
Static analysis cppcheck w/ CI suppress flags ✅ no new findings (one pre-existing knownConditionTrueFalse at cypher.c:2442/2446, in untouched code)
No-skips policy scripts/check-no-test-skips.sh ✅ OK
DCO scripts/check-dco.sh ✅ signed

Where it was tested — full environment

OS macOS 26.5 (build 25F71)
Kernel Darwin 25.5.0
Architecture arm64 (Apple Silicon)
Chip Apple M1 — 8 cores (4 performance + 4 efficiency)
RAM 8 GB (the constrained-host profile from #580/#601)
Compiler Apple clang 21.0.0 (clang-2100.1.1.101), target arm64-apple-darwin25.5.0
Build system GNU Make 3.81, Makefile.cbm
Git 2.50.1 (Apple Git-155)
Shell zsh
Lint tooling clang-format 22.1.8, cppcheck 2.21.0
Test build ASan + UBSan (make -f Makefile.cbm test)
Repro graph 277,759 nodes / 705,478 edges, indexed from a real 13,450-file workspace

Note on toolchain skew: local clang-format is 22.1.8 while CI pins clang-format-20; the changed code is clean under 22 and the surrounding code is left in the repo's clang-format-20 format, so CI lint is expected to pass. Local cppcheck is 2.21.0 (newer/stricter than CI), which is why one pre-existing finding in untouched code surfaces locally but not in CI.


Checklist

  • Every commit is signed off (git commit -s) — DCO verified via scripts/check-dco.sh
  • Tests pass locally (make -f Makefile.cbm test) — 5690 passed
  • Lint passes — clang-format clean on changed code; cppcheck clean on changed code (only a pre-existing finding in untouched code remains)
  • New behavior is covered by a test (reproduce-first)

Notes

  • The default 10 s budget is a safety net for pathological queries, not a performance gate — it should never trip on a normal query; CBM_CYPHER_TIMEOUT_MS lets operators with very large graphs raise or disable it.
  • Scope is limited to the executor guards in cypher.c plus their tests — no unrelated reformatting. One pre-existing comment-alignment block that clang-format 22 wanted to rewrite was deliberately left in the repo's current (CI clang-format-20) format.
  • I cannot run the project's CodeQL / multi-platform CI locally; those will run on the PR. The reasoning above is grounded, but final confirmation comes from CI.

yasku added 6 commits June 24, 2026 22:30
The startup/query integrity check unlink()'d a project's .db (plus its
-wal/-shm sidecars) on any anomaly, causing total data loss with no
backup, no recovery path, and no visible warning. The trigger is often a
false positive: a WAL leftover after a SIGKILL, a binary variant swap, or
a stale-lock break can all make a recoverable DB look corrupt.

Replace the destructive unlink with a rename to <name>.db.corrupt.<ts>
and emit a prominent stderr message so the user can inspect, recover, or
attach the file to a bug report. Two call sites are fixed:

- mcp.c resolve_store(): the reported path, hit on first query.
- pipeline.c try_incremental_or_delete_db(): only preserves on real
  corruption; a healthy DB rebuilt due to a mode change is still deleted
  so .corrupt files don't accumulate on normal reindexes.

Add a regression test (resolve_store_preserves_corrupt_db_issue557) that
seeds a corrupt projects row, issues a query, and asserts the original
.db is gone while a .corrupt sidecar remains.

Signed-off-by: yasku <aguss.cba@gmail.com>
CodeQL flagged a time-of-check/time-of-use race in
try_incremental_or_delete_db(): the file at db_path was probed with
stat() and then operated on by name (rename/unlink), so the underlying
file could change between the two.

Drop the redundant stat() existence probe and derive existence from a
query-mode open (cbm_store_open_path_query, which never creates the
file): a NULL result means absent/unreadable, so we return without
touching the path. This removes the stat()->rename() dataflow CodeQL
reports and matches resolve_store() in mcp.c, which already uses this
pattern. Behavior is unchanged (absent -> skip, healthy -> reindex,
corrupt -> preserve as .corrupt).

Signed-off-by: yasku <aguss.cba@gmail.com>
Adds pipeline_reindex_preserves_corrupt_db_issue557: indexes a temp repo,
corrupts the resulting DB (numeric root_path trips the integrity check),
re-indexes, and asserts the DB is preserved as <db>.corrupt.<ts> rather
than deleted. Guards both the DeusData#557 data-loss fix and the TOCTOU refactor
of try_incremental_or_delete_db (existence via query-mode open, not stat).

Signed-off-by: yasku <aguss.cba@gmail.com>
manage_adr stores ADRs in the project_summaries table (SQLite store,
unified in DeusData#256). Both dump paths serialize a fresh DB with an empty
project_summaries — the custom B-tree writer writes an empty summaries
btree — so any re-index that rewrites the DB silently wipes the ADR.

The data loss occurs at TWO sites, not one (the issue report identified
only the full path):
  - dump_and_persist_hashes (pipeline.c) — full re-index.
  - dump_and_persist (pipeline_incremental.c) — incremental re-index when
    a file changed/was added (the no-change incremental early-returns
    without a dump, so the ADR survives there).

Mirror the existing file_hashes preservation: capture the ADR from the
old DB before it is rewritten (in try_incremental_or_delete_db, the single
choke point that opens the old DB for both paths), carry it on the
pipeline, and re-insert it after each dump via cbm_store_adr_store.

Adds regression tests for both the full and incremental paths. Also drops
an unused rc assignment in the pre-existing usages_creates_edges test that
the larger file surfaced under cppcheck (unreadVariable).

Signed-off-by: yasku <aguss.cba@gmail.com>
…ap (DeusData#601)

A whole-graph `OPTIONAL MATCH` (e.g. `MATCH (a) OPTIONAL MATCH
(a)-[:CALLS]->(b) RETURN count(b)`) drove off every node and then
allocated a 10x growth buffer per relationship, blowing up memory/CPU
before the post-hoc 100k-row ceiling could fire. With aggregation the
final row count stays under the ceiling, so it never tripped — the call
just hung. SQLite's own interrupt does not help: the blow-up is in the
C binding-expansion loop, not inside a SQL statement.

Add three guards to the executor:
  - Wall-clock timeout (default 10s, override via CBM_CYPHER_TIMEOUT_MS,
    0 disables) sampled in the hot expansion loop — matches the existing
    CBM_* env-knob precedent (CBM_WORKERS, CBM_SQLITE_MMAP_SIZE).
  - An early working-set cap that rejects an oversized driving set before
    the bindings array is allocated.
  - A matching cap inside expand_pattern_rels for sets that only grow
    mid-traversal, plus a NULL-check on the growth malloc.

On a trip the executor stops and cbm_cypher_execute returns a clear,
actionable error instead of hanging or OOMing. Ordinary queries (bind_cap
at the default row ceiling) are unaffected. Adds regression tests for the
cap, the timeout, and the no-regression path.

Signed-off-by: yasku <aguss.cba@gmail.com>
…sData#601)

The previous commit added #include "foundation/compat.h" to cypher.c
(for cbm_clock_gettime). On Windows that header transitively includes
<windows.h>, which defines the legacy macro 'far' (#define far). This
broke the pre-existing local variable cbm_node_pattern_t far in the
EXISTS pattern parser, failing the test-windows build with 'expected
expression' errors. Unix builds were unaffected since compat.h pulls
windows.h only under _WIN32.

Rename the local to 'far_node'. No behavior change; verified the EXISTS
pattern parser and the DeusData#601 guards still work, full suite 5690 passed.

Signed-off-by: yasku <aguss.cba@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] query_graph hangs on whole-graph OPTIONAL MATCH (no execution timeout / degradation) on large graphs

1 participant