fix(cypher): bound query_graph execution with timeout + working-set cap (#601) by yasku · Pull Request #626 · DeusData/codebase-memory-mcp

yasku · 2026-06-25T09:54:15Z

What does this PR do?

Fixes #601 — query_graph hangs indefinitely on a whole-graph OPTIONAL MATCH, with no execution timeout or graceful degradation.

The reported pattern, on a 90k+ node graph:

MATCH (a)
OPTIONAL MATCH (a)-[:CALLS]->(b)
RETURN a.qualified_name, count(b)

never returns — no partial result, no error. This PR makes the Cypher executor bounded: a query that would otherwise run forever (or exhaust memory) now stops at a wall-clock deadline or a working-set cap and returns a clear, actionable error.

Root cause (traced in `src/cypher/cypher.c`)

The executor builds and expands query bindings entirely in C, in memory, before projecting results. Three things compound:

Unbounded driving set. scan_pattern_nodes() for an unlabeled MATCH (a) scans up to max_rows * CYP_GROWTH_10 nodes (≈ 1,000,000 with the default ceiling). Every node becomes a binding.
10× growth allocation per relationship. expand_pattern_rels() does
malloc((bind_cap * CYP_GROWTH_10 + 1) * sizeof(binding_t)) up front. With a ~93k driving set and a ~1.5 KB binding_t, that is a multi-GB single allocation per relationship, followed by a per-binding edge-expansion grind.
The existing ceiling is post-hoc and mis-targeted. CYPHER_RESULT_CEILING (100k) is only checked after execution completes (cbm_cypher_execute), and count() aggregates the result under 100k — so the ceiling never fires while the process is already wedged. The reporter observed exactly this: "hangs rather than hit the documented 100k-row ceiling."

There was no wall-clock guard anywhere in cypher.c (verified by grep: no clock_gettime / deadline / timeout).

How the fix was researched (provenance)

The approach was validated against the repo's own history and the database it sits on, not assumed:

GitHub — prior art / recorded decisions (gh). The umbrella task: Cypher query language hardening (9 issues) #389 "Cypher query language hardening" (9 issues) is closed/complete, and none of its sub-issues (Cypher: DISTINCT applied after ORDER BY + LIMIT instead of before (execution order bug) #237–242, query_graph: raw property reads work, but toInteger(...) returns empty strings and literal RETURN expressions fail parsing #252, query_graph Cypher evaluator: labels(), count(*), AS alias, and LIMIT all broken #305, query_graph silently returns blank default projection for unsupported WITH expression #373) touch execution-time guards — confirming this is a genuine gap, not a deliberate "no timeouts" decision. No open PR addresses [bug] query_graph hangs on whole-graph OPTIONAL MATCH (no execution timeout / degradation) on large graphs #601 (checked before starting, to avoid duplicating work as with feat(mem): CBM_MAX_MEMORY_MB explicit memory-budget override (#580) #586/Feature request: --max-memory flag for RAM-constrained hosts #580).
The result ceiling is payload-only by design. CYPHER_RESULT_CEILING's comment ("prevents accidental multi-GB JSON payloads from unbounded MATCH (n) RETURN n") shows it was always a result-size guard, never an execution guard — so [bug] query_graph hangs on whole-graph OPTIONAL MATCH (no execution timeout / degradation) on large graphs #601 needs a new mechanism.
SQLite architecture (Context7, official docs). SQLite offers sqlite3_interrupt() / sqlite3_progress_handler() to abort work, but they only cover execution inside a single SQL statement. The blow-up here is in the C binding-expansion loop between many small per-node edge queries (cbm_store_find_edges_by_source/target), so a SQLite-native interrupt would not see it. The guard must live in the executor. This matches how the codebase already uses busy_timeout only for lock waits, not statement runtime.
Env-knob precedent. Exposing the timeout via CBM_CYPHER_TIMEOUT_MS follows the established CBM_* pattern (CBM_WORKERS, CBM_SQLITE_MMAP_SIZE) — env override with a sane default.

How it was fixed — three execution guards

All in src/cypher/cypher.c, single-threaded executor (the MCP event loop), so file-static guard state is safe:

Wall-clock timeout. cyp_guard_begin() (called at cbm_cypher_execute entry) reads CBM_CYPHER_TIMEOUT_MS (default 10 000 ms, 0 disables) and arms a CLOCK_MONOTONIC deadline. cyp_deadline_exceeded() is sampled every 1024 iterations inside the hot expansion loop; on trip it drops the partial set and unwinds.
Early working-set cap. In execute_single(), before the bindings array is allocated, an oversized driving set (bind_cap > CYPHER_MAX_BINDINGS / CYP_GROWTH_10) is rejected up front — so a whole-graph MATCH can't even request the multi-GB buffer.
Expansion cap + NULL-check. A matching cap inside expand_pattern_rels() for working sets that only blow up mid-traversal, plus a previously-missing NULL-check on the growth malloc.

On any trip, execute_single stops cleanly and cbm_cypher_execute turns the latched reason into a clear error:

query expansion exceeded the working-set limit — add LIMIT, a directed MATCH, or a WHERE filter
query exceeded the execution time limit — … (raise CBM_CYPHER_TIMEOUT_MS to allow longer)

Ordinary queries — where bind_cap sits at the default row ceiling (100k) — are unaffected; the caps use the same threshold so they never reject a normal query.

Stack / where this lives

Layer	Detail
Language	C11 (`-std=c11 -Wall -Wextra -Werror`)
Component	Cypher query engine (`src/cypher/cypher.c`) — translates Cypher → in-memory graph traversal over the SQLite-backed store
Storage	SQLite (vendored, in-process); the executor reads edges via `cbm_store_find_edges_by_*`
Build	`Makefile.cbm` (ASan + UBSan test build; `-O2` production binary)
Tests	first-party C test harness (`tests/test_cypher.c`, `make -f Makefile.cbm test`)
New surface	one env var (`CBM_CYPHER_TIMEOUT_MS`), no new dependencies, no API/schema change

Tests added (`tests/test_cypher.c`)

cypher_exec_guard_caps_huge_driving_set — a large max_rows pushes bind_cap past the cap deterministically (the early guard fires before allocation, so no large graph is needed); asserts the clear "working-set" error.
cypher_exec_guard_timeout_trips — builds 3,000 nodes + CALLS edges and runs a whole-graph expansion under CBM_CYPHER_TIMEOUT_MS=1, kept below the cap via max_rows; under the ASan test build the expansion comfortably outlasts 1 ms, so the timeout trips deterministically; asserts the "time" error.
cypher_exec_guard_normal_query_unaffected — an ordinary directed MATCH (f:Function)-[:CALLS]->(g) RETURN f.name, count(g) still returns correct rows with no error.

Manual end-to-end reproduction

Indexed a real 13,450-file workspace → 277,759 nodes / 705,478 edges, then ran the issue's query via the CLI:

$ codebase-memory-mcp cli query_graph '{"project":"…","query":"MATCH (a) OPTIONAL MATCH (a)-[:CALLS]->(b) RETURN a.qualified_name, count(b)"}'
query expansion exceeded the working-set limit — add LIMIT, a directed MATCH, or a WHERE filter

Before this PR the same call never returned. Normal query_graph calls against the same index are unchanged.

Validation

Check	Command	Result
Full test suite	`make -f Makefile.cbm test`	✅ 5690 passed, 0 failures
Build (prod)	`make -f Makefile.cbm cbm`	✅ no warnings (`-Werror`)
Formatting	`clang-format` (changed code)	✅ clean
Static analysis	`cppcheck` w/ CI suppress flags	✅ no new findings (one pre-existing `knownConditionTrueFalse` at `cypher.c:2442/2446`, in untouched code)
No-skips policy	`scripts/check-no-test-skips.sh`	✅ OK
DCO	`scripts/check-dco.sh`	✅ signed

Where it was tested — full environment


OS	macOS 26.5 (build 25F71)
Kernel	Darwin 25.5.0
Architecture	arm64 (Apple Silicon)
Chip	Apple M1 — 8 cores (4 performance + 4 efficiency)
RAM	8 GB (the constrained-host profile from #580/#601)
Compiler	Apple clang 21.0.0 (clang-2100.1.1.101), target `arm64-apple-darwin25.5.0`
Build system	GNU Make 3.81, `Makefile.cbm`
Git	2.50.1 (Apple Git-155)
Shell	zsh
Lint tooling	clang-format 22.1.8, cppcheck 2.21.0
Test build	ASan + UBSan (`make -f Makefile.cbm test`)
Repro graph	277,759 nodes / 705,478 edges, indexed from a real 13,450-file workspace

Note on toolchain skew: local clang-format is 22.1.8 while CI pins clang-format-20; the changed code is clean under 22 and the surrounding code is left in the repo's clang-format-20 format, so CI lint is expected to pass. Local cppcheck is 2.21.0 (newer/stricter than CI), which is why one pre-existing finding in untouched code surfaces locally but not in CI.

Checklist

Every commit is signed off (git commit -s) — DCO verified via scripts/check-dco.sh
Tests pass locally (make -f Makefile.cbm test) — 5690 passed
Lint passes — clang-format clean on changed code; cppcheck clean on changed code (only a pre-existing finding in untouched code remains)
New behavior is covered by a test (reproduce-first)

Notes

The default 10 s budget is a safety net for pathological queries, not a performance gate — it should never trip on a normal query; CBM_CYPHER_TIMEOUT_MS lets operators with very large graphs raise or disable it.
Scope is limited to the executor guards in cypher.c plus their tests — no unrelated reformatting. One pre-existing comment-alignment block that clang-format 22 wanted to rewrite was deliberately left in the repo's current (CI clang-format-20) format.
I cannot run the project's CodeQL / multi-platform CI locally; those will run on the PR. The reasoning above is grounded, but final confirmation comes from CI.

The startup/query integrity check unlink()'d a project's .db (plus its -wal/-shm sidecars) on any anomaly, causing total data loss with no backup, no recovery path, and no visible warning. The trigger is often a false positive: a WAL leftover after a SIGKILL, a binary variant swap, or a stale-lock break can all make a recoverable DB look corrupt. Replace the destructive unlink with a rename to <name>.db.corrupt.<ts> and emit a prominent stderr message so the user can inspect, recover, or attach the file to a bug report. Two call sites are fixed: - mcp.c resolve_store(): the reported path, hit on first query. - pipeline.c try_incremental_or_delete_db(): only preserves on real corruption; a healthy DB rebuilt due to a mode change is still deleted so .corrupt files don't accumulate on normal reindexes. Add a regression test (resolve_store_preserves_corrupt_db_issue557) that seeds a corrupt projects row, issues a query, and asserts the original .db is gone while a .corrupt sidecar remains. Signed-off-by: yasku <aguss.cba@gmail.com>

CodeQL flagged a time-of-check/time-of-use race in try_incremental_or_delete_db(): the file at db_path was probed with stat() and then operated on by name (rename/unlink), so the underlying file could change between the two. Drop the redundant stat() existence probe and derive existence from a query-mode open (cbm_store_open_path_query, which never creates the file): a NULL result means absent/unreadable, so we return without touching the path. This removes the stat()->rename() dataflow CodeQL reports and matches resolve_store() in mcp.c, which already uses this pattern. Behavior is unchanged (absent -> skip, healthy -> reindex, corrupt -> preserve as .corrupt). Signed-off-by: yasku <aguss.cba@gmail.com>

Adds pipeline_reindex_preserves_corrupt_db_issue557: indexes a temp repo, corrupts the resulting DB (numeric root_path trips the integrity check), re-indexes, and asserts the DB is preserved as <db>.corrupt.<ts> rather than deleted. Guards both the DeusData#557 data-loss fix and the TOCTOU refactor of try_incremental_or_delete_db (existence via query-mode open, not stat). Signed-off-by: yasku <aguss.cba@gmail.com>

manage_adr stores ADRs in the project_summaries table (SQLite store, unified in DeusData#256). Both dump paths serialize a fresh DB with an empty project_summaries — the custom B-tree writer writes an empty summaries btree — so any re-index that rewrites the DB silently wipes the ADR. The data loss occurs at TWO sites, not one (the issue report identified only the full path): - dump_and_persist_hashes (pipeline.c) — full re-index. - dump_and_persist (pipeline_incremental.c) — incremental re-index when a file changed/was added (the no-change incremental early-returns without a dump, so the ADR survives there). Mirror the existing file_hashes preservation: capture the ADR from the old DB before it is rewritten (in try_incremental_or_delete_db, the single choke point that opens the old DB for both paths), carry it on the pipeline, and re-insert it after each dump via cbm_store_adr_store. Adds regression tests for both the full and incremental paths. Also drops an unused rc assignment in the pre-existing usages_creates_edges test that the larger file surfaced under cppcheck (unreadVariable). Signed-off-by: yasku <aguss.cba@gmail.com>

…ap (DeusData#601) A whole-graph `OPTIONAL MATCH` (e.g. `MATCH (a) OPTIONAL MATCH (a)-[:CALLS]->(b) RETURN count(b)`) drove off every node and then allocated a 10x growth buffer per relationship, blowing up memory/CPU before the post-hoc 100k-row ceiling could fire. With aggregation the final row count stays under the ceiling, so it never tripped — the call just hung. SQLite's own interrupt does not help: the blow-up is in the C binding-expansion loop, not inside a SQL statement. Add three guards to the executor: - Wall-clock timeout (default 10s, override via CBM_CYPHER_TIMEOUT_MS, 0 disables) sampled in the hot expansion loop — matches the existing CBM_* env-knob precedent (CBM_WORKERS, CBM_SQLITE_MMAP_SIZE). - An early working-set cap that rejects an oversized driving set before the bindings array is allocated. - A matching cap inside expand_pattern_rels for sets that only grow mid-traversal, plus a NULL-check on the growth malloc. On a trip the executor stops and cbm_cypher_execute returns a clear, actionable error instead of hanging or OOMing. Ordinary queries (bind_cap at the default row ceiling) are unaffected. Adds regression tests for the cap, the timeout, and the no-regression path. Signed-off-by: yasku <aguss.cba@gmail.com>

…sData#601) The previous commit added #include "foundation/compat.h" to cypher.c (for cbm_clock_gettime). On Windows that header transitively includes <windows.h>, which defines the legacy macro 'far' (#define far). This broke the pre-existing local variable cbm_node_pattern_t far in the EXISTS pattern parser, failing the test-windows build with 'expected expression' errors. Unix builds were unaffected since compat.h pulls windows.h only under _WIN32. Rename the local to 'far_node'. No behavior change; verified the EXISTS pattern parser and the DeusData#601 guards still work, full suite 5690 passed. Signed-off-by: yasku <aguss.cba@gmail.com>

yasku added 6 commits June 24, 2026 22:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cypher): bound query_graph execution with timeout + working-set cap (#601)#626

fix(cypher): bound query_graph execution with timeout + working-set cap (#601)#626
yasku wants to merge 6 commits into
DeusData:mainfrom
yasku:fix/601-query-timeout

yasku commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yasku commented Jun 25, 2026

What does this PR do?

Root cause (traced in src/cypher/cypher.c)

How the fix was researched (provenance)

How it was fixed — three execution guards

Stack / where this lives

Tests added (tests/test_cypher.c)

Manual end-to-end reproduction

Validation

Where it was tested — full environment

Checklist

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Root cause (traced in `src/cypher/cypher.c`)

Tests added (`tests/test_cypher.c`)