fix(cypher): bound query_graph execution with timeout + working-set cap (#601)#626
Open
yasku wants to merge 6 commits into
Open
fix(cypher): bound query_graph execution with timeout + working-set cap (#601)#626yasku wants to merge 6 commits into
yasku wants to merge 6 commits into
Conversation
The startup/query integrity check unlink()'d a project's .db (plus its -wal/-shm sidecars) on any anomaly, causing total data loss with no backup, no recovery path, and no visible warning. The trigger is often a false positive: a WAL leftover after a SIGKILL, a binary variant swap, or a stale-lock break can all make a recoverable DB look corrupt. Replace the destructive unlink with a rename to <name>.db.corrupt.<ts> and emit a prominent stderr message so the user can inspect, recover, or attach the file to a bug report. Two call sites are fixed: - mcp.c resolve_store(): the reported path, hit on first query. - pipeline.c try_incremental_or_delete_db(): only preserves on real corruption; a healthy DB rebuilt due to a mode change is still deleted so .corrupt files don't accumulate on normal reindexes. Add a regression test (resolve_store_preserves_corrupt_db_issue557) that seeds a corrupt projects row, issues a query, and asserts the original .db is gone while a .corrupt sidecar remains. Signed-off-by: yasku <aguss.cba@gmail.com>
CodeQL flagged a time-of-check/time-of-use race in try_incremental_or_delete_db(): the file at db_path was probed with stat() and then operated on by name (rename/unlink), so the underlying file could change between the two. Drop the redundant stat() existence probe and derive existence from a query-mode open (cbm_store_open_path_query, which never creates the file): a NULL result means absent/unreadable, so we return without touching the path. This removes the stat()->rename() dataflow CodeQL reports and matches resolve_store() in mcp.c, which already uses this pattern. Behavior is unchanged (absent -> skip, healthy -> reindex, corrupt -> preserve as .corrupt). Signed-off-by: yasku <aguss.cba@gmail.com>
Adds pipeline_reindex_preserves_corrupt_db_issue557: indexes a temp repo, corrupts the resulting DB (numeric root_path trips the integrity check), re-indexes, and asserts the DB is preserved as <db>.corrupt.<ts> rather than deleted. Guards both the DeusData#557 data-loss fix and the TOCTOU refactor of try_incremental_or_delete_db (existence via query-mode open, not stat). Signed-off-by: yasku <aguss.cba@gmail.com>
manage_adr stores ADRs in the project_summaries table (SQLite store, unified in DeusData#256). Both dump paths serialize a fresh DB with an empty project_summaries — the custom B-tree writer writes an empty summaries btree — so any re-index that rewrites the DB silently wipes the ADR. The data loss occurs at TWO sites, not one (the issue report identified only the full path): - dump_and_persist_hashes (pipeline.c) — full re-index. - dump_and_persist (pipeline_incremental.c) — incremental re-index when a file changed/was added (the no-change incremental early-returns without a dump, so the ADR survives there). Mirror the existing file_hashes preservation: capture the ADR from the old DB before it is rewritten (in try_incremental_or_delete_db, the single choke point that opens the old DB for both paths), carry it on the pipeline, and re-insert it after each dump via cbm_store_adr_store. Adds regression tests for both the full and incremental paths. Also drops an unused rc assignment in the pre-existing usages_creates_edges test that the larger file surfaced under cppcheck (unreadVariable). Signed-off-by: yasku <aguss.cba@gmail.com>
…ap (DeusData#601) A whole-graph `OPTIONAL MATCH` (e.g. `MATCH (a) OPTIONAL MATCH (a)-[:CALLS]->(b) RETURN count(b)`) drove off every node and then allocated a 10x growth buffer per relationship, blowing up memory/CPU before the post-hoc 100k-row ceiling could fire. With aggregation the final row count stays under the ceiling, so it never tripped — the call just hung. SQLite's own interrupt does not help: the blow-up is in the C binding-expansion loop, not inside a SQL statement. Add three guards to the executor: - Wall-clock timeout (default 10s, override via CBM_CYPHER_TIMEOUT_MS, 0 disables) sampled in the hot expansion loop — matches the existing CBM_* env-knob precedent (CBM_WORKERS, CBM_SQLITE_MMAP_SIZE). - An early working-set cap that rejects an oversized driving set before the bindings array is allocated. - A matching cap inside expand_pattern_rels for sets that only grow mid-traversal, plus a NULL-check on the growth malloc. On a trip the executor stops and cbm_cypher_execute returns a clear, actionable error instead of hanging or OOMing. Ordinary queries (bind_cap at the default row ceiling) are unaffected. Adds regression tests for the cap, the timeout, and the no-regression path. Signed-off-by: yasku <aguss.cba@gmail.com>
…sData#601) The previous commit added #include "foundation/compat.h" to cypher.c (for cbm_clock_gettime). On Windows that header transitively includes <windows.h>, which defines the legacy macro 'far' (#define far). This broke the pre-existing local variable cbm_node_pattern_t far in the EXISTS pattern parser, failing the test-windows build with 'expected expression' errors. Unix builds were unaffected since compat.h pulls windows.h only under _WIN32. Rename the local to 'far_node'. No behavior change; verified the EXISTS pattern parser and the DeusData#601 guards still work, full suite 5690 passed. Signed-off-by: yasku <aguss.cba@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes #601 —
query_graphhangs indefinitely on a whole-graphOPTIONAL MATCH, with no execution timeout or graceful degradation.The reported pattern, on a 90k+ node graph:
never returns — no partial result, no error. This PR makes the Cypher executor bounded: a query that would otherwise run forever (or exhaust memory) now stops at a wall-clock deadline or a working-set cap and returns a clear, actionable error.
Root cause (traced in
src/cypher/cypher.c)The executor builds and expands query bindings entirely in C, in memory, before projecting results. Three things compound:
scan_pattern_nodes()for an unlabeledMATCH (a)scans up tomax_rows * CYP_GROWTH_10nodes (≈ 1,000,000 with the default ceiling). Every node becomes a binding.expand_pattern_rels()doesmalloc((bind_cap * CYP_GROWTH_10 + 1) * sizeof(binding_t))up front. With a ~93k driving set and a ~1.5 KBbinding_t, that is a multi-GB single allocation per relationship, followed by a per-binding edge-expansion grind.CYPHER_RESULT_CEILING(100k) is only checked after execution completes (cbm_cypher_execute), andcount()aggregates the result under 100k — so the ceiling never fires while the process is already wedged. The reporter observed exactly this: "hangs rather than hit the documented 100k-row ceiling."There was no wall-clock guard anywhere in
cypher.c(verified by grep: noclock_gettime/deadline/timeout).How the fix was researched (provenance)
The approach was validated against the repo's own history and the database it sits on, not assumed:
gh). The umbrella task: Cypher query language hardening (9 issues) #389 "Cypher query language hardening" (9 issues) is closed/complete, and none of its sub-issues (Cypher: DISTINCT applied after ORDER BY + LIMIT instead of before (execution order bug) #237–242,query_graph: raw property reads work, buttoInteger(...)returns empty strings and literalRETURNexpressions fail parsing #252, query_graph Cypher evaluator: labels(), count(*), AS alias, and LIMIT all broken #305, query_graph silently returns blank default projection for unsupported WITH expression #373) touch execution-time guards — confirming this is a genuine gap, not a deliberate "no timeouts" decision. No open PR addresses [bug]query_graphhangs on whole-graphOPTIONAL MATCH(no execution timeout / degradation) on large graphs #601 (checked before starting, to avoid duplicating work as with feat(mem): CBM_MAX_MEMORY_MB explicit memory-budget override (#580) #586/Feature request: --max-memory flag for RAM-constrained hosts #580).CYPHER_RESULT_CEILING's comment ("prevents accidental multi-GB JSON payloads from unboundedMATCH (n) RETURN n") shows it was always a result-size guard, never an execution guard — so [bug]query_graphhangs on whole-graphOPTIONAL MATCH(no execution timeout / degradation) on large graphs #601 needs a new mechanism.sqlite3_interrupt()/sqlite3_progress_handler()to abort work, but they only cover execution inside a single SQL statement. The blow-up here is in the C binding-expansion loop between many small per-node edge queries (cbm_store_find_edges_by_source/target), so a SQLite-native interrupt would not see it. The guard must live in the executor. This matches how the codebase already usesbusy_timeoutonly for lock waits, not statement runtime.CBM_CYPHER_TIMEOUT_MSfollows the establishedCBM_*pattern (CBM_WORKERS,CBM_SQLITE_MMAP_SIZE) — env override with a sane default.How it was fixed — three execution guards
All in
src/cypher/cypher.c, single-threaded executor (the MCP event loop), so file-static guard state is safe:cyp_guard_begin()(called atcbm_cypher_executeentry) readsCBM_CYPHER_TIMEOUT_MS(default 10 000 ms,0disables) and arms aCLOCK_MONOTONICdeadline.cyp_deadline_exceeded()is sampled every 1024 iterations inside the hot expansion loop; on trip it drops the partial set and unwinds.execute_single(), before the bindings array is allocated, an oversized driving set (bind_cap > CYPHER_MAX_BINDINGS / CYP_GROWTH_10) is rejected up front — so a whole-graphMATCHcan't even request the multi-GB buffer.expand_pattern_rels()for working sets that only blow up mid-traversal, plus a previously-missing NULL-check on the growthmalloc.On any trip,
execute_singlestops cleanly andcbm_cypher_executeturns the latched reason into a clear error:Ordinary queries — where
bind_capsits at the default row ceiling (100k) — are unaffected; the caps use the same threshold so they never reject a normal query.Stack / where this lives
-std=c11 -Wall -Wextra -Werror)src/cypher/cypher.c) — translates Cypher → in-memory graph traversal over the SQLite-backed storecbm_store_find_edges_by_*Makefile.cbm(ASan + UBSan test build;-O2production binary)tests/test_cypher.c,make -f Makefile.cbm test)CBM_CYPHER_TIMEOUT_MS), no new dependencies, no API/schema changeTests added (
tests/test_cypher.c)cypher_exec_guard_caps_huge_driving_set— a largemax_rowspushesbind_cappast the cap deterministically (the early guard fires before allocation, so no large graph is needed); asserts the clear "working-set" error.cypher_exec_guard_timeout_trips— builds 3,000 nodes + CALLS edges and runs a whole-graph expansion underCBM_CYPHER_TIMEOUT_MS=1, kept below the cap viamax_rows; under the ASan test build the expansion comfortably outlasts 1 ms, so the timeout trips deterministically; asserts the "time" error.cypher_exec_guard_normal_query_unaffected— an ordinary directedMATCH (f:Function)-[:CALLS]->(g) RETURN f.name, count(g)still returns correct rows with no error.Manual end-to-end reproduction
Indexed a real 13,450-file workspace → 277,759 nodes / 705,478 edges, then ran the issue's query via the CLI:
Before this PR the same call never returned. Normal
query_graphcalls against the same index are unchanged.Validation
make -f Makefile.cbm testmake -f Makefile.cbm cbm-Werror)clang-format(changed code)cppcheckw/ CI suppress flagsknownConditionTrueFalseatcypher.c:2442/2446, in untouched code)scripts/check-no-test-skips.shscripts/check-dco.shWhere it was tested — full environment
arm64-apple-darwin25.5.0Makefile.cbmmake -f Makefile.cbm test)Checklist
git commit -s) — DCO verified viascripts/check-dco.shmake -f Makefile.cbm test) — 5690 passedclang-formatclean on changed code;cppcheckclean on changed code (only a pre-existing finding in untouched code remains)Notes
CBM_CYPHER_TIMEOUT_MSlets operators with very large graphs raise or disable it.cypher.cplus their tests — no unrelated reformatting. One pre-existing comment-alignment block that clang-format 22 wanted to rewrite was deliberately left in the repo's current (CI clang-format-20) format.