Specialize SQLGraph.from_other for SQLite-to-SQLite (closes #285)#286
Specialize SQLGraph.from_other for SQLite-to-SQLite (closes #285)#286JoOkuma wants to merge 4 commits into
Conversation
) The generic BaseGraph.from_other materializes the source graph in Python and then calls other.overlaps(), which on a SQLGraph-backed GraphView issues a single 'IN (...)' clause containing every selected node id. With non-trivial selections this hits SQLite's SQLITE_MAX_VARIABLE_NUMBER and raises 'too many SQL variables'. Per @JoOkuma's suggestion in royerlab#285, add a specialized fast path that bypasses the BaseGraph copy entirely when both sides are SQLite: * Open the destination SQLGraph so it materializes its schema. * Mirror any extra columns and metadata. * Dispose the dst engine, ATTACH the dst database to the source's connection, and dump each table via INSERT INTO ... SELECT FROM. * For GraphView sources, materialize the selected node ids in a temp table (chunked inserts) and JOIN against it instead of using a giant IN (...) clause. * Re-open the dst engine and refresh derived caches. For correctness in other code paths (summary, NN solver, direct callers), also chunk SQLGraph.overlaps(node_ids) so it never overflows the bound-parameter limit. New tests cover: * The ATTACH fast path is taken (BaseGraph.from_other not called). * The exact issue royerlab#285 scenario: filter -> subgraph -> from_other with a selection > 999 nodes plus overlaps. * Node ids are preserved on full copies. * Fallback to the generic path for ':memory:' destinations and for non-SQL sources. * SQLGraph.overlaps with > 999 node ids returns the right pairs.
Instead of instantiating the destination upfront and ALTER-ing it into shape, replay the source's own CREATE TABLE/INDEX DDL from sqlite_master against the attached destination, copy rows (including the Metadata table verbatim), then open the destination via cls(**kwargs). The constructor's normal reflection path rebuilds the in-memory state, so the manual engine dispose/reopen, cache invalidation, and _update_max_id_per_time call are no longer needed. Also tighten the eligibility gate in from_other: the destination file must start empty (DDL replay would clash with an existing schema), and overwrite=True is excluded (the dst constructor would drop the tables we just populated).
Replace the hand-rolled chunk loop with the existing _chunked_sa_read helper. Only the source side is constrained per chunk via IN(...); the target side is filtered in Polars afterwards, which avoids the quadratic bound-parameter blow-up of having two IN clauses per chunk and lets us use the full _sql_chunk_size() budget. Also drop the seen-set deduplication, which was a behaviour change vs. the pre-fix code (the original query.all() preserved duplicates if any existed in the Overlap table).
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #286 +/- ##
==========================================
+ Coverage 87.70% 87.72% +0.02%
==========================================
Files 57 57
Lines 4879 4936 +57
Branches 858 868 +10
==========================================
+ Hits 4279 4330 +51
- Misses 378 381 +3
- Partials 222 225 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
yfukai
left a comment
There was a problem hiding this comment.
I'm sorry for the late response. I only have minor comments, and this looks good to me!
| dst._engine.dispose() | ||
|
|
||
|
|
||
| def test_sql_from_other_full_copy_preserves_node_ids(tmp_path: Path) -> None: |
There was a problem hiding this comment.
I may be missing something, but this looks like it has already been tested in _assert_graphs_equivalent in test_sql_from_other_uses_attach_dump_for_sqlite?
| dst._engine.dispose() | ||
|
|
||
|
|
||
| def test_sql_from_other_subgraph_view_with_overlaps(tmp_path: Path) -> None: |
There was a problem hiding this comment.
It may make sense to explicitly test this scenario for all combinations of SQL(disk), SQL(memory), and RustworkX.
Closes #285.
BaseGraph.from_otherends withother.overlaps(), which on a SQL-rootedGraphViewissues a singleIN (...)containing every selected node id and overflows SQLite's variable limit.This PR adds a SQL-level fast path on
SQLGraph.from_other: when both ends are on-disk SQLite, the source's schema and rows are dumped straight into the destination viaATTACH DATABASE+ replayedCREATE TABLE/INDEXDDL +INSERT INTO ... SELECT, then the destination is opened from the populated file. Filtered selections (GraphView) are joined against a temp table to sidestep the bound-parameter limit. Other configurations fall back to the generic path.Separately,
SQLGraph.overlaps(node_ids)is now chunked via the existing_chunked_sa_readhelper so direct callers don't hit the same limit.Tests cover the fast path, the issue reproducer (filter → subgraph → from_other with > 999 nodes + overlaps), full-copy id preservation, and fallbacks for
:memory:and non-SQL sources.