You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-`chunker.rs` — smart chunking with break-point scoring algorithm. Finds optimal split points considering headings, code fences, blank lines, and thematic breaks. `split_oversized_chunks()` handles token-aware secondary splitting with overlap
11
11
-`docid.rs` — deterministic 6-char hex IDs for files (SHA-256 of path, truncated). Shown in search results for quick reference
12
12
-`embedder.rs` — downloads and runs `all-MiniLM-L6-v2` ONNX model (384-dim). SHA256-verified on download. Uses `ort` for inference, `tokenizers` for tokenization. Implements `ModelBackend` trait
13
13
-`model.rs` — pluggable `ModelBackend` trait, model registry, and `parse_model_spec()`. Enables future model swapping without changing consumer code
-`hnsw.rs` — thin wrapper around `hnsw_rs`. **Important:**`hnsw_rs` does not support inserting after `load_hnsw()`. The index is rebuilt from vectors stored in SQLite on every index run
19
-
-`indexer.rs` — orchestrates vault walking (via `ignore` crate for `.gitignore` support), diffing, chunking, embedding (Rayon for parallel chunking, serial embedding since `Embedder` is not `Send`), and serial writes to store + HNSW + FTS5
20
+
-`indexer.rs` — orchestrates vault walking (via `ignore` crate for `.gitignore` support), diffing, chunking, embedding (Rayon for parallel chunking, serial embedding since `Embedder` is not `Send`), serial writes to store + HNSW + FTS5, and vault graph edge building (wikilinks + people detection)
20
21
21
-
`main.rs` is a thin clap CLI that wires the modules together. Subcommands: `index`, `search` (with `--explain`), `status`, `clear`, `init`, `configure`, `models`.
22
+
`main.rs` is a thin clap CLI. Subcommands: `index`, `search` (with `--explain`), `status`, `clear`, `init`, `configure`, `models`, `graph` (show/stats).
22
23
23
24
## Key patterns
24
25
25
-
-**Hybrid search:** Queries run through two lanes — semantic (HNSW embeddings) and keyword (FTS5 BM25). Results are fused via Reciprocal Rank Fusion (RRF) with configurable lane weights
26
-
-**Smart chunking:** Break-point scoring algorithm assigns scores to potential split points (headings 50-100, code fences 80, thematic breaks 60, blank lines 20). Chunks split at the highest-scored break point near the token target. Code fence protection prevents splitting inside code blocks
27
-
-**Incremental indexing:**`diff_vault()` compares file content hashes in SQLite against disk. Changed files have their old chunks deleted (cascade), then are re-embedded as new. FTS5 entries are cleaned up alongside vector entries
28
-
-**HNSW rebuild on every run:** Vectors are stored as BLOBs in the `chunks` table. After SQLite is updated, the full HNSW index is rebuilt from `store.get_all_vectors()`. This is necessary because `hnsw_rs` doesn't support append-after-load
29
-
-**Docids:** Each file gets a deterministic 6-char hex ID (SHA-256 of relative path). Displayed in search results for quick reference
26
+
-**3-lane hybrid search:** Queries run through three lanes — semantic (HNSW embeddings), keyword (FTS5 BM25), and graph (wikilink expansion). Results are fused via Reciprocal Rank Fusion (RRF) with configurable lane weights (semantic 1.0, FTS 1.0, graph 0.8)
27
+
-**Vault graph:**`edges` table stores bidirectional wikilink edges and mention edges. Built during indexing after all files are written. People detection scans for person name/alias mentions using notes from the configured People folder
28
+
-**Graph agent:** Expands seed results by following wikilinks 1-2 hops. Decay: 0.8× for 1-hop, 0.5× for 2-hop. Relevance filter: must contain query term (FTS5) or share tags with seed. Multi-parent merge takes highest score
-**Incremental indexing:**`diff_vault()` compares file content hashes in SQLite against disk. Changed files have their old chunks and edges deleted, then are re-processed. FTS5 entries cleaned up alongside vector entries
31
+
-**HNSW rebuild on every run:** Vectors stored as BLOBs. Full HNSW index rebuilt from `store.get_all_vectors()` after SQLite update (hnsw_rs limitation)
32
+
-**Docids:** Each file gets a deterministic 6-char hex ID. Displayed in search results
30
33
-**Vault profiles:**`engraph init` auto-detects vault structure and writes `vault.toml`
31
-
-**Pluggable models:**`ModelBackend` trait enables future model swapping. Current implementation uses ONNX all-MiniLM-L6-v2
34
+
-**Pluggable models:**`ModelBackend` trait enables future model swapping
32
35
33
36
## Data directory
34
37
35
-
`~/.engraph/` — hardcoded via `Config::data_dir()` (uses `dirs::home_dir()`). Contains `engraph.db` (SQLite with FTS5), `hnsw/` (index files), `models/` (ONNX model + tokenizer), `vault.toml` (vault profile), `config.toml` (user config).
38
+
`~/.engraph/` — hardcoded via `Config::data_dir()`. Contains `engraph.db` (SQLite with FTS5 + edges), `hnsw/` (index files), `models/` (ONNX model + tokenizer), `vault.toml` (vault profile), `config.toml` (user config).
36
39
37
40
Single vault only. Re-indexing a different vault path triggers a confirmation prompt.
38
41
39
42
## Dependencies to be aware of
40
43
41
-
-`ort` (2.0.0-rc.12) — ONNX Runtime Rust bindings. Pre-release API. `Session::builder()?.commit_from_file()` pattern. Does not provide prebuilt binaries for all targets (no x86_64-apple-darwin)
42
-
-`hnsw_rs` (0.3) — pure Rust HNSW. `Box::leak`used in `load()` to satisfy `'static` lifetime on the loaded index. Read-only after load
44
+
-`ort` (2.0.0-rc.12) — ONNX Runtime Rust bindings. Pre-release API. Does not provide prebuilt binaries for all targets
45
+
-`hnsw_rs` (0.3) — pure Rust HNSW. `Box::leak` in `load()`. Read-only after load
0 commit comments