devwhodevs
diff --git a/‎CLAUDE.md‎
Lines changed: 22 additions & 14 deletions b/‎CLAUDE.md‎
Lines changed: 22 additions & 14 deletions
diff --git a/‎Cargo.toml‎
Lines changed: 1 addition & 1 deletion b/‎Cargo.toml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/chunker.rs‎
Lines changed: 116 additions & 42 deletions b/‎src/chunker.rs‎
Lines changed: 116 additions & 42 deletions
@@ -4,28 +4,35 @@ Local semantic search CLI for Obsidian vaults. Rust, MIT licensed.
 
 ## Architecture
 
-Single binary with 7 modules behind a lib crate:
-
-- `config.rs` — loads `~/.engraph/config.toml`, merges CLI args, provides `data_dir()`
-- `chunker.rs` — splits markdown by `##` headings, strips YAML frontmatter, extracts tags. `split_oversized_chunks()` handles token-aware sub-splitting with overlap
-- `embedder.rs` — downloads and runs `all-MiniLM-L6-v2` ONNX model (384-dim). SHA256-verified on download. Uses `ort` for inference, `tokenizers` for tokenization
-- `store.rs` — SQLite persistence. Tables: `meta`, `files`, `chunks` (with vector BLOBs), `tombstones`. Handles incremental diffing via content hashes
+Single binary with 11 modules behind a lib crate:
+
+- `config.rs` — loads `~/.engraph/config.toml` and `vault.toml`, merges CLI args, provides `data_dir()`
+- `chunker.rs` — smart chunking with break-point scoring algorithm. Finds optimal split points considering headings, code fences, blank lines, and thematic breaks. `split_oversized_chunks()` handles token-aware secondary splitting with overlap
+- `docid.rs` — deterministic 6-char hex IDs for files (SHA-256 of path, truncated). Shown in search results for quick reference
+- `embedder.rs` — downloads and runs `all-MiniLM-L6-v2` ONNX model (384-dim). SHA256-verified on download. Uses `ort` for inference, `tokenizers` for tokenization. Implements `ModelBackend` trait
+- `model.rs` — pluggable `ModelBackend` trait, model registry, and `parse_model_spec()`. Enables future model swapping without changing consumer code
+- `fts.rs` — FTS5 full-text search support. Re-exports `FtsResult` from store. BM25-ranked keyword search
+- `fusion.rs` — Reciprocal Rank Fusion (RRF) engine. Merges semantic + FTS5 results. Supports lane weighting and `--explain` output
+- `profile.rs` — vault profile detection. Auto-detects PARA/Folders/Flat structure, vault type (Obsidian/Logseq/Plain), wikilinks, frontmatter, tags. Writes/loads `vault.toml`
+- `store.rs` — SQLite persistence. Tables: `meta`, `files` (with docid), `chunks` (with vector BLOBs), `chunks_fts` (FTS5 virtual table), `tombstones`. Handles incremental diffing via content hashes
 - `hnsw.rs` — thin wrapper around `hnsw_rs`. **Important:** `hnsw_rs` does not support inserting after `load_hnsw()`. The index is rebuilt from vectors stored in SQLite on every index run
-- `indexer.rs` — orchestrates vault walking (via `ignore` crate for `.gitignore` support), diffing, chunking, embedding (Rayon for parallel chunking, serial embedding since `Embedder` is not `Send`), and serial writes to store + HNSW
-- `search.rs` — embeds query, searches HNSW with tombstone filtering, formats results (human + JSON). Also handles `status` formatting
+- `indexer.rs` — orchestrates vault walking (via `ignore` crate for `.gitignore` support), diffing, chunking, embedding (Rayon for parallel chunking, serial embedding since `Embedder` is not `Send`), and serial writes to store + HNSW + FTS5
 
-`main.rs` is a thin clap CLI that wires the modules together.
+`main.rs` is a thin clap CLI that wires the modules together. Subcommands: `index`, `search` (with `--explain`), `status`, `clear`, `init`, `configure`, `models`.
 
 ## Key patterns
 
-- **Incremental indexing:** `diff_vault()` compares file content hashes in SQLite against disk. Changed files have their old chunks deleted (cascade), then are re-embedded as new
+- **Hybrid search:** Queries run through two lanes — semantic (HNSW embeddings) and keyword (FTS5 BM25). Results are fused via Reciprocal Rank Fusion (RRF) with configurable lane weights
+- **Smart chunking:** Break-point scoring algorithm assigns scores to potential split points (headings 50-100, code fences 80, thematic breaks 60, blank lines 20). Chunks split at the highest-scored break point near the token target. Code fence protection prevents splitting inside code blocks
+- **Incremental indexing:** `diff_vault()` compares file content hashes in SQLite against disk. Changed files have their old chunks deleted (cascade), then are re-embedded as new. FTS5 entries are cleaned up alongside vector entries
 - **HNSW rebuild on every run:** Vectors are stored as BLOBs in the `chunks` table. After SQLite is updated, the full HNSW index is rebuilt from `store.get_all_vectors()`. This is necessary because `hnsw_rs` doesn't support append-after-load
-- **Vector IDs:** Assigned sequentially, stored in both SQLite and HNSW. `next_vector_id` is derived from `MAX(vector_id)` in SQLite
-- **Tombstones:** Exist in the schema but are largely unused now that we rebuild HNSW each run. Kept for future use if switching to a vector store that supports deletion
+- **Docids:** Each file gets a deterministic 6-char hex ID (SHA-256 of relative path). Displayed in search results for quick reference
+- **Vault profiles:** `engraph init` auto-detects vault structure and writes `vault.toml`
+- **Pluggable models:** `ModelBackend` trait enables future model swapping. Current implementation uses ONNX all-MiniLM-L6-v2
 
 ## Data directory
 
-`~/.engraph/` — hardcoded via `Config::data_dir()` (uses `dirs::home_dir()`). Contains `engraph.db` (SQLite), `hnsw/` (index files), `models/` (ONNX model + tokenizer).
+`~/.engraph/` — hardcoded via `Config::data_dir()` (uses `dirs::home_dir()`). Contains `engraph.db` (SQLite with FTS5), `hnsw/` (index files), `models/` (ONNX model + tokenizer), `vault.toml` (vault profile), `config.toml` (user config).
 
 Single vault only. Re-indexing a different vault path triggers a confirmation prompt.
 
@@ -35,10 +42,11 @@ Single vault only. Re-indexing a different vault path triggers a confirmation pr
 - `hnsw_rs` (0.3) — pure Rust HNSW. `Box::leak` used in `load()` to satisfy `'static` lifetime on the loaded index. Read-only after load
 - `tokenizers` (0.22) — HuggingFace tokenizer. Needs `fancy-regex` feature
 - `ignore` (0.4) — vault walking with automatic `.gitignore` support
+- `rusqlite` (0.32) — bundled SQLite with FTS5 support
 
 ## Testing
 
-- Unit tests in each module (`cargo test --lib`) — 44 tests, no network required
+- Unit tests in each module (`cargo test --lib`) — 91 tests, no network required
 - 1 ignored smoke test (`test_embed_smoke`) — downloads ONNX model, verifies embedding
 - Integration tests (`cargo test --test integration -- --ignored`) — 8 tests, require model download. Use `tempfile` for isolated data dirs
 
 
@@ -1,6 +1,6 @@
 [package]
 name = "engraph"
-version = "0.1.0"
+version = "0.2.0"
 edition = "2024"
 description = "Local semantic search for Obsidian vaults"
 license = "MIT"
 
@@ -56,7 +56,12 @@ pub fn find_break_points(content: &str) -> Vec<BreakPoint> {
                 score: 80,
                 inside_code_fence: bp_inside,
             });
-            byte_offset += line.len() + if byte_offset + line.len() < content.len() { 1 } else { 0 };
+            byte_offset += line.len()
+                + if byte_offset + line.len() < content.len() {
+                    1
+                } else {
+                    0
+                };
             continue;
         } else if inside_code_fence {
             // Lines inside code fences: push with inside_code_fence = true
@@ -67,7 +72,12 @@ pub fn find_break_points(content: &str) -> Vec<BreakPoint> {
                 score: 1,
                 inside_code_fence: true,
             });
-            byte_offset += line.len() + if byte_offset + line.len() < content.len() { 1 } else { 0 };
+            byte_offset += line.len()
+                + if byte_offset + line.len() < content.len() {
+                    1
+                } else {
+                    0
+                };
             continue;
         } else if trimmed.starts_with("# ") && !trimmed.starts_with("## ") {
             100
@@ -100,7 +110,12 @@ pub fn find_break_points(content: &str) -> Vec<BreakPoint> {
             });
         }
 
-        byte_offset += line.len() + if byte_offset + line.len() < content.len() { 1 } else { 0 };
+        byte_offset += line.len()
+            + if byte_offset + line.len() < content.len() {
+                1
+            } else {
+                0
+            };
     }
 
     break_points
@@ -128,17 +143,18 @@ fn is_list_item(trimmed: &str) -> bool {
     // Check for ordered list: digit(s) followed by `. ` or `) `
     let mut chars = trimmed.chars();
     if let Some(first) = chars.next()
-        && first.is_ascii_digit() {
-            for c in chars {
-                if c.is_ascii_digit() {
-                    continue;
-                }
-                if c == '.' || c == ')' {
-                    return true;
-                }
-                break;
+        && first.is_ascii_digit()
+    {
+        for c in chars {
+            if c.is_ascii_digit() {
+                continue;
             }
+            if c == '.' || c == ')' {
+                return true;
+            }
+            break;
         }
+    }
     false
 }
 
@@ -242,11 +258,7 @@ pub fn smart_chunk(content: &str, target_tokens: usize, overlap_pct: usize) -> V
                     .rfind('\n')
                     .map(|p| start_offset + p + 1)
                 {
-                    if nl > start_offset {
-                        nl
-                    } else {
-                        cut
-                    }
+                    if nl > start_offset { nl } else { cut }
                 } else {
                     cut
                 };
@@ -511,25 +523,49 @@ mod tests {
         let pairs: Vec<(usize, u32)> = bps.iter().map(|bp| (bp.line_number, bp.score)).collect();
 
         // # Title -> 100
-        assert!(pairs.contains(&(0, 100)), "Expected # heading at line 0 with score 100, got: {:?}", pairs);
+        assert!(
+            pairs.contains(&(0, 100)),
+            "Expected # heading at line 0 with score 100, got: {:?}",
+            pairs
+        );
         // empty line -> 20
-        assert!(pairs.contains(&(1, 20)), "Expected empty line at line 1 with score 20");
+        assert!(
+            pairs.contains(&(1, 20)),
+            "Expected empty line at line 1 with score 20"
+        );
         // empty line -> 20
-        assert!(pairs.contains(&(3, 20)), "Expected empty line at line 3 with score 20");
+        assert!(
+            pairs.contains(&(3, 20)),
+            "Expected empty line at line 3 with score 20"
+        );
         // ## Section -> 90
-        assert!(pairs.contains(&(4, 90)), "Expected ## heading at line 4 with score 90");
+        assert!(
+            pairs.contains(&(4, 90)),
+            "Expected ## heading at line 4 with score 90"
+        );
         // ### Sub -> 80
-        assert!(pairs.contains(&(6, 80)), "Expected ### heading at line 6 with score 80");
+        assert!(
+            pairs.contains(&(6, 80)),
+            "Expected ### heading at line 6 with score 80"
+        );
         // empty line -> 20
-        assert!(pairs.contains(&(8, 20)), "Expected empty line at line 8 with score 20");
+        assert!(
+            pairs.contains(&(8, 20)),
+            "Expected empty line at line 8 with score 20"
+        );
         // --- -> 60
-        assert!(pairs.contains(&(9, 60)), "Expected thematic break at line 9 with score 60");
+        assert!(
+            pairs.contains(&(9, 60)),
+            "Expected thematic break at line 9 with score 60"
+        );
 
         // "Some text", "Content", "More" have score 1 and should NOT appear
         // (only lines inside code fences get score 1 in results)
         for bp in &bps {
-            assert!(bp.score > 1 || bp.inside_code_fence,
-                "Non-fence break points should not include lines with score <= 1");
+            assert!(
+                bp.score > 1 || bp.inside_code_fence,
+                "Non-fence break points should not include lines with score <= 1"
+            );
         }
     }
 
@@ -541,20 +577,41 @@ mod tests {
         // The opening ``` should be a break point with score 80, NOT inside fence
         let opening = bps.iter().find(|bp| bp.line_number == 2).unwrap();
         assert_eq!(opening.score, 80);
-        assert!(!opening.inside_code_fence, "Opening fence should not be marked as inside");
+        assert!(
+            !opening.inside_code_fence,
+            "Opening fence should not be marked as inside"
+        );
 
         // The closing ``` should be a break point with score 80, NOT inside fence
         // (it toggles the fence off)
         let closing = bps.iter().find(|bp| bp.line_number == 5).unwrap();
         assert_eq!(closing.score, 80);
-        assert!(!closing.inside_code_fence, "Closing fence should not be marked as inside");
+        assert!(
+            !closing.inside_code_fence,
+            "Closing fence should not be marked as inside"
+        );
 
         // Lines inside the fence (let x = 1; let y = 2;) SHOULD appear with inside_code_fence = true
-        let inside_bps: Vec<&BreakPoint> = bps.iter().filter(|bp| bp.line_number == 3 || bp.line_number == 4).collect();
-        assert_eq!(inside_bps.len(), 2, "Expected 2 break points inside code fence");
+        let inside_bps: Vec<&BreakPoint> = bps
+            .iter()
+            .filter(|bp| bp.line_number == 3 || bp.line_number == 4)
+            .collect();
+        assert_eq!(
+            inside_bps.len(),
+            2,
+            "Expected 2 break points inside code fence"
+        );
         for bp in &inside_bps {
-            assert!(bp.inside_code_fence, "Line {} inside fence should have inside_code_fence=true", bp.line_number);
-            assert_eq!(bp.score, 1, "Line {} inside fence should have score 1", bp.line_number);
+            assert!(
+                bp.inside_code_fence,
+                "Line {} inside fence should have inside_code_fence=true",
+                bp.line_number
+            );
+            assert_eq!(
+                bp.score, 1,
+                "Line {} inside fence should have score 1",
+                bp.line_number
+            );
         }
     }
 
@@ -563,11 +620,23 @@ mod tests {
         let content = "- item one\n* item two\n1. numbered\nplain text\n";
         let bps = find_break_points(content);
         let pairs: Vec<(usize, u32)> = bps.iter().map(|bp| (bp.line_number, bp.score)).collect();
-        assert!(pairs.contains(&(0, 5)), "Expected list item at line 0 with score 5");
-        assert!(pairs.contains(&(1, 5)), "Expected list item at line 1 with score 5");
-        assert!(pairs.contains(&(2, 5)), "Expected numbered list item at line 2 with score 5");
+        assert!(
+            pairs.contains(&(0, 5)),
+            "Expected list item at line 0 with score 5"
+        );
+        assert!(
+            pairs.contains(&(1, 5)),
+            "Expected list item at line 1 with score 5"
+        );
+        assert!(
+            pairs.contains(&(2, 5)),
+            "Expected numbered list item at line 2 with score 5"
+        );
         // "plain text" has score 1, should NOT appear
-        assert!(!bps.iter().any(|bp| bp.line_number == 3), "Plain text should not be a break point");
+        assert!(
+            !bps.iter().any(|bp| bp.line_number == 3),
+            "Plain text should not be a break point"
+        );
     }
 
     // ── Smart chunk tests ────────────────────────────────────────────────
@@ -661,7 +730,12 @@ mod tests {
         // since total tokens < 512
         assert!(parsed.chunks.len() >= 1);
         // The content should all be present
-        let all_text: String = parsed.chunks.iter().map(|c| c.text.clone()).collect::<Vec<_>>().join(" ");
+        let all_text: String = parsed
+            .chunks
+            .iter()
+            .map(|c| c.text.clone())
+            .collect::<Vec<_>>()
+            .join(" ");
         assert!(all_text.contains("Content A"));
         assert!(all_text.contains("Content B"));
     }
@@ -693,7 +767,10 @@ mod tests {
         assert!(!parsed.chunks.is_empty());
         // At least one chunk should have a truncated snippet
         let has_truncated = parsed.chunks.iter().any(|c| c.snippet.ends_with("..."));
-        assert!(has_truncated, "Expected at least one snippet to be truncated");
+        assert!(
+            has_truncated,
+            "Expected at least one snippet to be truncated"
+        );
         // Verify truncation length
         for c in &parsed.chunks {
             if c.snippet.ends_with("...") {
@@ -787,10 +864,7 @@ mod tests {
             extract_heading("# Title\nBody text"),
             Some("# Title".to_string())
         );
-        assert_eq!(
-            extract_heading("## Sub\nBody"),
-            Some("## Sub".to_string())
-        );
+        assert_eq!(extract_heading("## Sub\nBody"), Some("## Sub".to_string()));
         assert_eq!(extract_heading("No heading here"), None);
         assert_eq!(
             extract_heading("Some text\n### Deep heading\nMore"),