Skip to content

Commit 46f4764

Browse files
devwhodevsclaude
andcommitted
chore: bump to v0.2.0 — hybrid search, smart chunking, vault profiles
Version bump and integration for engraph v2.0: - Smart chunking with break-point scoring (replaces heading-only splitting) - 6-char docid system for quick file reference - FTS5 full-text search lane (BM25 keyword matching) - RRF fusion engine merging semantic + FTS5 results - Vault profile auto-detection (PARA/Folders/Flat, Obsidian/Logseq) - Pluggable ModelBackend trait for future model swapping - All code formatted, clippy clean, 91 tests passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent dc01f9e commit 46f4764

10 files changed

Lines changed: 363 additions & 173 deletions

File tree

CLAUDE.md

Lines changed: 22 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -4,28 +4,35 @@ Local semantic search CLI for Obsidian vaults. Rust, MIT licensed.
44

55
## Architecture
66

7-
Single binary with 7 modules behind a lib crate:
8-
9-
- `config.rs` — loads `~/.engraph/config.toml`, merges CLI args, provides `data_dir()`
10-
- `chunker.rs` — splits markdown by `##` headings, strips YAML frontmatter, extracts tags. `split_oversized_chunks()` handles token-aware sub-splitting with overlap
11-
- `embedder.rs` — downloads and runs `all-MiniLM-L6-v2` ONNX model (384-dim). SHA256-verified on download. Uses `ort` for inference, `tokenizers` for tokenization
12-
- `store.rs` — SQLite persistence. Tables: `meta`, `files`, `chunks` (with vector BLOBs), `tombstones`. Handles incremental diffing via content hashes
7+
Single binary with 11 modules behind a lib crate:
8+
9+
- `config.rs` — loads `~/.engraph/config.toml` and `vault.toml`, merges CLI args, provides `data_dir()`
10+
- `chunker.rs` — smart chunking with break-point scoring algorithm. Finds optimal split points considering headings, code fences, blank lines, and thematic breaks. `split_oversized_chunks()` handles token-aware secondary splitting with overlap
11+
- `docid.rs` — deterministic 6-char hex IDs for files (SHA-256 of path, truncated). Shown in search results for quick reference
12+
- `embedder.rs` — downloads and runs `all-MiniLM-L6-v2` ONNX model (384-dim). SHA256-verified on download. Uses `ort` for inference, `tokenizers` for tokenization. Implements `ModelBackend` trait
13+
- `model.rs` — pluggable `ModelBackend` trait, model registry, and `parse_model_spec()`. Enables future model swapping without changing consumer code
14+
- `fts.rs` — FTS5 full-text search support. Re-exports `FtsResult` from store. BM25-ranked keyword search
15+
- `fusion.rs` — Reciprocal Rank Fusion (RRF) engine. Merges semantic + FTS5 results. Supports lane weighting and `--explain` output
16+
- `profile.rs` — vault profile detection. Auto-detects PARA/Folders/Flat structure, vault type (Obsidian/Logseq/Plain), wikilinks, frontmatter, tags. Writes/loads `vault.toml`
17+
- `store.rs` — SQLite persistence. Tables: `meta`, `files` (with docid), `chunks` (with vector BLOBs), `chunks_fts` (FTS5 virtual table), `tombstones`. Handles incremental diffing via content hashes
1318
- `hnsw.rs` — thin wrapper around `hnsw_rs`. **Important:** `hnsw_rs` does not support inserting after `load_hnsw()`. The index is rebuilt from vectors stored in SQLite on every index run
14-
- `indexer.rs` — orchestrates vault walking (via `ignore` crate for `.gitignore` support), diffing, chunking, embedding (Rayon for parallel chunking, serial embedding since `Embedder` is not `Send`), and serial writes to store + HNSW
15-
- `search.rs` — embeds query, searches HNSW with tombstone filtering, formats results (human + JSON). Also handles `status` formatting
19+
- `indexer.rs` — orchestrates vault walking (via `ignore` crate for `.gitignore` support), diffing, chunking, embedding (Rayon for parallel chunking, serial embedding since `Embedder` is not `Send`), and serial writes to store + HNSW + FTS5
1620

17-
`main.rs` is a thin clap CLI that wires the modules together.
21+
`main.rs` is a thin clap CLI that wires the modules together. Subcommands: `index`, `search` (with `--explain`), `status`, `clear`, `init`, `configure`, `models`.
1822

1923
## Key patterns
2024

21-
- **Incremental indexing:** `diff_vault()` compares file content hashes in SQLite against disk. Changed files have their old chunks deleted (cascade), then are re-embedded as new
25+
- **Hybrid search:** Queries run through two lanes — semantic (HNSW embeddings) and keyword (FTS5 BM25). Results are fused via Reciprocal Rank Fusion (RRF) with configurable lane weights
26+
- **Smart chunking:** Break-point scoring algorithm assigns scores to potential split points (headings 50-100, code fences 80, thematic breaks 60, blank lines 20). Chunks split at the highest-scored break point near the token target. Code fence protection prevents splitting inside code blocks
27+
- **Incremental indexing:** `diff_vault()` compares file content hashes in SQLite against disk. Changed files have their old chunks deleted (cascade), then are re-embedded as new. FTS5 entries are cleaned up alongside vector entries
2228
- **HNSW rebuild on every run:** Vectors are stored as BLOBs in the `chunks` table. After SQLite is updated, the full HNSW index is rebuilt from `store.get_all_vectors()`. This is necessary because `hnsw_rs` doesn't support append-after-load
23-
- **Vector IDs:** Assigned sequentially, stored in both SQLite and HNSW. `next_vector_id` is derived from `MAX(vector_id)` in SQLite
24-
- **Tombstones:** Exist in the schema but are largely unused now that we rebuild HNSW each run. Kept for future use if switching to a vector store that supports deletion
29+
- **Docids:** Each file gets a deterministic 6-char hex ID (SHA-256 of relative path). Displayed in search results for quick reference
30+
- **Vault profiles:** `engraph init` auto-detects vault structure and writes `vault.toml`
31+
- **Pluggable models:** `ModelBackend` trait enables future model swapping. Current implementation uses ONNX all-MiniLM-L6-v2
2532

2633
## Data directory
2734

28-
`~/.engraph/` — hardcoded via `Config::data_dir()` (uses `dirs::home_dir()`). Contains `engraph.db` (SQLite), `hnsw/` (index files), `models/` (ONNX model + tokenizer).
35+
`~/.engraph/` — hardcoded via `Config::data_dir()` (uses `dirs::home_dir()`). Contains `engraph.db` (SQLite with FTS5), `hnsw/` (index files), `models/` (ONNX model + tokenizer), `vault.toml` (vault profile), `config.toml` (user config).
2936

3037
Single vault only. Re-indexing a different vault path triggers a confirmation prompt.
3138

@@ -35,10 +42,11 @@ Single vault only. Re-indexing a different vault path triggers a confirmation pr
3542
- `hnsw_rs` (0.3) — pure Rust HNSW. `Box::leak` used in `load()` to satisfy `'static` lifetime on the loaded index. Read-only after load
3643
- `tokenizers` (0.22) — HuggingFace tokenizer. Needs `fancy-regex` feature
3744
- `ignore` (0.4) — vault walking with automatic `.gitignore` support
45+
- `rusqlite` (0.32) — bundled SQLite with FTS5 support
3846

3947
## Testing
4048

41-
- Unit tests in each module (`cargo test --lib`) — 44 tests, no network required
49+
- Unit tests in each module (`cargo test --lib`) — 91 tests, no network required
4250
- 1 ignored smoke test (`test_embed_smoke`) — downloads ONNX model, verifies embedding
4351
- Integration tests (`cargo test --test integration -- --ignored`) — 8 tests, require model download. Use `tempfile` for isolated data dirs
4452

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "engraph"
3-
version = "0.1.0"
3+
version = "0.2.0"
44
edition = "2024"
55
description = "Local semantic search for Obsidian vaults"
66
license = "MIT"

src/chunker.rs

Lines changed: 116 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,12 @@ pub fn find_break_points(content: &str) -> Vec<BreakPoint> {
5656
score: 80,
5757
inside_code_fence: bp_inside,
5858
});
59-
byte_offset += line.len() + if byte_offset + line.len() < content.len() { 1 } else { 0 };
59+
byte_offset += line.len()
60+
+ if byte_offset + line.len() < content.len() {
61+
1
62+
} else {
63+
0
64+
};
6065
continue;
6166
} else if inside_code_fence {
6267
// Lines inside code fences: push with inside_code_fence = true
@@ -67,7 +72,12 @@ pub fn find_break_points(content: &str) -> Vec<BreakPoint> {
6772
score: 1,
6873
inside_code_fence: true,
6974
});
70-
byte_offset += line.len() + if byte_offset + line.len() < content.len() { 1 } else { 0 };
75+
byte_offset += line.len()
76+
+ if byte_offset + line.len() < content.len() {
77+
1
78+
} else {
79+
0
80+
};
7181
continue;
7282
} else if trimmed.starts_with("# ") && !trimmed.starts_with("## ") {
7383
100
@@ -100,7 +110,12 @@ pub fn find_break_points(content: &str) -> Vec<BreakPoint> {
100110
});
101111
}
102112

103-
byte_offset += line.len() + if byte_offset + line.len() < content.len() { 1 } else { 0 };
113+
byte_offset += line.len()
114+
+ if byte_offset + line.len() < content.len() {
115+
1
116+
} else {
117+
0
118+
};
104119
}
105120

106121
break_points
@@ -128,17 +143,18 @@ fn is_list_item(trimmed: &str) -> bool {
128143
// Check for ordered list: digit(s) followed by `. ` or `) `
129144
let mut chars = trimmed.chars();
130145
if let Some(first) = chars.next()
131-
&& first.is_ascii_digit() {
132-
for c in chars {
133-
if c.is_ascii_digit() {
134-
continue;
135-
}
136-
if c == '.' || c == ')' {
137-
return true;
138-
}
139-
break;
146+
&& first.is_ascii_digit()
147+
{
148+
for c in chars {
149+
if c.is_ascii_digit() {
150+
continue;
140151
}
152+
if c == '.' || c == ')' {
153+
return true;
154+
}
155+
break;
141156
}
157+
}
142158
false
143159
}
144160

@@ -242,11 +258,7 @@ pub fn smart_chunk(content: &str, target_tokens: usize, overlap_pct: usize) -> V
242258
.rfind('\n')
243259
.map(|p| start_offset + p + 1)
244260
{
245-
if nl > start_offset {
246-
nl
247-
} else {
248-
cut
249-
}
261+
if nl > start_offset { nl } else { cut }
250262
} else {
251263
cut
252264
};
@@ -511,25 +523,49 @@ mod tests {
511523
let pairs: Vec<(usize, u32)> = bps.iter().map(|bp| (bp.line_number, bp.score)).collect();
512524

513525
// # Title -> 100
514-
assert!(pairs.contains(&(0, 100)), "Expected # heading at line 0 with score 100, got: {:?}", pairs);
526+
assert!(
527+
pairs.contains(&(0, 100)),
528+
"Expected # heading at line 0 with score 100, got: {:?}",
529+
pairs
530+
);
515531
// empty line -> 20
516-
assert!(pairs.contains(&(1, 20)), "Expected empty line at line 1 with score 20");
532+
assert!(
533+
pairs.contains(&(1, 20)),
534+
"Expected empty line at line 1 with score 20"
535+
);
517536
// empty line -> 20
518-
assert!(pairs.contains(&(3, 20)), "Expected empty line at line 3 with score 20");
537+
assert!(
538+
pairs.contains(&(3, 20)),
539+
"Expected empty line at line 3 with score 20"
540+
);
519541
// ## Section -> 90
520-
assert!(pairs.contains(&(4, 90)), "Expected ## heading at line 4 with score 90");
542+
assert!(
543+
pairs.contains(&(4, 90)),
544+
"Expected ## heading at line 4 with score 90"
545+
);
521546
// ### Sub -> 80
522-
assert!(pairs.contains(&(6, 80)), "Expected ### heading at line 6 with score 80");
547+
assert!(
548+
pairs.contains(&(6, 80)),
549+
"Expected ### heading at line 6 with score 80"
550+
);
523551
// empty line -> 20
524-
assert!(pairs.contains(&(8, 20)), "Expected empty line at line 8 with score 20");
552+
assert!(
553+
pairs.contains(&(8, 20)),
554+
"Expected empty line at line 8 with score 20"
555+
);
525556
// --- -> 60
526-
assert!(pairs.contains(&(9, 60)), "Expected thematic break at line 9 with score 60");
557+
assert!(
558+
pairs.contains(&(9, 60)),
559+
"Expected thematic break at line 9 with score 60"
560+
);
527561

528562
// "Some text", "Content", "More" have score 1 and should NOT appear
529563
// (only lines inside code fences get score 1 in results)
530564
for bp in &bps {
531-
assert!(bp.score > 1 || bp.inside_code_fence,
532-
"Non-fence break points should not include lines with score <= 1");
565+
assert!(
566+
bp.score > 1 || bp.inside_code_fence,
567+
"Non-fence break points should not include lines with score <= 1"
568+
);
533569
}
534570
}
535571

@@ -541,20 +577,41 @@ mod tests {
541577
// The opening ``` should be a break point with score 80, NOT inside fence
542578
let opening = bps.iter().find(|bp| bp.line_number == 2).unwrap();
543579
assert_eq!(opening.score, 80);
544-
assert!(!opening.inside_code_fence, "Opening fence should not be marked as inside");
580+
assert!(
581+
!opening.inside_code_fence,
582+
"Opening fence should not be marked as inside"
583+
);
545584

546585
// The closing ``` should be a break point with score 80, NOT inside fence
547586
// (it toggles the fence off)
548587
let closing = bps.iter().find(|bp| bp.line_number == 5).unwrap();
549588
assert_eq!(closing.score, 80);
550-
assert!(!closing.inside_code_fence, "Closing fence should not be marked as inside");
589+
assert!(
590+
!closing.inside_code_fence,
591+
"Closing fence should not be marked as inside"
592+
);
551593

552594
// Lines inside the fence (let x = 1; let y = 2;) SHOULD appear with inside_code_fence = true
553-
let inside_bps: Vec<&BreakPoint> = bps.iter().filter(|bp| bp.line_number == 3 || bp.line_number == 4).collect();
554-
assert_eq!(inside_bps.len(), 2, "Expected 2 break points inside code fence");
595+
let inside_bps: Vec<&BreakPoint> = bps
596+
.iter()
597+
.filter(|bp| bp.line_number == 3 || bp.line_number == 4)
598+
.collect();
599+
assert_eq!(
600+
inside_bps.len(),
601+
2,
602+
"Expected 2 break points inside code fence"
603+
);
555604
for bp in &inside_bps {
556-
assert!(bp.inside_code_fence, "Line {} inside fence should have inside_code_fence=true", bp.line_number);
557-
assert_eq!(bp.score, 1, "Line {} inside fence should have score 1", bp.line_number);
605+
assert!(
606+
bp.inside_code_fence,
607+
"Line {} inside fence should have inside_code_fence=true",
608+
bp.line_number
609+
);
610+
assert_eq!(
611+
bp.score, 1,
612+
"Line {} inside fence should have score 1",
613+
bp.line_number
614+
);
558615
}
559616
}
560617

@@ -563,11 +620,23 @@ mod tests {
563620
let content = "- item one\n* item two\n1. numbered\nplain text\n";
564621
let bps = find_break_points(content);
565622
let pairs: Vec<(usize, u32)> = bps.iter().map(|bp| (bp.line_number, bp.score)).collect();
566-
assert!(pairs.contains(&(0, 5)), "Expected list item at line 0 with score 5");
567-
assert!(pairs.contains(&(1, 5)), "Expected list item at line 1 with score 5");
568-
assert!(pairs.contains(&(2, 5)), "Expected numbered list item at line 2 with score 5");
623+
assert!(
624+
pairs.contains(&(0, 5)),
625+
"Expected list item at line 0 with score 5"
626+
);
627+
assert!(
628+
pairs.contains(&(1, 5)),
629+
"Expected list item at line 1 with score 5"
630+
);
631+
assert!(
632+
pairs.contains(&(2, 5)),
633+
"Expected numbered list item at line 2 with score 5"
634+
);
569635
// "plain text" has score 1, should NOT appear
570-
assert!(!bps.iter().any(|bp| bp.line_number == 3), "Plain text should not be a break point");
636+
assert!(
637+
!bps.iter().any(|bp| bp.line_number == 3),
638+
"Plain text should not be a break point"
639+
);
571640
}
572641

573642
// ── Smart chunk tests ────────────────────────────────────────────────
@@ -661,7 +730,12 @@ mod tests {
661730
// since total tokens < 512
662731
assert!(parsed.chunks.len() >= 1);
663732
// The content should all be present
664-
let all_text: String = parsed.chunks.iter().map(|c| c.text.clone()).collect::<Vec<_>>().join(" ");
733+
let all_text: String = parsed
734+
.chunks
735+
.iter()
736+
.map(|c| c.text.clone())
737+
.collect::<Vec<_>>()
738+
.join(" ");
665739
assert!(all_text.contains("Content A"));
666740
assert!(all_text.contains("Content B"));
667741
}
@@ -693,7 +767,10 @@ mod tests {
693767
assert!(!parsed.chunks.is_empty());
694768
// At least one chunk should have a truncated snippet
695769
let has_truncated = parsed.chunks.iter().any(|c| c.snippet.ends_with("..."));
696-
assert!(has_truncated, "Expected at least one snippet to be truncated");
770+
assert!(
771+
has_truncated,
772+
"Expected at least one snippet to be truncated"
773+
);
697774
// Verify truncation length
698775
for c in &parsed.chunks {
699776
if c.snippet.ends_with("...") {
@@ -787,10 +864,7 @@ mod tests {
787864
extract_heading("# Title\nBody text"),
788865
Some("# Title".to_string())
789866
);
790-
assert_eq!(
791-
extract_heading("## Sub\nBody"),
792-
Some("## Sub".to_string())
793-
);
867+
assert_eq!(extract_heading("## Sub\nBody"), Some("## Sub".to_string()));
794868
assert_eq!(extract_heading("No heading here"), None);
795869
assert_eq!(
796870
extract_heading("Some text\n### Deep heading\nMore"),

0 commit comments

Comments
 (0)