Skip to content

Commit 634616e

Browse files
authored
Merge pull request #2 from devwhodevs/feature/v2.1
feat: engraph v0.3.0 — vault graph and graph search agent
2 parents 71a68e1 + 88bda95 commit 634616e

9 files changed

Lines changed: 1419 additions & 36 deletions

File tree

CLAUDE.md

Lines changed: 22 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,59 +1,62 @@
11
# engraph
22

3-
Local semantic search CLI for Obsidian vaults. Rust, MIT licensed.
3+
Local hybrid search CLI for Obsidian vaults. Rust, MIT licensed.
44

55
## Architecture
66

7-
Single binary with 11 modules behind a lib crate:
7+
Single binary with 12 modules behind a lib crate:
88

99
- `config.rs` — loads `~/.engraph/config.toml` and `vault.toml`, merges CLI args, provides `data_dir()`
1010
- `chunker.rs` — smart chunking with break-point scoring algorithm. Finds optimal split points considering headings, code fences, blank lines, and thematic breaks. `split_oversized_chunks()` handles token-aware secondary splitting with overlap
1111
- `docid.rs` — deterministic 6-char hex IDs for files (SHA-256 of path, truncated). Shown in search results for quick reference
1212
- `embedder.rs` — downloads and runs `all-MiniLM-L6-v2` ONNX model (384-dim). SHA256-verified on download. Uses `ort` for inference, `tokenizers` for tokenization. Implements `ModelBackend` trait
1313
- `model.rs` — pluggable `ModelBackend` trait, model registry, and `parse_model_spec()`. Enables future model swapping without changing consumer code
1414
- `fts.rs` — FTS5 full-text search support. Re-exports `FtsResult` from store. BM25-ranked keyword search
15-
- `fusion.rs` — Reciprocal Rank Fusion (RRF) engine. Merges semantic + FTS5 results. Supports lane weighting and `--explain` output
15+
- `fusion.rs` — Reciprocal Rank Fusion (RRF) engine. Merges semantic + FTS5 + graph results. Supports lane weighting, `--explain` output with per-lane detail
16+
- `graph.rs` — vault graph agent. Extracts wikilink targets, expands search results by following graph connections 1-2 hops. Relevance filtering via FTS5 term check and shared tags
1617
- `profile.rs` — vault profile detection. Auto-detects PARA/Folders/Flat structure, vault type (Obsidian/Logseq/Plain), wikilinks, frontmatter, tags. Writes/loads `vault.toml`
17-
- `store.rs` — SQLite persistence. Tables: `meta`, `files` (with docid), `chunks` (with vector BLOBs), `chunks_fts` (FTS5 virtual table), `tombstones`. Handles incremental diffing via content hashes
18+
- `store.rs` — SQLite persistence. Tables: `meta`, `files` (with docid), `chunks` (with vector BLOBs), `chunks_fts` (FTS5), `edges` (vault graph), `tombstones`. Handles incremental diffing via content hashes
1819
- `hnsw.rs` — thin wrapper around `hnsw_rs`. **Important:** `hnsw_rs` does not support inserting after `load_hnsw()`. The index is rebuilt from vectors stored in SQLite on every index run
19-
- `indexer.rs` — orchestrates vault walking (via `ignore` crate for `.gitignore` support), diffing, chunking, embedding (Rayon for parallel chunking, serial embedding since `Embedder` is not `Send`), and serial writes to store + HNSW + FTS5
20+
- `indexer.rs` — orchestrates vault walking (via `ignore` crate for `.gitignore` support), diffing, chunking, embedding (Rayon for parallel chunking, serial embedding since `Embedder` is not `Send`), serial writes to store + HNSW + FTS5, and vault graph edge building (wikilinks + people detection)
2021

21-
`main.rs` is a thin clap CLI that wires the modules together. Subcommands: `index`, `search` (with `--explain`), `status`, `clear`, `init`, `configure`, `models`.
22+
`main.rs` is a thin clap CLI. Subcommands: `index`, `search` (with `--explain`), `status`, `clear`, `init`, `configure`, `models`, `graph` (show/stats).
2223

2324
## Key patterns
2425

25-
- **Hybrid search:** Queries run through two lanes — semantic (HNSW embeddings) and keyword (FTS5 BM25). Results are fused via Reciprocal Rank Fusion (RRF) with configurable lane weights
26-
- **Smart chunking:** Break-point scoring algorithm assigns scores to potential split points (headings 50-100, code fences 80, thematic breaks 60, blank lines 20). Chunks split at the highest-scored break point near the token target. Code fence protection prevents splitting inside code blocks
27-
- **Incremental indexing:** `diff_vault()` compares file content hashes in SQLite against disk. Changed files have their old chunks deleted (cascade), then are re-embedded as new. FTS5 entries are cleaned up alongside vector entries
28-
- **HNSW rebuild on every run:** Vectors are stored as BLOBs in the `chunks` table. After SQLite is updated, the full HNSW index is rebuilt from `store.get_all_vectors()`. This is necessary because `hnsw_rs` doesn't support append-after-load
29-
- **Docids:** Each file gets a deterministic 6-char hex ID (SHA-256 of relative path). Displayed in search results for quick reference
26+
- **3-lane hybrid search:** Queries run through three lanes — semantic (HNSW embeddings), keyword (FTS5 BM25), and graph (wikilink expansion). Results are fused via Reciprocal Rank Fusion (RRF) with configurable lane weights (semantic 1.0, FTS 1.0, graph 0.8)
27+
- **Vault graph:** `edges` table stores bidirectional wikilink edges and mention edges. Built during indexing after all files are written. People detection scans for person name/alias mentions using notes from the configured People folder
28+
- **Graph agent:** Expands seed results by following wikilinks 1-2 hops. Decay: 0.8× for 1-hop, 0.5× for 2-hop. Relevance filter: must contain query term (FTS5) or share tags with seed. Multi-parent merge takes highest score
29+
- **Smart chunking:** Break-point scoring algorithm assigns scores to potential split points (headings 50-100, code fences 80, thematic breaks 60, blank lines 20). Code fence protection prevents splitting inside code blocks
30+
- **Incremental indexing:** `diff_vault()` compares file content hashes in SQLite against disk. Changed files have their old chunks and edges deleted, then are re-processed. FTS5 entries cleaned up alongside vector entries
31+
- **HNSW rebuild on every run:** Vectors stored as BLOBs. Full HNSW index rebuilt from `store.get_all_vectors()` after SQLite update (hnsw_rs limitation)
32+
- **Docids:** Each file gets a deterministic 6-char hex ID. Displayed in search results
3033
- **Vault profiles:** `engraph init` auto-detects vault structure and writes `vault.toml`
31-
- **Pluggable models:** `ModelBackend` trait enables future model swapping. Current implementation uses ONNX all-MiniLM-L6-v2
34+
- **Pluggable models:** `ModelBackend` trait enables future model swapping
3235

3336
## Data directory
3437

35-
`~/.engraph/` — hardcoded via `Config::data_dir()` (uses `dirs::home_dir()`). Contains `engraph.db` (SQLite with FTS5), `hnsw/` (index files), `models/` (ONNX model + tokenizer), `vault.toml` (vault profile), `config.toml` (user config).
38+
`~/.engraph/` — hardcoded via `Config::data_dir()`. Contains `engraph.db` (SQLite with FTS5 + edges), `hnsw/` (index files), `models/` (ONNX model + tokenizer), `vault.toml` (vault profile), `config.toml` (user config).
3639

3740
Single vault only. Re-indexing a different vault path triggers a confirmation prompt.
3841

3942
## Dependencies to be aware of
4043

41-
- `ort` (2.0.0-rc.12) — ONNX Runtime Rust bindings. Pre-release API. `Session::builder()?.commit_from_file()` pattern. Does not provide prebuilt binaries for all targets (no x86_64-apple-darwin)
42-
- `hnsw_rs` (0.3) — pure Rust HNSW. `Box::leak` used in `load()` to satisfy `'static` lifetime on the loaded index. Read-only after load
44+
- `ort` (2.0.0-rc.12) — ONNX Runtime Rust bindings. Pre-release API. Does not provide prebuilt binaries for all targets
45+
- `hnsw_rs` (0.3) — pure Rust HNSW. `Box::leak` in `load()`. Read-only after load
4346
- `tokenizers` (0.22) — HuggingFace tokenizer. Needs `fancy-regex` feature
44-
- `ignore` (0.4) — vault walking with automatic `.gitignore` support
47+
- `ignore` (0.4) — vault walking with `.gitignore` support
4548
- `rusqlite` (0.32) — bundled SQLite with FTS5 support
4649

4750
## Testing
4851

49-
- Unit tests in each module (`cargo test --lib`) — 91 tests, no network required
52+
- Unit tests in each module (`cargo test --lib`) — 119 tests, no network required
5053
- 1 ignored smoke test (`test_embed_smoke`) — downloads ONNX model, verifies embedding
51-
- Integration tests (`cargo test --test integration -- --ignored`) — 8 tests, require model download. Use `tempfile` for isolated data dirs
54+
- Integration tests (`cargo test --test integration -- --ignored`) — 8 tests, require model download
5255

5356
## CI/CD
5457

5558
- CI: `cargo fmt --check` + `cargo clippy -- -D warnings` + `cargo test --lib` on macOS + Ubuntu
56-
- Release: native builds on macOS arm64 (macos-14) + Linux x86_64 (ubuntu-latest). Triggered by `v*` tags. No x86_64 macOS build (ort-sys limitation)
59+
- Release: native builds on macOS arm64 (macos-14) + Linux x86_64 (ubuntu-latest). Triggered by `v*` tags
5760
- Homebrew: `devwhodevs/homebrew-tap` — formula builds from source tarball
5861

5962
## Common tasks
@@ -70,5 +73,4 @@ cargo build --release
7073

7174
# Release: tag and push
7275
git tag v0.x.y && git push origin v0.x.y
73-
# Then update homebrew-tap formula with new SHA256
7476
```

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "engraph"
3-
version = "0.2.0"
3+
version = "0.3.0"
44
edition = "2024"
55
description = "Local semantic search for Obsidian vaults"
66
license = "MIT"

src/fusion.rs

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
/// rrf_score = sum( weight_i / (k + rank_i) )
77
///
88
/// A ranked result from a single search lane.
9+
#[derive(Clone)]
910
pub struct RankedResult {
1011
pub file_path: String,
1112
pub file_id: i64,
@@ -32,6 +33,7 @@ pub struct LaneContribution {
3233
pub rank: usize,
3334
pub raw_score: f64,
3435
pub weighted_contribution: f64,
36+
pub detail: Option<String>, // e.g., "1-hop from BRE-2579"
3537
}
3638

3739
use std::collections::HashMap;
@@ -93,6 +95,7 @@ pub fn rrf_fuse(lanes: &[(&str, &[RankedResult], f64)], k: usize) -> Vec<FusedRe
9395
rank,
9496
raw_score: r.score,
9597
weighted_contribution: contribution,
98+
detail: None,
9699
});
97100
}
98101
}
@@ -124,10 +127,15 @@ pub fn rrf_fuse(lanes: &[(&str, &[RankedResult], f64)], k: usize) -> Vec<FusedRe
124127
pub fn format_explain(result: &FusedResult) -> String {
125128
let mut out = format!(" RRF: {:.4}\n", result.rrf_score);
126129
for lc in &result.lane_contributions {
127-
out.push_str(&format!(
128-
" {}: rank #{}, raw {:.2}, +{:.4}\n",
129-
lc.lane_name, lc.rank, lc.raw_score, lc.weighted_contribution,
130-
));
130+
let detail_str = lc
131+
.detail
132+
.as_deref()
133+
.map(|d| format!(" ({})", d))
134+
.unwrap_or_default();
135+
out += &format!(
136+
" {}: rank #{}, raw {:.2}{}, +{:.4}\n",
137+
lc.lane_name, lc.rank, lc.raw_score, detail_str, lc.weighted_contribution
138+
);
131139
}
132140
out
133141
}
@@ -234,12 +242,14 @@ mod tests {
234242
rank: 1,
235243
raw_score: 0.87,
236244
weighted_contribution: 0.0164,
245+
detail: None,
237246
},
238247
LaneContribution {
239248
lane_name: "fts".to_string(),
240249
rank: 3,
241250
raw_score: 5.23,
242251
weighted_contribution: 0.0159,
252+
detail: None,
243253
},
244254
],
245255
};

0 commit comments

Comments
 (0)