You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-`chunker.rs` — splits markdown by `##` headings, strips YAML frontmatter, extracts tags. `split_oversized_chunks()` handles token-aware sub-splitting with overlap
11
+
-`embedder.rs` — downloads and runs `all-MiniLM-L6-v2` ONNX model (384-dim). SHA256-verified on download. Uses `ort` for inference, `tokenizers` for tokenization
-`hnsw.rs` — thin wrapper around `hnsw_rs`. **Important:**`hnsw_rs` does not support inserting after `load_hnsw()`. The index is rebuilt from vectors stored in SQLite on every index run
14
+
-`indexer.rs` — orchestrates vault walking (via `ignore` crate for `.gitignore` support), diffing, chunking, embedding (Rayon for parallel chunking, serial embedding since `Embedder` is not `Send`), and serial writes to store + HNSW
15
+
-`search.rs` — embeds query, searches HNSW with tombstone filtering, formats results (human + JSON). Also handles `status` formatting
16
+
17
+
`main.rs` is a thin clap CLI that wires the modules together.
18
+
19
+
## Key patterns
20
+
21
+
-**Incremental indexing:**`diff_vault()` compares file content hashes in SQLite against disk. Changed files have their old chunks deleted (cascade), then are re-embedded as new
22
+
-**HNSW rebuild on every run:** Vectors are stored as BLOBs in the `chunks` table. After SQLite is updated, the full HNSW index is rebuilt from `store.get_all_vectors()`. This is necessary because `hnsw_rs` doesn't support append-after-load
23
+
-**Vector IDs:** Assigned sequentially, stored in both SQLite and HNSW. `next_vector_id` is derived from `MAX(vector_id)` in SQLite
24
+
-**Tombstones:** Exist in the schema but are largely unused now that we rebuild HNSW each run. Kept for future use if switching to a vector store that supports deletion
25
+
26
+
## Data directory
27
+
28
+
`~/.engraph/` — hardcoded via `Config::data_dir()` (uses `dirs::home_dir()`). Contains `engraph.db` (SQLite), `hnsw/` (index files), `models/` (ONNX model + tokenizer).
29
+
30
+
Single vault only. Re-indexing a different vault path triggers a confirmation prompt.
31
+
32
+
## Dependencies to be aware of
33
+
34
+
-`ort` (2.0.0-rc.12) — ONNX Runtime Rust bindings. Pre-release API. `Session::builder()?.commit_from_file()` pattern. Does not provide prebuilt binaries for all targets (no x86_64-apple-darwin)
35
+
-`hnsw_rs` (0.3) — pure Rust HNSW. `Box::leak` used in `load()` to satisfy `'static` lifetime on the loaded index. Read-only after load
Local semantic search for Obsidian vaults. Runs entirely offline — no API keys, no cloud services.
4
+
5
+
engraph indexes your markdown notes into a local vector database and lets you search by meaning, not just keywords. It uses a small ONNX model (`all-MiniLM-L6-v2`, ~23MB) that runs on your machine.
4
6
5
7
## Install
6
8
@@ -10,72 +12,136 @@ Local semantic search for Obsidian vaults.
10
12
brew install devwhodevs/tap/engraph
11
13
```
12
14
13
-
**Cargo:**
15
+
**Pre-built binaries:**
16
+
17
+
Download from [Releases](https://github.com/devwhodevs/engraph/releases) (macOS arm64, Linux x86_64).
Pre-built binaries for macOS (arm64/x86_64) and Linux (x86_64) are available on the [Releases](https://github.com/devwhodevs/engraph/releases) page.
27
+
```bash
28
+
# Index your vault (downloads the embedding model on first run, ~23MB)
29
+
engraph index ~/path/to/vault
22
30
23
-
## Quick start
31
+
# Search
32
+
engraph search "how does error handling work in Rust"
33
+
34
+
# Check what's indexed
35
+
engraph status
36
+
37
+
# Re-index after changes
38
+
engraph index ~/path/to/vault
39
+
40
+
# Full rebuild (discard incremental state)
41
+
engraph index ~/path/to/vault --rebuild
42
+
43
+
# JSON output (for scripts/tools)
44
+
engraph search "query" --json
45
+
engraph status --json
46
+
47
+
# Clear index data (keeps downloaded model)
48
+
engraph clear
49
+
50
+
# Clear everything including model
51
+
engraph clear --all
52
+
```
53
+
54
+
## How it works
55
+
56
+
1.**Walk** the vault collecting `.md` files (respects `.gitignore` and exclude patterns)
57
+
2.**Chunk** each file by `##` heading boundaries. Oversized chunks are sub-split at sentence boundaries with token overlap
58
+
3.**Embed** chunks locally using `all-MiniLM-L6-v2` via ONNX Runtime (384-dim vectors)
59
+
4.**Store** vectors and metadata in SQLite (`~/.engraph/engraph.db`)
60
+
5.**Build** an HNSW index for fast approximate nearest-neighbor search
61
+
62
+
Re-indexing is incremental — only new or modified files are re-embedded. The HNSW index is rebuilt from stored vectors each run (necessary because `hnsw_rs` doesn't support append-after-load).
engraph splits your vault's markdown files into heading-based chunks, generates embeddings locally using an ONNX runtime model (all-MiniLM-L6-v2), and stores them in an HNSW index for fast approximate nearest-neighbor search. Everything runs on your machine -- no API keys, no network calls after the initial one-time model download.
106
+
## Data directory
60
107
61
-
The indexing pipeline:
108
+
Everything is stored in `~/.engraph/`:
62
109
63
-
1. Walk the vault, respecting `.gitignore` and exclude patterns
64
-
2. Split each markdown file into chunks by heading boundaries
65
-
3. Sub-split oversized chunks to stay within the model's token limit
66
-
4. Embed chunks in batches via ONNX Runtime
67
-
5. Insert embeddings into an HNSW graph stored alongside a SQLite metadata database
0 commit comments