Skip to content

Commit f2731af

Browse files
committed
docs: rewrite README and add CLAUDE.md
1 parent 9a4ff2b commit f2731af

2 files changed

Lines changed: 171 additions & 39 deletions

File tree

CLAUDE.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# engraph
2+
3+
Local semantic search CLI for Obsidian vaults. Rust, MIT licensed.
4+
5+
## Architecture
6+
7+
Single binary with 7 modules behind a lib crate:
8+
9+
- `config.rs` — loads `~/.engraph/config.toml`, merges CLI args, provides `data_dir()`
10+
- `chunker.rs` — splits markdown by `##` headings, strips YAML frontmatter, extracts tags. `split_oversized_chunks()` handles token-aware sub-splitting with overlap
11+
- `embedder.rs` — downloads and runs `all-MiniLM-L6-v2` ONNX model (384-dim). SHA256-verified on download. Uses `ort` for inference, `tokenizers` for tokenization
12+
- `store.rs` — SQLite persistence. Tables: `meta`, `files`, `chunks` (with vector BLOBs), `tombstones`. Handles incremental diffing via content hashes
13+
- `hnsw.rs` — thin wrapper around `hnsw_rs`. **Important:** `hnsw_rs` does not support inserting after `load_hnsw()`. The index is rebuilt from vectors stored in SQLite on every index run
14+
- `indexer.rs` — orchestrates vault walking (via `ignore` crate for `.gitignore` support), diffing, chunking, embedding (Rayon for parallel chunking, serial embedding since `Embedder` is not `Send`), and serial writes to store + HNSW
15+
- `search.rs` — embeds query, searches HNSW with tombstone filtering, formats results (human + JSON). Also handles `status` formatting
16+
17+
`main.rs` is a thin clap CLI that wires the modules together.
18+
19+
## Key patterns
20+
21+
- **Incremental indexing:** `diff_vault()` compares file content hashes in SQLite against disk. Changed files have their old chunks deleted (cascade), then are re-embedded as new
22+
- **HNSW rebuild on every run:** Vectors are stored as BLOBs in the `chunks` table. After SQLite is updated, the full HNSW index is rebuilt from `store.get_all_vectors()`. This is necessary because `hnsw_rs` doesn't support append-after-load
23+
- **Vector IDs:** Assigned sequentially, stored in both SQLite and HNSW. `next_vector_id` is derived from `MAX(vector_id)` in SQLite
24+
- **Tombstones:** Exist in the schema but are largely unused now that we rebuild HNSW each run. Kept for future use if switching to a vector store that supports deletion
25+
26+
## Data directory
27+
28+
`~/.engraph/` — hardcoded via `Config::data_dir()` (uses `dirs::home_dir()`). Contains `engraph.db` (SQLite), `hnsw/` (index files), `models/` (ONNX model + tokenizer).
29+
30+
Single vault only. Re-indexing a different vault path triggers a confirmation prompt.
31+
32+
## Dependencies to be aware of
33+
34+
- `ort` (2.0.0-rc.12) — ONNX Runtime Rust bindings. Pre-release API. `Session::builder()?.commit_from_file()` pattern. Does not provide prebuilt binaries for all targets (no x86_64-apple-darwin)
35+
- `hnsw_rs` (0.3) — pure Rust HNSW. `Box::leak` used in `load()` to satisfy `'static` lifetime on the loaded index. Read-only after load
36+
- `tokenizers` (0.22) — HuggingFace tokenizer. Needs `fancy-regex` feature
37+
- `ignore` (0.4) — vault walking with automatic `.gitignore` support
38+
39+
## Testing
40+
41+
- Unit tests in each module (`cargo test --lib`) — 44 tests, no network required
42+
- 1 ignored smoke test (`test_embed_smoke`) — downloads ONNX model, verifies embedding
43+
- Integration tests (`cargo test --test integration -- --ignored`) — 8 tests, require model download. Use `tempfile` for isolated data dirs
44+
45+
## CI/CD
46+
47+
- CI: `cargo fmt --check` + `cargo clippy -- -D warnings` + `cargo test --lib` on macOS + Ubuntu
48+
- Release: native builds on macOS arm64 (macos-14) + Linux x86_64 (ubuntu-latest). Triggered by `v*` tags. No x86_64 macOS build (ort-sys limitation)
49+
- Homebrew: `devwhodevs/homebrew-tap` — formula builds from source tarball
50+
51+
## Common tasks
52+
53+
```bash
54+
# Run tests
55+
cargo test --lib
56+
57+
# Run integration tests (downloads model)
58+
cargo test --test integration -- --ignored
59+
60+
# Build release
61+
cargo build --release
62+
63+
# Release: tag and push
64+
git tag v0.x.y && git push origin v0.x.y
65+
# Then update homebrew-tap formula with new SHA256
66+
```

README.md

Lines changed: 105 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# engraph
22

3-
Local semantic search for Obsidian vaults.
3+
Local semantic search for Obsidian vaults. Runs entirely offline — no API keys, no cloud services.
4+
5+
engraph indexes your markdown notes into a local vector database and lets you search by meaning, not just keywords. It uses a small ONNX model (`all-MiniLM-L6-v2`, ~23MB) that runs on your machine.
46

57
## Install
68

@@ -10,72 +12,136 @@ Local semantic search for Obsidian vaults.
1012
brew install devwhodevs/tap/engraph
1113
```
1214

13-
**Cargo:**
15+
**Pre-built binaries:**
16+
17+
Download from [Releases](https://github.com/devwhodevs/engraph/releases) (macOS arm64, Linux x86_64).
18+
19+
**From source:**
1420

1521
```bash
16-
cargo install engraph
22+
cargo install --git https://github.com/devwhodevs/engraph
1723
```
1824

19-
**Binary download:**
25+
## Usage
2026

21-
Pre-built binaries for macOS (arm64/x86_64) and Linux (x86_64) are available on the [Releases](https://github.com/devwhodevs/engraph/releases) page.
27+
```bash
28+
# Index your vault (downloads the embedding model on first run, ~23MB)
29+
engraph index ~/path/to/vault
2230

23-
## Quick start
31+
# Search
32+
engraph search "how does error handling work in Rust"
33+
34+
# Check what's indexed
35+
engraph status
36+
37+
# Re-index after changes
38+
engraph index ~/path/to/vault
39+
40+
# Full rebuild (discard incremental state)
41+
engraph index ~/path/to/vault --rebuild
42+
43+
# JSON output (for scripts/tools)
44+
engraph search "query" --json
45+
engraph status --json
46+
47+
# Clear index data (keeps downloaded model)
48+
engraph clear
49+
50+
# Clear everything including model
51+
engraph clear --all
52+
```
53+
54+
## How it works
55+
56+
1. **Walk** the vault collecting `.md` files (respects `.gitignore` and exclude patterns)
57+
2. **Chunk** each file by `##` heading boundaries. Oversized chunks are sub-split at sentence boundaries with token overlap
58+
3. **Embed** chunks locally using `all-MiniLM-L6-v2` via ONNX Runtime (384-dim vectors)
59+
4. **Store** vectors and metadata in SQLite (`~/.engraph/engraph.db`)
60+
5. **Build** an HNSW index for fast approximate nearest-neighbor search
61+
62+
Re-indexing is incremental — only new or modified files are re-embedded. The HNSW index is rebuilt from stored vectors each run (necessary because `hnsw_rs` doesn't support append-after-load).
63+
64+
## Search output
2465

25-
```bash
26-
engraph index ~/vault
27-
engraph search "query"
66+
```
67+
1. [0.87] 02-Areas/Development/Rust Tips.md > ## Error Handling
68+
Use thiserror for library errors and anyhow for application errors...
69+
70+
2. [0.82] 03-Resources/Code-Snippets/WASM Setup.md
71+
Setting up wasm-pack with Rust requires...
72+
73+
3. [0.74] 07-Daily/2026-03-15.md > ## Notes
74+
Looked into embedding models for local inference...
2875
```
2976

3077
## Commands
3178

32-
| Command | Description | Flags |
33-
|----------|------------------------------------|--------------------------------|
34-
| `index` | Index a vault for semantic search | `[path]`, `--rebuild` |
35-
| `search` | Search the indexed vault | `<query>`, `-n/--top-n <N>` |
36-
| `status` | Show index status and statistics | |
37-
| `clear` | Clear cached data | `--all` |
79+
| Command | Description | Options |
80+
|---------|-------------|---------|
81+
| `engraph index [PATH]` | Index a vault (default: current dir) | `--rebuild` force full rebuild |
82+
| `engraph search <QUERY>` | Semantic search | `-n <N>` number of results (default: 5) |
83+
| `engraph status` | Show index stats | |
84+
| `engraph clear` | Delete index (keeps model) | `--all` delete everything |
85+
86+
Global flags: `--json` for machine-readable output, `--verbose` for debug logging.
3887

3988
## Configuration
4089

41-
engraph reads `~/.config/engraph/config.toml`:
90+
Optional config file at `~/.engraph/config.toml`:
4291

4392
```toml
44-
vault_path = "~/Documents/vault"
93+
vault_path = "~/Documents/MyVault"
4594
top_n = 5
46-
exclude = [".obsidian/*", ".trash/*"]
95+
exclude = [".obsidian/", "node_modules/"]
4796
batch_size = 64
4897
```
4998

50-
| Key | Description | Default |
51-
|--------------|----------------------------------------------|----------------------------------|
52-
| `vault_path` | Path to Obsidian vault | None (must specify via CLI/config) |
53-
| `top_n` | Number of search results to return | `5` |
54-
| `exclude` | Glob patterns to exclude from indexing | `[".obsidian/*", ".trash/*"]` |
55-
| `batch_size` | Files per embedding batch | `64` |
99+
| Key | Type | Default | Description |
100+
|-----|------|---------|-------------|
101+
| `vault_path` | string | current dir | Default vault path |
102+
| `top_n` | integer | `5` | Number of search results |
103+
| `exclude` | string[] | `[".obsidian/"]` | Patterns to exclude from indexing |
104+
| `batch_size` | integer | `64` | Embedding batch size |
56105

57-
## How it works
58-
59-
engraph splits your vault's markdown files into heading-based chunks, generates embeddings locally using an ONNX runtime model (all-MiniLM-L6-v2), and stores them in an HNSW index for fast approximate nearest-neighbor search. Everything runs on your machine -- no API keys, no network calls after the initial one-time model download.
106+
## Data directory
60107

61-
The indexing pipeline:
108+
Everything is stored in `~/.engraph/`:
62109

63-
1. Walk the vault, respecting `.gitignore` and exclude patterns
64-
2. Split each markdown file into chunks by heading boundaries
65-
3. Sub-split oversized chunks to stay within the model's token limit
66-
4. Embed chunks in batches via ONNX Runtime
67-
5. Insert embeddings into an HNSW graph stored alongside a SQLite metadata database
110+
```
111+
~/.engraph/
112+
engraph.db # SQLite: file metadata, chunks, vectors
113+
hnsw/ # HNSW index files
114+
models/ # Downloaded ONNX model + tokenizer
115+
config.toml # Optional configuration
116+
```
68117

69-
Subsequent runs are incremental -- only new or modified files are re-processed.
118+
## Development
70119

71-
## Contributing
120+
```bash
121+
# Run all unit tests
122+
cargo test --lib
72123

73-
Contributions are welcome. Please open an issue to discuss larger changes before submitting a PR.
124+
# Run integration tests (requires ~23MB model download)
125+
cargo test --test integration -- --ignored
74126

75-
```bash
76-
cargo fmt
127+
# Lint
128+
cargo fmt --check
77129
cargo clippy -- -D warnings
78-
cargo test --lib
130+
```
131+
132+
## Architecture
133+
134+
```
135+
src/
136+
main.rs # CLI entry point (clap)
137+
lib.rs # Public module re-exports
138+
config.rs # Config loading and merging
139+
chunker.rs # Markdown parsing, heading-based chunking
140+
embedder.rs # ONNX model download + inference
141+
store.rs # SQLite persistence (files, chunks, vectors, metadata)
142+
hnsw.rs # HNSW index wrapper
143+
indexer.rs # Vault walking, incremental sync orchestration
144+
search.rs # Query pipeline and output formatting
79145
```
80146

81147
## License

0 commit comments

Comments
 (0)