Skip to content

Commit 3631c25

Browse files
authored
v1.0.1: llama.cpp backend, Metal GPU, intelligence pipeline fixes (#11)
* fix(llm): add shimmytok fallback for GGUF-embedded tokenizers GGUF repos rarely ship tokenizer.json and Google Gemma tokenizers are gated on HuggingFace. FlexTokenizer enum wraps both HuggingFace tokenizers crate and shimmytok (extracts from GGUF metadata). CandleEmbed uses FlexTokenizer, orchestrator/reranker use HF-only. * fix(llm): apply prompt format in CandleEmbed embed_one and embed_batch embed_one now calls prompt_format.format_query() and embed_batch calls prompt_format.format_document() before passing text to embed_text(). This is required for asymmetric models like embeddinggemma that need specific prefixes for queries vs documents. * fix(store): clear FTS on reindex and use stored dim for vec table init reset_for_reindex now also deletes from chunks_fts so stale keyword entries don't survive a dimension migration. Store::init() reads the stored embedding_dim from meta to create the vec table with the correct dimension, preventing a stale 384-dim table from persisting when the model outputs 256-dim vectors. * fix(search): wire LLM cache into search_with_intelligence When an orchestrator is present, compute a SHA256 cache key from the query and check the llm_cache table first. On miss, call the orchestrator and store the result. Adds Serialize/Deserialize to QueryIntent and OrchestrationResult for JSON round-tripping. Removes #[allow(dead_code)] from orchestration_cache_key. * fix(serve): wire orchestrator and reranker into MCP search handler The search tool handler now calls search_with_intelligence with the orchestrator and reranker from EngraphServer, enabling LLM-powered query expansion and result reranking in the MCP server. Removes #[allow(dead_code)] from the orchestrator and reranker fields. * feat(llm): add BERT GGUF architecture support, switch default to all-MiniLM-L6-v2 Add BertLayer struct with LayerNorm+bias, absolute position embeddings, and GELU FFN activation alongside the existing Gemma EmbedLayer. The CandleEmbed struct now wraps an EmbedModelVariant enum (Gemma | Bert) and detects architecture from GGUF metadata (general.architecture). Switch default embedding model from embeddinggemma-300M (256-dim) to all-MiniLM-L6-v2-GGUF Q8_0 (384-dim, 25MB). Users can still override to embeddinggemma via config.toml. Update store default dimension to 384. * feat: add accelerate feature flag for optimized CPU on macOS * fix: add indexing progress output, fix Qwen3 GGUF filename case - Print [N/M] file progress during indexing (was silent for minutes) - Fix expand model URI: Qwen3-0.6B-Q8_0.gguf (uppercase, was 404) - Add accelerate feature flag for Apple vecLib optimization * fix: use float32 RmsNorm for Metal GPU compatibility in Gemma embedding Replace candle_transformers::quantized_nn::RmsNorm (which lacks a Metal kernel) with candle_nn::RmsNorm throughout the Gemma embedding code. QTensor weights are dequantized to f32 Tensor at load time so the standard RmsNorm forward pass runs on Metal without error. Also restores embeddinggemma as the default model (256-dim), replaces eprint indexing progress with an indicatif progress bar, and fixes store tests to match the new default dimension. * refactor(llm): replace candle backend with llama-cpp-2 for Metal GPU support candle lacks Metal kernels for quantized GGUF models (rms-norm, QMatMul). llama.cpp has mature Metal support and auto-detects GPU at build time. - Replace candle-core/candle-nn/candle-transformers with llama-cpp-2 - CandleEmbed -> LlamaEmbed, CandleOrchestrator -> LlamaOrchestrator, CandleRerank -> LlamaRerank - Remove select_device(), CandleQMatMul, EmbedLayer, BertLayer, EmbedModelVariant (llama.cpp handles all model loading internally) - Remove metal/accelerate/cuda feature flags (llama.cpp handles GPU detection at CMake build time) - LlamaContext is !Send so contexts are created per-call from the stored LlamaModel (which is Send+Sync) - Public API unchanged: traits, MockLlm, download infra, FlexTokenizer, PromptFormat, heuristic_orchestrate all preserved - 270 tests pass (net -1: removed select_device test) * feat(llm): switch to llama.cpp backend, fix embedding params Replace candle with llama-cpp-2 for all ML inference. Gets Metal GPU acceleration (88 files in 70s vs 37+ min on CPU). Fixes: use encode() not decode() for embeddings, set n_ubatch >= n_tokens, use AddBos::Never (PromptFormat already adds <bos>), force CPU device for quantized ops (candle Metal unsupported). Keeps BERT GGUF support code for fallback. Default: embeddinggemma-300M. * style: cargo fmt * ci: install CMake on Ubuntu for llama.cpp build * feat(search): wire intelligence models into CLI search path run_search now loads orchestrator + reranker when intelligence is enabled and calls search_with_intelligence instead of search_internal. * fix: singleton LlamaBackend and built-in tokenizer for orchestrator/reranker Bug 1: LlamaBackend::init() fails with BackendAlreadyInitialized if called more than once. Add a module-level llama_backend() function using OnceLock + a Mutex-guarded double-checked init (get_or_try_init is still unstable on stable Rust). Remove the backend field from LlamaEmbed, LlamaOrchestrator, and LlamaRerank; all three now share the single static backend. Bug 2: LlamaOrchestrator and LlamaRerank were loading an external tokenizer.json via load_hf_tokenizer(), which does not exist in Qwen3 GGUF repos. Switch both to llama.cpp's built-in tokenizer: str_to_token() for encoding, token_to_piece() for decoding, and str_to_token("Yes"/"No") for Yes/No token ID lookup. Remove the tokenizer field from both structs and drop the load_hf_tokenizer() helper. Add encoding_rs as a direct dependency (required by token_to_piece's Decoder parameter; was already a transitive dep). All 270 unit tests pass, clippy clean, fmt clean. * fix(llm): global backend singleton, built-in tokenizers, wire CLI intelligence - LlamaBackend shared via OnceLock (was re-initialized per model, crashed) - Orchestrator/reranker use llama.cpp built-in tokenizer (GGUF-embedded) - CLI search loads intelligence models when enabled - Debug log for orchestration results * docs: update README, CHANGELOG, CLAUDE.md for llama.cpp backend - README: llama.cpp references, Metal GPU, 270 tests, CMake requirement - CHANGELOG: v1.0.1 entry with all fixes and backend switch - CLAUDE.md: llama-cpp-2 deps, LlamaEmbed/LlamaOrchestrator/LlamaRerank - Release workflow: CMake on Ubuntu, cmake dep in Homebrew formula - Vault spec: updated with hotfix PR reference
1 parent 697df6e commit 3631c25

13 files changed

Lines changed: 751 additions & 1731 deletions

File tree

.github/workflows/ci.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,9 @@ jobs:
1414
runs-on: ${{ matrix.os }}
1515
steps:
1616
- uses: actions/checkout@v4
17+
- name: Install CMake (Ubuntu)
18+
if: runner.os == 'Linux'
19+
run: sudo apt-get update && sudo apt-get install -y cmake
1720
- uses: dtolnay/rust-toolchain@stable
1821
with:
1922
components: rustfmt, clippy

.github/workflows/release.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,9 @@ jobs:
1919
contents: write
2020
steps:
2121
- uses: actions/checkout@v4
22+
- name: Install CMake (Ubuntu)
23+
if: runner.os == 'Linux'
24+
run: sudo apt-get update && sudo apt-get install -y cmake
2225
- uses: dtolnay/rust-toolchain@stable
2326
- run: cargo build --release
2427
- name: Archive binary
@@ -60,6 +63,7 @@ jobs:
6063
sha256 "SHA256"
6164
license "MIT"
6265
66+
depends_on "cmake" => :build
6367
depends_on "rust" => :build
6468
6569
def install

CHANGELOG.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,29 @@
11
# Changelog
22

3+
## [1.0.1] - 2026-03-26
4+
5+
### Changed
6+
- **Inference backend switched from candle to llama.cpp** — via `llama-cpp-2` Rust bindings. Gets full Metal GPU acceleration on macOS (88 files indexed in 70s vs 37+ minutes on CPU with candle). Same backend as [qmd](https://github.com/tobi/qmd).
7+
- Default embedding model produces 256-dim vectors via embeddinggemma-300M (Matryoshka truncation)
8+
- BERT GGUF architecture support added alongside Gemma (future model flexibility)
9+
- Progress bar during indexing via indicatif (was silent for minutes)
10+
- CI workflow installs CMake on Ubuntu (required for llama.cpp build)
11+
12+
### Fixed
13+
- **Prompt format applied during embedding**`embed_one` uses search_query prefix, `embed_batch` uses search_document prefix. Without this, embeddinggemma operated in wrong symmetric mode.
14+
- **GGUF tokenizer fallback** — added `shimmytok` crate to extract tokenizer from GGUF metadata when tokenizer.json is unavailable (Google Gemma repos are gated)
15+
- **LlamaBackend singleton** — global `OnceLock` prevents double-initialization crash when loading multiple models
16+
- **Orchestrator/reranker use built-in tokenizer** — llama.cpp reads tokenizer from GGUF metadata, no external tokenizer.json needed
17+
- **Dimension migration clears FTS**`reset_for_reindex` now also clears `chunks_fts` to prevent duplicate entries
18+
- **LLM cache wired into search**`search_with_intelligence` checks/populates `llm_cache` table
19+
- **MCP server wires intelligence** — search handler passes orchestrator + reranker via `SearchConfig`
20+
- **CLI search wires intelligence**`run_search` loads models when intelligence enabled
21+
- **Qwen3 GGUF filename** — fixed case sensitivity (was 404)
22+
- **Embedding batch params**`n_ubatch >= n_tokens` assertion, use `encode()` not `decode()`, `AddBos::Never` (PromptFormat adds `<bos>`)
23+
24+
### Removed
25+
- `candle-core`, `candle-nn`, `candle-transformers` dependencies (replaced by `llama-cpp-2`)
26+
327
## [1.0.0] - 2026-03-25
428

529
### Added

CLAUDE.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Single binary with 19 modules behind a lib crate:
99
- `config.rs` — loads `~/.engraph/config.toml` and `vault.toml`, merges CLI args, provides `data_dir()`. Includes `intelligence: Option<bool>` and `[models]` section for model overrides. `Config::save()` writes back to disk.
1010
- `chunker.rs` — smart chunking with break-point scoring algorithm. Finds optimal split points considering headings, code fences, blank lines, and thematic breaks. `split_oversized_chunks()` handles token-aware secondary splitting with overlap
1111
- `docid.rs` — deterministic 6-char hex IDs for files (SHA-256 of path, truncated). Shown in search results for quick reference
12-
- `llm.rs`candle model management. Three traits: `EmbedModel` (embeddings), `RerankModel` (cross-encoder scoring), `OrchestratorModel` (query intent + expansion). Three candle implementations: `CandleEmbed` (custom bidirectional transformer from GGUF for embeddinggemma), `CandleOrchestrator` (quantized_qwen3 for query analysis), `CandleRerank` (quantized_qwen3 for relevance scoring). Also: `MockLlm` for testing, `HfModelUri` for model download, `PromptFormat` for model-family prompt templates, `heuristic_orchestrate()` fast path, `LaneWeights` per query intent
12+
- `llm.rs`ML inference via llama.cpp (Rust bindings: `llama-cpp-2`). Three traits: `EmbedModel` (embeddings), `RerankModel` (cross-encoder scoring), `OrchestratorModel` (query intent + expansion). Three llama.cpp implementations: `LlamaEmbed` (embeddinggemma-300M GGUF on Metal GPU), `LlamaOrchestrator` (Qwen3-0.6B for query analysis + expansion), `LlamaRerank` (Qwen3-Reranker-0.6B for relevance scoring). Global `LlamaBackend` via `OnceLock`. Also: `MockLlm` for testing, `HfModelUri` for model download, `FlexTokenizer` (HuggingFace tokenizers + shimmytok GGUF fallback), `PromptFormat` for model-family prompt templates, `heuristic_orchestrate()` fast path, `LaneWeights` per query intent
1313
- `fts.rs` — FTS5 full-text search support. Re-exports `FtsResult` from store. BM25-ranked keyword search
1414
- `fusion.rs` — Reciprocal Rank Fusion (RRF) engine. Merges semantic + FTS5 + graph + reranker results. Supports per-lane weighting, `--explain` output with intent + per-lane detail
1515
- `context.rs` — context engine. Six functions: `read` (full note content + metadata), `list` (filtered note listing with `created_by` filter), `vault_map` (structure overview), `who` (person context bundle), `project` (project context bundle), `context_topic` (rich topic context with budget trimming). Pure functions taking `ContextParams` — no model loading except `context_topic` which reuses `search_internal`
@@ -52,14 +52,13 @@ Single vault only. Re-indexing a different vault path triggers a confirmation pr
5252

5353
## Dependencies to be aware of
5454

55-
- `candle-core` (0.9) — HuggingFace pure Rust ML framework. GGUF model loading, tensor ops. `metal` feature for macOS GPU acceleration
56-
- `candle-nn` (0.9) — neural network building blocks (RmsNorm, rotary embeddings, etc.)
57-
- `candle-transformers` (0.9) — pre-built transformer model architectures. Used: `quantized_qwen3` for orchestrator + reranker
55+
- `llama-cpp-2` (0.1) — Rust bindings to llama.cpp. GGUF model loading + inference. Metal GPU on macOS, CUDA on Linux. Compiles llama.cpp C++ via build script (requires CMake)
56+
- `shimmytok` (0.7) — pure Rust tokenizer that reads from GGUF metadata. Fallback when tokenizer.json is unavailable (gated HuggingFace repos)
57+
- `tokenizers` (0.22) — HuggingFace tokenizer. Kept for FlexTokenizer HuggingFace backend
5858
- `sqlite-vec` (0.1.8-alpha.1) — SQLite extension for vector search. Provides vec0 virtual tables with KNN via `vec_distance_cosine()`
5959
- `zerocopy` (0.7) — zero-copy serialization for vector data passed to sqlite-vec
6060
- `strsim` (0.11) — string similarity for fuzzy tag matching and fuzzy link matching
6161
- `time` (0.3) — date/time handling for frontmatter timestamps
62-
- `tokenizers` (0.22) — HuggingFace tokenizer. Needs `fancy-regex` feature. Used for all three GGUF models
6362
- `ignore` (0.4) — vault walking with `.gitignore` support
6463
- `rusqlite` (0.32) — bundled SQLite with FTS5 support
6564
- `rmcp` (1.2) — MCP server SDK for stdio transport
@@ -68,12 +67,13 @@ Single vault only. Re-indexing a different vault path triggers a confirmation pr
6867

6968
## Testing
7069

71-
- Unit tests in each module (`cargo test --lib`) — 271 tests, no network required
70+
- Unit tests in each module (`cargo test --lib`) — 270 tests, no network required
7271
- Integration tests (`cargo test --test integration -- --ignored`) — require GGUF model download
72+
- Build requires CMake (for llama.cpp C++ compilation)
7373

7474
## CI/CD
7575

76-
- CI: `cargo fmt --check` + `cargo clippy -- -D warnings` + `cargo test --lib` on macOS + Ubuntu
76+
- CI: `cargo fmt --check` + `cargo clippy -- -D warnings` + `cargo test --lib` on macOS + Ubuntu. Ubuntu step installs CMake.
7777
- Release: native builds on macOS arm64 (macos-14) + Linux x86_64 (ubuntu-latest). Triggered by `v*` tags
7878
- Homebrew: `devwhodevs/homebrew-tap` — formula builds from source tarball
7979

0 commit comments

Comments
 (0)