Commit 3631c25
authored
v1.0.1: llama.cpp backend, Metal GPU, intelligence pipeline fixes (#11)
* fix(llm): add shimmytok fallback for GGUF-embedded tokenizers
GGUF repos rarely ship tokenizer.json and Google Gemma tokenizers
are gated on HuggingFace. FlexTokenizer enum wraps both HuggingFace
tokenizers crate and shimmytok (extracts from GGUF metadata).
CandleEmbed uses FlexTokenizer, orchestrator/reranker use HF-only.
* fix(llm): apply prompt format in CandleEmbed embed_one and embed_batch
embed_one now calls prompt_format.format_query() and embed_batch calls
prompt_format.format_document() before passing text to embed_text().
This is required for asymmetric models like embeddinggemma that need
specific prefixes for queries vs documents.
* fix(store): clear FTS on reindex and use stored dim for vec table init
reset_for_reindex now also deletes from chunks_fts so stale keyword
entries don't survive a dimension migration. Store::init() reads the
stored embedding_dim from meta to create the vec table with the correct
dimension, preventing a stale 384-dim table from persisting when the
model outputs 256-dim vectors.
* fix(search): wire LLM cache into search_with_intelligence
When an orchestrator is present, compute a SHA256 cache key from the
query and check the llm_cache table first. On miss, call the
orchestrator and store the result. Adds Serialize/Deserialize to
QueryIntent and OrchestrationResult for JSON round-tripping.
Removes #[allow(dead_code)] from orchestration_cache_key.
* fix(serve): wire orchestrator and reranker into MCP search handler
The search tool handler now calls search_with_intelligence with the
orchestrator and reranker from EngraphServer, enabling LLM-powered
query expansion and result reranking in the MCP server. Removes
#[allow(dead_code)] from the orchestrator and reranker fields.
* feat(llm): add BERT GGUF architecture support, switch default to all-MiniLM-L6-v2
Add BertLayer struct with LayerNorm+bias, absolute position embeddings,
and GELU FFN activation alongside the existing Gemma EmbedLayer. The
CandleEmbed struct now wraps an EmbedModelVariant enum (Gemma | Bert)
and detects architecture from GGUF metadata (general.architecture).
Switch default embedding model from embeddinggemma-300M (256-dim) to
all-MiniLM-L6-v2-GGUF Q8_0 (384-dim, 25MB). Users can still override
to embeddinggemma via config.toml. Update store default dimension to 384.
* feat: add accelerate feature flag for optimized CPU on macOS
* fix: add indexing progress output, fix Qwen3 GGUF filename case
- Print [N/M] file progress during indexing (was silent for minutes)
- Fix expand model URI: Qwen3-0.6B-Q8_0.gguf (uppercase, was 404)
- Add accelerate feature flag for Apple vecLib optimization
* fix: use float32 RmsNorm for Metal GPU compatibility in Gemma embedding
Replace candle_transformers::quantized_nn::RmsNorm (which lacks a Metal
kernel) with candle_nn::RmsNorm throughout the Gemma embedding code.
QTensor weights are dequantized to f32 Tensor at load time so the
standard RmsNorm forward pass runs on Metal without error.
Also restores embeddinggemma as the default model (256-dim), replaces
eprint indexing progress with an indicatif progress bar, and fixes
store tests to match the new default dimension.
* refactor(llm): replace candle backend with llama-cpp-2 for Metal GPU support
candle lacks Metal kernels for quantized GGUF models (rms-norm, QMatMul).
llama.cpp has mature Metal support and auto-detects GPU at build time.
- Replace candle-core/candle-nn/candle-transformers with llama-cpp-2
- CandleEmbed -> LlamaEmbed, CandleOrchestrator -> LlamaOrchestrator,
CandleRerank -> LlamaRerank
- Remove select_device(), CandleQMatMul, EmbedLayer, BertLayer,
EmbedModelVariant (llama.cpp handles all model loading internally)
- Remove metal/accelerate/cuda feature flags (llama.cpp handles GPU
detection at CMake build time)
- LlamaContext is !Send so contexts are created per-call from the
stored LlamaModel (which is Send+Sync)
- Public API unchanged: traits, MockLlm, download infra, FlexTokenizer,
PromptFormat, heuristic_orchestrate all preserved
- 270 tests pass (net -1: removed select_device test)
* feat(llm): switch to llama.cpp backend, fix embedding params
Replace candle with llama-cpp-2 for all ML inference. Gets Metal GPU
acceleration (88 files in 70s vs 37+ min on CPU).
Fixes: use encode() not decode() for embeddings, set n_ubatch >= n_tokens,
use AddBos::Never (PromptFormat already adds <bos>), force CPU device
for quantized ops (candle Metal unsupported).
Keeps BERT GGUF support code for fallback. Default: embeddinggemma-300M.
* style: cargo fmt
* ci: install CMake on Ubuntu for llama.cpp build
* feat(search): wire intelligence models into CLI search path
run_search now loads orchestrator + reranker when intelligence is
enabled and calls search_with_intelligence instead of search_internal.
* fix: singleton LlamaBackend and built-in tokenizer for orchestrator/reranker
Bug 1: LlamaBackend::init() fails with BackendAlreadyInitialized if called
more than once. Add a module-level llama_backend() function using OnceLock +
a Mutex-guarded double-checked init (get_or_try_init is still unstable on
stable Rust). Remove the backend field from LlamaEmbed, LlamaOrchestrator,
and LlamaRerank; all three now share the single static backend.
Bug 2: LlamaOrchestrator and LlamaRerank were loading an external
tokenizer.json via load_hf_tokenizer(), which does not exist in Qwen3 GGUF
repos. Switch both to llama.cpp's built-in tokenizer: str_to_token() for
encoding, token_to_piece() for decoding, and str_to_token("Yes"/"No") for
Yes/No token ID lookup. Remove the tokenizer field from both structs and
drop the load_hf_tokenizer() helper. Add encoding_rs as a direct dependency
(required by token_to_piece's Decoder parameter; was already a transitive dep).
All 270 unit tests pass, clippy clean, fmt clean.
* fix(llm): global backend singleton, built-in tokenizers, wire CLI intelligence
- LlamaBackend shared via OnceLock (was re-initialized per model, crashed)
- Orchestrator/reranker use llama.cpp built-in tokenizer (GGUF-embedded)
- CLI search loads intelligence models when enabled
- Debug log for orchestration results
* docs: update README, CHANGELOG, CLAUDE.md for llama.cpp backend
- README: llama.cpp references, Metal GPU, 270 tests, CMake requirement
- CHANGELOG: v1.0.1 entry with all fixes and backend switch
- CLAUDE.md: llama-cpp-2 deps, LlamaEmbed/LlamaOrchestrator/LlamaRerank
- Release workflow: CMake on Ubuntu, cmake dep in Homebrew formula
- Vault spec: updated with hotfix PR reference1 parent 697df6e commit 3631c25
13 files changed
Lines changed: 751 additions & 1731 deletions
File tree
- .github/workflows
- src
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
17 | 20 | | |
18 | 21 | | |
19 | 22 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
22 | 25 | | |
23 | 26 | | |
24 | 27 | | |
| |||
60 | 63 | | |
61 | 64 | | |
62 | 65 | | |
| 66 | + | |
63 | 67 | | |
64 | 68 | | |
65 | 69 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
3 | 27 | | |
4 | 28 | | |
5 | 29 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | | - | |
| 12 | + | |
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
| |||
52 | 52 | | |
53 | 53 | | |
54 | 54 | | |
55 | | - | |
56 | | - | |
57 | | - | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
61 | 61 | | |
62 | | - | |
63 | 62 | | |
64 | 63 | | |
65 | 64 | | |
| |||
68 | 67 | | |
69 | 68 | | |
70 | 69 | | |
71 | | - | |
| 70 | + | |
72 | 71 | | |
| 72 | + | |
73 | 73 | | |
74 | 74 | | |
75 | 75 | | |
76 | | - | |
| 76 | + | |
77 | 77 | | |
78 | 78 | | |
79 | 79 | | |
| |||
0 commit comments