You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Major quality improvement: replace random index vectors with pre-trained
Nomic nomic-embed-code token embeddings. Tokens like 'error' and 'exception' now start
with similar vectors (learned from millions of code repos) instead of
arbitrary random projections. Co-occurrence enrichment adds project-specific
context on top.
Architecture:
- vendored/nomic/code_vectors.bin: 37.7MB raw int8 vectors
- vendored/nomic/code_vectors_blob.S: assembler .incbin (instant build)
- vendored/nomic/code_vectors.h: extern declarations + pretrained_vec_at()
- vendored/nomic/code_tokens.h: 40856 token strings (575KB)
- semantic.c: cbm_sem_random_index() now looks up pretrained vectors first,
falls back to sparse random for unknown tokens
- CBM_SEM_DIM raised from 256 to 768 to match Nomic nomic-embed-code
Also: RRI (Reflective Random Indexing), code pattern vocabulary injection,
120+ abbreviation expansions, callee/caller/body token enrichment,
label filter (Function/Method/Class only) in vector search SQL.
Binary size: 136MB → 169MB (+33MB from embedded vectors).
Search quality: keyword queries return relevant error-handling functions.
Domain-specific keyword queries return the expected functions.
0 commit comments