Skip to content

feat(search): index prose content for BM25 full-text search#617

Open
ShauryaaSharma wants to merge 1 commit into
DeusData:mainfrom
ShauryaaSharma:feat/fts-prose-content
Open

feat(search): index prose content for BM25 full-text search#617
ShauryaaSharma wants to merge 1 commit into
DeusData:mainfrom
ShauryaaSharma:feat/fts-prose-content

Conversation

@ShauryaaSharma

Copy link
Copy Markdown
Contributor

What & why

search_graph BM25 only matched node names and headings, so it was blind to the
prose that documentation- and config-heavy repos carry. Markdown Section nodes
exposed only their heading; YAML/JSON Module nodes only their file name — the
section body and the description value were never indexed, and Section/Module
were excluded from BM25 results entirely. This indexes that prose so content is
searchable.

Closes #518
Closes #519

Changes

Testing

7 extraction cases + 3 store FTS cases added. Verified end-to-end: bodies are
extracted → indexed into nodes_fts.body → returned by BM25; json_valid() tolerates
malformed rows; legacy FTS tables upgrade on rebuild.

Notes

Backward compatible (additive column; legacy DBs upgrade on next index). No MCP
tool changes, no new deps, no new system()/popen()/network calls. #518 and #519
share the FTS body infra (#519 can't work without it), so they're together —
happy to split if preferred.

Section nodes (markdown) and Module nodes (YAML/JSON) previously exposed
only their heading/name to BM25, so search_graph could not match the prose
body or a config description. Index that text so content is searchable.

- store: add a `body` column to the nodes_fts FTS5 table; new
  cbm_store_fts_rebuild() drops+recreates the table (upgrading legacy
  4-column databases) and backfills `body` from each node's docstring,
  guarded by json_valid() against malformed-JSON rows
- pipeline: both FTS backfill sites now call cbm_store_fts_rebuild()
- mcp: stop excluding Section/Module from BM25 results (they rank below
  code symbols, so existing result ordering is preserved)
- internal/cbm: capture the markdown section body beneath each heading
  (DeusData#518) and promote top-level description/summary/purpose values onto
  the file's Module node (DeusData#519), reusing the existing docstring property
- tests: 7 extraction cases + 3 store FTS cases

Closes DeusData#518
Closes DeusData#519

Signed-off-by: ShauryaaSharma <shauryasofficial27@gmail.com>
@ShauryaaSharma ShauryaaSharma force-pushed the feat/fts-prose-content branch from f6b313a to 58cd6c4 Compare June 25, 2026 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

META.yaml/frontmatter description values not indexed for BM25 search Section nodes don't index body text — BM25 can't search markdown content

1 participant