Skip to content

Commit 80680ea

Browse files
committed
Cross-service communication discovery + RAM-first incremental indexing
AST-based detection of HTTP calls, async dispatch (Pub/Sub, Cloud Tasks, Kafka, SQS, etc.), and config accesses via resolved qualified names. Route nodes as cross-service rendezvous points with infra→handler matching. Constant propagation for module-level string assignments. YAML infrastructure URL extraction from Cloud Scheduler configs. RAM-first incremental pipeline: load DB into graph buffer, purge changed file nodes, extract directly into existing buffer (resolver sees all nodes), dump back to disk. Zero edge gap on kubernetes/django/meilisearch/neovim. - service_patterns.c: ~170 library patterns (90 HTTP, 50 async, 30 config) - pass_route_nodes.c: Route node creation + infra URL matching - extract_unified.c: string constant collection + string ref classification - extract_calls.c: first_string_arg + keyword argument extraction - pipeline_incremental.c: RAM-first load→purge→extract→resolve→dump - graph_buffer.c: load_from_db, delete_by_file, foreach visitors - C++ LSP crash fix: NULL guard in cbm_type_substitute
1 parent a7d6403 commit 80680ea

22 files changed

Lines changed: 1906 additions & 176 deletions

CONTRIBUTING.md

Lines changed: 26 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -118,8 +118,32 @@ Examples: `fix(store): set busy_timeout before WAL`, `feat(cli): add --progress
118118

119119
## Pull Request Guidelines
120120

121-
- **One issue per PR.** Each PR must address exactly one bug, one feature, or one refactor. Do not bundle multiple fixes or feature additions into a single PR. If your change touches multiple areas, split it into separate PRs.
122-
- **Open an issue first.** Every PR should reference a tracking issue (`Fixes #N` or `Closes #N`). This ensures the change is discussed before code is written.
121+
### Before You Write Code
122+
123+
- **Open an issue first — always.** Every PR must reference a tracking issue (`Fixes #N` or `Closes #N`). Describe what you want to change and why. Wait for maintainer feedback before implementing. PRs without a prior issue discussion will be closed.
124+
- **Bug fixes and test additions** are the exception — these are welcome without prior discussion, as long as they're focused.
125+
126+
### What Requires Explicit Maintainer Approval
127+
128+
The following changes will not be merged without prior design discussion in an issue:
129+
130+
- **API surface changes** — adding, removing, renaming, or changing defaults of MCP tools
131+
- **New pipeline passes or indexing algorithms** — anything that changes what gets extracted or how
132+
- **Build system / Makefile changes** — beyond trivial fixes
133+
- **Project configuration** — CLAUDE.md, skill files, .mcp.json, CI workflows
134+
- **New dependencies** — vendored or otherwise
135+
- **Breaking changes** of any kind
136+
137+
If in doubt, open an issue and ask.
138+
139+
### PR Scope and Size
140+
141+
- **One issue per PR.** Each PR must address exactly one bug, one feature, or one refactor. Do not bundle multiple fixes or feature additions into a single PR. Kitchen-sink PRs will be closed with a request to split.
142+
- **Keep PRs small.** A good PR is under 500 lines. If your change is larger, split it into reviewable increments that each stand on their own.
143+
- **Don't mix features with fixes.** If you find a bug while implementing a feature, submit the bug fix as a separate PR.
144+
145+
### Code Requirements
146+
123147
- **C code only** — this project was rewritten from Go to pure C in v0.5.0. Go PRs will be acknowledged and potentially ported, but cannot be merged directly.
124148
- Include tests for new functionality
125149
- Run `scripts/test.sh` and `scripts/lint.sh` before submitting

Makefile.cbm

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,8 @@ EXTRACTION_SRCS = \
118118
$(CBM_DIR)/extract_env_accesses.c \
119119
$(CBM_DIR)/extract_k8s.c \
120120
$(CBM_DIR)/helpers.c \
121-
$(CBM_DIR)/lang_specs.c
121+
$(CBM_DIR)/lang_specs.c \
122+
$(CBM_DIR)/service_patterns.c
122123

123124
# LSP resolvers (compiled as one unit via lsp_all.c)
124125
LSP_SRCS = $(CBM_DIR)/lsp_all.c
@@ -175,6 +176,7 @@ PIPELINE_SRCS = \
175176
src/pipeline/pass_gitdiff.c \
176177
src/pipeline/pass_configures.c \
177178
src/pipeline/pass_configlink.c \
179+
src/pipeline/pass_route_nodes.c \
178180
src/pipeline/pass_enrichment.c \
179181
src/pipeline/pass_envscan.c \
180182
src/pipeline/pass_compile_commands.c \

internal/cbm/cbm.c

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,11 @@ void cbm_typeassign_push(CBMTypeAssignArray *arr, CBMArena *a, CBMTypeAssign ta)
124124
arr->items[arr->count++] = ta;
125125
}
126126

127+
void cbm_stringref_push(CBMStringRefArray *arr, CBMArena *a, CBMStringRef sr) {
128+
GROW_ARRAY(arr, a);
129+
arr->items[arr->count++] = sr;
130+
}
131+
127132
void cbm_impltrait_push(CBMImplTraitArray *arr, CBMArena *a, CBMImplTrait it) {
128133
GROW_ARRAY(arr, a);
129134
arr->items[arr->count++] = it;

internal/cbm/cbm.h

Lines changed: 33 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,7 @@ typedef struct {
110110
typedef struct {
111111
const char *callee_name; // raw callee text ("pkg.Func", "foo")
112112
const char *enclosing_func_qn; // QN of enclosing function (or module QN)
113+
const char *first_string_arg; // first string literal argument (URL, topic, key) or NULL
113114
} CBMCall;
114115

115116
typedef struct {
@@ -149,6 +150,19 @@ typedef struct {
149150
const char *enclosing_func_qn; // QN of enclosing function
150151
} CBMTypeAssign;
151152

153+
// String reference: URL, config key, or async target found in source.
154+
// Extracted from string literals during AST walk.
155+
typedef enum {
156+
CBM_STRREF_URL = 0, // REST path or full URL
157+
CBM_STRREF_CONFIG = 1, // config file path or env var key
158+
} CBMStringRefKind;
159+
160+
typedef struct {
161+
const char *value; // the string literal content
162+
const char *enclosing_func_qn; // QN of enclosing function
163+
CBMStringRefKind kind; // URL, CONFIG
164+
} CBMStringRef;
165+
152166
// Rust: impl Trait for Struct
153167
typedef struct {
154168
const char *trait_name; // trait name (raw text)
@@ -225,6 +239,12 @@ typedef struct {
225239
int cap;
226240
} CBMTypeAssignArray;
227241

242+
typedef struct {
243+
CBMStringRef *items;
244+
int count;
245+
int cap;
246+
} CBMStringRefArray;
247+
228248
typedef struct {
229249
CBMImplTrait *items;
230250
int count;
@@ -246,6 +266,7 @@ typedef struct {
246266
CBMTypeAssignArray type_assigns;
247267
CBMImplTraitArray impl_traits; // Rust: impl Trait for Struct pairs
248268
CBMResolvedCallArray resolved_calls; // LSP-resolved calls (high confidence)
269+
CBMStringRefArray string_refs; // URL/config string literals from AST
249270

250271
const char *module_qn; // module qualified name
251272
const char **exports; // NULL-terminated (NULL if none)
@@ -279,6 +300,14 @@ typedef struct {
279300

280301
// --- Extraction context passed to sub-extractors ---
281302

303+
// Module-level string constant map (for constant propagation)
304+
#define CBM_MAX_STRING_CONSTANTS 256
305+
typedef struct {
306+
const char *names[CBM_MAX_STRING_CONSTANTS];
307+
const char *values[CBM_MAX_STRING_CONSTANTS];
308+
int count;
309+
} CBMStringConstantMap;
310+
282311
typedef struct {
283312
CBMArena *arena;
284313
CBMFileResult *result;
@@ -289,8 +318,9 @@ typedef struct {
289318
const char *rel_path;
290319
const char *module_qn;
291320
TSNode root;
292-
EFCache ef_cache; // enclosing function cache
293-
const char *enclosing_class_qn; // for nested class QN computation
321+
EFCache ef_cache; // enclosing function cache
322+
const char *enclosing_class_qn; // for nested class QN computation
323+
CBMStringConstantMap string_constants; // module-level NAME = "value" pairs
294324
} CBMExtractCtx;
295325

296326
// --- Public API ---
@@ -346,6 +376,7 @@ void cbm_rw_push(CBMRWArray *arr, CBMArena *a, CBMReadWrite rw);
346376
void cbm_typerefs_push(CBMTypeRefArray *arr, CBMArena *a, CBMTypeRef tr);
347377
void cbm_envaccess_push(CBMEnvAccessArray *arr, CBMArena *a, CBMEnvAccess ea);
348378
void cbm_typeassign_push(CBMTypeAssignArray *arr, CBMArena *a, CBMTypeAssign ta);
379+
void cbm_stringref_push(CBMStringRefArray *arr, CBMArena *a, CBMStringRef sr);
349380
void cbm_impltrait_push(CBMImplTraitArray *arr, CBMArena *a, CBMImplTrait it);
350381
void cbm_resolvedcall_push(CBMResolvedCallArray *arr, CBMArena *a, CBMResolvedCall rc);
351382

internal/cbm/extract_calls.c

Lines changed: 183 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,20 @@
88
#include <string.h>
99
#include <ctype.h>
1010

11+
/* Look up a module-level string constant by name. */
12+
static const char *lookup_string_constant(const CBMExtractCtx *ctx, const char *name) {
13+
if (!name || !name[0]) {
14+
return NULL;
15+
}
16+
const CBMStringConstantMap *map = &ctx->string_constants;
17+
for (int i = 0; i < map->count; i++) {
18+
if (strcmp(map->names[i], name) == 0) {
19+
return map->values[i];
20+
}
21+
}
22+
return NULL;
23+
}
24+
1125
// Forward declarations
1226
static void walk_calls(CBMExtractCtx *ctx, TSNode node, const CBMLangSpec *spec);
1327
static char *extract_callee_name(CBMArena *a, TSNode node, const char *source, CBMLanguage lang);
@@ -256,6 +270,46 @@ static void walk_calls(CBMExtractCtx *ctx, TSNode node, const CBMLangSpec *spec)
256270
CBMCall call;
257271
call.callee_name = callee;
258272
call.enclosing_func_qn = cbm_enclosing_func_qn_cached(ctx, node);
273+
call.first_string_arg = NULL;
274+
275+
/* Extract first string literal argument (URL, topic, key) */
276+
TSNode args = ts_node_child_by_field_name(node, "arguments", 9);
277+
if (!ts_node_is_null(args)) {
278+
uint32_t nc = ts_node_named_child_count(args);
279+
for (uint32_t ai = 0; ai < nc && ai < 3; ai++) {
280+
TSNode arg = ts_node_named_child(args, ai);
281+
const char *ak = ts_node_type(arg);
282+
if (strcmp(ak, "string") == 0 || strcmp(ak, "string_literal") == 0 ||
283+
strcmp(ak, "interpreted_string_literal") == 0 ||
284+
strcmp(ak, "raw_string_literal") == 0 ||
285+
strcmp(ak, "string_content") == 0) {
286+
char *text = cbm_node_text(ctx->arena, arg, ctx->source);
287+
if (text && text[0]) {
288+
/* Strip quotes */
289+
int len = (int)strlen(text);
290+
if (len >= 2 && (text[0] == '"' || text[0] == '\'')) {
291+
text =
292+
cbm_arena_strndup(ctx->arena, text + 1, (size_t)(len - 2));
293+
len -= 2;
294+
}
295+
/* Validate: must be printable ASCII, no control chars */
296+
// NOLINTNEXTLINE(readability-implicit-bool-conversion)
297+
bool valid = (text != NULL && len > 0 && len < 512);
298+
for (int vi = 0; vi < len && valid; vi++) {
299+
unsigned char ch = (unsigned char)text[vi];
300+
if (ch < 0x20 && ch != '\t') {
301+
valid = false;
302+
}
303+
}
304+
if (valid) {
305+
call.first_string_arg = text;
306+
}
307+
}
308+
break;
309+
}
310+
}
311+
}
312+
259313
cbm_calls_push(&ctx->result->calls, ctx->arena, call);
260314
}
261315
}
@@ -294,7 +348,7 @@ static void extract_jsx_refs(CBMExtractCtx *ctx, TSNode node) {
294348
return;
295349
}
296350

297-
CBMCall call;
351+
CBMCall call = {0};
298352
call.callee_name = name;
299353
call.enclosing_func_qn = cbm_enclosing_func_qn_cached(ctx, node);
300354
cbm_calls_push(&ctx->result->calls, ctx->arena, call);
@@ -321,9 +375,135 @@ void handle_calls(CBMExtractCtx *ctx, TSNode node, const CBMLangSpec *spec, Walk
321375
if (cbm_kind_in_set(node, spec->call_node_types)) {
322376
char *callee = extract_callee_name(ctx->arena, node, ctx->source, ctx->language);
323377
if (callee && callee[0] && !cbm_is_keyword(callee, ctx->language)) {
324-
CBMCall call;
378+
CBMCall call = {0};
325379
call.callee_name = callee;
326380
call.enclosing_func_qn = state->enclosing_func_qn;
381+
382+
/* Extract URL/topic/key from call arguments.
383+
* Strategy: check keyword args first (url=, topic_id=, queue=),
384+
* then first positional string, then resolve constant references. */
385+
TSNode args = ts_node_child_by_field_name(node, "arguments", 9);
386+
if (!ts_node_is_null(args)) {
387+
/* Keyword patterns that indicate URL/topic/key */
388+
static const char *url_keywords[] = {"url", "endpoint", "path", "uri",
389+
"target_url", "base_url", NULL};
390+
static const char *topic_keywords[] = {"topic", "topic_id", "topic_name",
391+
"queue", "queue_name", "queue_id",
392+
"subject", "channel", NULL};
393+
394+
uint32_t nc = ts_node_named_child_count(args);
395+
for (uint32_t ai = 0; ai < nc && !call.first_string_arg; ai++) {
396+
TSNode arg = ts_node_named_child(args, ai);
397+
const char *ak = ts_node_type(arg);
398+
399+
/* Check keyword_argument nodes: url="...", topic_id="..." */
400+
if (strcmp(ak, "keyword_argument") == 0 || strcmp(ak, "pair") == 0) {
401+
TSNode key_node = ts_node_child_by_field_name(arg, "name", 4);
402+
TSNode val_node = ts_node_child_by_field_name(arg, "value", 5);
403+
if (ts_node_is_null(key_node)) {
404+
key_node = ts_node_child_by_field_name(arg, "key", 3);
405+
}
406+
if (ts_node_is_null(key_node) || ts_node_is_null(val_node)) {
407+
continue;
408+
}
409+
410+
char *key = cbm_node_text(ctx->arena, key_node, ctx->source);
411+
if (!key) {
412+
continue;
413+
}
414+
415+
/* Check if key matches URL or topic patterns */
416+
bool is_url_kw = false;
417+
bool is_topic_kw = false;
418+
for (int ki = 0; url_keywords[ki]; ki++) {
419+
if (strcmp(key, url_keywords[ki]) == 0) {
420+
is_url_kw = true;
421+
break;
422+
}
423+
}
424+
if (!is_url_kw) {
425+
for (int ki = 0; topic_keywords[ki]; ki++) {
426+
if (strcmp(key, topic_keywords[ki]) == 0) {
427+
is_topic_kw = true;
428+
break;
429+
}
430+
}
431+
}
432+
if (!is_url_kw && !is_topic_kw) {
433+
continue;
434+
}
435+
436+
/* Extract value — string literal or constant reference */
437+
const char *vk = ts_node_type(val_node);
438+
if (strcmp(vk, "string") == 0 || strcmp(vk, "string_literal") == 0 ||
439+
strcmp(vk, "interpreted_string_literal") == 0 ||
440+
strcmp(vk, "raw_string_literal") == 0) {
441+
char *text = cbm_node_text(ctx->arena, val_node, ctx->source);
442+
if (text && text[0]) {
443+
int len = (int)strlen(text);
444+
if (len >= 2 && (text[0] == '"' || text[0] == '\'')) {
445+
text =
446+
cbm_arena_strndup(ctx->arena, text + 1, (size_t)(len - 2));
447+
}
448+
if (text && text[0]) {
449+
call.first_string_arg = text;
450+
}
451+
}
452+
} else if (strcmp(vk, "identifier") == 0) {
453+
/* Constant reference: url=MY_URL_CONSTANT */
454+
char *const_name = cbm_node_text(ctx->arena, val_node, ctx->source);
455+
if (const_name) {
456+
const char *resolved = lookup_string_constant(ctx, const_name);
457+
if (resolved) {
458+
call.first_string_arg = resolved;
459+
}
460+
}
461+
}
462+
continue;
463+
}
464+
465+
/* First positional string argument (fallback) */
466+
if (ai < 3) {
467+
if (strcmp(ak, "string") == 0 || strcmp(ak, "string_literal") == 0 ||
468+
strcmp(ak, "interpreted_string_literal") == 0 ||
469+
strcmp(ak, "raw_string_literal") == 0) {
470+
char *text = cbm_node_text(ctx->arena, arg, ctx->source);
471+
if (text && text[0]) {
472+
int len = (int)strlen(text);
473+
if (len >= 2 && (text[0] == '"' || text[0] == '\'')) {
474+
text =
475+
cbm_arena_strndup(ctx->arena, text + 1, (size_t)(len - 2));
476+
len -= 2;
477+
}
478+
/* Validate printable */
479+
// NOLINTNEXTLINE(readability-implicit-bool-conversion)
480+
bool valid = (text != NULL && len > 0 && len < 512);
481+
for (int vi = 0; vi < len && valid; vi++) {
482+
if ((unsigned char)text[vi] < 0x20 && text[vi] != '\t') {
483+
valid = false;
484+
}
485+
}
486+
if (valid) {
487+
call.first_string_arg = text;
488+
}
489+
}
490+
break;
491+
}
492+
/* Positional constant reference: create_task(MY_URL) */
493+
if (strcmp(ak, "identifier") == 0) {
494+
char *const_name = cbm_node_text(ctx->arena, arg, ctx->source);
495+
if (const_name) {
496+
const char *resolved = lookup_string_constant(ctx, const_name);
497+
if (resolved) {
498+
call.first_string_arg = resolved;
499+
break;
500+
}
501+
}
502+
}
503+
}
504+
}
505+
}
506+
327507
cbm_calls_push(&ctx->result->calls, ctx->arena, call);
328508
}
329509
}
@@ -336,7 +516,7 @@ void handle_calls(CBMExtractCtx *ctx, TSNode node, const CBMLangSpec *spec, Walk
336516
if (!ts_node_is_null(name_node)) {
337517
char *name = cbm_node_text(ctx->arena, name_node, ctx->source);
338518
if (name && name[0] >= 'A' && name[0] <= 'Z') {
339-
CBMCall call;
519+
CBMCall call = {0};
340520
call.callee_name = name;
341521
call.enclosing_func_qn = state->enclosing_func_qn;
342522
cbm_calls_push(&ctx->result->calls, ctx->arena, call);

0 commit comments

Comments
 (0)