feat(diagnose): causal sweep, responsibility scoring, replay-validated repair by drewstone · Pull Request #240 · tangle-network/agent-eval

drewstone · 2026-06-10T10:43:45Z

What

New ./diagnose subpath (tsup entry + package export, mirroring ./rl) that turns the dormant counterfactual primitives into a complete remediation chain:

fuzz finds → sweep blames → repair prescribes (validated) → findings / corpus / invariant remediate → gates verify

WHY a run failed — `causalSweep`

Orchestrates reps × steps × mutations within a hard replay budget, composed entirely over the existing runCounterfactual seam (CounterfactualRunner.executeFrom stays the execution boundary).
Per-step responsibility = mean of per-rep score deltas + bootstrap CI (confidenceInterval, seeded), ranked by |meanEffect|. reps is REQUIRED and >= 2 — a single intervention delta is one stochastic draw, not a measurement.
Kind-level aggregate reuses the existing attributeCounterfactuals (exposed as byMutationKind).
Budget exhaustion halts mid-cell rather than emitting weakened CIs; every unprobed step is named in uncovered — never silent.
Default probes are the payload-free existing mutation kinds: swap-tool-result knockout (newResult: null) for tool spans, truncate-after re-roll for llm spans. swap-model / inject-system-message are opt-in via mutationsPerStep since they need consumer payloads.

WHAT should have happened — `prescribeRepair`

Consumer-supplied proposeFix(step, context) (LLM-backed in live use) proposes candidate mutations for the blamed steps.
A candidate becomes a repair ONLY when EVERY validation replay crosses flipThreshold — machine-verified, never speculated. Non-flippers land in rejected with reason: 'did-not-flip' + the observed delta; replay errors land with reason: 'error' + the message.
First validated repair per step is the prescription; remaining candidates stay untried (no fabricated verdicts).

HOW to make it happen — remediation adapters into existing machinery

toAnalystFindings(report, repairs?) → AnalystFinding[] via the real makeFinding; severity scales with |meanEffect| but is CI-gated (an effect whose CI includes zero is info and must not steer priority); evidence carries stepRefs + raw deltas + CI + replay run ids; validated repairs set recommended_action + validation_plan.
toCorpusRecord(run, repair) → CorpusRecord pinning the failure as a permanent scenario (fresh runId so corpus dedup keeps both; validateRunRecord at the boundary).
suggestInvariant(repair) → plain-data { description, never?, without? } hint in the shape the trace-contracts track consumes.

Grounding (recon-first)

Read before building: src/counterfactual.ts, src/causal-attribution.ts, src/replay.ts, src/bisector.ts, src/statistics.ts, src/analyst/types.ts, src/rl/corpus.ts, plus the runCounterfactual tests in tests/tier2.test.ts. Nothing duplicated — the sweep is pure orchestration over runCounterfactual + attributeCounterfactuals + confidenceInterval.

One spec deviation: mutationsPerStep is (step) => CounterfactualMutation[] rather than a flat CounterfactualMutation[], because mutations carry a step-bound at field and applicability is span-kind-dependent; returned mutations are validated to target the step they were asked for (fail-loud).

Tests

16 deterministic tests in tests/diagnose.test.ts faking the execution seam the same way tier2.test.ts does (seeded mulberry32 noise, no LLM calls): fault step ranked #1 with CI excluding zero vs no-effect step CI including zero; uncovered named under tight budgets; repairs emit only flipping mutations with non-flippers/errors in rejected; every-rep (not on-average) flip enforcement; adapters produce schema-valid outputs.

pnpm typecheck ✓ · pnpm test 208 files / 1998 passed ✓ · pnpm build ✓ (dist/diagnose.js + .d.ts verified importable)

No version bump (release sequenced by the program lead). No root src/index.ts changes — subpath-only, rebase-friendly.

…d repair New ./diagnose subpath orchestrating the dormant counterfactual primitives into a three-stage remediation chain: - causalSweep — reps x steps x mutations within a hard replay budget, composed over runCounterfactual; per-step mean effect + bootstrap CI (confidenceInterval) ranked by |meanEffect|; kind-level aggregate via attributeCounterfactuals; budget exhaustion names uncovered steps. - prescribeRepair — consumer-supplied proposeFix candidates are machine-verified by replaying WITH the mutation; a repair counts only when every validation rep crosses flipThreshold; non-flippers and replay errors land in rejected with typed reasons. - Remediation adapters into existing machinery: toAnalystFindings (makeFinding, severity from effect size, CI-gated), toCorpusRecord (pins the failure as a permanent corpus scenario, validateRunRecord at the boundary), suggestInvariant (never/without hint shape for trace contracts). Deterministic tests fake the CounterfactualRunner seam with seeded mulberry32 noise; no LLM calls.

tangletools

✅ Auto-approved PR — `a17dca38`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T10:43:52Z}

tangletools

✅ Auto-approved PR — `f98d73ae`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T10:55:18Z}

tangletools previously approved these changes Jun 10, 2026

View reviewed changes

merge main into branch (frontier merge sequencing)

f98d73a

drewstone dismissed tangletools’s stale review via f98d73a June 10, 2026 10:55

tangletools approved these changes Jun 10, 2026

View reviewed changes

drewstone merged commit ea03b8c into main Jun 10, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(diagnose): causal sweep, responsibility scoring, replay-validated repair#240

feat(diagnose): causal sweep, responsibility scoring, replay-validated repair#240
drewstone merged 2 commits into
mainfrom
feat/diagnose-causal-chain

drewstone commented Jun 10, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 10, 2026

What

WHY a run failed — causalSweep

WHAT should have happened — prescribeRepair

HOW to make it happen — remediation adapters into existing machinery

Grounding (recon-first)

Tests

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — a17dca38

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — f98d73ae

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WHY a run failed — `causalSweep`

WHAT should have happened — `prescribeRepair`

✅ Auto-approved PR — `a17dca38`

✅ Auto-approved PR — `f98d73ae`