feat(diagnose): causal sweep, responsibility scoring, replay-validated repair#240
Merged
Conversation
…d repair New ./diagnose subpath orchestrating the dormant counterfactual primitives into a three-stage remediation chain: - causalSweep — reps x steps x mutations within a hard replay budget, composed over runCounterfactual; per-step mean effect + bootstrap CI (confidenceInterval) ranked by |meanEffect|; kind-level aggregate via attributeCounterfactuals; budget exhaustion names uncovered steps. - prescribeRepair — consumer-supplied proposeFix candidates are machine-verified by replaying WITH the mutation; a repair counts only when every validation rep crosses flipThreshold; non-flippers and replay errors land in rejected with typed reasons. - Remediation adapters into existing machinery: toAnalystFindings (makeFinding, severity from effect size, CI-gated), toCorpusRecord (pins the failure as a permanent corpus scenario, validateRunRecord at the boundary), suggestInvariant (never/without hint shape for trace contracts). Deterministic tests fake the CounterfactualRunner seam with seeded mulberry32 noise; no LLM calls.
tangletools
previously approved these changes
Jun 10, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — a17dca38
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T10:43:52Z
tangletools
approved these changes
Jun 10, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — f98d73ae
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T10:55:18Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
New
./diagnosesubpath (tsup entry + package export, mirroring./rl) that turns the dormant counterfactual primitives into a complete remediation chain:fuzz finds → sweep blames → repair prescribes (validated) → findings / corpus / invariant remediate → gates verify
WHY a run failed —
causalSweeprunCounterfactualseam (CounterfactualRunner.executeFromstays the execution boundary).confidenceInterval, seeded), ranked by |meanEffect|.repsis REQUIRED and >= 2 — a single intervention delta is one stochastic draw, not a measurement.attributeCounterfactuals(exposed asbyMutationKind).uncovered— never silent.swap-tool-resultknockout (newResult: null) for tool spans,truncate-afterre-roll for llm spans.swap-model/inject-system-messageare opt-in viamutationsPerStepsince they need consumer payloads.WHAT should have happened —
prescribeRepairproposeFix(step, context)(LLM-backed in live use) proposes candidate mutations for the blamed steps.flipThreshold— machine-verified, never speculated. Non-flippers land inrejectedwithreason: 'did-not-flip'+ the observed delta; replay errors land withreason: 'error'+ the message.HOW to make it happen — remediation adapters into existing machinery
toAnalystFindings(report, repairs?)→AnalystFinding[]via the realmakeFinding; severity scales with |meanEffect| but is CI-gated (an effect whose CI includes zero isinfoand must not steer priority); evidence carries stepRefs + raw deltas + CI + replay run ids; validated repairs setrecommended_action+validation_plan.toCorpusRecord(run, repair)→CorpusRecordpinning the failure as a permanent scenario (fresh runId so corpus dedup keeps both;validateRunRecordat the boundary).suggestInvariant(repair)→ plain-data{ description, never?, without? }hint in the shape the trace-contracts track consumes.Grounding (recon-first)
Read before building:
src/counterfactual.ts,src/causal-attribution.ts,src/replay.ts,src/bisector.ts,src/statistics.ts,src/analyst/types.ts,src/rl/corpus.ts, plus therunCounterfactualtests intests/tier2.test.ts. Nothing duplicated — the sweep is pure orchestration overrunCounterfactual+attributeCounterfactuals+confidenceInterval.One spec deviation:
mutationsPerStepis(step) => CounterfactualMutation[]rather than a flatCounterfactualMutation[], because mutations carry a step-boundatfield and applicability is span-kind-dependent; returned mutations are validated to target the step they were asked for (fail-loud).Tests
16 deterministic tests in
tests/diagnose.test.tsfaking the execution seam the same waytier2.test.tsdoes (seeded mulberry32 noise, no LLM calls): fault step ranked #1 with CI excluding zero vs no-effect step CI including zero; uncovered named under tight budgets; repairs emit only flipping mutations with non-flippers/errors inrejected; every-rep (not on-average) flip enforcement; adapters produce schema-valid outputs.pnpm typecheck✓ ·pnpm test208 files / 1998 passed ✓ ·pnpm build✓ (dist/diagnose.js + .d.ts verified importable)No version bump (release sequenced by the program lead). No root
src/index.tschanges — subpath-only, rebase-friendly.