fix(fuzz): eval-error isolation — typed events, circuit breaker, capsule-so-far#246
Conversation
…aker, capsule-so-far (0.90.1) evaluate/gates/minimize cross an external boundary; a thrown transport error there (router 503 mid-gate-re-eval) previously killed the whole campaign and evaporated the capsule. Now: each failure becomes a typed 'eval-error' progress event + stats.evalErrors (an infra axis, never folded into robustness), and consecutive failures trip a circuit breaker (maxConsecutiveEvalErrors, default 5) that stops the run with stats.stoppedEarly and a complete capsule-so-far. Successes reset the streak. Internal validation errors (fabricated costOf) stay loud. HTML capsule renders the eval-errors KPI + early-stop banner.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 97a85d6e
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T21:21:39Z
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 89 | 86 | 86 |
| Confidence | 75 | 75 | 75 |
| Correctness | 89 | 86 | 86 |
| Security | 89 | 86 | 86 |
| Testing | 89 | 86 | 86 |
| Architecture | 89 | 86 | 86 |
Full multi-shot audit completed 3/3 planned shots over 7 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 7 changed files. Global verifier still owns final merge decision.
🟡 LOW Gate-thrown errors counted as evalErrors but don't increment runsUsed — semantic mismatch — src/fuzz/explorer.ts
Lines 261-267: When
isValidorisUncontaminatedgates throw, the catch block at line 285 incrementsevalErrorsandconsecutiveEvalErrors. However, the successful evaluation at line 245 has already incrementedrunsUsed(line 246) before reaching the gates. So a gate-thrown error double-counts: the run IS
🟡 LOW consecutiveEvalErrors counter not safe under concurrency > 1 — src/fuzz/explorer.ts
Lines 248, 290:
this.consecutiveEvalErrorsis reset to 0 on success (line 248) and incremented on error (line 290) without synchronization.pMapspawns concurrent workers whenconcurrency > 1(line 307). Since JavaScript is single-threaded with cooperative scheduling, theawaitpoints between increment and chec
🟡 LOW eval-error event lacks structured operation discriminator — src/fuzz/explorer.ts
The
ExploreEvent'seval-errortype (line 216 in types.ts) carriescell,scenarioId, andmessagebut no field indicating which operation threw —evaluate,isValid,isUncontaminated, orminimize. All four cross the same external boundary and all land in the same catch block at explorer.ts:285. For observability and debugging, asource?: 'evaluate' | 'gate' | 'minimize'discriminator would let monitoring dashboards distinguish dead-backend failures from broken-minimizer failures without parsing error messages.
🟡 LOW runsThisStep not incremented on eval-error path — step() reports runs=0 after only errors — src/fuzz/explorer.ts
Line 247:
runsThisStep++only fires on the success path inside the try block. When all evaluations in a step throw,runsThisStepstays 0. Inrun()line 326:if (runs === 0 && this.stoppedEarly === undefined) break— this is guarded correctly becausestoppedEarlywill be set by the circuit breaker. But for a step where some cells succeed and some fail (without tripping the breaker),runsThisSteponly counts successes, not the total attempted work. This is actually the correct semantic (runs = successful evaluations consumed from budget), b
🟡 LOW Redundant test assertion duplicates the same invariant — src/fuzz/fuzz-agent.test.ts
Lines 264-266:
expect(capsule.stats.totalRuns + capsule.stats.evalErrors).toBeGreaterThan(capsule.stats.totalRuns)is algebraically equivalent toevalErrors > 0, which is already asserted on line 258. The redundant assertion adds no coverage and could confuse readers about intent. Remove it or replace with a more specific check.
tangletools · 2026-06-10T21:27:03Z · trace
A router 503 during a gate re-evaluation killed a live 24-run legal campaign and evaporated its capsule (2026-06-10 live runs). Root cause: evaluate/gates/minimize throws propagated out of the exploration loop.
eval-errorprogress events +stats.evalErrors— an infra axis, never folded into robustness or findings.maxConsecutiveEvalErrors(default 5) circuit breaker: a dead backend stops the run withstats.stoppedEarlyinstead of burning the remaining budget; the capsule-so-far is complete and honest. Successes reset the streak.costOf) still throw — programming mistakes stay loud.5 new deterministic tests; full suite 2260 passing; typecheck clean. 0.90.1.