Skip to content

fix(fuzz): eval-error isolation — typed events, circuit breaker, capsule-so-far#246

Merged
drewstone merged 1 commit into
mainfrom
fix/explorer-eval-error-isolation
Jun 10, 2026
Merged

fix(fuzz): eval-error isolation — typed events, circuit breaker, capsule-so-far#246
drewstone merged 1 commit into
mainfrom
fix/explorer-eval-error-isolation

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

A router 503 during a gate re-evaluation killed a live 24-run legal campaign and evaporated its capsule (2026-06-10 live runs). Root cause: evaluate/gates/minimize throws propagated out of the exploration loop.

  • Throws from the evaluate/gate/minimize boundary become typed eval-error progress events + stats.evalErrors — an infra axis, never folded into robustness or findings.
  • maxConsecutiveEvalErrors (default 5) circuit breaker: a dead backend stops the run with stats.stoppedEarly instead of burning the remaining budget; the capsule-so-far is complete and honest. Successes reset the streak.
  • Internal validation errors (fabricated costOf) still throw — programming mistakes stay loud.
  • HTML capsule: eval-errors KPI + early-stop banner.

5 new deterministic tests; full suite 2260 passing; typecheck clean. 0.90.1.

…aker, capsule-so-far (0.90.1)

evaluate/gates/minimize cross an external boundary; a thrown transport error
there (router 503 mid-gate-re-eval) previously killed the whole campaign and
evaporated the capsule. Now: each failure becomes a typed 'eval-error' progress
event + stats.evalErrors (an infra axis, never folded into robustness), and
consecutive failures trip a circuit breaker (maxConsecutiveEvalErrors, default
5) that stops the run with stats.stoppedEarly and a complete capsule-so-far.
Successes reset the streak. Internal validation errors (fabricated costOf)
stay loud. HTML capsule renders the eval-errors KPI + early-stop banner.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 97a85d6e

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T21:21:39Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 97a85d6e

Readiness 86/100 · Confidence 75/100 · 5 findings (5 low)

deepseek glm aggregate
Readiness 89 86 86
Confidence 75 75 75
Correctness 89 86 86
Security 89 86 86
Testing 89 86 86
Architecture 89 86 86

Full multi-shot audit completed 3/3 planned shots over 7 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 7 changed files. Global verifier still owns final merge decision.

🟡 LOW Gate-thrown errors counted as evalErrors but don't increment runsUsed — semantic mismatch — src/fuzz/explorer.ts

Lines 261-267: When isValid or isUncontaminated gates throw, the catch block at line 285 increments evalErrors and consecutiveEvalErrors. However, the successful evaluation at line 245 has already incremented runsUsed (line 246) before reaching the gates. So a gate-thrown error double-counts: the run IS

🟡 LOW consecutiveEvalErrors counter not safe under concurrency > 1 — src/fuzz/explorer.ts

Lines 248, 290: this.consecutiveEvalErrors is reset to 0 on success (line 248) and incremented on error (line 290) without synchronization. pMap spawns concurrent workers when concurrency > 1 (line 307). Since JavaScript is single-threaded with cooperative scheduling, the await points between increment and chec

🟡 LOW eval-error event lacks structured operation discriminator — src/fuzz/explorer.ts

The ExploreEvent's eval-error type (line 216 in types.ts) carries cell, scenarioId, and message but no field indicating which operation threw — evaluate, isValid, isUncontaminated, or minimize. All four cross the same external boundary and all land in the same catch block at explorer.ts:285. For observability and debugging, a source?: 'evaluate' | 'gate' | 'minimize' discriminator would let monitoring dashboards distinguish dead-backend failures from broken-minimizer failures without parsing error messages.

🟡 LOW runsThisStep not incremented on eval-error path — step() reports runs=0 after only errors — src/fuzz/explorer.ts

Line 247: runsThisStep++ only fires on the success path inside the try block. When all evaluations in a step throw, runsThisStep stays 0. In run() line 326: if (runs === 0 && this.stoppedEarly === undefined) break — this is guarded correctly because stoppedEarly will be set by the circuit breaker. But for a step where some cells succeed and some fail (without tripping the breaker), runsThisStep only counts successes, not the total attempted work. This is actually the correct semantic (runs = successful evaluations consumed from budget), b

🟡 LOW Redundant test assertion duplicates the same invariant — src/fuzz/fuzz-agent.test.ts

Lines 264-266: expect(capsule.stats.totalRuns + capsule.stats.evalErrors).toBeGreaterThan(capsule.stats.totalRuns) is algebraically equivalent to evalErrors > 0, which is already asserted on line 258. The redundant assertion adds no coverage and could confuse readers about intent. Remove it or replace with a more specific check.


tangletools · 2026-06-10T21:27:03Z · trace

@drewstone drewstone merged commit a289cbf into main Jun 10, 2026
1 check passed
@drewstone drewstone deleted the fix/explorer-eval-error-isolation branch June 10, 2026 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants