fix(fuzz): eval-error isolation — typed events, circuit breaker, capsule-so-far by drewstone · Pull Request #246 · tangle-network/agent-eval

drewstone · 2026-06-10T21:21:32Z

A router 503 during a gate re-evaluation killed a live 24-run legal campaign and evaporated its capsule (2026-06-10 live runs). Root cause: evaluate/gates/minimize throws propagated out of the exploration loop.

Throws from the evaluate/gate/minimize boundary become typed eval-error progress events + stats.evalErrors — an infra axis, never folded into robustness or findings.
maxConsecutiveEvalErrors (default 5) circuit breaker: a dead backend stops the run with stats.stoppedEarly instead of burning the remaining budget; the capsule-so-far is complete and honest. Successes reset the streak.
Internal validation errors (fabricated costOf) still throw — programming mistakes stay loud.
HTML capsule: eval-errors KPI + early-stop banner.

5 new deterministic tests; full suite 2260 passing; typecheck clean. 0.90.1.

…aker, capsule-so-far (0.90.1) evaluate/gates/minimize cross an external boundary; a thrown transport error there (router 503 mid-gate-re-eval) previously killed the whole campaign and evaporated the capsule. Now: each failure becomes a typed 'eval-error' progress event + stats.evalErrors (an infra axis, never folded into robustness), and consecutive failures trip a circuit breaker (maxConsecutiveEvalErrors, default 5) that stops the run with stats.stoppedEarly and a complete capsule-so-far. Successes reset the streak. Internal validation errors (fabricated costOf) stay loud. HTML capsule renders the eval-errors KPI + early-stop banner.

tangletools

✅ Auto-approved PR — `97a85d6e`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T21:21:39Z}

tangletools · 2026-06-10T21:27:05Z

✅ No Blockers — `97a85d6e`

Readiness 86/100 · Confidence 75/100 · 5 findings (5 low)

	deepseek	glm	aggregate
Readiness	89	86	86
Confidence	75	75	75
Correctness	89	86	86
Security	89	86	86
Testing	89	86	86
Architecture	89	86	86

Full multi-shot audit completed 3/3 planned shots over 7 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 7 changed files. Global verifier still owns final merge decision.

🟡 LOW Gate-thrown errors counted as evalErrors but don't increment runsUsed — semantic mismatch — src/fuzz/explorer.ts

Lines 261-267: When isValid or isUncontaminated gates throw, the catch block at line 285 increments evalErrors and consecutiveEvalErrors. However, the successful evaluation at line 245 has already incremented runsUsed (line 246) before reaching the gates. So a gate-thrown error double-counts: the run IS

🟡 LOW consecutiveEvalErrors counter not safe under concurrency > 1 — src/fuzz/explorer.ts

Lines 248, 290: this.consecutiveEvalErrors is reset to 0 on success (line 248) and incremented on error (line 290) without synchronization. pMap spawns concurrent workers when concurrency > 1 (line 307). Since JavaScript is single-threaded with cooperative scheduling, the await points between increment and chec

🟡 LOW eval-error event lacks structured operation discriminator — src/fuzz/explorer.ts

The ExploreEvent's eval-error type (line 216 in types.ts) carries cell, scenarioId, and message but no field indicating which operation threw — evaluate, isValid, isUncontaminated, or minimize. All four cross the same external boundary and all land in the same catch block at explorer.ts:285. For observability and debugging, a source?: 'evaluate' | 'gate' | 'minimize' discriminator would let monitoring dashboards distinguish dead-backend failures from broken-minimizer failures without parsing error messages.

🟡 LOW runsThisStep not incremented on eval-error path — step() reports runs=0 after only errors — src/fuzz/explorer.ts

Line 247: runsThisStep++ only fires on the success path inside the try block. When all evaluations in a step throw, runsThisStep stays 0. In run() line 326: if (runs === 0 && this.stoppedEarly === undefined) break — this is guarded correctly because stoppedEarly will be set by the circuit breaker. But for a step where some cells succeed and some fail (without tripping the breaker), runsThisStep only counts successes, not the total attempted work. This is actually the correct semantic (runs = successful evaluations consumed from budget), b

🟡 LOW Redundant test assertion duplicates the same invariant — src/fuzz/fuzz-agent.test.ts

Lines 264-266: expect(capsule.stats.totalRuns + capsule.stats.evalErrors).toBeGreaterThan(capsule.stats.totalRuns) is algebraically equivalent to evalErrors > 0, which is already asserted on line 258. The redundant assertion adds no coverage and could confuse readers about intent. Remove it or replace with a more specific check.

_{tangletools · 2026-06-10T21:27:03Z · trace}

tangletools approved these changes Jun 10, 2026

View reviewed changes

drewstone merged commit a289cbf into main Jun 10, 2026
1 check passed

drewstone deleted the fix/explorer-eval-error-isolation branch June 10, 2026 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(fuzz): eval-error isolation — typed events, circuit breaker, capsule-so-far#246

fix(fuzz): eval-error isolation — typed events, circuit breaker, capsule-so-far#246
drewstone merged 1 commit into
mainfrom
fix/explorer-eval-error-isolation

drewstone commented Jun 10, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 10, 2026

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 97a85d6e

Uh oh!

tangletools commented Jun 10, 2026

✅ No Blockers — 97a85d6e

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `97a85d6e`

✅ No Blockers — `97a85d6e`