feat(fuzz): coverage distributions — score/dimension/latency spread + per-cell cost by drewstone · Pull Request #247 · tangle-network/agent-eval

drewstone · 2026-06-11T00:35:07Z

Answers "why is there only one dimension of scale?" — there isn't anymore, and the aggregates stopped lying by averaging.

Distribution { mean, median, p90, min, max, n } everywhere a mean used to be: per-cell headline score, every judged dimension, evaluation latency.
Latency: engine-measured wall-clock per evaluation (consumer-supplied latencyMs overrides, e.g. to exclude judge time); per-cell + capsule-wide distributions; median-latency KPI with a p90≫median warning accent.
Per-cell cost: known dollars + tracked-but-unknown runs counted apart; fields absent entirely when cost tracking is unwired — never a fabricated $0.
Capsule stats: robustness: Distribution | null over per-cell means (cells weigh equally — variance steering sends more runs to weak cells, so run-weighting would bias low), latencyMs: Distribution | null over all runs. Bare meanRobustness is gone.
HTML: tiles color by mean, tooltips carry median/min/latency/cost; KPIs add cell spread (min–max).
Determinism preserved (latency excluded from the seed-determinism contract — it is wall-clock by nature).

Greenfield replacement, no compat aliases. In-repo consumers updated; product fuzz CLIs that read stats.meanRobustness / coverage[].robustness update at their next agent-eval bump.

9 new tests (distribution math incl. empty-sample throw, latency override, per-cell cost reconciliation against capsule totals); full suite 2264 passing; typecheck + build clean. 0.91.0.

…spread + per-cell cost (0.91.0) A bare mean hides outliers. CoverageCell.score and every dimension are now full Distributions (mean/median/p90/min/max/n); evaluation latency is engine-measured per run (consumer latencyMs overrides) and aggregates per cell and capsule-wide; per-cell cost splits known dollars from tracked-but-unknown runs (absent when tracking is unwired — never a fabricated $0). Capsule stats replace the bare meanRobustness with a robustness Distribution over per-cell means (cells weigh equally — variance steering would bias a run-weighted average low) plus a latency Distribution over all runs. HTML tiles color by mean and carry median/min/latency/cost in the tooltip; KPIs add cell spread + median latency.

tangletools

✅ Auto-approved PR — `323b8c19`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-11T00:35:14Z}

tangletools approved these changes Jun 11, 2026

View reviewed changes

drewstone merged commit 58dd279 into main Jun 11, 2026
1 check passed

drewstone deleted the feat/capsule-distributions branch June 11, 2026 00:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(fuzz): coverage distributions — score/dimension/latency spread + per-cell cost#247

feat(fuzz): coverage distributions — score/dimension/latency spread + per-cell cost#247
drewstone merged 1 commit into
mainfrom
feat/capsule-distributions

drewstone commented Jun 11, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 11, 2026

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 323b8c19

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `323b8c19`