Skip to content

feat(fuzz): coverage distributions — score/dimension/latency spread + per-cell cost#247

Merged
drewstone merged 1 commit into
mainfrom
feat/capsule-distributions
Jun 11, 2026
Merged

feat(fuzz): coverage distributions — score/dimension/latency spread + per-cell cost#247
drewstone merged 1 commit into
mainfrom
feat/capsule-distributions

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Answers "why is there only one dimension of scale?" — there isn't anymore, and the aggregates stopped lying by averaging.

  • Distribution { mean, median, p90, min, max, n } everywhere a mean used to be: per-cell headline score, every judged dimension, evaluation latency.
  • Latency: engine-measured wall-clock per evaluation (consumer-supplied latencyMs overrides, e.g. to exclude judge time); per-cell + capsule-wide distributions; median-latency KPI with a p90≫median warning accent.
  • Per-cell cost: known dollars + tracked-but-unknown runs counted apart; fields absent entirely when cost tracking is unwired — never a fabricated $0.
  • Capsule stats: robustness: Distribution | null over per-cell means (cells weigh equally — variance steering sends more runs to weak cells, so run-weighting would bias low), latencyMs: Distribution | null over all runs. Bare meanRobustness is gone.
  • HTML: tiles color by mean, tooltips carry median/min/latency/cost; KPIs add cell spread (min–max).
  • Determinism preserved (latency excluded from the seed-determinism contract — it is wall-clock by nature).

Greenfield replacement, no compat aliases. In-repo consumers updated; product fuzz CLIs that read stats.meanRobustness / coverage[].robustness update at their next agent-eval bump.

9 new tests (distribution math incl. empty-sample throw, latency override, per-cell cost reconciliation against capsule totals); full suite 2264 passing; typecheck + build clean. 0.91.0.

…spread + per-cell cost (0.91.0)

A bare mean hides outliers. CoverageCell.score and every dimension are now full
Distributions (mean/median/p90/min/max/n); evaluation latency is engine-measured
per run (consumer latencyMs overrides) and aggregates per cell and capsule-wide;
per-cell cost splits known dollars from tracked-but-unknown runs (absent when
tracking is unwired — never a fabricated $0). Capsule stats replace the bare
meanRobustness with a robustness Distribution over per-cell means (cells weigh
equally — variance steering would bias a run-weighted average low) plus a
latency Distribution over all runs. HTML tiles color by mean and carry
median/min/latency/cost in the tooltip; KPIs add cell spread + median latency.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 323b8c19

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-11T00:35:14Z

@drewstone drewstone merged commit 58dd279 into main Jun 11, 2026
1 check passed
@drewstone drewstone deleted the feat/capsule-distributions branch June 11, 2026 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants