feat(fuzz): coverage distributions — score/dimension/latency spread + per-cell cost#247
Merged
Merged
Conversation
…spread + per-cell cost (0.91.0) A bare mean hides outliers. CoverageCell.score and every dimension are now full Distributions (mean/median/p90/min/max/n); evaluation latency is engine-measured per run (consumer latencyMs overrides) and aggregates per cell and capsule-wide; per-cell cost splits known dollars from tracked-but-unknown runs (absent when tracking is unwired — never a fabricated $0). Capsule stats replace the bare meanRobustness with a robustness Distribution over per-cell means (cells weigh equally — variance steering would bias a run-weighted average low) plus a latency Distribution over all runs. HTML tiles color by mean and carry median/min/latency/cost in the tooltip; KPIs add cell spread + median latency.
tangletools
approved these changes
Jun 11, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — 323b8c19
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-11T00:35:14Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Answers "why is there only one dimension of scale?" — there isn't anymore, and the aggregates stopped lying by averaging.
Distribution { mean, median, p90, min, max, n }everywhere a mean used to be: per-cell headline score, every judged dimension, evaluation latency.latencyMsoverrides, e.g. to exclude judge time); per-cell + capsule-wide distributions; median-latency KPI with a p90≫median warning accent.robustness: Distribution | nullover per-cell means (cells weigh equally — variance steering sends more runs to weak cells, so run-weighting would bias low),latencyMs: Distribution | nullover all runs. BaremeanRobustnessis gone.Greenfield replacement, no compat aliases. In-repo consumers updated; product fuzz CLIs that read
stats.meanRobustness/coverage[].robustnessupdate at their next agent-eval bump.9 new tests (distribution math incl. empty-sample throw, latency override, per-cell cost reconciliation against capsule totals); full suite 2264 passing; typecheck + build clean. 0.91.0.