feat(perf): infra perf-benchmark substrate — journeys, integrity contracts, percentile ratchet (0.90.0) by drewstone · Pull Request #245 · tangle-network/agent-eval

drewstone · 2026-06-10T13:34:03Z

Problem

Infra performance benchmarking (provision latency, TTFT, stream reliability) has no substrate primitive. The existing BenchmarkRunner (root) scores QUALITY via judge panels over scenarios; consumers measuring LATENCY / RELIABILITY over flat metric records keep hand-rolling three things: a scenario matrix (journeys × drivers × regions), "is this passing record actually carrying its measurements" checks, and a percentile regression gate against a committed baseline.

Solution

New domain-agnostic src/perf/ module, exported from the root barrel and a new ./perf subpath (tsup entry + package.json exports, matching the existing subpath convention):

journey.ts — JourneySpec (id, requiredFields, minimums, phaseFields, requiresLLM), ScenarioAxes, PerfScenario, expandMatrix (cartesian expansion with a combo filter), scenarioKey (journeyId|dim=value|…, dims sorted, so keys are stable across axes-object insertion order).
integrity.ts — checkRecordIntegrity / assertRecordIntegrity: a record claiming pass === true must carry its journey's required fields and clear its numeric minimums (null-required-field / below-minimum violations); failed records are exempt — an errored run legitimately has nulls. resolveJourney returning null skips a record.
ratchet.ts — summarizeRecords (per-scenario PerfStat p50/p90/n, nearest-rank on sorted values; null/non-numeric metric values excluded from n, zero-sample fields omitted — no fake zeros) and gatePerf (trips when p50 OR p90 exceed tolerancePct (default 10) over baseline; strict improvements reported with negative overBy; n < minSamples (default 3) scenarios surfaced in missingScenarios and never gated; key drift in missingScenarios / newScenarios).

Version 0.90.0 across npm + PyPI (version-locked trio) + CHANGELOG entry.

Testing

tests/perf.test.ts: 25 cases, each naming the bug it catches (cartesian size, sorted-key stability, filter inversion, pass=true with nulls, failed-record exemption, minimums, phase fields, nearest-rank correctness on even/odd n, tolerance boundary, p90-only regressions, missing/new scenarios, minSamples boundary, null-metric exclusion, zero-baseline division).
Mutation-verified: 7 hand-applied mutants (unsorted keys, integrity checking failed records, floor-based rank, minSamples off-by-one, null coercion, loosened minimum, AND-ed gate condition) — 7/7 killed.
Full suite: 222 files / 2255 tests pass. tsc --noEmit clean, biome check src clean, pnpm build (tsup + dts + openapi) green; dist/perf/index.js and root exports verified by import.

…racts, percentile ratchet (0.90.0) New domain-agnostic /perf subpath for infra performance benchmarking, complementing the judge-panel BenchmarkRunner (quality) with latency / reliability scoring over flat metric records: - JourneySpec + expandMatrix + scenarioKey: journeys × free-form axes cartesian matrix with sorted-dim stable keys and a combo filter. - checkRecordIntegrity + assertRecordIntegrity: a pass=true record must carry its journey's requiredFields / minimums / phaseFields; failed records are exempt. - summarizeRecords + gatePerf: nearest-rank p50/p90 PerfStat baselines and a tolerance ratchet with improvements, missing/new scenario detection, and a minSamples floor; null metrics never become fake zeros. Exported from the root barrel and the new ./perf subpath (tsup entry + package.json exports). Version 0.90.0 across npm + PyPI; CHANGELOG entry added. 25 vitest cases, each mutation-verified (7/7 mutants killed).

tangletools

✅ Auto-approved PR — `a310acd1`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T13:34:11Z}

tangletools approved these changes Jun 10, 2026

View reviewed changes

drewstone merged commit e47c172 into main Jun 10, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(perf): infra perf-benchmark substrate — journeys, integrity contracts, percentile ratchet (0.90.0)#245

feat(perf): infra perf-benchmark substrate — journeys, integrity contracts, percentile ratchet (0.90.0)#245
drewstone merged 1 commit into
mainfrom
feat/perf-benchmark-substrate

drewstone commented Jun 10, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 10, 2026

Problem

Solution

Testing

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — a310acd1

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `a310acd1`