feat(perf): infra perf-benchmark substrate — journeys, integrity contracts, percentile ratchet (0.90.0)#245
Merged
Conversation
…racts, percentile ratchet (0.90.0) New domain-agnostic /perf subpath for infra performance benchmarking, complementing the judge-panel BenchmarkRunner (quality) with latency / reliability scoring over flat metric records: - JourneySpec + expandMatrix + scenarioKey: journeys × free-form axes cartesian matrix with sorted-dim stable keys and a combo filter. - checkRecordIntegrity + assertRecordIntegrity: a pass=true record must carry its journey's requiredFields / minimums / phaseFields; failed records are exempt. - summarizeRecords + gatePerf: nearest-rank p50/p90 PerfStat baselines and a tolerance ratchet with improvements, missing/new scenario detection, and a minSamples floor; null metrics never become fake zeros. Exported from the root barrel and the new ./perf subpath (tsup entry + package.json exports). Version 0.90.0 across npm + PyPI; CHANGELOG entry added. 25 vitest cases, each mutation-verified (7/7 mutants killed).
tangletools
approved these changes
Jun 10, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — a310acd1
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T13:34:11Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Infra performance benchmarking (provision latency, TTFT, stream reliability) has no substrate primitive. The existing
BenchmarkRunner(root) scores QUALITY via judge panels over scenarios; consumers measuring LATENCY / RELIABILITY over flat metric records keep hand-rolling three things: a scenario matrix (journeys × drivers × regions), "is this passing record actually carrying its measurements" checks, and a percentile regression gate against a committed baseline.Solution
New domain-agnostic
src/perf/module, exported from the root barrel and a new./perfsubpath (tsup entry + package.json exports, matching the existing subpath convention):journey.ts—JourneySpec(id, requiredFields, minimums, phaseFields, requiresLLM),ScenarioAxes,PerfScenario,expandMatrix(cartesian expansion with a combofilter),scenarioKey(journeyId|dim=value|…, dims sorted, so keys are stable across axes-object insertion order).integrity.ts—checkRecordIntegrity/assertRecordIntegrity: a record claimingpass === truemust carry its journey's required fields and clear its numeric minimums (null-required-field/below-minimumviolations); failed records are exempt — an errored run legitimately has nulls.resolveJourneyreturning null skips a record.ratchet.ts—summarizeRecords(per-scenarioPerfStatp50/p90/n, nearest-rank on sorted values; null/non-numeric metric values excluded fromn, zero-sample fields omitted — no fake zeros) andgatePerf(trips when p50 OR p90 exceedtolerancePct(default 10) over baseline; strict improvements reported with negativeoverBy;n < minSamples(default 3) scenarios surfaced inmissingScenariosand never gated; key drift inmissingScenarios/newScenarios).Version 0.90.0 across npm + PyPI (version-locked trio) + CHANGELOG entry.
Testing
tests/perf.test.ts: 25 cases, each naming the bug it catches (cartesian size, sorted-key stability, filter inversion, pass=true with nulls, failed-record exemption, minimums, phase fields, nearest-rank correctness on even/odd n, tolerance boundary, p90-only regressions, missing/new scenarios, minSamples boundary, null-metric exclusion, zero-baseline division).tsc --noEmitclean,biome check srcclean,pnpm build(tsup + dts + openapi) green;dist/perf/index.jsand root exports verified by import.