Skip to content

feat(perf): infra perf-benchmark substrate — journeys, integrity contracts, percentile ratchet (0.90.0)#245

Merged
drewstone merged 1 commit into
mainfrom
feat/perf-benchmark-substrate
Jun 10, 2026
Merged

feat(perf): infra perf-benchmark substrate — journeys, integrity contracts, percentile ratchet (0.90.0)#245
drewstone merged 1 commit into
mainfrom
feat/perf-benchmark-substrate

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Problem

Infra performance benchmarking (provision latency, TTFT, stream reliability) has no substrate primitive. The existing BenchmarkRunner (root) scores QUALITY via judge panels over scenarios; consumers measuring LATENCY / RELIABILITY over flat metric records keep hand-rolling three things: a scenario matrix (journeys × drivers × regions), "is this passing record actually carrying its measurements" checks, and a percentile regression gate against a committed baseline.

Solution

New domain-agnostic src/perf/ module, exported from the root barrel and a new ./perf subpath (tsup entry + package.json exports, matching the existing subpath convention):

  • journey.tsJourneySpec (id, requiredFields, minimums, phaseFields, requiresLLM), ScenarioAxes, PerfScenario, expandMatrix (cartesian expansion with a combo filter), scenarioKey (journeyId|dim=value|…, dims sorted, so keys are stable across axes-object insertion order).
  • integrity.tscheckRecordIntegrity / assertRecordIntegrity: a record claiming pass === true must carry its journey's required fields and clear its numeric minimums (null-required-field / below-minimum violations); failed records are exempt — an errored run legitimately has nulls. resolveJourney returning null skips a record.
  • ratchet.tssummarizeRecords (per-scenario PerfStat p50/p90/n, nearest-rank on sorted values; null/non-numeric metric values excluded from n, zero-sample fields omitted — no fake zeros) and gatePerf (trips when p50 OR p90 exceed tolerancePct (default 10) over baseline; strict improvements reported with negative overBy; n < minSamples (default 3) scenarios surfaced in missingScenarios and never gated; key drift in missingScenarios / newScenarios).

Version 0.90.0 across npm + PyPI (version-locked trio) + CHANGELOG entry.

Testing

  • tests/perf.test.ts: 25 cases, each naming the bug it catches (cartesian size, sorted-key stability, filter inversion, pass=true with nulls, failed-record exemption, minimums, phase fields, nearest-rank correctness on even/odd n, tolerance boundary, p90-only regressions, missing/new scenarios, minSamples boundary, null-metric exclusion, zero-baseline division).
  • Mutation-verified: 7 hand-applied mutants (unsorted keys, integrity checking failed records, floor-based rank, minSamples off-by-one, null coercion, loosened minimum, AND-ed gate condition) — 7/7 killed.
  • Full suite: 222 files / 2255 tests pass. tsc --noEmit clean, biome check src clean, pnpm build (tsup + dts + openapi) green; dist/perf/index.js and root exports verified by import.

…racts, percentile ratchet (0.90.0)

New domain-agnostic /perf subpath for infra performance benchmarking,
complementing the judge-panel BenchmarkRunner (quality) with latency /
reliability scoring over flat metric records:

- JourneySpec + expandMatrix + scenarioKey: journeys × free-form axes
  cartesian matrix with sorted-dim stable keys and a combo filter.
- checkRecordIntegrity + assertRecordIntegrity: a pass=true record must
  carry its journey's requiredFields / minimums / phaseFields; failed
  records are exempt.
- summarizeRecords + gatePerf: nearest-rank p50/p90 PerfStat baselines
  and a tolerance ratchet with improvements, missing/new scenario
  detection, and a minSamples floor; null metrics never become fake
  zeros.

Exported from the root barrel and the new ./perf subpath (tsup entry +
package.json exports). Version 0.90.0 across npm + PyPI; CHANGELOG entry
added. 25 vitest cases, each mutation-verified (7/7 mutants killed).

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — a310acd1

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T13:34:11Z

@drewstone drewstone merged commit e47c172 into main Jun 10, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants