Skip to content

feat(cost): model seating chart, dollar budgets in the fuzz loop, program cost report#243

Merged
drewstone merged 1 commit into
mainfrom
feat/cost-governance
Jun 10, 2026
Merged

feat(cost): model seating chart, dollar budgets in the fuzz loop, program cost report#243
drewstone merged 1 commit into
mainfrom
feat/cost-governance

Conversation

@drewstone

@drewstone drewstone commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What

T7 — the program's cost layer. Three pieces, all projections over the existing CostLedger (nothing duplicated):

ModelSeats (src/model-seats.ts)

  • ModelSeats { worker, judges, analyst, reflection, verifier } — the one object that re-tiers an entire eval program.
  • seatPresets is plain data: economy uses the fleet-policy ids (kimi-k2.6 worker, [kimi-k2.6, deepseek-v4-pro, gpt-4.1-mini] judges — cross-family by construction, every id family-priced so the preset never produces a costUnknown axis); frontier is deliberately empty because entitled frontier ids vary per router account — callers spread their own.
  • resolveSeat(seats, seat, fallback?) throws typed SeatUnsetError (code config) when a seat is unset with no explicit fallback — a model id is a budget decision, never a silent default. Wiring points (ensembleJudge({ models: seats.judges }), selfImprove({ llm: { model: seats.reflection } }), makeEvalTools panels, campaign cells) are named in the JSDoc; none are implemented here — those files belong to other surfaces.

Dollar budgets in the fuzz loop (src/fuzz)

  • ExploreOptions gains costOf (consumer-supplied — the explorer cannot know token usage; null = unknown), costBudgetUsd, ledger, onCost.
  • Budget semantics mirror control-runtime's maxCostUsd: nonnegative-finite validation throws RangeError; the session stops once accumulated KNOWN cost ≥ ceiling (step()/run() honor it exactly like the run budget). Unknown-cost runs never consume budget and are never folded in as $0 — they land in stats.costUnknownRuns.
  • Cost options without costOf throw at construction (a ceiling that can never trip is a silent lie).
  • Known costs are recorded into the supplied CostLedger (channel agent, actualCostUsd) so fuzz spend lands in the same ledger as judge/analyst spend.
  • CapsuleData.stats gains costUsd/costUnknownRuns — present only when tracking was wired (absent ≠ $0). renderCapsuleHtml shows the cost KPI, with N runs unpriced named in amber when the total is a lower bound.

Program cost report (src/cost-report.ts)

  • costReport(ledger){ perChannel, total: { usd, unknownEntries }, perModel: [{ model, usd, entries, unpriced }] } — a thin projection over CostLedgerSummary (byChannel reused verbatim; only the per-model rollup is new).
  • attachCostToReport(report, ledger) — the one generic stamp for capsules / campaign results / diagnose reports; refuses to overwrite an existing cost key.

Campaign wiring (documented, not done)

src/campaign/run-campaign.ts is owned by a sibling track this round. Wiring is one line each: thread seats.judges into the campaign's judge configs via ensembleJudge, and stamp campaign results with attachCostToReport(result, ledger).

Tests

  • 30 new deterministic tests (seats resolve + loud throw, preset shape/cross-family/fully-priced, budget stop at costBudgetUsd, unknown-runs-never-$0, ledger recording, validation rejects negative/NaN, HTML KPI, unpriced:true projection, no-overwrite stamp).
  • Full suite: 2074 passed / 2 skipped (was ≥2044 on main; no existing test weakened). pnpm typecheck + pnpm build green. No version bump (stays 0.88.0).

…gram cost report

- ModelSeats + seatPresets + resolveSeat (src/model-seats.ts): one object
  re-tiers an entire eval program; economy preset uses the fleet-policy ids
  (cross-family judges, fully priced), frontier is deliberately empty —
  resolveSeat fails loud on any unset seat, a model id is never a silent
  default.
- BehaviorExplorer cost governance (src/fuzz): costOf + costBudgetUsd +
  ledger + onCost. Known cost accrues toward a hard ceiling with
  control-runtime maxCostUsd semantics (nonnegative finite, stop at >=);
  unknown-cost runs are counted apart, never folded in as $0. Capsule stats
  gain costUsd/costUnknownRuns only when tracking was wired, and the HTML
  capsule shows the cost KPI with the unpriced-run count.
- costReport + attachCostToReport (src/cost-report.ts): thin projection over
  CostLedger.summary() adding the per-model rollup (unpriced:true marks a
  lower-bound $); attachCostToReport is the one stamp every artifact uses
  and refuses to overwrite an existing cost key.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 60a9fa3d

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T11:03:17Z

@drewstone drewstone merged commit d2888e9 into main Jun 10, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants