feat(cost): model seating chart, dollar budgets in the fuzz loop, program cost report#243
Merged
Conversation
…gram cost report - ModelSeats + seatPresets + resolveSeat (src/model-seats.ts): one object re-tiers an entire eval program; economy preset uses the fleet-policy ids (cross-family judges, fully priced), frontier is deliberately empty — resolveSeat fails loud on any unset seat, a model id is never a silent default. - BehaviorExplorer cost governance (src/fuzz): costOf + costBudgetUsd + ledger + onCost. Known cost accrues toward a hard ceiling with control-runtime maxCostUsd semantics (nonnegative finite, stop at >=); unknown-cost runs are counted apart, never folded in as $0. Capsule stats gain costUsd/costUnknownRuns only when tracking was wired, and the HTML capsule shows the cost KPI with the unpriced-run count. - costReport + attachCostToReport (src/cost-report.ts): thin projection over CostLedger.summary() adding the per-model rollup (unpriced:true marks a lower-bound $); attachCostToReport is the one stamp every artifact uses and refuses to overwrite an existing cost key.
tangletools
approved these changes
Jun 10, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — 60a9fa3d
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T11:03:17Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
T7 — the program's cost layer. Three pieces, all projections over the existing
CostLedger(nothing duplicated):ModelSeats (
src/model-seats.ts)ModelSeats { worker, judges, analyst, reflection, verifier }— the one object that re-tiers an entire eval program.seatPresetsis plain data:economyuses the fleet-policy ids (kimi-k2.6worker,[kimi-k2.6, deepseek-v4-pro, gpt-4.1-mini]judges — cross-family by construction, every id family-priced so the preset never produces a costUnknown axis);frontieris deliberately empty because entitled frontier ids vary per router account — callers spread their own.resolveSeat(seats, seat, fallback?)throws typedSeatUnsetError(codeconfig) when a seat is unset with no explicit fallback — a model id is a budget decision, never a silent default. Wiring points (ensembleJudge({ models: seats.judges }),selfImprove({ llm: { model: seats.reflection } }),makeEvalToolspanels, campaign cells) are named in the JSDoc; none are implemented here — those files belong to other surfaces.Dollar budgets in the fuzz loop (
src/fuzz)ExploreOptionsgainscostOf(consumer-supplied — the explorer cannot know token usage;null= unknown),costBudgetUsd,ledger,onCost.control-runtime'smaxCostUsd: nonnegative-finite validation throwsRangeError; the session stops once accumulated KNOWN cost ≥ ceiling (step()/run()honor it exactly like the run budget). Unknown-cost runs never consume budget and are never folded in as $0 — they land instats.costUnknownRuns.costOfthrow at construction (a ceiling that can never trip is a silent lie).CostLedger(channelagent,actualCostUsd) so fuzz spend lands in the same ledger as judge/analyst spend.CapsuleData.statsgainscostUsd/costUnknownRuns— present only when tracking was wired (absent ≠ $0).renderCapsuleHtmlshows the cost KPI, withN runs unpricednamed in amber when the total is a lower bound.Program cost report (
src/cost-report.ts)costReport(ledger)→{ perChannel, total: { usd, unknownEntries }, perModel: [{ model, usd, entries, unpriced }] }— a thin projection overCostLedgerSummary(byChannelreused verbatim; only the per-model rollup is new).attachCostToReport(report, ledger)— the one generic stamp for capsules / campaign results / diagnose reports; refuses to overwrite an existingcostkey.Campaign wiring (documented, not done)
src/campaign/run-campaign.tsis owned by a sibling track this round. Wiring is one line each: threadseats.judgesinto the campaign's judge configs viaensembleJudge, and stamp campaign results withattachCostToReport(result, ledger).Tests
costBudgetUsd, unknown-runs-never-$0, ledger recording, validation rejects negative/NaN, HTML KPI,unpriced:trueprojection, no-overwrite stamp).pnpm typecheck+pnpm buildgreen. No version bump (stays 0.88.0).