feat(cost): model seating chart, dollar budgets in the fuzz loop, program cost report by drewstone · Pull Request #243 · tangle-network/agent-eval

drewstone · 2026-06-10T11:03:10Z

What

T7 — the program's cost layer. Three pieces, all projections over the existing CostLedger (nothing duplicated):

ModelSeats (`src/model-seats.ts`)

ModelSeats { worker, judges, analyst, reflection, verifier } — the one object that re-tiers an entire eval program.
seatPresets is plain data: economy uses the fleet-policy ids (kimi-k2.6 worker, [kimi-k2.6, deepseek-v4-pro, gpt-4.1-mini] judges — cross-family by construction, every id family-priced so the preset never produces a costUnknown axis); frontier is deliberately empty because entitled frontier ids vary per router account — callers spread their own.
resolveSeat(seats, seat, fallback?) throws typed SeatUnsetError (code config) when a seat is unset with no explicit fallback — a model id is a budget decision, never a silent default. Wiring points (ensembleJudge({ models: seats.judges }), selfImprove({ llm: { model: seats.reflection } }), makeEvalTools panels, campaign cells) are named in the JSDoc; none are implemented here — those files belong to other surfaces.

Dollar budgets in the fuzz loop (`src/fuzz`)

ExploreOptions gains costOf (consumer-supplied — the explorer cannot know token usage; null = unknown), costBudgetUsd, ledger, onCost.
Budget semantics mirror control-runtime's maxCostUsd: nonnegative-finite validation throws RangeError; the session stops once accumulated KNOWN cost ≥ ceiling (step()/run() honor it exactly like the run budget). Unknown-cost runs never consume budget and are never folded in as $0 — they land in stats.costUnknownRuns.
Cost options without costOf throw at construction (a ceiling that can never trip is a silent lie).
Known costs are recorded into the supplied CostLedger (channel agent, actualCostUsd) so fuzz spend lands in the same ledger as judge/analyst spend.
CapsuleData.stats gains costUsd/costUnknownRuns — present only when tracking was wired (absent ≠ $0). renderCapsuleHtml shows the cost KPI, with N runs unpriced named in amber when the total is a lower bound.

Program cost report (`src/cost-report.ts`)

costReport(ledger) → { perChannel, total: { usd, unknownEntries }, perModel: [{ model, usd, entries, unpriced }] } — a thin projection over CostLedgerSummary (byChannel reused verbatim; only the per-model rollup is new).
attachCostToReport(report, ledger) — the one generic stamp for capsules / campaign results / diagnose reports; refuses to overwrite an existing cost key.

Campaign wiring (documented, not done)

src/campaign/run-campaign.ts is owned by a sibling track this round. Wiring is one line each: thread seats.judges into the campaign's judge configs via ensembleJudge, and stamp campaign results with attachCostToReport(result, ledger).

Tests

30 new deterministic tests (seats resolve + loud throw, preset shape/cross-family/fully-priced, budget stop at costBudgetUsd, unknown-runs-never-$0, ledger recording, validation rejects negative/NaN, HTML KPI, unpriced:true projection, no-overwrite stamp).
Full suite: 2074 passed / 2 skipped (was ≥2044 on main; no existing test weakened). pnpm typecheck + pnpm build green. No version bump (stays 0.88.0).

…gram cost report - ModelSeats + seatPresets + resolveSeat (src/model-seats.ts): one object re-tiers an entire eval program; economy preset uses the fleet-policy ids (cross-family judges, fully priced), frontier is deliberately empty — resolveSeat fails loud on any unset seat, a model id is never a silent default. - BehaviorExplorer cost governance (src/fuzz): costOf + costBudgetUsd + ledger + onCost. Known cost accrues toward a hard ceiling with control-runtime maxCostUsd semantics (nonnegative finite, stop at >=); unknown-cost runs are counted apart, never folded in as $0. Capsule stats gain costUsd/costUnknownRuns only when tracking was wired, and the HTML capsule shows the cost KPI with the unpriced-run count. - costReport + attachCostToReport (src/cost-report.ts): thin projection over CostLedger.summary() adding the per-model rollup (unpriced:true marks a lower-bound $); attachCostToReport is the one stamp every artifact uses and refuses to overwrite an existing cost key.

tangletools

✅ Auto-approved PR — `60a9fa3d`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T11:03:17Z}

tangletools approved these changes Jun 10, 2026

View reviewed changes

drewstone merged commit d2888e9 into main Jun 10, 2026
1 check passed

drewstone mentioned this pull request Jun 10, 2026

chore(release): 0.89.0 — frontier program #244

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cost): model seating chart, dollar budgets in the fuzz loop, program cost report#243

feat(cost): model seating chart, dollar budgets in the fuzz loop, program cost report#243
drewstone merged 1 commit into
mainfrom
feat/cost-governance

drewstone commented Jun 10, 2026 •

edited

Loading

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

ModelSeats (src/model-seats.ts)

Dollar budgets in the fuzz loop (src/fuzz)

Program cost report (src/cost-report.ts)

Campaign wiring (documented, not done)

Tests

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 60a9fa3d

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drewstone commented Jun 10, 2026 •

edited

Loading

ModelSeats (`src/model-seats.ts`)

Dollar budgets in the fuzz loop (`src/fuzz`)

Program cost report (`src/cost-report.ts`)

✅ Auto-approved PR — `60a9fa3d`