Skip to content

feat(loops): the cost waterfall + three steering modes from new hypercube corners#250

Merged
drewstone merged 1 commit into
mainfrom
feat/cost-waterfall-steering-modes
Jun 10, 2026
Merged

feat(loops): the cost waterfall + three steering modes from new hypercube corners#250
drewstone merged 1 commit into
mainfrom
feat/cost-waterfall-steering-modes

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

100% trajectory observability — createWaterfallCollector

Folds the lifecycle stream (every spawn/settle: shots, analysts, nested agents) into timed, billed spans. The sum of spans IS the run's cost story: a text waterfall (bars scaled to the window, per-span s / $ / tokens / score, by-kind rollups, totals) or structured rows for any chart. Attaches to runAgentic/runBenchmark hooks. With #249's critic billing, every router call in a trajectory is now visible and priced — nothing rides free.

Three steering modes (bench arms; AIME; gated vs the incumbent critic at equal budget)

mode hypercube corner the bet
structural deterministic (the ddmin of steering) a pure function detecting stuck loops / tool-error pileups / no-execution, templated correctives, zero LLM cost — if the critic can't beat this, it isn't earning its calls (the analyst-prompt GEPA null says nobody has checked)
contrastive population information hybrid strategies pay for multiple rollouts then discard the losers; the critic sees BOTH trajectories (never scores — firewall intact) and emits the difference that matters; the better attempt continues with the diff
belief decision-theoretic (E8's cheap form) the critic emits VERDICT: CONTINUE|RESTART|STOP confidence=… — value-of-continuation — and the strategy obeys: steering the BUDGET, not the content (the steerer-population winner's lineage, domain-generic)

Runner: steering-modes.mts — each mode its own arm on identical AIME tasks (the grid pattern), sample as the no-steering control, non-inferiority gates vs refine, per-arm waterfall + by-kind cost rollups (WATERFALL=1 prints the bars).

+3 waterfall tests (span summation/rollups, downed-span billing, reset). Suite 785 ✓ · typecheck ✓ · lint ✓ (1 pre-existing warning). Bench script compiles against the next dist refresh (cost run holds bench/node_modules).

…cube corners

100%% observability: createWaterfallCollector folds the lifecycle stream
(every spawn/settle — shots, analysts, nested agents) into timed, billed
spans; the sum of spans IS the run's cost story — text waterfall (bars,
per-span s/$ /tokens/score, by-kind rollups, totals) or structured rows for
any chart. Attaches to runAgentic/runBenchmark hooks.

Three steering modes from unexplored corners (bench arms, AIME, gated
non-inferiority vs the incumbent critic at equal budget):
- structural — the DETERMINISTIC floor (the ddmin of steering): pure
  function detects stuck loops / tool-error pileups / no-execution, emits a
  templated corrective; zero LLM cost. If the critic can't beat this, it is
  not earning its calls.
- contrastive — POPULATION-information steering: the critic sees two
  trajectories (never scores; firewall intact) and emits the difference
  that matters; the better attempt continues with the diff as steer.
- belief — DECISION-theoretic steering (E8's cheap form): the critic emits
  VERDICT CONTINUE|RESTART|STOP + confidence; the strategy obeys — steering
  the budget allocation, not the message content.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — b3224800

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T23:01:43Z

@drewstone drewstone merged commit 7324cee into main Jun 10, 2026
1 check passed
@drewstone drewstone deleted the feat/cost-waterfall-steering-modes branch June 10, 2026 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants