feat(loops): the cost waterfall + three steering modes from new hypercube corners#250
Merged
Merged
Conversation
…cube corners 100%% observability: createWaterfallCollector folds the lifecycle stream (every spawn/settle — shots, analysts, nested agents) into timed, billed spans; the sum of spans IS the run's cost story — text waterfall (bars, per-span s/$ /tokens/score, by-kind rollups, totals) or structured rows for any chart. Attaches to runAgentic/runBenchmark hooks. Three steering modes from unexplored corners (bench arms, AIME, gated non-inferiority vs the incumbent critic at equal budget): - structural — the DETERMINISTIC floor (the ddmin of steering): pure function detects stuck loops / tool-error pileups / no-execution, emits a templated corrective; zero LLM cost. If the critic can't beat this, it is not earning its calls. - contrastive — POPULATION-information steering: the critic sees two trajectories (never scores; firewall intact) and emits the difference that matters; the better attempt continues with the diff as steer. - belief — DECISION-theoretic steering (E8's cheap form): the critic emits VERDICT CONTINUE|RESTART|STOP + confidence; the strategy obeys — steering the budget allocation, not the message content.
tangletools
approved these changes
Jun 10, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — b3224800
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T23:01:43Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
100% trajectory observability —
createWaterfallCollectorFolds the lifecycle stream (every spawn/settle: shots, analysts, nested agents) into timed, billed spans. The sum of spans IS the run's cost story: a text waterfall (bars scaled to the window, per-span s / $ / tokens / score, by-kind rollups, totals) or structured rows for any chart. Attaches to
runAgentic/runBenchmarkhooks. With #249's critic billing, every router call in a trajectory is now visible and priced — nothing rides free.Three steering modes (bench arms; AIME; gated vs the incumbent critic at equal budget)
structuralcontrastivebeliefVERDICT: CONTINUE|RESTART|STOP confidence=…— value-of-continuation — and the strategy obeys: steering the BUDGET, not the content (the steerer-population winner's lineage, domain-generic)Runner:
steering-modes.mts— each mode its own arm on identical AIME tasks (the grid pattern),sampleas the no-steering control, non-inferiority gates vsrefine, per-arm waterfall + by-kind cost rollups (WATERFALL=1prints the bars).+3 waterfall tests (span summation/rollups, downed-span billing, reset). Suite 785 ✓ · typecheck ✓ · lint ✓ (1 pre-existing warning). Bench script compiles against the next dist refresh (cost run holds bench/node_modules).