Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 18 additions & 2 deletions docs/research/adaptive-computation-program.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,17 @@ the default unfinished-items analyst in the existing steerer harness on EOPS. Sc
criterion 1: decision regret, not prediction quality alone — a state that predicts
better but does not choose better fails.

Status: DESIGNED (the steerer-population fitness harness exists).
Status: DESIGNED (the steerer-population fitness harness exists). **First arm RAN and was
AUTOPSIED (2026-06-11): the −37.5pp result does NOT measure the corner — it measured a
broken channel.** Ground truth (per-task artifact + a parser probe): the controller
stopped after one shot on 14/16 unsolved tasks because its verdicts never existed —
`critique()` routes through `observe()`'s findings-extraction protocol, which overrides
the verdict-format instruction (probe: the analyst returned an ordinary recommendation,
no `VERDICT:` line), so the body received nulls/plain steers, never a decision.
Classification: infra/design-flaw, not real-result. **The corner is unmeasured.** Next
arm requires a verdict-capable channel (a raw firewalled analyst call on the strategy
ctx — trace-only input, raw text out — bypassing the findings schema) plus calibration;
only then does the corner get a verdict.

## E1-coarse — the leakage-bounded author channel

Expand Down Expand Up @@ -111,7 +121,13 @@ explicit selection rules, fresh-slice promotion gate, band-aware holdout
(`band.holdoutPoolN` — the estimand "paired lift on headroom tasks", pre-registered).
Each run also feeds E2 for free (gzip-bits per authored artifact vs holdout gap).

Status: RUNNING (n=24/budget-4 family; band + tool-selection arms land in the next run).
Status: RUNNING → three family members completed (2026-06-10/11): powered n=24 HOLD;
discriminating run (tool introspection) HOLD with the first train-side champion
displacements + a REPRODUCIBLE certificate on the authored champion; cost-objective run
#1 HOLD via the funnel misalignment (the search tie-band was stricter than the gate —
fixed, the law recorded in HARNESS.md), rerun in flight. The replicated positives so far
are cost-shaped (author at ~2.5× lower cost, ×3) and interaction-shaped (σ×κ promoted
×2 on AIME). Live ledger: `.evolve/current.json` + the findings gist.

## Rejected without prejudice (named triggers, not dead)

Expand Down
35 changes: 35 additions & 0 deletions docs/research/factorial-ablation-design.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,3 +82,38 @@ future claim a named cell instead of an ad-hoc run.
leakage-bounded channel (`lossesDetail: 'binary'`) is the dose-control arm (E1-coarse).
- **One-domain exposure**: EOPS-itsm is the workhorse; cross-domain cells (commit0, verifier-math)
are the same grid at a different `ENV` — the transfer claim needs at least two.


## The formal object (added 2026-06-11)

A configuration is a point in a finite product lattice
`c = (σ, α, γ, κ, m, d, b) ∈ Σ × A × Γ × K × M × D × B`; the response is vector-valued
`J(c) = (score, cost, latency)`, estimated per cell from paired task-level deltas. The
"hypercube" is the binary on/off projection of (σ, α, γ, κ) — our cell-naming scheme.
In design-of-experiments language this is a fractional factorial; the estimands are main
effects and two-way interactions as paired contrasts, each with its own promotion rule.

**The steering subspace is a true {0,1}⁴**: who steers (LLM/deterministic) × what
information (single trajectory/population) × what is steered (content/budget) × what form
(advice/verdict). Measured corners (AIME n=16, 2026-06-11): `refine` = (LLM, single,
content, advice) 50.0%; `structural` flips axis 1 → 43.8% (the deterministic floor
captures ~⅔ of the lift; the LLM critic's marginal value is unproven at this n);
`contrastive` flips axis 2 → 25.0% (significantly negative — context dilution);
`belief` flips axes 3+4 → 12.5% — **autopsied: the channel was broken, not the idea**
(the findings pipeline strips the verdict format; the controller never received a
decision; corner UNMEASURED — see adaptive-computation-program.md E8). Twelve corners remain
unexplored and enumerable (e.g. deterministic+budget = rule-based early stopping). Note
the exploration shape: a Hamming ball of radius 1–2 around the incumbent — and the
measured σ×κ interaction (neither factor promotes alone; the combination promoted, ×2)
is the standing exhibit for factorial over one-factor-at-a-time designs.

## Measured cells update (2026-06-11)

- σ×κ (`steer+compress`): **PROMOTED non-inferior-and-cheaper ×2** (incl. on honest
critic billing; savings CI[+0.0006,+0.0061]; the compression call broke even after 1
task; Δlatency +38.8s CI[19.4,56.7] = steering's real price).
- σ on stateless hard math: +18.7pp (50.0 vs 31.3) — refines the domain-boundary law
(carried messages behave like state on competition math).
- Funnel-alignment law (from the cost run): the search-side champion tie-band must be no
stricter than the gate's tolerance, or promotable candidates die in search.
- The model×cell matrix (8 workers × 4 cells, identical tasks) is the M-axis sweep.
Loading