diff --git a/docs/research/adaptive-computation-program.md b/docs/research/adaptive-computation-program.md index 1dded32..f8d1b94 100644 --- a/docs/research/adaptive-computation-program.md +++ b/docs/research/adaptive-computation-program.md @@ -78,7 +78,17 @@ the default unfinished-items analyst in the existing steerer harness on EOPS. Sc criterion 1: decision regret, not prediction quality alone — a state that predicts better but does not choose better fails. -Status: DESIGNED (the steerer-population fitness harness exists). +Status: DESIGNED (the steerer-population fitness harness exists). **First arm RAN and was +AUTOPSIED (2026-06-11): the −37.5pp result does NOT measure the corner — it measured a +broken channel.** Ground truth (per-task artifact + a parser probe): the controller +stopped after one shot on 14/16 unsolved tasks because its verdicts never existed — +`critique()` routes through `observe()`'s findings-extraction protocol, which overrides +the verdict-format instruction (probe: the analyst returned an ordinary recommendation, +no `VERDICT:` line), so the body received nulls/plain steers, never a decision. +Classification: infra/design-flaw, not real-result. **The corner is unmeasured.** Next +arm requires a verdict-capable channel (a raw firewalled analyst call on the strategy +ctx — trace-only input, raw text out — bypassing the findings schema) plus calibration; +only then does the corner get a verdict. ## E1-coarse — the leakage-bounded author channel @@ -111,7 +121,13 @@ explicit selection rules, fresh-slice promotion gate, band-aware holdout (`band.holdoutPoolN` — the estimand "paired lift on headroom tasks", pre-registered). Each run also feeds E2 for free (gzip-bits per authored artifact vs holdout gap). -Status: RUNNING (n=24/budget-4 family; band + tool-selection arms land in the next run). +Status: RUNNING → three family members completed (2026-06-10/11): powered n=24 HOLD; +discriminating run (tool introspection) HOLD with the first train-side champion +displacements + a REPRODUCIBLE certificate on the authored champion; cost-objective run +#1 HOLD via the funnel misalignment (the search tie-band was stricter than the gate — +fixed, the law recorded in HARNESS.md), rerun in flight. The replicated positives so far +are cost-shaped (author at ~2.5× lower cost, ×3) and interaction-shaped (σ×κ promoted +×2 on AIME). Live ledger: `.evolve/current.json` + the findings gist. ## Rejected without prejudice (named triggers, not dead) diff --git a/docs/research/factorial-ablation-design.md b/docs/research/factorial-ablation-design.md index 5df84d7..547245f 100644 --- a/docs/research/factorial-ablation-design.md +++ b/docs/research/factorial-ablation-design.md @@ -82,3 +82,38 @@ future claim a named cell instead of an ad-hoc run. leakage-bounded channel (`lossesDetail: 'binary'`) is the dose-control arm (E1-coarse). - **One-domain exposure**: EOPS-itsm is the workhorse; cross-domain cells (commit0, verifier-math) are the same grid at a different `ENV` — the transfer claim needs at least two. + + +## The formal object (added 2026-06-11) + +A configuration is a point in a finite product lattice +`c = (σ, α, γ, κ, m, d, b) ∈ Σ × A × Γ × K × M × D × B`; the response is vector-valued +`J(c) = (score, cost, latency)`, estimated per cell from paired task-level deltas. The +"hypercube" is the binary on/off projection of (σ, α, γ, κ) — our cell-naming scheme. +In design-of-experiments language this is a fractional factorial; the estimands are main +effects and two-way interactions as paired contrasts, each with its own promotion rule. + +**The steering subspace is a true {0,1}⁴**: who steers (LLM/deterministic) × what +information (single trajectory/population) × what is steered (content/budget) × what form +(advice/verdict). Measured corners (AIME n=16, 2026-06-11): `refine` = (LLM, single, +content, advice) 50.0%; `structural` flips axis 1 → 43.8% (the deterministic floor +captures ~⅔ of the lift; the LLM critic's marginal value is unproven at this n); +`contrastive` flips axis 2 → 25.0% (significantly negative — context dilution); +`belief` flips axes 3+4 → 12.5% — **autopsied: the channel was broken, not the idea** +(the findings pipeline strips the verdict format; the controller never received a +decision; corner UNMEASURED — see adaptive-computation-program.md E8). Twelve corners remain +unexplored and enumerable (e.g. deterministic+budget = rule-based early stopping). Note +the exploration shape: a Hamming ball of radius 1–2 around the incumbent — and the +measured σ×κ interaction (neither factor promotes alone; the combination promoted, ×2) +is the standing exhibit for factorial over one-factor-at-a-time designs. + +## Measured cells update (2026-06-11) + +- σ×κ (`steer+compress`): **PROMOTED non-inferior-and-cheaper ×2** (incl. on honest + critic billing; savings CI[+0.0006,+0.0061]; the compression call broke even after 1 + task; Δlatency +38.8s CI[19.4,56.7] = steering's real price). +- σ on stateless hard math: +18.7pp (50.0 vs 31.3) — refines the domain-boundary law + (carried messages behave like state on competition math). +- Funnel-alignment law (from the cost run): the search-side champion tie-band must be no + stricter than the gate's tolerance, or promotable candidates die in search. +- The model×cell matrix (8 workers × 4 cells, identical tasks) is the M-axis sweep.