Skip to content

feat(gates): anytime-valid sequential e-process gates#242

Merged
drewstone merged 3 commits into
mainfrom
feat/sequential-gates
Jun 10, 2026
Merged

feat(gates): anytime-valid sequential e-process gates#242
drewstone merged 3 commits into
mainfrom
feat/sequential-gates

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

Anytime-valid sequential testing for campaigns: e-process gates so a campaign stops the moment evidence decides, instead of burning a fixed-n budget. No mainstream agent-eval framework ships this (Inspect/Braintrust/LangSmith/promptfoo are all fixed-n); the math is mature (Ville's inequality, betting martingales).

Surface

eProcess({ alpha, maxBet, nullMean }) (src/statistics.ts, root barrel) — pure incremental betting test-martingale for bounded observations (Waudby-Smith & Ramdas, JRSS-B 2024). Wealth W_t = Π (1 + λ_i(x_i − m₀)) with predictable truncated bets computed from PRIOR observations only (λ_i never sees x_i — documented as the load-bearing invariant); decision at W_t ≥ 1/α is valid at ANY data-dependent stopping time.

sequentialPairedGate({ alpha, minN, maxN, preRegistration }) (src/campaign/gates/sequential.ts, /campaign barrel) — conforms to the existing Gate contract (decide(ctx) composes pairHoldout: full-cellId pairing, same granularity as the fixed-n gates) plus a streaming observe(delta) entry. Paired deltas scale x = (d/scale + 1)/2 ∈ [0,1], H0: mean ≤ 1/2.

  • 'promote'ship; stream-ends-undecided → need_more_work; 'undecided-at-maxN'hold with an explicit NOT evidence of no effect reason — never a silent default.
  • Observing past the pre-registered maxN throws (extending a finished stream reopens optional stopping).
  • Missing ctx.baselineJudgeScores throws (never compares the candidate against itself).

Pre-registration binding — anytime validity holds only for the pre-registered statistic, so the gate binds the real SignedManifest mechanism: alpha + observation budget + direction + minEffect come FROM the manifest; conflicting explicit options throw; the content hash is verified at construction (sync twin of verifyManifest, same sha256-content scheme over canonicalize). minEffect shifts the null boundary to 1/2 + minEffect/(2·scale) rather than clamping observations (clamping is mean-distorting for skewed deltas).

sequentialDecide(options)ImprovementDriver.decide adapter (the grounded early-stop hook in runOptimization): stops the loop once per-scenario evidence of the latest generation's top candidate vs the generation-0 incumbent decides. An undecided process never stops the loop. Consumes each generation exactly once across repeated calls (no double-counting).

Honest caveats (in the module docs)

  • Replaces, never layers on, a fixed-n gate — repeatedly peeking at heldoutSignificance/paretoSignificanceGate on a growing sample and stopping early is optional stopping no matter how it is dressed up.
  • Exchangeabilitydecide(ctx) shuffles paired deltas with a seeded (mulberry32), data-independent permutation by default; stratified betting is named future work.
  • sequentialDecide shares the incumbent's once-measured scores across generations; type-I control is exact insofar as those approximate the incumbent's true means.

Measured (deterministic, 200 seeded mulberry32 streams, in the test names)

Condition Result
H0 (symmetric deltas), maxN=400, α=0.05 false-promote rate 0.025 (bound: 1.5×α = 0.075)
+0.2 mean effect, maxN=400 median stopping n=68 = 17% of the fixed-n budget; p90=88; 0/200 undecided

Tests

27 new tests (all deterministic, no LLM, no unseeded RNG): H0 false-promote rate, measured stopping savings, wealth-positivity, sticky decisions, minN stopping rule, predictability (permuted-future invariance + zero first bet), gate-contract conformance, pre-registration binding (tamper/conflict/direction/minEffect), decide-adapter latch + no-double-count.

pnpm typecheck && pnpm test && pnpm build all green (2071 passed) after rebasing onto main (post verdict-spine + fuzz merges); biome check src clean.

Notes

  • No version bump (release sequenced by the program lead).
  • Root src/index.ts additions are one contiguous block at EOF; gate exports live on the /campaign barrel with the other gates.
  • mulberry32 is now exported from statistics.ts (seed required) so shuffles and bootstrap resampling share one PRNG; makeRng delegates to it — no behavior change.

Betting test-martingale core (eProcess, Waudby-Smith & Ramdas) in
statistics.ts: H0 E[x] <= nullMean on bounded observations, predictable
truncated bets from prior observations only, decision at wealth >= 1/alpha
(Ville's inequality) — valid at any data-dependent stopping time.

sequentialPairedGate conforms to the existing Gate contract (pairHoldout
pairing, full-cellId granularity) and adds a streaming observe(delta) entry
so campaigns stop the moment evidence decides instead of burning a fixed-n
budget. undecided-at-maxN maps to hold with an explicit not-evidence-of-
no-effect reason, never a silent default. Binds the pre-registration
manifest mechanism: alpha/budget/direction/minEffect come from the signed
manifest (content hash verified at construction); conflicting parameters
throw. minEffect shifts the null boundary rather than clamping observations
(clamping is mean-distorting for skewed deltas).

sequentialDecide adapts the same e-process as an ImprovementDriver.decide
impl: stops the optimization loop once per-scenario evidence vs the
generation-0 incumbent decides; an undecided process never stops the loop.

Measured on 200 seeded streams (mulberry32, maxN=400, alpha=0.05):
false-promote rate 0.025 under H0; median stopping n=68 (17% of the
fixed-n budget) under a +0.2 mean effect.
tangletools
tangletools previously approved these changes Jun 10, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 6493e1e5

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T10:50:55Z

tangletools
tangletools previously approved these changes Jun 10, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 4ddbb20e

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T10:55:29Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 4631c71a

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T10:58:38Z

@drewstone drewstone merged commit 30e6987 into main Jun 10, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants