feat(gates): anytime-valid sequential e-process gates#242
Conversation
Betting test-martingale core (eProcess, Waudby-Smith & Ramdas) in statistics.ts: H0 E[x] <= nullMean on bounded observations, predictable truncated bets from prior observations only, decision at wealth >= 1/alpha (Ville's inequality) — valid at any data-dependent stopping time. sequentialPairedGate conforms to the existing Gate contract (pairHoldout pairing, full-cellId granularity) and adds a streaming observe(delta) entry so campaigns stop the moment evidence decides instead of burning a fixed-n budget. undecided-at-maxN maps to hold with an explicit not-evidence-of- no-effect reason, never a silent default. Binds the pre-registration manifest mechanism: alpha/budget/direction/minEffect come from the signed manifest (content hash verified at construction); conflicting parameters throw. minEffect shifts the null boundary rather than clamping observations (clamping is mean-distorting for skewed deltas). sequentialDecide adapts the same e-process as an ImprovementDriver.decide impl: stops the optimization loop once per-scenario evidence vs the generation-0 incumbent decides; an undecided process never stops the loop. Measured on 200 seeded streams (mulberry32, maxN=400, alpha=0.05): false-promote rate 0.025 under H0; median stopping n=68 (17% of the fixed-n budget) under a +0.2 mean effect.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 6493e1e5
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T10:50:55Z
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 4ddbb20e
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T10:55:29Z
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 4631c71a
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T10:58:38Z
What
Anytime-valid sequential testing for campaigns: e-process gates so a campaign stops the moment evidence decides, instead of burning a fixed-n budget. No mainstream agent-eval framework ships this (Inspect/Braintrust/LangSmith/promptfoo are all fixed-n); the math is mature (Ville's inequality, betting martingales).
Surface
eProcess({ alpha, maxBet, nullMean })(src/statistics.ts, root barrel) — pure incremental betting test-martingale for bounded observations (Waudby-Smith & Ramdas, JRSS-B 2024). WealthW_t = Π (1 + λ_i(x_i − m₀))with predictable truncated bets computed from PRIOR observations only (λ_i never sees x_i — documented as the load-bearing invariant); decision atW_t ≥ 1/αis valid at ANY data-dependent stopping time.sequentialPairedGate({ alpha, minN, maxN, preRegistration })(src/campaign/gates/sequential.ts,/campaignbarrel) — conforms to the existingGatecontract (decide(ctx)composespairHoldout: full-cellId pairing, same granularity as the fixed-n gates) plus a streamingobserve(delta)entry. Paired deltas scalex = (d/scale + 1)/2 ∈ [0,1], H0: mean ≤ 1/2.'promote'→ship; stream-ends-undecided →need_more_work;'undecided-at-maxN'→holdwith an explicit NOT evidence of no effect reason — never a silent default.ctx.baselineJudgeScoresthrows (never compares the candidate against itself).Pre-registration binding — anytime validity holds only for the pre-registered statistic, so the gate binds the real
SignedManifestmechanism: alpha + observation budget + direction + minEffect come FROM the manifest; conflicting explicit options throw; the content hash is verified at construction (sync twin ofverifyManifest, samesha256-contentscheme overcanonicalize).minEffectshifts the null boundary to1/2 + minEffect/(2·scale)rather than clamping observations (clamping is mean-distorting for skewed deltas).sequentialDecide(options)—ImprovementDriver.decideadapter (the grounded early-stop hook inrunOptimization): stops the loop once per-scenario evidence of the latest generation's top candidate vs the generation-0 incumbent decides. An undecided process never stops the loop. Consumes each generation exactly once across repeated calls (no double-counting).Honest caveats (in the module docs)
heldoutSignificance/paretoSignificanceGateon a growing sample and stopping early is optional stopping no matter how it is dressed up.decide(ctx)shuffles paired deltas with a seeded (mulberry32), data-independent permutation by default; stratified betting is named future work.sequentialDecideshares the incumbent's once-measured scores across generations; type-I control is exact insofar as those approximate the incumbent's true means.Measured (deterministic, 200 seeded mulberry32 streams, in the test names)
Tests
27 new tests (all deterministic, no LLM, no unseeded RNG): H0 false-promote rate, measured stopping savings, wealth-positivity, sticky decisions, minN stopping rule, predictability (permuted-future invariance + zero first bet), gate-contract conformance, pre-registration binding (tamper/conflict/direction/minEffect), decide-adapter latch + no-double-count.
pnpm typecheck && pnpm test && pnpm buildall green (2071 passed) after rebasing onto main (post verdict-spine + fuzz merges);biome check srcclean.Notes
src/index.tsadditions are one contiguous block at EOF; gate exports live on the/campaignbarrel with the other gates.mulberry32is now exported fromstatistics.ts(seed required) so shuffles and bootstrap resampling share one PRNG;makeRngdelegates to it — no behavior change.