feat(gates): anytime-valid sequential e-process gates by drewstone · Pull Request #242 · tangle-network/agent-eval

drewstone · 2026-06-10T10:50:49Z

What

Anytime-valid sequential testing for campaigns: e-process gates so a campaign stops the moment evidence decides, instead of burning a fixed-n budget. No mainstream agent-eval framework ships this (Inspect/Braintrust/LangSmith/promptfoo are all fixed-n); the math is mature (Ville's inequality, betting martingales).

Surface

eProcess({ alpha, maxBet, nullMean }) (src/statistics.ts, root barrel) — pure incremental betting test-martingale for bounded observations (Waudby-Smith & Ramdas, JRSS-B 2024). Wealth W_t = Π (1 + λ_i(x_i − m₀)) with predictable truncated bets computed from PRIOR observations only (λ_i never sees x_i — documented as the load-bearing invariant); decision at W_t ≥ 1/α is valid at ANY data-dependent stopping time.

sequentialPairedGate({ alpha, minN, maxN, preRegistration }) (src/campaign/gates/sequential.ts, /campaign barrel) — conforms to the existing Gate contract (decide(ctx) composes pairHoldout: full-cellId pairing, same granularity as the fixed-n gates) plus a streaming observe(delta) entry. Paired deltas scale x = (d/scale + 1)/2 ∈ [0,1], H0: mean ≤ 1/2.

'promote' → ship; stream-ends-undecided → need_more_work; 'undecided-at-maxN' → hold with an explicit NOT evidence of no effect reason — never a silent default.
Observing past the pre-registered maxN throws (extending a finished stream reopens optional stopping).
Missing ctx.baselineJudgeScores throws (never compares the candidate against itself).

Pre-registration binding — anytime validity holds only for the pre-registered statistic, so the gate binds the real SignedManifest mechanism: alpha + observation budget + direction + minEffect come FROM the manifest; conflicting explicit options throw; the content hash is verified at construction (sync twin of verifyManifest, same sha256-content scheme over canonicalize). minEffect shifts the null boundary to 1/2 + minEffect/(2·scale) rather than clamping observations (clamping is mean-distorting for skewed deltas).

sequentialDecide(options) — ImprovementDriver.decide adapter (the grounded early-stop hook in runOptimization): stops the loop once per-scenario evidence of the latest generation's top candidate vs the generation-0 incumbent decides. An undecided process never stops the loop. Consumes each generation exactly once across repeated calls (no double-counting).

Honest caveats (in the module docs)

Replaces, never layers on, a fixed-n gate — repeatedly peeking at heldoutSignificance/paretoSignificanceGate on a growing sample and stopping early is optional stopping no matter how it is dressed up.
Exchangeability — decide(ctx) shuffles paired deltas with a seeded (mulberry32), data-independent permutation by default; stratified betting is named future work.
sequentialDecide shares the incumbent's once-measured scores across generations; type-I control is exact insofar as those approximate the incumbent's true means.

Measured (deterministic, 200 seeded mulberry32 streams, in the test names)

Condition	Result
H0 (symmetric deltas), maxN=400, α=0.05	false-promote rate 0.025 (bound: 1.5×α = 0.075)
+0.2 mean effect, maxN=400	median stopping n=68 = 17% of the fixed-n budget; p90=88; 0/200 undecided

Tests

27 new tests (all deterministic, no LLM, no unseeded RNG): H0 false-promote rate, measured stopping savings, wealth-positivity, sticky decisions, minN stopping rule, predictability (permuted-future invariance + zero first bet), gate-contract conformance, pre-registration binding (tamper/conflict/direction/minEffect), decide-adapter latch + no-double-count.

pnpm typecheck && pnpm test && pnpm build all green (2071 passed) after rebasing onto main (post verdict-spine + fuzz merges); biome check src clean.

Notes

No version bump (release sequenced by the program lead).
Root src/index.ts additions are one contiguous block at EOF; gate exports live on the /campaign barrel with the other gates.
mulberry32 is now exported from statistics.ts (seed required) so shuffles and bootstrap resampling share one PRNG; makeRng delegates to it — no behavior change.

Betting test-martingale core (eProcess, Waudby-Smith & Ramdas) in statistics.ts: H0 E[x] <= nullMean on bounded observations, predictable truncated bets from prior observations only, decision at wealth >= 1/alpha (Ville's inequality) — valid at any data-dependent stopping time. sequentialPairedGate conforms to the existing Gate contract (pairHoldout pairing, full-cellId granularity) and adds a streaming observe(delta) entry so campaigns stop the moment evidence decides instead of burning a fixed-n budget. undecided-at-maxN maps to hold with an explicit not-evidence-of- no-effect reason, never a silent default. Binds the pre-registration manifest mechanism: alpha/budget/direction/minEffect come from the signed manifest (content hash verified at construction); conflicting parameters throw. minEffect shifts the null boundary rather than clamping observations (clamping is mean-distorting for skewed deltas). sequentialDecide adapts the same e-process as an ImprovementDriver.decide impl: stops the optimization loop once per-scenario evidence vs the generation-0 incumbent decides; an undecided process never stops the loop. Measured on 200 seeded streams (mulberry32, maxN=400, alpha=0.05): false-promote rate 0.025 under H0; median stopping n=68 (17% of the fixed-n budget) under a +0.2 mean effect.

tangletools

✅ Auto-approved PR — `6493e1e5`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T10:50:55Z}

tangletools

✅ Auto-approved PR — `4ddbb20e`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T10:55:29Z}

tangletools

✅ Auto-approved PR — `4631c71a`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T10:58:38Z}

tangletools previously approved these changes Jun 10, 2026

View reviewed changes

merge main into branch (frontier merge sequencing)

4ddbb20

drewstone dismissed tangletools’s stale review via 4ddbb20 June 10, 2026 10:55

tangletools previously approved these changes Jun 10, 2026

View reviewed changes

merge main into branch (frontier merge sequencing)

4631c71

drewstone dismissed tangletools’s stale review via 4631c71 June 10, 2026 10:58

tangletools approved these changes Jun 10, 2026

View reviewed changes

drewstone merged commit 30e6987 into main Jun 10, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gates): anytime-valid sequential e-process gates#242

feat(gates): anytime-valid sequential e-process gates#242
drewstone merged 3 commits into
mainfrom
feat/sequential-gates

drewstone commented Jun 10, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 10, 2026

What

Surface

Honest caveats (in the module docs)

Measured (deterministic, 200 seeded mulberry32 streams, in the test names)

Tests

Notes

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 6493e1e5

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 4ddbb20e

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 4631c71a

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `6493e1e5`

✅ Auto-approved PR — `4ddbb20e`

✅ Auto-approved PR — `4631c71a`