feat(core): aggressive history clipping for long task runs by lmorchard · Pull Request #487 · mozilla/pilo

lmorchard · 2026-05-30T01:15:09Z

Closes #437.

Summary

Bound the growth of WebAgent.messages across long task runs by extending trimOldHistory (renamed from truncateOldExternalContent) with four new passes on top of the existing <EXTERNAL-CONTENT> clipping: clip old assistant tool-call inputs, clip paired tool-result outputs, drop fully-clipped stubs, and aggregate older feedback messages.
Add SYSTEM_DEBUG_HISTORY_SIZE event emitted per iteration with a crude ~chars/4 token estimate and message count for telemetry.
Add sentinel-prefix tagging ([VALIDATION-FEEDBACK], [STEP-ERROR-FEEDBACK], [REPEATED-ACTION-WARNING]) at the three feedback insertion sites so pass 4's aggregation can identify them anchored-prefix-style. System prompt updated to explain the markers and the [N earlier feedback messages clipped: ...] placeholder.

Design notes

Boundary: count backward through assistant-role messages, find the index of the 5th-most-recent assistant message (HISTORY_CLIP_KEEP_LAST = 5); anything older is eligible for clipping. messages[0] (system) and messages[1] (task+plan) are unconditionally protected.
Pairing preservation: pass 2 clips assistant tool-call inputs and records the toolCallIds. Pass 3 clips matching tool-result outputs by toolCallId. Pass 4 then drops fully-clipped pairs together — clipped assistant + clipped tool-result are always before the boundary in normal flow, so they disappear as a unit. No orphans. Verified by audit + explicit no-orphans assertion in tests.
Deviation from original design: the spec initially forbade message deletion as a conservative belt-and-suspenders for AI-SDK pairing safety. The plateau test (50 iterations, ratio < 1.25 for both tokens and message count) revealed clipping alone leaves ~150 stub messages over a long run, failing the "history size plateaus" acceptance criterion. Pass 4 (drop fully-clipped stubs) satisfies the underlying invariant (pairing preserved by structural guarantee) by a stronger mechanism than the original rule. Spec in docs/dev-sessions/ documents the reframing.
Scope: issue Aggressive history clipping for long task runs #437 parts A, B, C, D. Part E (hard cap on history age) deferred — overlaps with LLM-based history compaction for long tasks #441 (LLM-based summarization).

Test plan

pnpm --filter pilo-core run test — 753 pass / 34 files / 0 fail
pnpm --filter pilo-core run typecheck — clean
All four packages typecheck via pnpm -r run typecheck
Prettier check on packages/ — clean
50-iteration plateau test (history size plateaus rather than growing linearly) — lateTokens/earlyTokens ≈ 1.08, lateCount/earlyCount ≈ 1.15
No-orphans assertion in clips and drops paired tool-result outputs test
messages[0..1] never touched regression test
Eval-judge run on a sample task to confirm no behavioral regression (recommend before merge)

🤖 Generated with Claude Code

Add STEP_ERROR_FEEDBACK_PREFIX, VALIDATION_FEEDBACK_PREFIX, and REPETITION_WARNING_PREFIX constants (in historyPrefixes.ts, re-exported from webAgent.ts) and prepend them to the three feedback insertion sites so pass 4 of trimOldHistory can identify and aggregate these messages. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…t test

Add pass 2 to trimOldHistory: computes a keep-last-5 boundary over assistant messages and replaces tool-call inputs older than that boundary with { clipped: true }, stripping stray reasoning text parts from the same old messages. Adds the HISTORY_CLIP_KEEP_LAST constant (deferred from Task 2 to avoid noUnusedLocals). TDD: test written first, watched fail, then implemented. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Pass 2 now collects a Set of clipped toolCallIds; pass 3 uses it to replace the `output` of any matching tool-result message with `{ clipped: true }`, keeping tool-call/tool-result pairs in sync. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…3 test

Pass 4 collapses all-but-the-most-recent feedback message of each kind (validation, step-error, repetition) into a compact placeholder, keeping at most one full message per kind. Also removes unused HistorySizeDebugEventData import from events.test.ts that was causing a TS6133 typecheck error. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add estimateHistoryTokens() helper and emit SYSTEM_DEBUG_HISTORY_SIZE at the end of addPageSnapshot() after every snapshot push. Token count is a crude chars/4 estimate (images = 1000 tokens each) intended for trend analysis, not billing accuracy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds a 50-iteration plateau acceptance test that verifies history token count and message count plateau (ratio < 1.25) rather than growing linearly across a long agent run. The test revealed that fully-clipped messages before the boundary were being retained as tiny stubs, causing linear growth at ~28 tokens/iter. Adds pass 3.5 to trimOldHistory: drop fully-clipped messages before the boundary (clipped snapshots, clipped tool-call + tool-result pairs) so the history size stays bounded at O(KEEP_LAST) iterations. Updates two earlier tests that checked for clipped stubs to reflect the new drop-not-retain semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add a short **Feedback markers** section to the action-loop system prompt that names the three sentinel prefixes ([VALIDATION-FEEDBACK], [STEP-ERROR-FEEDBACK], [REPEATED-ACTION-WARNING]), describes when each appears, and explains the placeholder-aggregation behaviour for older clipped messages. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…RY_SIZE event

lmorchard and others added 15 commits May 29, 2026 14:30

feat(core): add SYSTEM_DEBUG_HISTORY_SIZE event scaffolding

f1a4d48

test(core): include SYSTEM_DEBUG_HISTORY_SIZE in exhaustive event-lis…

b28c3e3

…t test

refactor(core): rename truncateOldExternalContent to trimOldHistory

b30d764

refactor(core): update stale comment at trimOldHistory call site

55f0533

docs(core): clarify text-part filter comment in trimOldHistory pass 2

679ef39

test(core): strengthen orphan-check assertion in trimOldHistory pass …

d578c45

…3 test

test(core): regression test for system+task message protection

381fb33

chore(schemas): regenerate webagent-event.json for SYSTEM_DEBUG_HISTO…

85d849a

…RY_SIZE event

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): aggressive history clipping for long task runs#487

feat(core): aggressive history clipping for long task runs#487
lmorchard wants to merge 15 commits into
mainfrom
feat/aggressive-history-clipping

lmorchard commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lmorchard commented May 30, 2026

Summary

Design notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant