Skip to content

feat(core): aggressive history clipping for long task runs#487

Draft
lmorchard wants to merge 15 commits into
mainfrom
feat/aggressive-history-clipping
Draft

feat(core): aggressive history clipping for long task runs#487
lmorchard wants to merge 15 commits into
mainfrom
feat/aggressive-history-clipping

Conversation

@lmorchard
Copy link
Copy Markdown
Collaborator

Closes #437.

Summary

  • Bound the growth of WebAgent.messages across long task runs by extending trimOldHistory (renamed from truncateOldExternalContent) with four new passes on top of the existing <EXTERNAL-CONTENT> clipping: clip old assistant tool-call inputs, clip paired tool-result outputs, drop fully-clipped stubs, and aggregate older feedback messages.
  • Add SYSTEM_DEBUG_HISTORY_SIZE event emitted per iteration with a crude ~chars/4 token estimate and message count for telemetry.
  • Add sentinel-prefix tagging ([VALIDATION-FEEDBACK], [STEP-ERROR-FEEDBACK], [REPEATED-ACTION-WARNING]) at the three feedback insertion sites so pass 4's aggregation can identify them anchored-prefix-style. System prompt updated to explain the markers and the [N earlier feedback messages clipped: ...] placeholder.

Design notes

  • Boundary: count backward through assistant-role messages, find the index of the 5th-most-recent assistant message (HISTORY_CLIP_KEEP_LAST = 5); anything older is eligible for clipping. messages[0] (system) and messages[1] (task+plan) are unconditionally protected.
  • Pairing preservation: pass 2 clips assistant tool-call inputs and records the toolCallIds. Pass 3 clips matching tool-result outputs by toolCallId. Pass 4 then drops fully-clipped pairs together — clipped assistant + clipped tool-result are always before the boundary in normal flow, so they disappear as a unit. No orphans. Verified by audit + explicit no-orphans assertion in tests.
  • Deviation from original design: the spec initially forbade message deletion as a conservative belt-and-suspenders for AI-SDK pairing safety. The plateau test (50 iterations, ratio < 1.25 for both tokens and message count) revealed clipping alone leaves ~150 stub messages over a long run, failing the "history size plateaus" acceptance criterion. Pass 4 (drop fully-clipped stubs) satisfies the underlying invariant (pairing preserved by structural guarantee) by a stronger mechanism than the original rule. Spec in docs/dev-sessions/ documents the reframing.
  • Scope: issue Aggressive history clipping for long task runs #437 parts A, B, C, D. Part E (hard cap on history age) deferred — overlaps with LLM-based history compaction for long tasks #441 (LLM-based summarization).

Test plan

  • pnpm --filter pilo-core run test — 753 pass / 34 files / 0 fail
  • pnpm --filter pilo-core run typecheck — clean
  • All four packages typecheck via pnpm -r run typecheck
  • Prettier check on packages/ — clean
  • 50-iteration plateau test (history size plateaus rather than growing linearly) — lateTokens/earlyTokens ≈ 1.08, lateCount/earlyCount ≈ 1.15
  • No-orphans assertion in clips and drops paired tool-result outputs test
  • messages[0..1] never touched regression test
  • Eval-judge run on a sample task to confirm no behavioral regression (recommend before merge)

🤖 Generated with Claude Code

lmorchard and others added 15 commits May 29, 2026 14:30
Add STEP_ERROR_FEEDBACK_PREFIX, VALIDATION_FEEDBACK_PREFIX, and
REPETITION_WARNING_PREFIX constants (in historyPrefixes.ts, re-exported
from webAgent.ts) and prepend them to the three feedback insertion sites
so pass 4 of trimOldHistory can identify and aggregate these messages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add pass 2 to trimOldHistory: computes a keep-last-5 boundary over
assistant messages and replaces tool-call inputs older than that
boundary with { clipped: true }, stripping stray reasoning text parts
from the same old messages. Adds the HISTORY_CLIP_KEEP_LAST constant
(deferred from Task 2 to avoid noUnusedLocals). TDD: test written
first, watched fail, then implemented.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pass 2 now collects a Set of clipped toolCallIds; pass 3 uses it to
replace the `output` of any matching tool-result message with
`{ clipped: true }`, keeping tool-call/tool-result pairs in sync.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pass 4 collapses all-but-the-most-recent feedback message of each kind
(validation, step-error, repetition) into a compact placeholder, keeping
at most one full message per kind. Also removes unused HistorySizeDebugEventData
import from events.test.ts that was causing a TS6133 typecheck error.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add estimateHistoryTokens() helper and emit SYSTEM_DEBUG_HISTORY_SIZE
at the end of addPageSnapshot() after every snapshot push. Token count
is a crude chars/4 estimate (images = 1000 tokens each) intended for
trend analysis, not billing accuracy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a 50-iteration plateau acceptance test that verifies history token
count and message count plateau (ratio < 1.25) rather than growing
linearly across a long agent run.

The test revealed that fully-clipped messages before the boundary were
being retained as tiny stubs, causing linear growth at ~28 tokens/iter.
Adds pass 3.5 to trimOldHistory: drop fully-clipped messages before the
boundary (clipped snapshots, clipped tool-call + tool-result pairs) so
the history size stays bounded at O(KEEP_LAST) iterations.

Updates two earlier tests that checked for clipped stubs to reflect the
new drop-not-retain semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a short **Feedback markers** section to the action-loop system
prompt that names the three sentinel prefixes ([VALIDATION-FEEDBACK],
[STEP-ERROR-FEEDBACK], [REPEATED-ACTION-WARNING]), describes when each
appears, and explains the placeholder-aggregation behaviour for older
clipped messages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Aggressive history clipping for long task runs

1 participant