feat(core): aggressive history clipping for long task runs#487
Draft
lmorchard wants to merge 15 commits into
Draft
feat(core): aggressive history clipping for long task runs#487lmorchard wants to merge 15 commits into
lmorchard wants to merge 15 commits into
Conversation
Add STEP_ERROR_FEEDBACK_PREFIX, VALIDATION_FEEDBACK_PREFIX, and REPETITION_WARNING_PREFIX constants (in historyPrefixes.ts, re-exported from webAgent.ts) and prepend them to the three feedback insertion sites so pass 4 of trimOldHistory can identify and aggregate these messages. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add pass 2 to trimOldHistory: computes a keep-last-5 boundary over
assistant messages and replaces tool-call inputs older than that
boundary with { clipped: true }, stripping stray reasoning text parts
from the same old messages. Adds the HISTORY_CLIP_KEEP_LAST constant
(deferred from Task 2 to avoid noUnusedLocals). TDD: test written
first, watched fail, then implemented.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pass 2 now collects a Set of clipped toolCallIds; pass 3 uses it to
replace the `output` of any matching tool-result message with
`{ clipped: true }`, keeping tool-call/tool-result pairs in sync.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pass 4 collapses all-but-the-most-recent feedback message of each kind (validation, step-error, repetition) into a compact placeholder, keeping at most one full message per kind. Also removes unused HistorySizeDebugEventData import from events.test.ts that was causing a TS6133 typecheck error. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add estimateHistoryTokens() helper and emit SYSTEM_DEBUG_HISTORY_SIZE at the end of addPageSnapshot() after every snapshot push. Token count is a crude chars/4 estimate (images = 1000 tokens each) intended for trend analysis, not billing accuracy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a 50-iteration plateau acceptance test that verifies history token count and message count plateau (ratio < 1.25) rather than growing linearly across a long agent run. The test revealed that fully-clipped messages before the boundary were being retained as tiny stubs, causing linear growth at ~28 tokens/iter. Adds pass 3.5 to trimOldHistory: drop fully-clipped messages before the boundary (clipped snapshots, clipped tool-call + tool-result pairs) so the history size stays bounded at O(KEEP_LAST) iterations. Updates two earlier tests that checked for clipped stubs to reflect the new drop-not-retain semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a short **Feedback markers** section to the action-loop system prompt that names the three sentinel prefixes ([VALIDATION-FEEDBACK], [STEP-ERROR-FEEDBACK], [REPEATED-ACTION-WARNING]), describes when each appears, and explains the placeholder-aggregation behaviour for older clipped messages. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #437.
Summary
WebAgent.messagesacross long task runs by extendingtrimOldHistory(renamed fromtruncateOldExternalContent) with four new passes on top of the existing<EXTERNAL-CONTENT>clipping: clip old assistant tool-call inputs, clip paired tool-result outputs, drop fully-clipped stubs, and aggregate older feedback messages.SYSTEM_DEBUG_HISTORY_SIZEevent emitted per iteration with a crude~chars/4token estimate and message count for telemetry.[VALIDATION-FEEDBACK],[STEP-ERROR-FEEDBACK],[REPEATED-ACTION-WARNING]) at the three feedback insertion sites so pass 4's aggregation can identify them anchored-prefix-style. System prompt updated to explain the markers and the[N earlier feedback messages clipped: ...]placeholder.Design notes
HISTORY_CLIP_KEEP_LAST = 5); anything older is eligible for clipping.messages[0](system) andmessages[1](task+plan) are unconditionally protected.toolCallIds. Pass 3 clips matching tool-result outputs bytoolCallId. Pass 4 then drops fully-clipped pairs together — clipped assistant + clipped tool-result are always before the boundary in normal flow, so they disappear as a unit. No orphans. Verified by audit + explicit no-orphans assertion in tests.docs/dev-sessions/documents the reframing.Test plan
pnpm --filter pilo-core run test— 753 pass / 34 files / 0 failpnpm --filter pilo-core run typecheck— cleanpnpm -r run typecheckpackages/— cleanhistory size plateaus rather than growing linearly) —lateTokens/earlyTokens ≈ 1.08,lateCount/earlyCount ≈ 1.15clips and drops paired tool-result outputstestmessages[0..1] never touchedregression test🤖 Generated with Claude Code