feat: LLM command-approval classifier (auto mode)#33586
Draft
thomaslwang wants to merge 1 commit into
Draft
Conversation
Opt-in classifier that gates auto-approved tool calls, after Claude Code "auto mode". Off by default. - Pluggable ClassifierProvider; default uses the user's configured model via the AI SDK (single-pass <block>yes/no). - Hooks Permission.ask on the would-auto-approve path only: block -> deny-and-continue (ClassifierDeniedError, surfaces as a tool error, no halt); classifier error/escalation -> fail closed (human ask). Never overrides an explicit user deny/ask. - Reasoning-blind transcript (user text + assistant tool calls only): prompt-injection + anti-rationalization defense. - Safe-tool allowlist short-circuit; per-session denial counters (3-consecutive / 20-total escalation, reset each user turn). - New `classifier` config block (backend/model/endpoint/apiKey + allow/ soft_deny/environment policy slots, copy-then-edit). Tests cover reasoning-blindness, verdict parsing (fail-closed), allowlist, and policy slots. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 task
Contributor
|
Thanks for updating your PR! It now meets our contributing guidelines. 👍 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue for this PR
Closes #33585
Type of change
What does this PR do?
Adds an opt-in "auto mode" classifier that gates the would-auto-approve path in
Permission.ask. When a rule resolves toallow, a model is consulted first; it can allow (proceed silently), block (deny-and-continue — returns aClassifierDeniedErrorthe agent sees as a tool error, with no halt), or fail closed to a human prompt on error/escalation. It never overrides an explicit userdeny/ask. Off by default.Why it's built this way:
!needsAskdecision inPermission.ask, so it covers every permissioned tool (bash, edit, webfetch, MCP, task, external-dir), not just bash. Read-only tools short-circuit before any model call.session/tools.tspasses the gate as a thunk run through the existingEffectBridge(run.run), which supplies the captured request context, soPermission.ask's requirement set staysnever.og-local/og-saasbackends are present but fail closed until implemented.Config is a new
classifierblock incore/v1/config.How did you verify your code works?
bun run typecheckis clean, andbun test test/classifier.test.tspasses (11 tests covering: no assistant-prose leak into the transcript, unparseable verdict → fail closed, the safe-tool allowlist, and the copy-then-edit policy slots). The pre-push checks pass.Screenshots / recordings
N/A — no UI changes.
Checklist