Skip to content

feat: LLM command-approval classifier (auto mode)#33586

Draft
thomaslwang wants to merge 1 commit into
anomalyco:devfrom
openguardrails:feat/auto-mode-classifier
Draft

feat: LLM command-approval classifier (auto mode)#33586
thomaslwang wants to merge 1 commit into
anomalyco:devfrom
openguardrails:feat/auto-mode-classifier

Conversation

@thomaslwang

@thomaslwang thomaslwang commented Jun 24, 2026

Copy link
Copy Markdown

Issue for this PR

Closes #33585

Type of change

  • Bug fix
  • New feature
  • Refactor / code improvement
  • Documentation

What does this PR do?

Adds an opt-in "auto mode" classifier that gates the would-auto-approve path in Permission.ask. When a rule resolves to allow, a model is consulted first; it can allow (proceed silently), block (deny-and-continue — returns a ClassifierDeniedError the agent sees as a tool error, with no halt), or fail closed to a human prompt on error/escalation. It never overrides an explicit user deny/ask. Off by default.

Why it's built this way:

  • The gate sits at the single !needsAsk decision in Permission.ask, so it covers every permissioned tool (bash, edit, webfetch, MCP, task, external-dir), not just bash. Read-only tools short-circuit before any model call.
  • The classifier is fed a reasoning-blind transcript — user text + the bare tool-call payload only, no assistant prose and no prior tool output — so tool-sourced/injected content can't grant permission and the model can't be talked into a call by the agent's own narration.
  • session/tools.ts passes the gate as a thunk run through the existing EffectBridge (run.run), which supplies the captured request context, so Permission.ask's requirement set stays never.
  • Denials are counted per session (3 consecutive / 20 total → escalate to the human), reset each user turn, so a false positive can't loop forever.
  • The backend is pluggable; the default calls the user's configured model via the AI SDK. og-local/og-saas backends are present but fail closed until implemented.

Config is a new classifier block in core/v1/config.

How did you verify your code works?

bun run typecheck is clean, and bun test test/classifier.test.ts passes (11 tests covering: no assistant-prose leak into the transcript, unparseable verdict → fail closed, the safe-tool allowlist, and the copy-then-edit policy slots). The pre-push checks pass.

Screenshots / recordings

N/A — no UI changes.

Checklist

  • I have tested my changes locally
  • I have not included unrelated changes in this PR

Opt-in classifier that gates auto-approved tool calls, after Claude Code
"auto mode". Off by default.

- Pluggable ClassifierProvider; default uses the user's configured model
  via the AI SDK (single-pass <block>yes/no).
- Hooks Permission.ask on the would-auto-approve path only: block ->
  deny-and-continue (ClassifierDeniedError, surfaces as a tool error, no
  halt); classifier error/escalation -> fail closed (human ask). Never
  overrides an explicit user deny/ask.
- Reasoning-blind transcript (user text + assistant tool calls only):
  prompt-injection + anti-rationalization defense.
- Safe-tool allowlist short-circuit; per-session denial counters
  (3-consecutive / 20-total escalation, reset each user turn).
- New `classifier` config block (backend/model/endpoint/apiKey + allow/
  soft_deny/environment policy slots, copy-then-edit).

Tests cover reasoning-blindness, verdict parsing (fail-closed), allowlist,
and policy slots.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions github-actions Bot added the needs:compliance This means the issue will auto-close after 2 hours. label Jun 24, 2026
@github-actions github-actions Bot removed the needs:compliance This means the issue will auto-close after 2 hours. label Jun 24, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for updating your PR! It now meets our contributing guidelines. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: LLM command-approval classifier ("auto mode") for permission gating

1 participant