Skip to content

fix(audit): stamp seq on tool_exec + session_end events#173

Merged
initializ-mk merged 1 commit into
mainfrom
fix/audit-tool-exec-seq
Jun 16, 2026
Merged

fix(audit): stamp seq on tool_exec + session_end events#173
initializ-mk merged 1 commit into
mainfrom
fix/audit-tool-exec-seq

Conversation

@initializ-mk

Copy link
Copy Markdown
Contributor

Symptom

A user on v0.15.0 reported `tool_exec` rows in the audit stream were missing the `seq` field even though they fired inside an A2A invocation. Their Slack-triggered task showed:

```
session_start seq=1 ✓
guardrail_check seq=2 ✓
llm_call seq=3 ✓
tool_exec (start) (none) ← BUG
tool_exec (end) (none) ← BUG
llm_call seq=4 ✓
session_end seq=5 ✓
invocation_complete seq=6 ✓
```

This breaks the FWS-8 guarantee that every event from a given `correlation_id` exposes a monotonically increasing `seq` consumers use for gap detection. `tool_exec` events also lose `trace_id` / `span_id` cross-link (Phase 4 / #105).

Root cause

`registerAuditHooks` registered the tool-exec hooks with `func(_ context.Context, ...)` — discarding the per-invocation ctx — and called `auditLogger.Emit(...)` instead of `EmitFromContext(ctx, ...)`. The per-invocation `SequenceCounter` lives on ctx; plain `Emit` can't see it.

Same pattern existed at one more site: the JSON-RPC `tasks/send` handler's outbound-guardrail-failure `session_end` branch at `runner.go:~1546` — the parallel success-path SSE branch right below it correctly used `EmitFromContext`, this one had drifted.

Fix

Three sites changed from `Emit(...)` to `EmitFromContext(ctx, ...)`:

Site Was Now
`registerAuditHooks` BeforeToolExec `func(_ context.Context, ...)` + `Emit` `func(ctxStart context.Context, ...)` + `EmitFromContext`
`registerAuditHooks` AfterToolExec `func(_ context.Context, ...)` + `Emit` `func(ctxEnd context.Context, ...)` + `EmitFromContext`
JSON-RPC tasks/send outbound-guardrail-failure session_end `Emit` `EmitFromContext(ctx, ...)`

Sites left alone (intentional, documented inline)

Site Why `Emit` is correct
`makeAuthAuditCallback` (auth_verify / auth_fail) Auth middleware runs before the runner installs the seq counter on ctx.
Egress proxy `OnAttempt` with `source: proxy` Subprocess HTTP CONNECT through the proxy has no Go ctx tying it back to the active A2A request.
MCP tool conflict at startup Pre-invocation, no scope.
Schedule dispatcher (`schedule_fire` / `schedule_complete`) Scheduler tick runs outside any A2A request scope.

The in-process egress enforcer's `OnAttempt` at `runner.go:~496` already uses `EmitFromContext` and stamps seq correctly.

Test plan

  • New regression test `TestToolExecAudit_CarriesSequenceFromContext` fires the registered hooks against a context carrying a `SequenceCounter`, parses the emitted NDJSON, and asserts `BeforeToolExec.seq=1` + `AfterToolExec.seq=2`. Pre-fix this test fails with `seq=0` on both events.
  • `go test -count=1 ./...` clean in forge-cli.
  • `golangci-lint run ./forge-cli/...` → 0 issues.
  • `gofmt -w` applied.

Schema impact

None. The events were already declared to carry `seq` — the field's `omitempty` JSON tag just dropped it when the value was zero. After the fix, consumers see the field they were already expecting; consumers ignoring unknown keys see no change.

Suggested follow-up (not in this PR)

The auth_verify / auth_fail seq gap is structurally fixable by moving the seq-counter installation earlier in the request pipeline (before the auth middleware). Worth filing as its own issue — same for adding a "audit emit must be EmitFromContext when inside an invocation scope" lint or linter pragma so this kind of drift gets caught at PR time.

Three audit-emit sites inside the per-A2A-invocation scope were
using plain AuditLogger.Emit() instead of EmitFromContext(ctx, ...),
which meant the per-invocation sequence counter installed on ctx at
request entry never reached those events. Consumers relying on FWS-8's
"every event from a given correlation_id exposes a monotonically
increasing seq" guarantee saw gaps in their gap-detection pipeline
exactly at tool_exec and the outbound-guardrail-failure session_end
branch.

Reported on v0.15.0: a Slack-triggered code-review task's audit
stream had session_start (seq=1), guardrail_check (seq=2), llm_call
(seq=3), then tool_exec start/end with NO seq, then llm_call (seq=4),
session_end (seq=5), invocation_complete (seq=6) — confirming the
counter advances but tool_exec emits skip it.

Three sites fixed:

  - registerAuditHooks BeforeToolExec (runner.go:~1990) — was
    `func(_ context.Context, ...)` then plain Emit; now takes
    ctxStart and calls EmitFromContext.
  - registerAuditHooks AfterToolExec (runner.go:~2016) — same.
  - JSON-RPC tasks/send handler outbound-guardrail-failure
    session_end (runner.go:~1546) — was plain Emit; now
    EmitFromContext(ctx, ...) matching the parallel success-path
    SSE branch right below it.

Sites left as plain Emit (intentional, documented inline):

  - makeAuthAuditCallback (auth_verify / auth_fail) — auth
    middleware runs BEFORE the runner installs the seq counter on
    ctx.
  - egress proxy OnAttempt with source=proxy — subprocess HTTP
    CONNECT has no Go ctx tying it back to the active A2A request.
  - MCP tool conflict at startup — pre-invocation, no scope.
  - Schedule dispatcher (schedule_fire / schedule_complete) —
    scheduler tick runs outside any A2A request scope.

Regression test: TestToolExecAudit_CarriesSequenceFromContext fires
the registered hooks against a context carrying a SequenceCounter,
parses the emitted NDJSON, and asserts BeforeToolExec.seq=1 +
AfterToolExec.seq=2.
@initializ-mk initializ-mk force-pushed the fix/audit-tool-exec-seq branch from 2bf4007 to d857bdd Compare June 16, 2026 16:11
@initializ-mk initializ-mk merged commit 16cfcf8 into main Jun 16, 2026
9 checks passed
initializ-mk added a commit that referenced this pull request Jun 16, 2026
…seq=1 (closes #174)

Pre-fix: auth_verify / auth_fail audit events lacked the seq field
because the SequenceCounter was installed downstream (in the runner's
per-A2A-request setup) of the auth middleware. The auth chain's
OnAuth callback used plain AuditLogger.Emit and couldn't reach a
counter that didn't exist yet on req.Context().

Symptom (user-visible on v0.15.0):

  auth_verify           (no seq)   ← BUG
  session_start          seq=1
  guardrail_check        seq=2
  llm_call               seq=3
  ...

The FWS-8 contract promises every event under a given correlation_id
exposes a monotonically increasing seq for gap detection. auth_verify
sitting outside the sequence breaks that.

Fix: install the SequenceCounter on r.Context() BEFORE the auth chain
runs, via a thin middleware wrapper around auth.Middleware in the
runner. The auth callback now uses EmitFromContext(req.Context(), ...)
and stamps seq=1 on auth_verify. The runner's per-A2A-request setup
calls coreruntime.EnsureSequenceCounter (new helper) which detects
the existing counter and reuses it — so session_start lands seq=2
and the sequence is gap-free.

Three pieces:

  - forge-core/runtime/audit_schema.go gains EnsureSequenceCounter:
    returns ctx unchanged when a counter is present, installs a fresh
    one when absent. Used at every invocation-entry point that may
    run downstream of an upstream middleware.

  - forge-cli/runtime/sequence_counter_middleware.go (new) is the
    thin wrapper: composes (next) → authMW(next) → install counter
    → serve. Cost: ~24B counter allocation per request, even on
    skip paths — simpler than threading skip-path knowledge into the
    wrapper, and the alloc is in the same ballpark as the request
    struct.

  - forge-cli/runtime/runner.go:
    * AuthMiddleware: auth.Middleware(authCfg) →
      installSequenceCounterMiddleware(auth.Middleware(authCfg))
    * makeAuthAuditCallback: Emit → EmitFromContext(req.Context())
    * Three in-request WithSequenceCounter(ctx, new(...)) calls →
      EnsureSequenceCounter(ctx). Counter is reused from the auth
      wrapper (auth ON path) or installed fresh (--no-auth path).
    * Scheduler dispatcher's WithSequenceCounter call is left alone
      — it runs outside any HTTP request, no upstream counter.

Three other plain-Emit sites stay as documented in PR #173:

  - source=proxy egress: subprocess HTTP CONNECT has no Go ctx
  - mcp_tool_conflict: startup-time, pre-invocation
  - schedule_fire / schedule_complete: scheduler tick, no A2A scope

End-to-end smoke (--no-auth path):

  session_start          seq=1
  session_end            seq=2
  invocation_complete    seq=3

End-to-end smoke (auth on, internal token):

  auth_verify            seq=1   ← FIXED
  session_start          seq=2
  session_end            seq=3
  invocation_complete    seq=4

Tests:

  - TestAuthAudit_SeqStampedWhenCounterInstalled — the #174 pin: with
    a counter on req.Context(), auth_verify lands seq=1
  - TestAuthAudit_NoSeqWhenCounterAbsent — back-compat: no counter →
    omitempty drops the field (legacy embedders without the wrapper)
  - TestSequenceCounterMiddleware_InstallsCounterBeforeNext — the
    wrapper installs the counter before delegating
  - TestEnsureSequenceCounter_ReusesExisting — runner must not
    clobber the auth-installed counter
  - TestEnsureSequenceCounter_InstallsFresh — --no-auth path: install
    a fresh counter when none is present

Schema impact: none. The events were already declared to carry seq.
initializ-mk added a commit that referenced this pull request Jun 16, 2026
Reflects PRs #173 and #174 (both merged to main) in the canonical
audit doc and the comprehensive knowledge skill.

docs/security/audit-logging.md — under "Sequence numbers", added:
  - Counter installation order — the SequenceCounter is installed by
    installSequenceCounterMiddleware (wraps the auth chain) so
    auth_verify / auth_fail land seq=1; the runner's request entry
    uses EnsureSequenceCounter to reuse the wrapper-installed counter
    and only allocate fresh on the --no-auth path. Pinned by
    TestAuthAudit_SeqStampedWhenCounterInstalled +
    TestEnsureSequenceCounter_ReusesExisting.
  - Emit invariant — per-invocation events MUST emit via
    EmitFromContext (or a typed helper); plain Emit skips the counter
    and the trace cross-link. The regression behind #173 (tool_exec +
    session_end) and #174 (auth_verify). A 4-row table names the
    sites that intentionally stay on plain Emit (egress proxy
    subprocess CONNECT, MCP startup events, scheduler tick, startup
    banners) and why. Issue #175 is named as the lint follow-up.

.claude/skills/forge.md — three sections updated:
  - § 3 path trace: middleware order now shows
    installSequenceCounterMiddleware as outermost; runner request
    entry uses EnsureSequenceCounter.
  - § 12.4 FWS-8 section: added the Emit invariant paragraph naming
    the regression pins and exception list.
  - § 18 FWS-8 row: appended #173 / #174 follow-up summary +
    #175 lint tracker.

No code change. TOC anchors unchanged. Broken-link sweep clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant