Skip to content

Custom memory persistence silently no-ops on Workflow (BaseNode) roots in 2.x — no working post-run hook #5282

@surfai

Description

@surfai

Edit 2026-04-11 (after @surajksharma07's triage + source re-verification): scope is narrower than this body originally claimed. Only plugin_manager.run_after_run_callback is missing dispatch on the BaseNode path. run_on_event_callback does fire via _consume_event_queue at runners.py:619, and after_model_callback is dispatched by LlmAgent internals on the model-call boundary independently of the Runner. The primary ask stands — land the runners.py:427 TODO for run_after_run_callback — but it's a one-dispatch-call fix, not the broad "plugin lifecycle is incomplete" re-architecture this body originally framed. Overclaims in the sections below are struck through. Verified dispatch matrix + working workaround + drop-in regression test in this comment below.


Environment: google-adk==2.0.0a3 (released 2026-04-09), Python 3.14, Linux/macOS. Feature flags NEW_WORKFLOW, V1_LLM_AGENT, PLUGGABLE_AUTH default-on (the a3 defaults).


The problem we ran into

We have a multi-turn chat agent backed by a custom SSE server (FastAPI + google.adk.Runner). The agent is a graph workflow — App(root_agent=Workflow(edges=[...])) — classifying, guarding, and routing requests, and we persist each completed session to a custom MyMemoryService(BaseMemoryService) for long-term recall across sessions. The implementation details of the memory backend are not relevant here; the issue reproduces identically with InMemoryMemoryService.

We followed the canonical pattern from the memory docs:

async def save_to_memory(callback_context):
    await callback_context.add_session_to_memory()

root_agent = Workflow(
    name="FrontDoorWorkflow",
    after_agent_callback=save_to_memory,   # ← docs pattern
    edges=[...],
)

Sessions and events were persisting normally (via DatabaseSessionService). The memory store's write count was stuck at its pre-upgrade value — no new rows after the a3 upgrade. No exception, no warning, no log line. Looked like the store had stopped accepting writes; it hadn't.

The next thing we tried was the BasePlugin.after_run_callback pattern, which reads as the more ADK-native approach:

class MemoryPersistPlugin(BasePlugin):
    async def after_run_callback(self, *, invocation_context):
        if invocation_context.memory_service is None:
            return
        await invocation_context.memory_service.add_session_to_memory(
            invocation_context.session
        )

app = App(root_agent=root_agent, plugins=[MemoryPersistPlugin()])

This also silently no-ops. We traced it all the way through the installed source and found the real story, which is that BaseNode root paths in 2.0.0a3 have no post-run lifecycle hook surface exposed to users — not via callbacks, not via plugins. [Edit: overclaim. The precise gap is run_after_run_callback. See top-of-issue note.]

This is a blocker for any cross-cutting post-run concern on Workflow roots: memory persistence, metrics emission, audit logging, final telemetry, cleanup. Users migrating toward the BaseNode-rooted architecture that 2.0 is steering them onto silently lose every after-hook they had on LlmAgent. [Edit: overclaim. The blocker is specifically on after_run_callback-based patterns, which still rules out most ADK-native cross-cutting wiring, but on_event_callback-based observability works today.]


Root cause (source trace)

1. Workflow is a BaseNode, not a BaseAgent

  • google/adk/workflow/_workflow_class.py:143class Workflow(BaseNode).
  • BaseNode has no after_agent_callback field. Workflow doesn't either.
  • Pydantic v2 default extra="ignore" → passing after_agent_callback=save_to_memory to Workflow(...) is silently dropped at construction. No error, no warning.
  • The dispatch mechanism for the field, BaseAgent._handle_after_agent_callback at google/adk/agents/base_agent.py:491, is agent-only and unreachable from a BaseNode root.

This alone produces the symptom. But the deeper issue is what happens if users try the ADK-native alternative:

2. run_after_run_callback is not dispatched on the BaseNode path

Runner.run_async at google/adk/runners.py:760 branches on root type:

  • Branch A (isinstance(self.agent, LlmAgent), line 807): wraps in _V1LlmAgentWrapper, calls _run_node_async, return at line 834.
  • Branch B (isinstance(self.agent, BaseNode) and not LlmAgent, line 838): calls _run_node_async, return at line 848. This is the path Workflow roots take.
  • Branch C (legacy fallthrough _run_with_trace, line 851+): uses _exec_with_plugin which dispatches run_after_run_callback at runners.py:1230. Neither root shape reaches this path.

So the only code path that dispatches after_run_callback is the one that nobody hits in 2.0. Both LlmAgent and Workflow roots funnel into _run_node_async.

3. _run_node_async has not yet wired run_after_run_callback

google/adk/runners.py:413 _run_node_async():

  • Line 427: # TODO: Add tracing and plugin lifecycle for the node runtime path. — explicit acknowledgement this is incomplete.
  • Line 467: run_on_user_message_callback
  • Line 482: run_before_run_callback
  • Line 506: async for event in self._consume_event_queue(ic, done_sentinel) — drains events via _consume_event_queue, which dispatches run_on_event_callback at runners.py:619[Edit: this was wrong in the original text below; on_event_callback does fire for Workflow roots.]
  • No call to run_after_run_callback anywhere in the function. run_after_agent_callback, run_after_model_callback, or run_on_event_callback [Edit: run_on_event_callback fires via _consume_event_queue:619; after_model_callback is dispatched by LlmAgent on the model-call boundary, not the Runner; after_agent_callback is agent-only and not a _run_node_async concern.]

Pre-run plugin hooks (on_user_message, before_run) and per-event hooks (on_event_callback) DO fire for BaseNode roots. run_after_run_callback does not. A BasePlugin overriding after_run_callback loads successfully, registers with the Runner, and then never executes.


Minimal reproducer

# pip install google-adk==2.0.0a3
import asyncio
from google.adk import Event, Workflow
from google.adk.apps import App
from google.adk.events import EventActions  # noqa: F401
from google.adk.plugins.base_plugin import BasePlugin
from google.adk.runners import Runner
from google.adk.sessions.in_memory_session_service import InMemorySessionService
from google.adk.memory.in_memory_memory_service import InMemoryMemoryService
from google.genai import types


def terminal_node(ctx) -> Event:
    return Event(state={"done": True})


class TracerPlugin(BasePlugin):
    def __init__(self):
        super().__init__(name="tracer")
        self.fired = {
            "on_user_message_callback": 0,
            "before_run_callback": 0,
            "after_run_callback": 0,
            "after_agent_callback": 0,
            "on_event_callback": 0,
        }

    async def on_user_message_callback(self, *, invocation_context, user_message):
        self.fired["on_user_message_callback"] += 1

    async def before_run_callback(self, *, invocation_context):
        self.fired["before_run_callback"] += 1

    async def after_run_callback(self, *, invocation_context):
        self.fired["after_run_callback"] += 1

    async def after_agent_callback(self, *, agent, callback_context):
        self.fired["after_agent_callback"] += 1

    async def on_event_callback(self, *, invocation_context, event):
        self.fired["on_event_callback"] += 1


async def main():
    plugin = TracerPlugin()
    workflow = Workflow(name="Demo", edges=[("START", terminal_node)])
    app = App(name="demo", root_agent=workflow, plugins=[plugin])
    runner = Runner(
        app_name="demo",
        app=app,
        session_service=InMemorySessionService(),
        memory_service=InMemoryMemoryService(),
    )
    session = await runner.session_service.create_session(app_name="demo", user_id="u1")
    async for _ in runner.run_async(
        user_id="u1",
        session_id=session.id,
        new_message=types.Content(parts=[types.Part(text="hi")], role="user"),
    ):
        pass
    print(plugin.fired)


asyncio.run(main())

[Edit: the reproducer above has a bug — terminal_node yields Event(state=...) with no content, which skews on_event_callback counts, and the original "Actual" output below reflected that. A corrected drop-in test case with a content-bearing terminal node and a working WorkaroundRunner is in this comment.]

Expected: every hook fires at least once.
Actual (on 2.0.0a3):
~~ ~~{'on_user_message_callback': 1, 'before_run_callback': 1,~~ ~~ 'after_run_callback': 0, 'after_agent_callback': 0, 'on_event_callback': 0}~~ ~~

Corrected actual (2.0.0a3, content-bearing terminal event):

{'on_user_message_callback': 1, 'before_run_callback': 1,
 'on_event_callback': 1, 'after_run_callback': 0}

Only after_run_callback stays at 0.


Impact

Every cross-cutting post-invocation concern silently no-ops on Workflow roots: [Edit: narrower than the table originally claimed. Corrected:]

Concern Typical wiring ADK 2.0.0a3 Workflow root status
Long-term memory (add_session_to_memory) after_agent_callback or after_run_callback Silent no-op (this issue)
Metrics emission (latency, token counts, success rate) after_run_callback Silent no-op (this issue)
Audit logging / compliance trails after_run_callback Silent no-op via after_run_callback (this issue). on_event_callback-based audit works today.
Final state cleanup, resource release after_run_callback Silent no-op (this issue)
Post-run token/cost accounting after_model_callback Silent no-op [Edit: after_model_callback is dispatched by LlmAgent on the model-call boundary, not by the Runner. Fires on Workflow roots that contain LlmAgent nodes. Not part of this issue.]
Per-event telemetry / event rewriting on_event_callback Works via _consume_event_queue:619.

And because there is no warning, teams discover the problem only when empty dashboards, missing memory rows, or absent audit trails are flagged downstream — often after shipping.

Our workaround was to embed memory persistence as a terminal FunctionNode inside the graph (async def persist_memory(ctx) wired as a final edge after every terminal specialist). It works, but it's specific to memory and doesn't generalize to token accounting, metrics, or audit hooks, which can't live as graph nodes cleanly. [Edit: a cleaner interim — subclass Runner and wrap run_async to dispatch run_after_run_callback after the generator drains — also works for teams instantiating Runner directly. Not viable when adk api_server owns the Runner instantiation (hardcoded at adk_web_server.py:737). Working example in the comment below.]


Open question — how should custom memory persistence work on Workflow roots today?

The memory docs page shows one canonical pattern: Agent(..., after_agent_callback=auto_save_session_to_memory_callback). With Workflow(BaseNode) being a flagship 2.0 feature and the intended future root type for non-LLM graph agents (per the runners.py:836-837 TODO to collapse LlmAgent into the BaseNode path), it would be very helpful if maintainers could confirm the supported answer to one of:

  1. Is there a hook we missed? If there's a canonical surface we haven't found for "run code at end of invocation on a BaseNode root," please point at it — we searched BasePlugin, BaseNode, Workflow, App, and Runner methods and could not find one that dispatches. Docs/example link welcome.
  2. Is the terminal-FunctionNode-in-graph pattern the intended interim answer? If so, it's worth documenting on the memory page next to the Agent example, so new users land on it instead of the silently-dropped after_agent_callback kwarg.
  3. If neither — what's the recommended path? We would like to build correctly against a path maintainers endorse, not against whichever private API happens to work.

Worth flagging: we believe every team building custom memory persistence in 2.x will hit this, not just us. The direction of travel in 2.0 is explicit — Workflow(BaseNode) is the flagship graph root type, and the TODO at runners.py:836-837 says LlmAgent itself will be refactored to inherit from BaseNode, at which point _run_node_async becomes the single dispatch path for all Runner invocations. Any team that:

  • subclasses BaseMemoryService (the documented extension point for custom memory backends), AND
  • uses a Workflow root (or, post-LlmAgent-migration, any 2.x root at all), AND
  • follows the memory docs page to wire persistence via after_agent_callback OR the ADK-native BasePlugin.after_run_callback

…will land in exactly the silent no-op we did. There is no "lucky" configuration that avoids it on 2.0.0a3 with a BaseNode root. And because the failure mode is silent (no exception, no warning, no log), downstream symptoms — empty memory store, missing cross-session recall, flat retrieval quality — are easy to attribute to embedding tuning, retrieval scoring, or chunk sizing instead of "the write never happened."

If the runners.py:427 TODO sits for multiple release cycles, which is entirely plausible given Runner-dispatch completeness does not currently appear to be a top-priority area, then custom memory persistence on Workflow roots is effectively unsupported in 2.x until either the TODO lands or the docs adopt the in-graph terminal-node pattern as an interim answer. The current situation — "docs pattern silently drops, Plugin alternative silently no-ops, no documented workaround, open TODO of unknown priority" — leaves every new adopter either guessing or reverse-engineering the Runner internals. A one-paragraph note on the memory docs page would save each of them the same multi-hour source trace.


Related issues

  • #4181 (open) — "before_model_callback and after_model_callback not invoked for live streaming sessions (run_live)". Same structural pattern: a specific Runner code path (run_live) bypasses the callback dispatch that the standard path calls. This issue is the BaseNode-root analogue — same structural pattern, different Runner branch (_run_node_async instead of run_live). A fix for one does not automatically cover the other; each alternate dispatch path has to be wired individually.
  • #4774 (open) — "Add Lifecycle Error Callbacks (on_agent_error, on_run_error) to ADK Framework". Motivates why after_run_callback/after_agent_callback are load-bearing for enterprise observability: "AGENT_COMPLETED and INVOCATION_COMPLETED events are never emitted to observability sinks… failed runs disappear from the denominator in standard reports." Same concern applies when after_run_callback doesn't fire at all on BaseNode roots — telemetry plugins depending on it miss every successful run, not just failed ones.

The ask

Primary — land the runners.py:427 TODO

Wire run_after_run_callback into _run_node_async, mirroring the dispatch that _exec_with_plugin already performs on the legacy path at runners.py:1230. Concretely: after the async for event in self._consume_event_queue(...) loop at line 506 drains, before the function returns, call await ic.plugin_manager.run_after_run_callback(invocation_context=ic).

- run_after_run_callback at the end of the function (after the event queue drains, before return) — mirrors runners.py:1230
- run_after_agent_callback when a node finishes
- run_after_model_callback at the model-call boundary
- run_on_event_callback per yielded event — mirrors runners.py:1216

[Edit: the other three bullets are either already handled (run_on_event_callback via _consume_event_queue:619) or out of scope for _run_node_async (after_agent_callback is BaseAgent-only; after_model_callback is dispatched by LlmAgent on the model-call boundary). Scope is one dispatch call.]

This is structurally the same ask as #4181: identify an alternate Runner dispatch path that the current plugin wiring skipped, and add the callback call. The fix pattern is well-understood.

This lets a BasePlugin work the same on BaseNode roots as it does on the legacy _exec_with_plugin path today, and unblocks the whole class of after_run_callback-based cross-cutting concerns (memory persistence, metrics, audit, cleanup).

Related: the TODO at runners.py:836-837 ("remove not isinstance(self.agent, LlmAgent) after LLM agent is refactored to inherit from BaseNode") shows _run_node_async is the intended single dispatch path going forward. Closing this gap removes the last pre-condition to collapsing Branch A and Branch B into one.

Secondary — defensive UX while the TODO is open

Log a one-time warning at Runner.__init__ (or first run_async call) when:

  1. the root is a BaseNode (not a BaseAgent subclass), AND
  2. app.plugins contains any plugin that overrides after_run_callback.

Something like:

WARNING: Plugin 'MemoryPersistPlugin' defines after_run_callback, but the
BaseNode root path in Runner does not currently dispatch it (see #5282
and runners.py:427). This callback will not fire. For memory writes on
Workflow roots, use a terminal FunctionNode until this is resolved.

This turns a silent data-loss bug into a startup-time warning. Users find out immediately, not after shipping and checking empty dashboards.

Tertiary — documentation

Update adk.dev/sessions/memory/. The current memory page shows only the Agent(..., after_agent_callback=auto_save_session_to_memory_callback) pattern, with no mention that Workflow roots behave differently. Given Workflow(BaseNode) is a flagship 2.0 feature, the page should:

  • either add a Workflow section with the interim terminal-node workaround
  • or, ideally, after the primary fix lands, show one unified BasePlugin pattern that works for both root shapes

Non-ask (for the record)

We considered two alternative framings and explicitly do NOT recommend them:

  • "Make Workflow accept after_agent_callback" — patches deprecated API. BaseAgent is on the way out per runners.py:836-837.
  • "Make BaseNode use extra='forbid'" — would have caught our silent-drop trap at construction, but likely breaks kwargs forwarded from subclasses and is an API break. The secondary warning above gets the same DX benefit without the compatibility cost.

The primary ask — finishing _run_node_async's run_after_run_callback dispatch — is aligned with the direction 2.0 is already headed.

Metadata

Metadata

Labels

core[Component] This issue is related to the core interface and implementation

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions