Skip to content

fix(bazel): pre-subscribe AXL build_events() via local() sink#1077

Open
gregmagolan wants to merge 2 commits into
mainfrom
fix_axl_bes_late_subscribe
Open

fix(bazel): pre-subscribe AXL build_events() via local() sink#1077
gregmagolan wants to merge 2 commits into
mainfrom
fix_axl_bes_late_subscribe

Conversation

@gregmagolan
Copy link
Copy Markdown
Member

@gregmagolan gregmagolan commented May 12, 2026

Summary

Fixes an intermittent late-subscribe race in ctx.bazel.build(...)'s
event stream. The AXL-facing receiver returned by
build.build_events() was created lazily, only when the AXL task
called the method. On a warm bazel daemon with a fully cached build,
BES events stream within milliseconds — fast enough that any work
between ctx.bazel.build(...) returning and the subsequent
build.build_events() (a task_update emit, a buildkite-agent annotate spawn, the gRPC sink's connect attempt) could let the
early burst (build_started, target_completed,
named_set_of_files) flow before the AXL subscriber registered.

Observed symptoms:

  • runnable.determine_entrypoint() returned NoneFailed to determine the formatter entrypoint. on aspect format (sister
    errors on aspect gazelle / aspect lint). Lint's SARIF
    collector missed reports → "0 findings" on runs that should have
    flagged.
  • Live status surfaces stuck on "Spawning bazel build…" until the
    first non-burst event arrived.
  • Cold-daemon runs passed; warm-daemon runs flaked. Race that
    escapes local testing.

Two fixes together:

1. Explicit local-iteration intent

New BuildEventSink::Local variant + AXL factory
bazel.build_events.local(buffer_cap=10000). Callers declare
local-iteration intent up front instead of relying on the runtime
to guess from a subsequent build.build_events() call:

build = ctx.bazel.build(
    "//target",
    build_events = [
        bazel.build_events.local(),                # AXL iterates
        bazel.build_events.grpc(uri = "..."),      # remote forwards
    ],
    flags = flags,
)

# Existing `build_events = True` continues to work — it now expands
# to `[bazel.build_events.local()]`.

Build::spawn registers the local subscriber inside the spawn,
before bazel opens the BEP FIFO and before remote gRPC sinks touch
the network. The first call to build.build_events() returns the
pre-subscribed receiver; events broadcast in between were already
buffered.

If no Local sink is configured, build.build_events() errors at
call time — the runtime won't subscribe behind the caller's back,
because every undrained subscription is buffered memory.

2. Capped buffer (memory bound)

A misbehaving task that subscribes but never drains would otherwise
grow the channel without bound. New CappedSubscriber<T> wraps a
Receiver + shared Arc<AtomicUsize>. The broadcaster's subscriber
list becomes Vec<Subscription<T>> with Unbounded (existing —
tracing sink, gRPC sinks) and Capped { tx, queued, cap } variants.
On each send the broadcaster checks the capped entries; when
queued >= cap it drops the subscription and the AXL iterator sees
Disconnected on its next recv. Default cap is 10000 events.

Built-in tasks updated

Every built-in task that iterates BES events now prepends
bazel.build_events.local() to its sink list:

  • build / test (via lib/bazel_runner.axl) — gated on
    need_bes (any bazel_trait.build_event handler or
    lifecycle.task_update handler registered).
  • lint, format, gazelle, delivery — always (each calls
    build.build_events() unconditionally).

Docs

DEVELOPMENT.md updated: the broadcaster section now describes the
explicit-intent contract, the buffer-cap semantics, and the
late-subscribe escape hatch. The contributor checklist is updated to
remind authors of new tasks to include bazel.build_events.local()
when they iterate.

Relationship to HANG_ANALYSIS.md and PR #1071

HANG_ANALYSIS.md covers a separate GHA-cancel hang with three
root causes. PR #1071 (fix(http): bound read and connect timeouts on the runtime HTTP client, already landed) handled one of them. The
remaining two (signal-handler / process-registry plumbing for
subprocess cleanup on cancel) are in a separate change set and not in
this PR. The two surface areas overlap (BES event-stream
plumbing / cancellation cleanup) but the fixes are independent.


Changes are visible to end-users: yes

  • Searched for relevant documentation and updated as needed: yes
  • Breaking change (forces users to change their own code or config): no
  • Suggested release notes appear below: yes

Suggested release notes

  • New bazel.build_events.local(buffer_cap=10000) AXL factory.
    Declare it explicitly in the build_events=[...] list of
    ctx.bazel.build(...) / ctx.bazel.test(...) when your task
    iterates the BES stream via build.build_events(). The runtime
    pre-registers your receiver before bazel opens the BEP FIFO,
    eliminating the warm-daemon late-subscribe race that caused
    intermittent Failed to determine the formatter entrypoint. and
    similar errors. ctx.bazel.build(build_events = True) continues to
    work — it now expands to [bazel.build_events.local()].
  • build.build_events() now errors when called against a build that
    was configured with remote BES sinks but no local sink. Callers
    that need to iterate locally must opt in via
    bazel.build_events.local(...). Existing callers using the True
    shorthand are unaffected.

Test plan

  • New test cases added: 3 broadcaster unit tests covering
    normal-drain, overflow-drop, and capped-coexists-with-unbounded
    paths.
  • Covered by existing test cases: every built-in task's snapshot
    suite (PR comment, BK annotation, template, lint, format, gazelle,
    delivery) — 727 AXL tests + 7 snapshot suites pass.
  • Manual testing: ran each of aspect build, aspect test,
    aspect lint, aspect format, aspect gazelle, aspect delivery
    against a warm daemon and confirmed entrypoint resolution, SARIF
    collection, and live status-surface phase tags consistently see
    the early-burst events.

@aspect-workflows
Copy link
Copy Markdown

aspect-workflows Bot commented May 12, 2026

✨ Aspect Workflows Tasks

📅 Wed May 13 04:05:49 UTC 2026

✅ 16 successful tasks

  • ✅ build (build-gha) · ⏱ 1m 16s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Bazel build complete (161 built)
  • ✅ build (build-gha-debug) · ⏱ 1m 41s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Bazel build complete (161 built)
  • ✅ format (buildifier-gha) · ⏱ 1m 27s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Format complete (clean)
  • ✅ format (buildifier-gha-debug) · ⏱ 1m 21s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Format complete (clean)
  • ✅ delivery (delivery-gha) · ⏱ 1m 45s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Delivery complete (1 delivered)
  • ✅ delivery (delivery-gha-debug) · ⏱ 5m 26s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Delivery complete (1 delivered)
  • ✅ format (format-gha) · ⏱ 1m 25s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Format complete (clean)
  • ✅ format (format-gha-debug) · ⏱ 1m 32s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Format complete (clean)
  • ✅ gazelle (gazelle-from-source-gha) · ⏱ 1m 24s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Gazelle complete (clean)
  • ✅ gazelle (gazelle-from-source-gha-debug) · ⏱ 1m 18s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Gazelle complete (clean)
  • ✅ gazelle (gazelle-gha) · ⏱ 1m 20s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Gazelle complete (clean)
  • ✅ gazelle (gazelle-gha-debug) · ⏱ 1m 22s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Gazelle complete (clean)
  • ✅ lint (lint-gha) · ⏱ 1m 20s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Lint complete (clean)
  • ✅ lint (lint-gha-debug) · ⏱ 1m 31s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Lint complete (clean)
  • ✅ test (test-gha) · ⏱ 1m 13s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Bazel test complete (25/25 passed · 25 cached)
  • ✅ test (test-gha-debug) · ⏱ 1m 35s · ✨ Aspect · 🐙 GitHub Actions · ☑️ Check
    💬 Bazel test complete (24/24 passed · 22 cached)

Powered by Aspect Build · GitHub · X · LinkedIn · YouTube

@gregmagolan gregmagolan force-pushed the fix_axl_bes_late_subscribe branch 8 times, most recently from b2e1ca0 to f1b6c61 Compare May 13, 2026 01:29
@gregmagolan gregmagolan requested a review from thesayyn May 13, 2026 01:44
@gregmagolan gregmagolan marked this pull request as ready for review May 13, 2026 01:44
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f1b6c61af8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/axl-runtime/src/engine/bazel/build.rs Outdated
@gregmagolan gregmagolan force-pushed the fix_axl_bes_late_subscribe branch from f1b6c61 to 2057863 Compare May 13, 2026 01:49
@gregmagolan gregmagolan changed the title fix axl bes late subscribe fix(bazel): pre-subscribe AXL build_events() via local() sink May 13, 2026
@gregmagolan gregmagolan force-pushed the fix_axl_bes_late_subscribe branch from 2057863 to e9ee487 Compare May 13, 2026 01:50
…calls

Per Codex review: the prior implementation unconditionally `.take()`'d
the pre-subscribed `early_event_subscriber` on every call and errored
when it was None, regressing the historical "each call returns a new
iterator" contract. AXL code that hands a fresh iterator to each
subroutine (e.g. helper functions that each obtain their own
`build_events()` handle) aborted at runtime.

Fix preserves both ends of the contract:

- **First call**: returns the race-free pre-subscribed
  `CappedSubscriber` (registered inside `Build::spawn` before bazel
  opens the BEP FIFO). No change.

- **Subsequent calls**: fall back to a fresh
  `event_stream.subscribe()` wrapped in
  `CappedSubscriber::from_unbounded(...)`. Normal late-subscriber
  semantics — these iterators only see events broadcast after the
  fresh subscribe call. The primary consumer (first caller) is past
  the race window, so the extra iterators getting late-subscribe
  semantics is acceptable.

- **No Local sink configured at all**: still errors. The
  explicit-intent gate ("declare a `bazel.build_events.local(...)`
  to opt in to local iteration") is preserved via a new
  `had_local_sink: bool` field on `Build`. Without it the runtime
  won't subscribe behind the caller's back, because every undrained
  subscription is buffered memory.

Implementation:

- `CappedSubscriber.queued` is now `Option<Arc<AtomicUsize>>`. Plain
  unbounded subscribers (the late-subscribe path) hold `None` and
  skip the decrement on `recv` / `try_recv` — they're on the
  broadcaster's `Subscription::Unbounded` path, so there's no
  shared counter to track.
- New `CappedSubscriber::from_unbounded(rx)` constructor.
- `Build.had_local_sink` driven by `local_sink_cap.is_some()` at
  spawn time, so the gating signal survives the first `.take()`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gregmagolan gregmagolan force-pushed the fix_axl_bes_late_subscribe branch from e4514bd to e4e1c9d Compare May 13, 2026 04:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant