Re-arm the shell ZMQStream read after the out-of-band reply send (fix intermittent dropped execute_request on Windows)#1529
Open
BoykoNeov wants to merge 1 commit into
Conversation
On Windows, ipykernel 7 intermittently drops an execute_request on the shell channel: the kernel goes idle and never replies, and the client times out waiting for execute_reply (~30% of headless notebook runs in our measurements; which cell hangs wanders run to run). Root cause: the shell ROUTER socket is dual-use on the shell-channel thread. A ZMQStream reads execute_requests off it, while replies are sent back over the SAME socket out-of-band via a raw send_multipart in SubshellManager._send_on_shell_channel. That out-of-band send drains the socket's edge-triggered ZMQ_FD read edge (a documented libzmq corollary: after zmq_send the socket may become readable without a new edge). The send is not ZMQStream-mediated, so the stream is never re-armed and a request that arrived concurrently strands unread on a registered-but- non-readable fd. The strand is terminal: no later arrival re-edges it. Fix: after each out-of-band reply send, schedule the shell ZMQStream's read handler on the shell-channel loop -- the same edge-trap reschedule ZMQStream._update_handler already runs internally (add_callback(lambda: stream._handle_events(stream.socket, 0))) -- so the concurrently-arrived request cannot strand. The shell_stream (built in kernelapp.init_kernel) is threaded through ShellChannelThread into SubshellManager so the reply path can reach it. Validated on Windows (Python 3.13/3.14, pyzmq 27.1.0 / libzmq 4.3.5): the wedge went from 6/20 (control) to 0/20 with this patch applied, same machine/session, P(0/20 | p=0.30) ~ 8e-4, with the threaded reference live on every send (551 re-arms, 0 None/mismatch). A sham arm with the same scheduling overhead but no re-arm stayed at the control rate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
BoykoNeov
added a commit
to BoykoNeov/steel-sim
that referenced
this pull request
Jun 14, 2026
The ZMQ_FD edge-drain fix is now an upstream PR (ipython/ipykernel#1529, filed under BoykoNeov, 3 files +21/-0, validated 0/20 vs 6/20). The doc now points to the PR and the applicable patch; steel-sim's retry mitigation is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
On Windows, ipykernel 7 intermittently drops an
execute_requeston the shell channel: the kernel goes idle (0% CPU) and never replies, and the client times out waiting forexecute_reply. In a headless notebook smoke-test we measured this at ~30% of runs; which cell hangs wanders run to run (it is content-innocent), and only a kernel restart recovers. This looks like the same user-visible hang tracked downstream in microsoft/vscode-jupyter#17228.Forcing
WindowsSelectorEventLoopPolicydoes not help — the kernel already runs a selector loop.Root cause
The shell ROUTER socket is dual-use on the shell-channel thread:
ZMQStreamreadsexecute_requests off it (init_kernelbuildsZMQStream(self.shell_socket, …), delivered viaon_recv), whilesend_multipartthat bypasses the stream, inSubshellManager._send_on_shell_channel:That out-of-band send drains the socket's edge-triggered
ZMQ_FDread edge — the documented libzmq corollary that afterzmq_senda socket may become readable without producing a new read event. Because the send is notZMQStream-mediated, the read side is never re-armed, so anexecute_requestthat arrived concurrently strands unread on a registered-but-non-readable fd. The strand is terminal: while a backlog is already pending there is no0 → nonzeroEVENTStransition, so no later arrival re-edges it, and the kernel sits idle.Minimal raw-pyzmq reproduction of just the send-drains-read-edge step (no Jupyter), on a ROUTER (the shell socket type):
Fix
After each out-of-band reply send, schedule the shell
ZMQStream's read handler on the shell-channel loop — the same edge-trap rescheduleZMQStream._update_handleralready runs internally:The
shell_stream(built ininit_kernel) is threaded throughShellChannelThreadintoSubshellManagerso the reply path can reach it; the re-arm is guarded bystream is not None and stream.socket is self._shell_socket. No polling and no new dependency — it moves the existing edge-trap reschedule to the one un-mediated operation that drains the edge.Validation
Windows, Python 3.13 and 3.14, pyzmq 27.1.0 / libzmq 4.3.5; three arms × 20 real notebook runs, same machine/session:
send_multipartadd_callback(lambda: None)(no re-arm)add_callback(lambda: stream._handle_events(stream.socket, 0))P(0/20 | p=0.30) ≈ 0.70^20 ≈ 8e-4. Theshamarm isolates the re-arm itself from the added wake-ups (it stays at the control rate). With the diff applied the threaded reference was live on every send (551 re-arms, 0 None/mismatch). Validated against 7.2.0 / 7.3.0, where_send_on_shell_channelis byte-identical tomain.A public-API alternative
stream.flush(zmq.POLLIN)— a publicZMQStreammethod ipykernel already calls inkernelbase.py— may be preferable to reaching into_handle_events.flushis a synchronous drain loop whereas_handle_events(socket, 0)is the edge-trap reschedule (related but not identical). I only measured_handle_events; happy to switch toflushif you'd rather stay on the public API.Notes
main:_send_on_shell_channelis a baresend_multipart.ZMQ_FDedge-trigger bug class), getsockopt(zmq.EVENTS) drains signaler, can cause zmq.asyncio recv to miss wakeups zeromq/pyzmq#2173 (same signaler-drain family, different layer).CellTimeoutErrorsignature — works today and is orthogonal to this fix.🤖 Generated with Claude Code