Skip to content

Re-arm the shell ZMQStream read after the out-of-band reply send (fix intermittent dropped execute_request on Windows)#1529

Open
BoykoNeov wants to merge 1 commit into
ipython:mainfrom
BoykoNeov:fix/shell-zmqstream-rearm-after-reply-send
Open

Re-arm the shell ZMQStream read after the out-of-band reply send (fix intermittent dropped execute_request on Windows)#1529
BoykoNeov wants to merge 1 commit into
ipython:mainfrom
BoykoNeov:fix/shell-zmqstream-rearm-after-reply-send

Conversation

@BoykoNeov

Copy link
Copy Markdown

Summary

On Windows, ipykernel 7 intermittently drops an execute_request on the shell channel: the kernel goes idle (0% CPU) and never replies, and the client times out waiting for execute_reply. In a headless notebook smoke-test we measured this at ~30% of runs; which cell hangs wanders run to run (it is content-innocent), and only a kernel restart recovers. This looks like the same user-visible hang tracked downstream in microsoft/vscode-jupyter#17228.

Forcing WindowsSelectorEventLoopPolicy does not help — the kernel already runs a selector loop.

Root cause

The shell ROUTER socket is dual-use on the shell-channel thread:

  • a ZMQStream reads execute_requests off it (init_kernel builds ZMQStream(self.shell_socket, …), delivered via on_recv), while
  • replies go back over the same socket out-of-band through a raw send_multipart that bypasses the stream, in SubshellManager._send_on_shell_channel:
def _send_on_shell_channel(self, msg) -> None:
    assert current_thread().name == SHELL_CHANNEL_THREAD_NAME
    self._shell_socket.send_multipart(msg)   # raw send on the ZMQStream's own socket

That out-of-band send drains the socket's edge-triggered ZMQ_FD read edge — the documented libzmq corollary that after zmq_send a socket may become readable without producing a new read event. Because the send is not ZMQStream-mediated, the read side is never re-armed, so an execute_request that arrived concurrently strands unread on a registered-but-non-readable fd. The strand is terminal: while a backlog is already pending there is no 0 → nonzero EVENTS transition, so no later arrival re-edges it, and the kernel sits idle.

Minimal raw-pyzmq reproduction of just the send-drains-read-edge step (no Jupyter), on a ROUTER (the shell socket type):

import select, time, zmq
def readable(fd): return bool(select.select([fd], [], [], 0)[0])

ctx = zmq.Context()
a = ctx.socket(zmq.DEALER); a.setsockopt(zmq.IDENTITY, b"A")
b = ctx.socket(zmq.ROUTER)
b.bind("tcp://127.0.0.1:5704"); a.connect("tcp://127.0.0.1:5704"); time.sleep(0.3)
bfd = b.getsockopt(zmq.FD)

a.send(b"hello"); time.sleep(0.2)                       # warmup: learn route 'A', clear setup edges
assert b.recv_multipart() == [b"A", b"hello"]
b.getsockopt(zmq.EVENTS)
assert not readable(bfd) and not (b.getsockopt(zmq.EVENTS) & zmq.POLLIN)

a.send(b"req1"); time.sleep(0.2)                        # a new request arrives, UNREAD
assert readable(bfd)                                    # read edge is set

b.send_multipart([b"A", b"reply"]); time.sleep(0.05)    # OUT-OF-BAND send on the same ROUTER
assert not readable(bfd)                                # <-- the send DRAINED the read edge
assert b.getsockopt(zmq.EVENTS) & zmq.POLLIN            # ...while POLLIN stays set...
assert b.recv_multipart(zmq.NOBLOCK) == [b"A", b"req1"] # ...and the request was still queued

Fix

After each out-of-band reply send, schedule the shell ZMQStream's read handler on the shell-channel loop — the same edge-trap reschedule ZMQStream._update_handler already runs internally:

self._shell_channel_io_loop.add_callback(
    lambda: stream._handle_events(stream.socket, 0)
)

The shell_stream (built in init_kernel) is threaded through ShellChannelThread into SubshellManager so the reply path can reach it; the re-arm is guarded by stream is not None and stream.socket is self._shell_socket. No polling and no new dependency — it moves the existing edge-trap reschedule to the one un-mediated operation that drains the edge.

Validation

Windows, Python 3.13 and 3.14, pyzmq 27.1.0 / libzmq 4.3.5; three arms × 20 real notebook runs, same machine/session:

Arm After the reply send_multipart Wedged
control none 6/20 (30%)
sham identical scheduling overhead, add_callback(lambda: None) (no re-arm) 5/20
fix add_callback(lambda: stream._handle_events(stream.socket, 0)) 0/20

P(0/20 | p=0.30) ≈ 0.70^20 ≈ 8e-4. The sham arm isolates the re-arm itself from the added wake-ups (it stays at the control rate). With the diff applied the threaded reference was live on every send (551 re-arms, 0 None/mismatch). Validated against 7.2.0 / 7.3.0, where _send_on_shell_channel is byte-identical to main.

A public-API alternative

stream.flush(zmq.POLLIN) — a public ZMQStream method ipykernel already calls in kernelbase.py — may be preferable to reaching into _handle_events. flush is a synchronous drain loop whereas _handle_events(socket, 0) is the edge-trap reschedule (related but not identical). I only measured _handle_events; happy to switch to flush if you'd rather stay on the public API.

Notes

🤖 Generated with Claude Code

On Windows, ipykernel 7 intermittently drops an execute_request on the
shell channel: the kernel goes idle and never replies, and the client
times out waiting for execute_reply (~30% of headless notebook runs in
our measurements; which cell hangs wanders run to run).

Root cause: the shell ROUTER socket is dual-use on the shell-channel
thread. A ZMQStream reads execute_requests off it, while replies are
sent back over the SAME socket out-of-band via a raw send_multipart in
SubshellManager._send_on_shell_channel. That out-of-band send drains the
socket's edge-triggered ZMQ_FD read edge (a documented libzmq corollary:
after zmq_send the socket may become readable without a new edge). The
send is not ZMQStream-mediated, so the stream is never re-armed and a
request that arrived concurrently strands unread on a registered-but-
non-readable fd. The strand is terminal: no later arrival re-edges it.

Fix: after each out-of-band reply send, schedule the shell ZMQStream's
read handler on the shell-channel loop -- the same edge-trap reschedule
ZMQStream._update_handler already runs internally
(add_callback(lambda: stream._handle_events(stream.socket, 0))) -- so
the concurrently-arrived request cannot strand. The shell_stream (built
in kernelapp.init_kernel) is threaded through ShellChannelThread into
SubshellManager so the reply path can reach it.

Validated on Windows (Python 3.13/3.14, pyzmq 27.1.0 / libzmq 4.3.5):
the wedge went from 6/20 (control) to 0/20 with this patch applied, same
machine/session, P(0/20 | p=0.30) ~ 8e-4, with the threaded reference
live on every send (551 re-arms, 0 None/mismatch). A sham arm with the
same scheduling overhead but no re-arm stayed at the control rate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
BoykoNeov added a commit to BoykoNeov/steel-sim that referenced this pull request Jun 14, 2026
The ZMQ_FD edge-drain fix is now an upstream PR (ipython/ipykernel#1529,
filed under BoykoNeov, 3 files +21/-0, validated 0/20 vs 6/20). The doc now
points to the PR and the applicable patch; steel-sim's retry mitigation is
unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant