Re-arm the shell ZMQStream read after the out-of-band reply send (fix intermittent dropped execute_request on Windows) by BoykoNeov · Pull Request #1529 · ipython/ipykernel

BoykoNeov · 2026-06-14T14:52:03Z

Summary

On Windows, ipykernel 7 intermittently drops an execute_request on the shell channel: the kernel goes idle (0% CPU) and never replies, and the client times out waiting for execute_reply. In a headless notebook smoke-test we measured this at ~30% of runs; which cell hangs wanders run to run (it is content-innocent), and only a kernel restart recovers. This looks like the same user-visible hang tracked downstream in microsoft/vscode-jupyter#17228.

Forcing WindowsSelectorEventLoopPolicy does not help — the kernel already runs a selector loop.

Root cause

The shell ROUTER socket is dual-use on the shell-channel thread:

a ZMQStream reads execute_requests off it (init_kernel builds ZMQStream(self.shell_socket, …), delivered via on_recv), while
replies go back over the same socket out-of-band through a raw send_multipart that bypasses the stream, in SubshellManager._send_on_shell_channel:

def _send_on_shell_channel(self, msg) -> None:
    assert current_thread().name == SHELL_CHANNEL_THREAD_NAME
    self._shell_socket.send_multipart(msg)   # raw send on the ZMQStream's own socket

That out-of-band send drains the socket's edge-triggered ZMQ_FD read edge — the documented libzmq corollary that after zmq_send a socket may become readable without producing a new read event. Because the send is not ZMQStream-mediated, the read side is never re-armed, so an execute_request that arrived concurrently strands unread on a registered-but-non-readable fd. The strand is terminal: while a backlog is already pending there is no 0 → nonzero EVENTS transition, so no later arrival re-edges it, and the kernel sits idle.

Minimal raw-pyzmq reproduction of just the send-drains-read-edge step (no Jupyter), on a ROUTER (the shell socket type):

import select, time, zmq
def readable(fd): return bool(select.select([fd], [], [], 0)[0])

ctx = zmq.Context()
a = ctx.socket(zmq.DEALER); a.setsockopt(zmq.IDENTITY, b"A")
b = ctx.socket(zmq.ROUTER)
b.bind("tcp://127.0.0.1:5704"); a.connect("tcp://127.0.0.1:5704"); time.sleep(0.3)
bfd = b.getsockopt(zmq.FD)

a.send(b"hello"); time.sleep(0.2)                       # warmup: learn route 'A', clear setup edges
assert b.recv_multipart() == [b"A", b"hello"]
b.getsockopt(zmq.EVENTS)
assert not readable(bfd) and not (b.getsockopt(zmq.EVENTS) & zmq.POLLIN)

a.send(b"req1"); time.sleep(0.2)                        # a new request arrives, UNREAD
assert readable(bfd)                                    # read edge is set

b.send_multipart([b"A", b"reply"]); time.sleep(0.05)    # OUT-OF-BAND send on the same ROUTER
assert not readable(bfd)                                # <-- the send DRAINED the read edge
assert b.getsockopt(zmq.EVENTS) & zmq.POLLIN            # ...while POLLIN stays set...
assert b.recv_multipart(zmq.NOBLOCK) == [b"A", b"req1"] # ...and the request was still queued

Fix

After each out-of-band reply send, schedule the shell ZMQStream's read handler on the shell-channel loop — the same edge-trap reschedule ZMQStream._update_handler already runs internally:

self._shell_channel_io_loop.add_callback(
    lambda: stream._handle_events(stream.socket, 0)
)

The shell_stream (built in init_kernel) is threaded through ShellChannelThread into SubshellManager so the reply path can reach it; the re-arm is guarded by stream is not None and stream.socket is self._shell_socket. No polling and no new dependency — it moves the existing edge-trap reschedule to the one un-mediated operation that drains the edge.

Validation

Windows, Python 3.13 and 3.14, pyzmq 27.1.0 / libzmq 4.3.5; three arms × 20 real notebook runs, same machine/session:

Arm	After the reply `send_multipart`	Wedged
control	none	6/20 (30%)
sham	identical scheduling overhead, `add_callback(lambda: None)` (no re-arm)	5/20
fix	`add_callback(lambda: stream._handle_events(stream.socket, 0))`	0/20

P(0/20 | p=0.30) ≈ 0.70^20 ≈ 8e-4. The sham arm isolates the re-arm itself from the added wake-ups (it stays at the control rate). With the diff applied the threaded reference was live on every send (551 re-arms, 0 None/mismatch). Validated against 7.2.0 / 7.3.0, where _send_on_shell_channel is byte-identical to main.

A public-API alternative

stream.flush(zmq.POLLIN) — a public ZMQStream method ipykernel already calls in kernelbase.py — may be preferable to reaching into _handle_events. flush is a synchronous drain loop whereas _handle_events(socket, 0) is the edge-trap reschedule (related but not identical). I only measured _handle_events; happy to switch to flush if you'd rather stay on the public API.

Notes

Still unfixed on main: _send_on_shell_channel is a bare send_multipart.
Related: ipykernel 7 causes notebook execution to hang microsoft/vscode-jupyter#17228 (downstream symptom), ipykernel 7.0.0 release with subshells #1438 (7.0.0 problem hub), ensure qt zmq streams starts in a clean state #307 (the 4.8.2 fix for the same ZMQ_FD edge-trigger bug class), getsockopt(zmq.EVENTS) drains signaler, can cause zmq.asyncio recv to miss wakeups zeromq/pyzmq#2173 (same signaler-drain family, different layer).
A user-side mitigation — retry the kernel subprocess on the CellTimeoutError signature — works today and is orthogonal to this fix.
Happy to add a regression test or open a companion tracking issue if you'd prefer.

🤖 Generated with Claude Code

On Windows, ipykernel 7 intermittently drops an execute_request on the shell channel: the kernel goes idle and never replies, and the client times out waiting for execute_reply (~30% of headless notebook runs in our measurements; which cell hangs wanders run to run). Root cause: the shell ROUTER socket is dual-use on the shell-channel thread. A ZMQStream reads execute_requests off it, while replies are sent back over the SAME socket out-of-band via a raw send_multipart in SubshellManager._send_on_shell_channel. That out-of-band send drains the socket's edge-triggered ZMQ_FD read edge (a documented libzmq corollary: after zmq_send the socket may become readable without a new edge). The send is not ZMQStream-mediated, so the stream is never re-armed and a request that arrived concurrently strands unread on a registered-but- non-readable fd. The strand is terminal: no later arrival re-edges it. Fix: after each out-of-band reply send, schedule the shell ZMQStream's read handler on the shell-channel loop -- the same edge-trap reschedule ZMQStream._update_handler already runs internally (add_callback(lambda: stream._handle_events(stream.socket, 0))) -- so the concurrently-arrived request cannot strand. The shell_stream (built in kernelapp.init_kernel) is threaded through ShellChannelThread into SubshellManager so the reply path can reach it. Validated on Windows (Python 3.13/3.14, pyzmq 27.1.0 / libzmq 4.3.5): the wedge went from 6/20 (control) to 0/20 with this patch applied, same machine/session, P(0/20 | p=0.30) ~ 8e-4, with the threaded reference live on every send (551 re-arms, 0 None/mismatch). A sham arm with the same scheduling overhead but no re-arm stayed at the control rate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The ZMQ_FD edge-drain fix is now an upstream PR (ipython/ipykernel#1529, filed under BoykoNeov, 3 files +21/-0, validated 0/20 vs 6/20). The doc now points to the PR and the applicable patch; steel-sim's retry mitigation is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Re-arm the shell ZMQStream read after the out-of-band reply send (fix intermittent dropped execute_request on Windows)#1529

Re-arm the shell ZMQStream read after the out-of-band reply send (fix intermittent dropped execute_request on Windows)#1529
BoykoNeov wants to merge 1 commit into
ipython:mainfrom
BoykoNeov:fix/shell-zmqstream-rearm-after-reply-send

BoykoNeov commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

BoykoNeov commented Jun 14, 2026

Summary

Root cause

Fix

Validation

A public-API alternative

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant