22Title: Frame Pointers Everywhere: Enabling System-Level Observability for Python
33Author: Pablo Galindo Salgado <pablogsal@python.org>,
44 Ken Jin <kenjin@python.org>,
5- Savannah Ostrowski <savannahostrowski@gmail.com >,
5+ Savannah Ostrowski <savannah@python.org >,
66 Diego Russo <diego.russo@arm.com>
77Discussions-To:
88Status: Draft
@@ -15,7 +15,7 @@ Post-History:
1515Abstract
1616========
1717
18- This PEP does two things:
18+ This PEP proposes two things:
1919
20201. **Build CPython with frame pointers by default on platforms that support
2121 them. ** The default build configuration is changed to compile the
@@ -51,9 +51,9 @@ Motivation
5151
5252Python's observability story (profiling, debugging, and system-level tracing)
5353is fundamentally limited by the absence of frame pointers. The core motivation
54- of this PEP is to make Python observable by default: profilers faster and more
55- accurate, debuggers more reliable, and eBPF-based tools functional without
56- workarounds.
54+ of this PEP is to make Python observable by default, so that profilers are faster
55+ and more accurate, debuggers are more reliable, and eBPF-based tools are functional
56+ without workarounds.
5757
5858Today, users who want to profile CPython with system tools must rebuild the
5959interpreter with special compiler flags, a step that most users cannot or will
@@ -201,10 +201,10 @@ processes. The Linux kernel has no DWARF unwinder and, per Linus Torvalds,
201201will not gain one [#torvalds_fp ]_; the kernel developed its own ORC format for
202202internal use instead.
203203
204- The impact extends beyond CPU profiling. Off-CPU flame graphs (used to
204+ The impact extends beyond CPU profiling. Off-CPU flamegraphs (used to
205205diagnose latency caused by I/O waits, lock contention, and scheduling delays)
206206rely on the same ``bpf_get_stackid() `` helper to capture the stack at the point
207- where a thread blocks. As Brendan Gregg notes, off-CPU flame graphs "can be
207+ where a thread blocks. As Brendan Gregg notes, off-CPU flamegraphs "can be
208208dominated by libc read/write and mutex functions, so without frame pointers end
209209up mostly broken" [#gregg2024 ]_. For Python services where latency matters
210210more than raw CPU throughput, off-CPU profiling is often the most valuable
@@ -405,30 +405,24 @@ The JIT Compiler Needs Frame Pointers to Be Debuggable
405405------------------------------------------------------
406406
407407CPython's copy-and-patch JIT (:pep: `744 `) generates native machine code at
408- runtime. Without frame pointers in that generated code, stack unwinding
409- through JIT frames is broken for virtually every tool in the ecosystem: GDB,
410- LLDB, libunwind, libdw (elfutils), py-spy, Austin, pystack, memray, ``perf ``,
411- and all eBPF-based profilers.
412-
413- The investigation in issue `#126910 `_ found that compiling the JIT stencils
414- with ``-fno-omit-frame-pointer `` and ``-mno-omit-leaf-frame-pointer `` is a
415- two-line change that would make most existing debuggers and profilers work with
416- JIT-compiled code immediately. The measured overhead is approximately 2% on
417- x86-64 and even lower on AArch64 (which has a dedicated link register). This
418- is a remarkably good outcome: other JIT compilers (V8, LuaJIT, .NET CoreCLR,
419- Julia, LLVM's ORC JIT) typically require hundreds to thousands of lines of code
420- to implement custom DWARF ``.eh_frame `` generation, GDB JIT interface support
408+ runtime. Without frame pointers in the interpreter, stack unwinding through
409+ JIT frames is broken for virtually every tool in the ecosystem: GDB, LLDB,
410+ libunwind, libdw (elfutils), py-spy, Austin, pystack, memray, ``perf ``, and
411+ all eBPF-based profilers. Ensuring full-stack observability for JIT-compiled
412+ code is a prerequisite for the JIT to be considered production-ready.
413+
414+ Individual JIT stencils do not need frame-pointer prologues; the entire JIT
415+ region can be treated as a single frameless region for unwinding purposes.
416+ What matters is that the interpreter itself is built with frame pointers, so
417+ that the frame-pointer register (``%rbp `` on x86-64, ``x29 `` on AArch64) is
418+ reserved and not clobbered by stencil code. With frame pointers in the
419+ interpreter, unwinders can walk through JIT regions without needing to inspect
420+ individual stencils. This is a remarkably good outcome compared to other
421+ JIT compilers (V8, LuaJIT, .NET CoreCLR, Julia, LLVM's ORC JIT), which
422+ typically require hundreds to thousands of lines of code to implement custom
423+ DWARF ``.eh_frame `` generation, GDB JIT interface support
421424(``__jit_debug_register_code ``), and per-unwinder registration APIs
422- (``_U_dyn_register ``, ``__register_frame ``). CPython's JIT may get most of the
423- benefit from frame pointers alone if that follow-up change is adopted.
424-
425- Critically, for JIT frame pointers to produce useful results, the interpreter
426- itself must also have frame pointers. A JIT-compiled function calls back into
427- the interpreter for many operations; if the interpreter frames lack frame
428- pointers, the unwinder hits a gap and the stack trace is truncated. This PEP
429- addresses that interpreter-side gap. JIT stencil flags (issue `#126910 `_) are
430- a complementary follow-up needed for complete stack unwinding in the presence
431- of the JIT.
425+ (``_U_dyn_register ``, ``__register_frame ``).
432426
433427The Ecosystem Has Already Adopted Frame Pointers
434428------------------------------------------------
@@ -836,8 +830,21 @@ incorrectly.
836830Performance
837831-----------
838832
839- .. TODO: Insert full pyperformance results here once data collection
840- is complete.
833+ Full pyperformance results comparing the frame-pointer build against an
834+ identical build without frame pointers (geometric mean and per-benchmark
835+ range, 108 benchmarks):
836+
837+ ===================================== =======================
838+ Machine Geometric mean overhead
839+ ===================================== =======================
840+ Apple M2 Mac Mini (arm64, macOS) 1.01x slower
841+ Intel Xeon Platinum 8480 (x86-64) 1.01x slower
842+ AMD EPYC 9654 (x86-64) 1.01x slower
843+ AWS Graviton c7g.16xlarge (aarch64) 1.02x slower
844+ Ampere Altra Max (aarch64) 1.01x slower
845+ Raspberry Pi (aarch64) +X.X%
846+ macOS M3 (arm64) +X.X%
847+ ===================================== =======================
841848
842849This overhead applies to both the interpreter and to C extensions that inherit
843850the flags via ``sysconfig ``. Detailed microarchitectural analysis shows the
@@ -892,10 +899,15 @@ information not already available through CPython's existing interfaces.
892899How to Teach This
893900=================
894901
895- No teaching is required. This change is invisible to Python users: no APIs
896- change, no behaviour changes, and no user action is needed. The only observable
897- effect is that profilers, debuggers, and system-level tracing tools produce
898- more complete and more reliable results out of the box.
902+ For Python users and application developers, this change is invisible: no APIs
903+ change, no behaviour changes, and no user action is needed. The only
904+ observable effect is that profilers, debuggers, and system-level tracing tools
905+ produce more complete and more reliable results out of the box.
906+
907+ Though extensions should see negligible overhead, extension authors who observe a
908+ measurable regression in a specific module can opt out as described in
909+ `Extension Build Impact `_. The ``--without-frame-pointers `` configure flag is
910+ documented in `Opt-Out Configure Flag `_.
899911
900912
901913Reference Implementation
0 commit comments