First pass at edits

savannahostrowski · pablogsal · commit 984672f2efe8 · 2026-04-09T23:03:29.000+01:00
diff --git a/peps/pep-0830.rst b/peps/pep-0830.rst
@@ -2,7 +2,7 @@ PEP: 830
 Title: Frame Pointers Everywhere: Enabling System-Level Observability for Python
 Author: Pablo Galindo Salgado <pablogsal@python.org>,
         Ken Jin <kenjin@python.org>,
-        Savannah Ostrowski <savannahostrowski@gmail.com>,
+        Savannah Ostrowski <savannah@python.org>,
         Diego Russo <diego.russo@arm.com>
 Discussions-To:
 Status: Draft
@@ -15,7 +15,7 @@ Post-History:
 Abstract
 ========
 
-This PEP does two things:
+This PEP proposes two things:
 
 1. **Build CPython with frame pointers by default on platforms that support
    them.**  The default build configuration is changed to compile the
@@ -51,9 +51,9 @@ Motivation
 
 Python's observability story (profiling, debugging, and system-level tracing)
 is fundamentally limited by the absence of frame pointers. The core motivation
-of this PEP is to make Python observable by default: profilers faster and more
-accurate, debuggers more reliable, and eBPF-based tools functional without
-workarounds.
+of this PEP is to make Python observable by default, so that profilers are faster
+and more accurate, debuggers are more reliable, and eBPF-based tools are functional 
+without workarounds.
 
 Today, users who want to profile CPython with system tools must rebuild the
 interpreter with special compiler flags, a step that most users cannot or will
@@ -201,10 +201,10 @@ processes.  The Linux kernel has no DWARF unwinder and, per Linus Torvalds,
 will not gain one [#torvalds_fp]_; the kernel developed its own ORC format for
 internal use instead.
 
-The impact extends beyond CPU profiling.  Off-CPU flame graphs (used to
+The impact extends beyond CPU profiling.  Off-CPU flamegraphs (used to
 diagnose latency caused by I/O waits, lock contention, and scheduling delays)
 rely on the same ``bpf_get_stackid()`` helper to capture the stack at the point
-where a thread blocks.  As Brendan Gregg notes, off-CPU flame graphs "can be
+where a thread blocks.  As Brendan Gregg notes, off-CPU flamegraphs "can be
 dominated by libc read/write and mutex functions, so without frame pointers end
 up mostly broken" [#gregg2024]_.  For Python services where latency matters
 more than raw CPU throughput, off-CPU profiling is often the most valuable
@@ -405,30 +405,24 @@ The JIT Compiler Needs Frame Pointers to Be Debuggable
 ------------------------------------------------------
 
 CPython's copy-and-patch JIT (:pep:`744`) generates native machine code at
-runtime.  Without frame pointers in that generated code, stack unwinding
-through JIT frames is broken for virtually every tool in the ecosystem: GDB,
-LLDB, libunwind, libdw (elfutils), py-spy, Austin, pystack, memray, ``perf``,
-and all eBPF-based profilers.
-
-The investigation in issue `#126910`_ found that compiling the JIT stencils
-with ``-fno-omit-frame-pointer`` and ``-mno-omit-leaf-frame-pointer`` is a
-two-line change that would make most existing debuggers and profilers work with
-JIT-compiled code immediately.  The measured overhead is approximately 2% on
-x86-64 and even lower on AArch64 (which has a dedicated link register).  This
-is a remarkably good outcome: other JIT compilers (V8, LuaJIT, .NET CoreCLR,
-Julia, LLVM's ORC JIT) typically require hundreds to thousands of lines of code
-to implement custom DWARF ``.eh_frame`` generation, GDB JIT interface support
+runtime.  Without frame pointers in the interpreter, stack unwinding through
+JIT frames is broken for virtually every tool in the ecosystem: GDB, LLDB,
+libunwind, libdw (elfutils), py-spy, Austin, pystack, memray, ``perf``, and
+all eBPF-based profilers.  Ensuring full-stack observability for JIT-compiled
+code is a prerequisite for the JIT to be considered production-ready.
+
+Individual JIT stencils do not need frame-pointer prologues; the entire JIT
+region can be treated as a single frameless region for unwinding purposes.
+What matters is that the interpreter itself is built with frame pointers, so
+that the frame-pointer register (``%rbp`` on x86-64, ``x29`` on AArch64) is
+reserved and not clobbered by stencil code.  With frame pointers in the
+interpreter, unwinders can walk through JIT regions without needing to inspect
+individual stencils.  This is a remarkably good outcome compared to other
+JIT compilers (V8, LuaJIT, .NET CoreCLR, Julia, LLVM's ORC JIT), which
+typically require hundreds to thousands of lines of code to implement custom
+DWARF ``.eh_frame`` generation, GDB JIT interface support
 (``__jit_debug_register_code``), and per-unwinder registration APIs
-(``_U_dyn_register``, ``__register_frame``).  CPython's JIT may get most of the
-benefit from frame pointers alone if that follow-up change is adopted.
-
-Critically, for JIT frame pointers to produce useful results, the interpreter
-itself must also have frame pointers.  A JIT-compiled function calls back into
-the interpreter for many operations; if the interpreter frames lack frame
-pointers, the unwinder hits a gap and the stack trace is truncated.  This PEP
-addresses that interpreter-side gap.  JIT stencil flags (issue `#126910`_) are
-a complementary follow-up needed for complete stack unwinding in the presence
-of the JIT.
+(``_U_dyn_register``, ``__register_frame``).
 
 The Ecosystem Has Already Adopted Frame Pointers
 ------------------------------------------------
@@ -836,8 +830,21 @@ incorrectly.
 Performance
 -----------
 
-.. TODO: Insert full pyperformance results here once data collection
-   is complete.
+Full pyperformance results comparing the frame-pointer build against an
+identical build without frame pointers (geometric mean and per-benchmark
+range, 108 benchmarks):
+
+=====================================  =======================
+Machine                                Geometric mean overhead
+=====================================  =======================
+Apple M2 Mac Mini (arm64, macOS)       1.01x slower
+Intel Xeon Platinum 8480 (x86-64)      1.01x slower
+AMD EPYC 9654 (x86-64)                 1.01x slower
+AWS Graviton c7g.16xlarge (aarch64)    1.02x slower
+Ampere Altra Max (aarch64)             1.01x slower
+Raspberry Pi (aarch64)                 +X.X%
+macOS M3 (arm64)                       +X.X%
+=====================================  =======================
 
 This overhead applies to both the interpreter and to C extensions that inherit
 the flags via ``sysconfig``.  Detailed microarchitectural analysis shows the
@@ -892,10 +899,15 @@ information not already available through CPython's existing interfaces.
 How to Teach This
 =================
 
-No teaching is required.  This change is invisible to Python users: no APIs
-change, no behaviour changes, and no user action is needed. The only observable
-effect is that profilers, debuggers, and system-level tracing tools produce
-more complete and more reliable results out of the box.
+For Python users and application developers, this change is invisible: no APIs
+change, no behaviour changes, and no user action is needed.  The only
+observable effect is that profilers, debuggers, and system-level tracing tools
+produce more complete and more reliable results out of the box.
+
+Though extensions should see negligible overhead, extension authors who observe a
+measurable regression in a specific module can opt out as described in 
+`Extension Build Impact`_.  The ``--without-frame-pointers`` configure flag is 
+documented in `Opt-Out Configure Flag`_.
 
 
 Reference Implementation