Skip to content

Commit 01cc0dc

Browse files
l1kbjorn-helgaas
authored andcommitted
Documentation: PCI: Sync AER doc with code
The PCIe Advanced Error Reporting driver has evolved over the years but its documentation hasn't. Catch up with past code changes: * The documentation claims that Correctable Errors are logged with KERN_INFO severity, but the code uses KERN_WARN. It had used KERN_WARN from the beginning with commit 6c2b374 ("PCI-Express AER implemetation: AER core and aerdriver"). In 2013, commit 2cced2d ("aerdrv: Cleanup log output for AER") switched to KERN_ERR, until 2020 when it was reverted back to KERN_WARN by commit e83e2ca ("PCI/AER: Log correctable errors as warning, not error"). * An example log message in the documentation uses the term "Uncorrected", but the code uses "Uncorrectable" since commit 02a06f5 ("PCI/AER: Use 'Correctable' and 'Uncorrectable' spec terms for errors"). * The example contains the Requester ID "id=0500", which is omitted since commit 010caed ("PCI/AER: Decode Error Source Requester ID"). * The example contains the error name "Unsupported Request", which is instead reported as "UnsupReq" since commit bd23780 ("PCI/AER: Adopt lspci names for AER error decoding"). * The example doesn't prepend "0x" to hex values from the TLP Header Log, as introduced by commit f68ea77 ("PCI: Add pcie_print_tlp_log() to print TLP Header and Prefix Log"). * The documentation refers to a reset_link callback which was removed by commit b6cf1a4 ("PCI/ERR: Remove service dependency in pcie_do_recovery()"). * Commit 5790862 ("PCI/ERR: Recover from RCiEP AER errors") added support to recover Root Complex Integrated Endpoints by applying a Function Level Reset, alternatively to the Secondary Bus Reset which is applied otherwise. * On non-fatal errors, a reset was previously never performed. But the AER driver has just been amended to allow drivers to opt in to a reset. * The documentation claims that a warning message is logged if a driver lacks pci_error_handlers. But the message has been informational (logged with KERN_INFO severity) since its introduction with commit 01daacf ("PCI/AER: Log which device prevents error recovery"). The documentation claims that the message is only logged for fatal errors, which is incorrect. Moreover it refers to "section 3", even though the documentation no longer contains section numbers since commit 4e37f05 ("Documentation: PCI: convert pcieaer-howto.txt to reST"). Section 3 is titled "Developer Guide". That's the same section where the reference is located, so it is self-referential and can be dropped. Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Brian Norris <briannorris@chromium.org> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://patch.msgid.link/7501bfc5b9920193a25998a3cbcf72c47674ec63.1757942121.git.lukas@wunner.de
1 parent 0a27bdb commit 01cc0dc

1 file changed

Lines changed: 39 additions & 44 deletions

File tree

Documentation/PCI/pcieaer-howto.rst

Lines changed: 39 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -70,16 +70,16 @@ AER error output
7070
----------------
7171

7272
When a PCIe AER error is captured, an error message will be output to
73-
console. If it's a correctable error, it is output as an info message.
73+
console. If it's a correctable error, it is output as a warning message.
7474
Otherwise, it is printed as an error. So users could choose different
7575
log level to filter out correctable error messages.
7676

7777
Below shows an example::
7878

79-
0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
79+
0000:50:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Requester ID)
8080
0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000
81-
0000:50:00.0: [20] Unsupported Request (First)
82-
0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100
81+
0000:50:00.0: [20] UnsupReq (First)
82+
0000:50:00.0: TLP Header: 0x04000001 0x00200a03 0x05010000 0x00050100
8383

8484
In the example, 'Requester ID' means the ID of the device that sent
8585
the error message to the Root Port. Please refer to PCIe specs for other
@@ -152,18 +152,6 @@ the device driver.
152152
Provide callbacks
153153
-----------------
154154

155-
callback reset_link to reset PCIe link
156-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
157-
158-
This callback is used to reset the PCIe physical link when a
159-
fatal error happens. The Root Port AER service driver provides a
160-
default reset_link function, but different Upstream Ports might
161-
have different specifications to reset the PCIe link, so
162-
Upstream Port drivers may provide their own reset_link functions.
163-
164-
Section 3.2.2.2 provides more detailed info on when to call
165-
reset_link.
166-
167155
PCI error-recovery callbacks
168156
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
169157

@@ -174,8 +162,8 @@ when performing error recovery actions.
174162
Data struct pci_driver has a pointer, err_handler, to point to
175163
pci_error_handlers who consists of a couple of callback function
176164
pointers. The AER driver follows the rules defined in
177-
pci-error-recovery.rst except PCIe-specific parts (e.g.
178-
reset_link). Please refer to pci-error-recovery.rst for detailed
165+
pci-error-recovery.rst except PCIe-specific parts (see
166+
below). Please refer to pci-error-recovery.rst for detailed
179167
definitions of the callbacks.
180168

181169
The sections below specify when to call the error callback functions.
@@ -189,10 +177,21 @@ software intervention or any loss of data. These errors do not
189177
require any recovery actions. The AER driver clears the device's
190178
correctable error status register accordingly and logs these errors.
191179

192-
Non-correctable (non-fatal and fatal) errors
193-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
180+
Uncorrectable (non-fatal and fatal) errors
181+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
194182

195-
If an error message indicates a non-fatal error, performing link reset
183+
The AER driver performs a Secondary Bus Reset to recover from
184+
uncorrectable errors. The reset is applied at the port above
185+
the originating device: If the originating device is an Endpoint,
186+
only the Endpoint is reset. If on the other hand the originating
187+
device has subordinate devices, those are all affected by the
188+
reset as well.
189+
190+
If the originating device is a Root Complex Integrated Endpoint,
191+
there's no port above where a Secondary Bus Reset could be applied.
192+
In this case, the AER driver instead applies a Function Level Reset.
193+
194+
If an error message indicates a non-fatal error, performing a reset
196195
at upstream is not required. The AER driver calls error_detected(dev,
197196
pci_channel_io_normal) to all drivers associated within a hierarchy in
198197
question. For example::
@@ -204,38 +203,34 @@ Downstream Port B and Endpoint.
204203

205204
A driver may return PCI_ERS_RESULT_CAN_RECOVER,
206205
PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
207-
whether it can recover or the AER driver calls mmio_enabled as next.
206+
whether it can recover without a reset, considers the device unrecoverable
207+
or needs a reset for recovery. If all affected drivers agree that they can
208+
recover without a reset, it is skipped. Should one driver request a reset,
209+
it overrides all other drivers.
208210

209211
If an error message indicates a fatal error, kernel will broadcast
210212
error_detected(dev, pci_channel_io_frozen) to all drivers within
211-
a hierarchy in question. Then, performing link reset at upstream is
212-
necessary. As different kinds of devices might use different approaches
213-
to reset link, AER port service driver is required to provide the
214-
function to reset link via callback parameter of pcie_do_recovery()
215-
function. If reset_link is not NULL, recovery function will use it
216-
to reset the link. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER
217-
and reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
218-
to mmio_enabled.
219-
220-
Frequent Asked Questions
221-
------------------------
213+
a hierarchy in question. Then, performing a reset at upstream is
214+
necessary. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER
215+
to indicate that recovery without a reset is possible, the error
216+
handling goes to mmio_enabled, but afterwards a reset is still
217+
performed.
222218

223-
Q:
224-
What happens if a PCIe device driver does not provide an
225-
error recovery handler (pci_driver->err_handler is equal to NULL)?
219+
In other words, for non-fatal errors, drivers may opt in to a reset.
220+
But for fatal errors, they cannot opt out of a reset, based on the
221+
assumption that the link is unreliable.
226222

227-
A:
228-
The devices attached with the driver won't be recovered. If the
229-
error is fatal, kernel will print out warning messages. Please refer
230-
to section 3 for more information.
223+
Frequently Asked Questions
224+
--------------------------
231225

232226
Q:
233-
What happens if an upstream port service driver does not provide
234-
callback reset_link?
227+
What happens if a PCIe device driver does not provide an
228+
error recovery handler (pci_driver->err_handler is equal to NULL)?
235229

236230
A:
237-
Fatal error recovery will fail if the errors are reported by the
238-
upstream ports who are attached by the service driver.
231+
The devices attached with the driver won't be recovered.
232+
The kernel will print out informational messages to identify
233+
unrecoverable devices.
239234

240235

241236
Software error injection

0 commit comments

Comments
 (0)