Skip to content

Commit aa35a48

Browse files
committed
Merge tag 'ras_core_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov: - Add initial support for RAS hardware found on AMD server GPUs (MI200). Those GPUs and CPUs are connected together through the coherent fabric and the GPU memory controllers report errors through x86's MCA so EDAC needs to support them. The amd64_edac driver supports now HBM (High Bandwidth Memory) and thus such heterogeneous memory controller systems - Other small cleanups and improvements * tag 'ras_core_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: EDAC/amd64: Cache and use GPU node map EDAC/amd64: Add support for AMD heterogeneous Family 19h Model 30h-3Fh EDAC/amd64: Document heterogeneous system enumeration x86/MCE/AMD, EDAC/mce_amd: Decode UMC_V2 ECC errors x86/amd_nb: Re-sort and re-indent PCI defines x86/amd_nb: Add MI200 PCI IDs ras/debugfs: Fix error checking for debugfs_create_dir() x86/MCE: Check a hw error's address to determine proper recovery action
2 parents e5ce2f1 + 4251566 commit aa35a48

9 files changed

Lines changed: 513 additions & 58 deletions

File tree

Documentation/driver-api/edac.rst

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,16 @@ will occupy those chip-select rows.
106106
This term is avoided because it is unclear when needing to distinguish
107107
between chip-select rows and socket sets.
108108

109+
* High Bandwidth Memory (HBM)
110+
111+
HBM is a new memory type with low power consumption and ultra-wide
112+
communication lanes. It uses vertically stacked memory chips (DRAM dies)
113+
interconnected by microscopic wires called "through-silicon vias," or
114+
TSVs.
115+
116+
Several stacks of HBM chips connect to the CPU or GPU through an ultra-fast
117+
interconnect called the "interposer". Therefore, HBM's characteristics
118+
are nearly indistinguishable from on-chip integrated RAM.
109119

110120
Memory Controllers
111121
------------------
@@ -176,3 +186,113 @@ nodes::
176186
the L1 and L2 directories would be "edac_device_block's"
177187

178188
.. kernel-doc:: drivers/edac/edac_device.h
189+
190+
191+
Heterogeneous system support
192+
----------------------------
193+
194+
An AMD heterogeneous system is built by connecting the data fabrics of
195+
both CPUs and GPUs via custom xGMI links. Thus, the data fabric on the
196+
GPU nodes can be accessed the same way as the data fabric on CPU nodes.
197+
198+
The MI200 accelerators are data center GPUs. They have 2 data fabrics,
199+
and each GPU data fabric contains four Unified Memory Controllers (UMC).
200+
Each UMC contains eight channels. Each UMC channel controls one 128-bit
201+
HBM2e (2GB) channel (equivalent to 8 X 2GB ranks). This creates a total
202+
of 4096-bits of DRAM data bus.
203+
204+
While the UMC is interfacing a 16GB (8high X 2GB DRAM) HBM stack, each UMC
205+
channel is interfacing 2GB of DRAM (represented as rank).
206+
207+
Memory controllers on AMD GPU nodes can be represented in EDAC thusly:
208+
209+
GPU DF / GPU Node -> EDAC MC
210+
GPU UMC -> EDAC CSROW
211+
GPU UMC channel -> EDAC CHANNEL
212+
213+
For example: a heterogeneous system with 1 AMD CPU is connected to
214+
4 MI200 (Aldebaran) GPUs using xGMI.
215+
216+
Some more heterogeneous hardware details:
217+
218+
- The CPU UMC (Unified Memory Controller) is mostly the same as the GPU UMC.
219+
They have chip selects (csrows) and channels. However, the layouts are different
220+
for performance, physical layout, or other reasons.
221+
- CPU UMCs use 1 channel, In this case UMC = EDAC channel. This follows the
222+
marketing speak. CPU has X memory channels, etc.
223+
- CPU UMCs use up to 4 chip selects, So UMC chip select = EDAC CSROW.
224+
- GPU UMCs use 1 chip select, So UMC = EDAC CSROW.
225+
- GPU UMCs use 8 channels, So UMC channel = EDAC channel.
226+
227+
The EDAC subsystem provides a mechanism to handle AMD heterogeneous
228+
systems by calling system specific ops for both CPUs and GPUs.
229+
230+
AMD GPU nodes are enumerated in sequential order based on the PCI
231+
hierarchy, and the first GPU node is assumed to have a Node ID value
232+
following those of the CPU nodes after latter are fully populated::
233+
234+
$ ls /sys/devices/system/edac/mc/
235+
mc0 - CPU MC node 0
236+
mc1 |
237+
mc2 |- GPU card[0] => node 0(mc1), node 1(mc2)
238+
mc3 |
239+
mc4 |- GPU card[1] => node 0(mc3), node 1(mc4)
240+
mc5 |
241+
mc6 |- GPU card[2] => node 0(mc5), node 1(mc6)
242+
mc7 |
243+
mc8 |- GPU card[3] => node 0(mc7), node 1(mc8)
244+
245+
For example, a heterogeneous system with one AMD CPU is connected to
246+
four MI200 (Aldebaran) GPUs using xGMI. This topology can be represented
247+
via the following sysfs entries::
248+
249+
/sys/devices/system/edac/mc/..
250+
251+
CPU # CPU node
252+
├── mc 0
253+
254+
GPU Nodes are enumerated sequentially after CPU nodes have been populated
255+
GPU card 1 # Each MI200 GPU has 2 nodes/mcs
256+
├── mc 1 # GPU node 0 == mc1, Each MC node has 4 UMCs/CSROWs
257+
│   ├── csrow 0 # UMC 0
258+
│   │   ├── channel 0 # Each UMC has 8 channels
259+
│   │   ├── channel 1 # size of each channel is 2 GB, so each UMC has 16 GB
260+
│   │   ├── channel 2
261+
│   │   ├── channel 3
262+
│   │   ├── channel 4
263+
│   │   ├── channel 5
264+
│   │   ├── channel 6
265+
│   │   ├── channel 7
266+
│   ├── csrow 1 # UMC 1
267+
│   │   ├── channel 0
268+
│   │   ├── ..
269+
│   │   ├── channel 7
270+
│   ├── .. ..
271+
│   ├── csrow 3 # UMC 3
272+
│   │   ├── channel 0
273+
│   │   ├── ..
274+
│   │   ├── channel 7
275+
│   ├── rank 0
276+
│   ├── .. ..
277+
│   ├── rank 31 # total 32 ranks/dimms from 4 UMCs
278+
279+
├── mc 2 # GPU node 1 == mc2
280+
│   ├── .. # each GPU has total 64 GB
281+
282+
GPU card 2
283+
├── mc 3
284+
│   ├── ..
285+
├── mc 4
286+
│   ├── ..
287+
288+
GPU card 3
289+
├── mc 5
290+
│   ├── ..
291+
├── mc 6
292+
│   ├── ..
293+
294+
GPU card 4
295+
├── mc 7
296+
│   ├── ..
297+
├── mc 8
298+
│   ├── ..

arch/x86/kernel/amd_nb.c

Lines changed: 28 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -15,28 +15,31 @@
1515
#include <linux/pci_ids.h>
1616
#include <asm/amd_nb.h>
1717

18-
#define PCI_DEVICE_ID_AMD_17H_ROOT 0x1450
19-
#define PCI_DEVICE_ID_AMD_17H_M10H_ROOT 0x15d0
20-
#define PCI_DEVICE_ID_AMD_17H_M30H_ROOT 0x1480
21-
#define PCI_DEVICE_ID_AMD_17H_M60H_ROOT 0x1630
22-
#define PCI_DEVICE_ID_AMD_17H_MA0H_ROOT 0x14b5
23-
#define PCI_DEVICE_ID_AMD_19H_M10H_ROOT 0x14a4
24-
#define PCI_DEVICE_ID_AMD_19H_M60H_ROOT 0x14d8
25-
#define PCI_DEVICE_ID_AMD_19H_M70H_ROOT 0x14e8
26-
#define PCI_DEVICE_ID_AMD_17H_DF_F4 0x1464
27-
#define PCI_DEVICE_ID_AMD_17H_M10H_DF_F4 0x15ec
28-
#define PCI_DEVICE_ID_AMD_17H_M30H_DF_F4 0x1494
29-
#define PCI_DEVICE_ID_AMD_17H_M60H_DF_F4 0x144c
30-
#define PCI_DEVICE_ID_AMD_17H_M70H_DF_F4 0x1444
31-
#define PCI_DEVICE_ID_AMD_17H_MA0H_DF_F4 0x1728
32-
#define PCI_DEVICE_ID_AMD_19H_DF_F4 0x1654
33-
#define PCI_DEVICE_ID_AMD_19H_M10H_DF_F4 0x14b1
34-
#define PCI_DEVICE_ID_AMD_19H_M40H_ROOT 0x14b5
35-
#define PCI_DEVICE_ID_AMD_19H_M40H_DF_F4 0x167d
36-
#define PCI_DEVICE_ID_AMD_19H_M50H_DF_F4 0x166e
37-
#define PCI_DEVICE_ID_AMD_19H_M60H_DF_F4 0x14e4
38-
#define PCI_DEVICE_ID_AMD_19H_M70H_DF_F4 0x14f4
39-
#define PCI_DEVICE_ID_AMD_19H_M78H_DF_F4 0x12fc
18+
#define PCI_DEVICE_ID_AMD_17H_ROOT 0x1450
19+
#define PCI_DEVICE_ID_AMD_17H_M10H_ROOT 0x15d0
20+
#define PCI_DEVICE_ID_AMD_17H_M30H_ROOT 0x1480
21+
#define PCI_DEVICE_ID_AMD_17H_M60H_ROOT 0x1630
22+
#define PCI_DEVICE_ID_AMD_17H_MA0H_ROOT 0x14b5
23+
#define PCI_DEVICE_ID_AMD_19H_M10H_ROOT 0x14a4
24+
#define PCI_DEVICE_ID_AMD_19H_M40H_ROOT 0x14b5
25+
#define PCI_DEVICE_ID_AMD_19H_M60H_ROOT 0x14d8
26+
#define PCI_DEVICE_ID_AMD_19H_M70H_ROOT 0x14e8
27+
#define PCI_DEVICE_ID_AMD_MI200_ROOT 0x14bb
28+
29+
#define PCI_DEVICE_ID_AMD_17H_DF_F4 0x1464
30+
#define PCI_DEVICE_ID_AMD_17H_M10H_DF_F4 0x15ec
31+
#define PCI_DEVICE_ID_AMD_17H_M30H_DF_F4 0x1494
32+
#define PCI_DEVICE_ID_AMD_17H_M60H_DF_F4 0x144c
33+
#define PCI_DEVICE_ID_AMD_17H_M70H_DF_F4 0x1444
34+
#define PCI_DEVICE_ID_AMD_17H_MA0H_DF_F4 0x1728
35+
#define PCI_DEVICE_ID_AMD_19H_DF_F4 0x1654
36+
#define PCI_DEVICE_ID_AMD_19H_M10H_DF_F4 0x14b1
37+
#define PCI_DEVICE_ID_AMD_19H_M40H_DF_F4 0x167d
38+
#define PCI_DEVICE_ID_AMD_19H_M50H_DF_F4 0x166e
39+
#define PCI_DEVICE_ID_AMD_19H_M60H_DF_F4 0x14e4
40+
#define PCI_DEVICE_ID_AMD_19H_M70H_DF_F4 0x14f4
41+
#define PCI_DEVICE_ID_AMD_19H_M78H_DF_F4 0x12fc
42+
#define PCI_DEVICE_ID_AMD_MI200_DF_F4 0x14d4
4043

4144
/* Protect the PCI config register pairs used for SMN. */
4245
static DEFINE_MUTEX(smn_mutex);
@@ -53,6 +56,7 @@ static const struct pci_device_id amd_root_ids[] = {
5356
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M40H_ROOT) },
5457
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M60H_ROOT) },
5558
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M70H_ROOT) },
59+
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI200_ROOT) },
5660
{}
5761
};
5862

@@ -81,6 +85,7 @@ static const struct pci_device_id amd_nb_misc_ids[] = {
8185
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M60H_DF_F3) },
8286
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M70H_DF_F3) },
8387
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M78H_DF_F3) },
88+
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI200_DF_F3) },
8489
{}
8590
};
8691

@@ -101,6 +106,7 @@ static const struct pci_device_id amd_nb_link_ids[] = {
101106
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M40H_DF_F4) },
102107
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M50H_DF_F4) },
103108
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_CNB17H_F4) },
109+
{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI200_DF_F4) },
104110
{}
105111
};
106112

arch/x86/kernel/cpu/mce/amd.c

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -715,11 +715,13 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
715715

716716
bool amd_mce_is_memory_error(struct mce *m)
717717
{
718+
enum smca_bank_types bank_type;
718719
/* ErrCodeExt[20:16] */
719720
u8 xec = (m->status >> 16) & 0x1f;
720721

722+
bank_type = smca_get_bank_type(m->extcpu, m->bank);
721723
if (mce_flags.smca)
722-
return smca_get_bank_type(m->extcpu, m->bank) == SMCA_UMC && xec == 0x0;
724+
return (bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) && xec == 0x0;
723725

724726
return m->bank == 4 && xec == 0x8;
725727
}
@@ -1050,7 +1052,7 @@ static const char *get_name(unsigned int cpu, unsigned int bank, struct threshol
10501052
if (bank_type >= N_SMCA_BANK_TYPES)
10511053
return NULL;
10521054

1053-
if (b && bank_type == SMCA_UMC) {
1055+
if (b && (bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2)) {
10541056
if (b->block < ARRAY_SIZE(smca_umc_block_names))
10551057
return smca_umc_block_names[b->block];
10561058
return NULL;

arch/x86/kernel/cpu/mce/core.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1533,7 +1533,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
15331533
/* If this triggers there is no way to recover. Die hard. */
15341534
BUG_ON(!on_thread_stack() || !user_mode(regs));
15351535

1536-
if (kill_current_task)
1536+
if (!mce_usable_address(&m))
15371537
queue_task_work(&m, msg, kill_me_now);
15381538
else
15391539
queue_task_work(&m, msg, kill_me_maybe);

0 commit comments

Comments
 (0)