Skip to content

Commit d104e3d

Browse files
committed
Merge tag 'cxl-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull CXL updates from Dave Jiang: "The changes include adding poison injection support, fixing CXL access coordinates when onlining CXL memory, and delaing the enumeration of downstream switch ports for CXL hierarchy to ensure that the CXL link is established at the time of enumeration to address a few issues observed on AMD and Intel platforms. Misc changes: - Use str_plural() instead of open code for emitting strings. - Use str_enabled_disabled() instead of ternary operator - Fix emit of type resource_size_t argument for validate_region_offset() - Typo fixup in CXL driver-api documentation - Rename CFMWS coherency restriction defines - Add convention doc describe dealing with x86 low memory hole and CXL Poison Inject support: - Move hpa_to_spa callback to new reoot decoder ops structure - Define a SPA to HPA callback for interleave calculation with XOR math - Add support for SPA to DPA address translation with XOR - Add locked variants of poison inject and clear functions - Add inject and clear poison support by region offset CXL access coordinates update fix: - A comment update for hotplug memory callback prority defines - Add node_update_perf_attrs() for updating perf attrs on a node - Update cxl_access_coordinates() to use the new node update function - Remove hmat_update_target_coordinates() and related code CXL delayed downstream port enumeration and initialization: - Add helper to detect top of CXL device topology and remove open coding - Add helper to delete single dport - Add a cached copy of target_map to cxl_decoder - Refactor decoder setup to reduce cxl_test burden - Defer dport allocation for switch ports - Add mock version of devm_cxl_add_dport_by_dev() for cxl_test - Adjust the mock version of devm_cxl_switch_port_decoders_setup() due to cxl core usage - Setup target_map for cxl_test decoder initialization - Change SSLBIS handler to handle single dport - Move port register setup to when first dport appears" * tag 'cxl-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (25 commits) cxl: Move port register setup to when first dport appear cxl: Change sslbis handler to only handle single dport cxl/test: Setup target_map for cxl_test decoder initialization cxl/test: Adjust the mock version of devm_cxl_switch_port_decoders_setup() cxl/test: Add mock version of devm_cxl_add_dport_by_dev() cxl: Defer dport allocation for switch ports cxl/test: Refactor decoder setup to reduce cxl_test burden cxl: Add a cached copy of target_map to cxl_decoder cxl: Add helper to delete dport cxl: Add helper to detect top of CXL device topology cxl: Documentation/driver-api/cxl: Describe the x86 Low Memory Hole solution cxl/acpi: Rename CFMW coherency restrictions Documentation/driver-api: Fix typo error in cxl acpi/hmat: Remove now unused hmat_update_target_coordinates() cxl, acpi/hmat: Update CXL access coordinates directly instead of through HMAT drivers/base/node: Add a helper function node_update_perf_attrs() mm/memory_hotplug: Update comment for hotplug memory callback priorities cxl: Fix emit of type resource_size_t argument for validate_region_offset() cxl/region: Add inject and clear poison by region offset cxl/core: Add locked variants of the poison inject and clear funcs ...
2 parents 67da125 + 4603745 commit d104e3d

28 files changed

Lines changed: 1261 additions & 390 deletions

File tree

Documentation/ABI/testing/debugfs-cxl

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,20 @@ Description:
1919
is returned to the user. The inject_poison attribute is only
2020
visible for devices supporting the capability.
2121

22+
TEST-ONLY INTERFACE: This interface is intended for testing
23+
and validation purposes only. It is not a data repair mechanism
24+
and should never be used on production systems or live data.
25+
26+
DATA LOSS RISK: For CXL persistent memory (PMEM) devices,
27+
poison injection can result in permanent data loss. Injected
28+
poison may render data permanently inaccessible even after
29+
clearing, as the clear operation writes zeros and does not
30+
recover original data.
31+
32+
SYSTEM STABILITY RISK: For volatile memory, poison injection
33+
can cause kernel crashes, system instability, or unpredictable
34+
behavior if the poisoned addresses are accessed by running code
35+
or critical kernel structures.
2236

2337
What: /sys/kernel/debug/cxl/memX/clear_poison
2438
Date: April, 2023
@@ -35,6 +49,79 @@ Description:
3549
The clear_poison attribute is only visible for devices
3650
supporting the capability.
3751

52+
TEST-ONLY INTERFACE: This interface is intended for testing
53+
and validation purposes only. It is not a data repair mechanism
54+
and should never be used on production systems or live data.
55+
56+
CLEAR IS NOT DATA RECOVERY: This operation writes zeros to the
57+
specified address range and removes the address from the poison
58+
list. It does NOT recover or restore original data that may have
59+
been present before poison injection. Any original data at the
60+
cleared address is permanently lost and replaced with zeros.
61+
62+
CLEAR IS NOT A REPAIR MECHANISM: This interface is for testing
63+
purposes only and should not be used as a data repair tool.
64+
Clearing poison is fundamentally different from data recovery
65+
or error correction.
66+
67+
What: /sys/kernel/debug/cxl/regionX/inject_poison
68+
Date: August, 2025
69+
Contact: linux-cxl@vger.kernel.org
70+
Description:
71+
(WO) When a Host Physical Address (HPA) is written to this
72+
attribute, the region driver translates it to a Device
73+
Physical Address (DPA) and identifies the corresponding
74+
memdev. It then sends an inject poison command to that memdev
75+
at the translated DPA. Refer to the memdev ABI entry at:
76+
/sys/kernel/debug/cxl/memX/inject_poison for the detailed
77+
behavior. This attribute is only visible if all memdevs
78+
participating in the region support both inject and clear
79+
poison commands.
80+
81+
TEST-ONLY INTERFACE: This interface is intended for testing
82+
and validation purposes only. It is not a data repair mechanism
83+
and should never be used on production systems or live data.
84+
85+
DATA LOSS RISK: For CXL persistent memory (PMEM) devices,
86+
poison injection can result in permanent data loss. Injected
87+
poison may render data permanently inaccessible even after
88+
clearing, as the clear operation writes zeros and does not
89+
recover original data.
90+
91+
SYSTEM STABILITY RISK: For volatile memory, poison injection
92+
can cause kernel crashes, system instability, or unpredictable
93+
behavior if the poisoned addresses are accessed by running code
94+
or critical kernel structures.
95+
96+
What: /sys/kernel/debug/cxl/regionX/clear_poison
97+
Date: August, 2025
98+
Contact: linux-cxl@vger.kernel.org
99+
Description:
100+
(WO) When a Host Physical Address (HPA) is written to this
101+
attribute, the region driver translates it to a Device
102+
Physical Address (DPA) and identifies the corresponding
103+
memdev. It then sends a clear poison command to that memdev
104+
at the translated DPA. Refer to the memdev ABI entry at:
105+
/sys/kernel/debug/cxl/memX/clear_poison for the detailed
106+
behavior. This attribute is only visible if all memdevs
107+
participating in the region support both inject and clear
108+
poison commands.
109+
110+
TEST-ONLY INTERFACE: This interface is intended for testing
111+
and validation purposes only. It is not a data repair mechanism
112+
and should never be used on production systems or live data.
113+
114+
CLEAR IS NOT DATA RECOVERY: This operation writes zeros to the
115+
specified address range and removes the address from the poison
116+
list. It does NOT recover or restore original data that may have
117+
been present before poison injection. Any original data at the
118+
cleared address is permanently lost and replaced with zeros.
119+
120+
CLEAR IS NOT A REPAIR MECHANISM: This interface is for testing
121+
purposes only and should not be used as a data repair tool.
122+
Clearing poison is fundamentally different from data recovery
123+
or error correction.
124+
38125
What: /sys/kernel/debug/cxl/einj_types
39126
Date: January, 2024
40127
KernelVersion: v6.9

Documentation/driver-api/cxl/conventions.rst

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,3 +45,138 @@ Detailed Description of the Change
4545
----------------------------------
4646

4747
<Propose spec language that corrects the conflict.>
48+
49+
50+
Resolve conflict between CFMWS, Platform Memory Holes, and Endpoint Decoders
51+
============================================================================
52+
53+
Document
54+
--------
55+
56+
CXL Revision 3.2, Version 1.0
57+
58+
License
59+
-------
60+
61+
SPDX-License Identifier: CC-BY-4.0
62+
63+
Creator/Contributors
64+
--------------------
65+
66+
- Fabio M. De Francesco, Intel
67+
- Dan J. Williams, Intel
68+
- Mahesh Natu, Intel
69+
70+
Summary of the Change
71+
---------------------
72+
73+
According to the current Compute Express Link (CXL) Specifications (Revision
74+
3.2, Version 1.0), the CXL Fixed Memory Window Structure (CFMWS) describes zero
75+
or more Host Physical Address (HPA) windows associated with each CXL Host
76+
Bridge. Each window represents a contiguous HPA range that may be interleaved
77+
across one or more targets, including CXL Host Bridges. Each window has a set
78+
of restrictions that govern its usage. It is the Operating System-directed
79+
configuration and Power Management (OSPM) responsibility to utilize each window
80+
for the specified use.
81+
82+
Table 9-22 of the current CXL Specifications states that the Window Size field
83+
contains the total number of consecutive bytes of HPA this window describes.
84+
This value must be a multiple of the Number of Interleave Ways (NIW) * 256 MB.
85+
86+
Platform Firmware (BIOS) might reserve physical addresses below 4 GB where a
87+
memory gap such as the Low Memory Hole for PCIe MMIO may exist. In such cases,
88+
the CFMWS Range Size may not adhere to the NIW * 256 MB rule.
89+
90+
The HPA represents the actual physical memory address space that the CXL devices
91+
can decode and respond to, while the System Physical Address (SPA), a related
92+
but distinct concept, represents the system-visible address space that users can
93+
direct transaction to and so it excludes reserved regions.
94+
95+
BIOS publishes CFMWS to communicate the active SPA ranges that, on platforms
96+
with LMH's, map to a strict subset of the HPA. The SPA range trims out the hole,
97+
resulting in lost capacity in the Endpoints with no SPA to map to that part of
98+
the HPA range that intersects the hole.
99+
100+
E.g, an x86 platform with two CFMWS and an LMH starting at 2 GB:
101+
102+
+--------+------------+-------------------+------------------+-------------------+------+
103+
| Window | CFMWS Base | CFMWS Size | HDM Decoder Base | HDM Decoder Size | Ways |
104+
+========+============+===================+==================+===================+======+
105+
|  0 | 0 GB | 2 GB | 0 GB | 3 GB | 12 |
106+
+--------+------------+-------------------+------------------+-------------------+------+
107+
|  1 | 4 GB | NIW*256MB Aligned | 4 GB | NIW*256MB Aligned | 12 |
108+
+--------+------------+-------------------+------------------+-------------------+------+
109+
110+
HDM decoder base and HDM decoder size represent all the 12 Endpoint Decoders of
111+
a 12 ways region and all the intermediate Switch Decoders. They are configured
112+
by the BIOS according to the NIW * 256MB rule, resulting in a HPA range size of
113+
3GB. Instead, the CFMWS Base and CFMWS Size are used to configure the Root
114+
Decoder HPA range that results smaller (2GB) than that of the Switch and
115+
Endpoint Decoders in the hierarchy (3GB).
116+
117+
This creates 2 issues which lead to a failure to construct a region:
118+
119+
1) A mismatch in region size between root and any HDM decoder. The root decoders
120+
will always be smaller due to the trim.
121+
122+
2) The trim causes the root decoder to violate the (NIW * 256MB) rule.
123+
124+
This change allows a region with a base address of 0GB to bypass these checks to
125+
allow for region creation with the trimmed root decoder address range.
126+
127+
This change does not allow for any other arbitrary region to violate these
128+
checks - it is intended exclusively to enable x86 platforms which map CXL memory
129+
under 4GB.
130+
131+
Despite the HDM decoders covering the PCIE hole HPA region, it is expected that
132+
the platform will never route address accesses to the CXL complex because the
133+
root decoder only covers the trimmed region (which excludes this). This is
134+
outside the ability of Linux to enforce.
135+
136+
On the example platform, only the first 2GB will be potentially usable, but
137+
Linux, aiming to adhere to the current specifications, fails to construct
138+
Regions and attach Endpoint and intermediate Switch Decoders to them.
139+
140+
There are several points of failure that due to the expectation that the Root
141+
Decoder HPA size, that is equal to the CFMWS from which it is configured, has
142+
to be greater or equal to the matching Switch and Endpoint HDM Decoders.
143+
144+
In order to succeed with construction and attachment, Linux must construct a
145+
Region with Root Decoder HPA range size, and then attach to that all the
146+
intermediate Switch Decoders and Endpoint Decoders that belong to the hierarchy
147+
regardless of their range sizes.
148+
149+
Benefits of the Change
150+
----------------------
151+
152+
Without the change, the OSPM wouldn't match intermediate Switch and Endpoint
153+
Decoders with Root Decoders configured with CFMWS HPA sizes that don't align
154+
with the NIW * 256MB constraint, and so it leads to lost memdev capacity.
155+
156+
This change allows the OSPM to construct Regions and attach intermediate Switch
157+
and Endpoint Decoders to them, so that the addressable part of the memory
158+
devices total capacity is made available to the users.
159+
160+
References
161+
----------
162+
163+
Compute Express Link Specification Revision 3.2, Version 1.0
164+
<https://www.computeexpresslink.org/>
165+
166+
Detailed Description of the Change
167+
----------------------------------
168+
169+
The description of the Window Size field in table 9-22 needs to account for
170+
platforms with Low Memory Holes, where SPA ranges might be subsets of the
171+
endpoints HPA. Therefore, it has to be changed to the following:
172+
173+
"The total number of consecutive bytes of HPA this window represents. This value
174+
shall be a multiple of NIW * 256 MB.
175+
176+
On platforms that reserve physical addresses below 4 GB, such as the Low Memory
177+
Hole for PCIe MMIO on x86, an instance of CFMWS whose Base HPA range is 0 might
178+
have a size that doesn't align with the NIW * 256 MB constraint.
179+
180+
Note that the matching intermediate Switch Decoders and the Endpoint Decoders
181+
HPA range sizes must still align to the above-mentioned rule, but the memory
182+
capacity that exceeds the CFMWS window size won't be accessible.".

Documentation/driver-api/cxl/maturity-map.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,7 @@ Accelerator
173173
User Flow Support
174174
-----------------
175175

176-
* [0] Inject & clear poison by HPA
176+
* [2] Inject & clear poison by region offset
177177

178178
Details
179179
=======

Documentation/driver-api/cxl/platform/bios-and-efi.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,7 @@ future and such a configuration should be avoided.
202202

203203
Memory Holes
204204
------------
205-
If your platform includes memory holes intersparsed between your CXL memory, it
205+
If your platform includes memory holes interspersed between your CXL memory, it
206206
is recommended to utilize multiple decoders to cover these regions of memory,
207207
rather than try to program the decoders to accept the entire range and expect
208208
Linux to manage the overlap.

drivers/acpi/numa/hmat.c

Lines changed: 0 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,6 @@ struct memory_target {
7474
struct node_cache_attrs cache_attrs;
7575
u8 gen_port_device_handle[ACPI_SRAT_DEVICE_HANDLE_SIZE];
7676
bool registered;
77-
bool ext_updated; /* externally updated */
7877
};
7978

8079
struct memory_initiator {
@@ -368,35 +367,6 @@ static void hmat_update_target_access(struct memory_target *target,
368367
}
369368
}
370369

371-
int hmat_update_target_coordinates(int nid, struct access_coordinate *coord,
372-
enum access_coordinate_class access)
373-
{
374-
struct memory_target *target;
375-
int pxm;
376-
377-
if (nid == NUMA_NO_NODE)
378-
return -EINVAL;
379-
380-
pxm = node_to_pxm(nid);
381-
guard(mutex)(&target_lock);
382-
target = find_mem_target(pxm);
383-
if (!target)
384-
return -ENODEV;
385-
386-
hmat_update_target_access(target, ACPI_HMAT_READ_LATENCY,
387-
coord->read_latency, access);
388-
hmat_update_target_access(target, ACPI_HMAT_WRITE_LATENCY,
389-
coord->write_latency, access);
390-
hmat_update_target_access(target, ACPI_HMAT_READ_BANDWIDTH,
391-
coord->read_bandwidth, access);
392-
hmat_update_target_access(target, ACPI_HMAT_WRITE_BANDWIDTH,
393-
coord->write_bandwidth, access);
394-
target->ext_updated = true;
395-
396-
return 0;
397-
}
398-
EXPORT_SYMBOL_GPL(hmat_update_target_coordinates);
399-
400370
static __init void hmat_add_locality(struct acpi_hmat_locality *hmat_loc)
401371
{
402372
struct memory_locality *loc;
@@ -773,10 +743,6 @@ static void hmat_update_target_attrs(struct memory_target *target,
773743
u32 best = 0;
774744
int i;
775745

776-
/* Don't update if an external agent has changed the data. */
777-
if (target->ext_updated)
778-
return;
779-
780746
/* Don't update for generic port if there's no device handle */
781747
if ((access == NODE_ACCESS_CLASS_GENPORT_SINK_LOCAL ||
782748
access == NODE_ACCESS_CLASS_GENPORT_SINK_CPU) &&

drivers/base/node.c

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -248,6 +248,44 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
248248
}
249249
EXPORT_SYMBOL_GPL(node_set_perf_attrs);
250250

251+
/**
252+
* node_update_perf_attrs - Update the performance values for given access class
253+
* @nid: Node identifier to be updated
254+
* @coord: Heterogeneous memory performance coordinates
255+
* @access: The access class for the given attributes
256+
*/
257+
void node_update_perf_attrs(unsigned int nid, struct access_coordinate *coord,
258+
enum access_coordinate_class access)
259+
{
260+
struct node_access_nodes *access_node;
261+
struct node *node;
262+
int i;
263+
264+
if (WARN_ON_ONCE(!node_online(nid)))
265+
return;
266+
267+
node = node_devices[nid];
268+
list_for_each_entry(access_node, &node->access_list, list_node) {
269+
if (access_node->access != access)
270+
continue;
271+
272+
access_node->coord = *coord;
273+
for (i = 0; access_attrs[i]; i++) {
274+
sysfs_notify(&access_node->dev.kobj,
275+
NULL, access_attrs[i]->name);
276+
}
277+
break;
278+
}
279+
280+
/* When setting CPU access coordinates, update mempolicy */
281+
if (access != ACCESS_COORDINATE_CPU)
282+
return;
283+
284+
if (mempolicy_set_node_perf(nid, coord))
285+
pr_info("failed to set mempolicy attrs for node %d\n", nid);
286+
}
287+
EXPORT_SYMBOL_GPL(node_update_perf_attrs);
288+
251289
/**
252290
* struct node_cache_info - Internal tracking for memory node caches
253291
* @dev: Device represeting the cache level

0 commit comments

Comments
 (0)