@@ -45,3 +45,138 @@ Detailed Description of the Change
4545----------------------------------
4646
4747<Propose spec language that corrects the conflict.>
48+
49+
50+ Resolve conflict between CFMWS, Platform Memory Holes, and Endpoint Decoders
51+ ============================================================================
52+
53+ Document
54+ --------
55+
56+ CXL Revision 3.2, Version 1.0
57+
58+ License
59+ -------
60+
61+ SPDX-License Identifier: CC-BY-4.0
62+
63+ Creator/Contributors
64+ --------------------
65+
66+ - Fabio M. De Francesco, Intel
67+ - Dan J. Williams, Intel
68+ - Mahesh Natu, Intel
69+
70+ Summary of the Change
71+ ---------------------
72+
73+ According to the current Compute Express Link (CXL) Specifications (Revision
74+ 3.2, Version 1.0), the CXL Fixed Memory Window Structure (CFMWS) describes zero
75+ or more Host Physical Address (HPA) windows associated with each CXL Host
76+ Bridge. Each window represents a contiguous HPA range that may be interleaved
77+ across one or more targets, including CXL Host Bridges. Each window has a set
78+ of restrictions that govern its usage. It is the Operating System-directed
79+ configuration and Power Management (OSPM) responsibility to utilize each window
80+ for the specified use.
81+
82+ Table 9-22 of the current CXL Specifications states that the Window Size field
83+ contains the total number of consecutive bytes of HPA this window describes.
84+ This value must be a multiple of the Number of Interleave Ways (NIW) * 256 MB.
85+
86+ Platform Firmware (BIOS) might reserve physical addresses below 4 GB where a
87+ memory gap such as the Low Memory Hole for PCIe MMIO may exist. In such cases,
88+ the CFMWS Range Size may not adhere to the NIW * 256 MB rule.
89+
90+ The HPA represents the actual physical memory address space that the CXL devices
91+ can decode and respond to, while the System Physical Address (SPA), a related
92+ but distinct concept, represents the system-visible address space that users can
93+ direct transaction to and so it excludes reserved regions.
94+
95+ BIOS publishes CFMWS to communicate the active SPA ranges that, on platforms
96+ with LMH's, map to a strict subset of the HPA. The SPA range trims out the hole,
97+ resulting in lost capacity in the Endpoints with no SPA to map to that part of
98+ the HPA range that intersects the hole.
99+
100+ E.g, an x86 platform with two CFMWS and an LMH starting at 2 GB:
101+
102+ +--------+------------+-------------------+------------------+-------------------+------+
103+ | Window | CFMWS Base | CFMWS Size | HDM Decoder Base | HDM Decoder Size | Ways |
104+ +========+============+===================+==================+===================+======+
105+ | 0 | 0 GB | 2 GB | 0 GB | 3 GB | 12 |
106+ +--------+------------+-------------------+------------------+-------------------+------+
107+ | 1 | 4 GB | NIW*256MB Aligned | 4 GB | NIW*256MB Aligned | 12 |
108+ +--------+------------+-------------------+------------------+-------------------+------+
109+
110+ HDM decoder base and HDM decoder size represent all the 12 Endpoint Decoders of
111+ a 12 ways region and all the intermediate Switch Decoders. They are configured
112+ by the BIOS according to the NIW * 256MB rule, resulting in a HPA range size of
113+ 3GB. Instead, the CFMWS Base and CFMWS Size are used to configure the Root
114+ Decoder HPA range that results smaller (2GB) than that of the Switch and
115+ Endpoint Decoders in the hierarchy (3GB).
116+
117+ This creates 2 issues which lead to a failure to construct a region:
118+
119+ 1) A mismatch in region size between root and any HDM decoder. The root decoders
120+ will always be smaller due to the trim.
121+
122+ 2) The trim causes the root decoder to violate the (NIW * 256MB) rule.
123+
124+ This change allows a region with a base address of 0GB to bypass these checks to
125+ allow for region creation with the trimmed root decoder address range.
126+
127+ This change does not allow for any other arbitrary region to violate these
128+ checks - it is intended exclusively to enable x86 platforms which map CXL memory
129+ under 4GB.
130+
131+ Despite the HDM decoders covering the PCIE hole HPA region, it is expected that
132+ the platform will never route address accesses to the CXL complex because the
133+ root decoder only covers the trimmed region (which excludes this). This is
134+ outside the ability of Linux to enforce.
135+
136+ On the example platform, only the first 2GB will be potentially usable, but
137+ Linux, aiming to adhere to the current specifications, fails to construct
138+ Regions and attach Endpoint and intermediate Switch Decoders to them.
139+
140+ There are several points of failure that due to the expectation that the Root
141+ Decoder HPA size, that is equal to the CFMWS from which it is configured, has
142+ to be greater or equal to the matching Switch and Endpoint HDM Decoders.
143+
144+ In order to succeed with construction and attachment, Linux must construct a
145+ Region with Root Decoder HPA range size, and then attach to that all the
146+ intermediate Switch Decoders and Endpoint Decoders that belong to the hierarchy
147+ regardless of their range sizes.
148+
149+ Benefits of the Change
150+ ----------------------
151+
152+ Without the change, the OSPM wouldn't match intermediate Switch and Endpoint
153+ Decoders with Root Decoders configured with CFMWS HPA sizes that don't align
154+ with the NIW * 256MB constraint, and so it leads to lost memdev capacity.
155+
156+ This change allows the OSPM to construct Regions and attach intermediate Switch
157+ and Endpoint Decoders to them, so that the addressable part of the memory
158+ devices total capacity is made available to the users.
159+
160+ References
161+ ----------
162+
163+ Compute Express Link Specification Revision 3.2, Version 1.0
164+ <https://www.computeexpresslink.org/>
165+
166+ Detailed Description of the Change
167+ ----------------------------------
168+
169+ The description of the Window Size field in table 9-22 needs to account for
170+ platforms with Low Memory Holes, where SPA ranges might be subsets of the
171+ endpoints HPA. Therefore, it has to be changed to the following:
172+
173+ "The total number of consecutive bytes of HPA this window represents. This value
174+ shall be a multiple of NIW * 256 MB.
175+
176+ On platforms that reserve physical addresses below 4 GB, such as the Low Memory
177+ Hole for PCIe MMIO on x86, an instance of CFMWS whose Base HPA range is 0 might
178+ have a size that doesn't align with the NIW * 256 MB constraint.
179+
180+ Note that the matching intermediate Switch Decoders and the Endpoint Decoders
181+ HPA range sizes must still align to the above-mentioned rule, but the memory
182+ capacity that exceeds the CFMWS window size won't be accessible.".
0 commit comments