@@ -10,6 +10,191 @@ encrypting the guest memory. In TDX, a special module running in a special
1010mode sits between the host and the guest and manages the guest/host
1111separation.
1212
13+ TDX Host Kernel Support
14+ =======================
15+
16+ TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
17+ a new isolated range pointed by the SEAM Ranger Register (SEAMRR). A
18+ CPU-attested software module called 'the TDX module' runs inside the new
19+ isolated range to provide the functionalities to manage and run protected
20+ VMs.
21+
22+ TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
23+ provide crypto-protection to the VMs. TDX reserves part of MKTME KeyIDs
24+ as TDX private KeyIDs, which are only accessible within the SEAM mode.
25+ BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
26+
27+ Before the TDX module can be used to create and run protected VMs, it
28+ must be loaded into the isolated range and properly initialized. The TDX
29+ architecture doesn't require the BIOS to load the TDX module, but the
30+ kernel assumes it is loaded by the BIOS.
31+
32+ TDX boot-time detection
33+ -----------------------
34+
35+ The kernel detects TDX by detecting TDX private KeyIDs during kernel
36+ boot. Below dmesg shows when TDX is enabled by BIOS::
37+
38+ [..] virt/tdx: BIOS enabled: private KeyID range: [16, 64)
39+
40+ TDX module initialization
41+ ---------------------------------------
42+
43+ The kernel talks to the TDX module via the new SEAMCALL instruction. The
44+ TDX module implements SEAMCALL leaf functions to allow the kernel to
45+ initialize it.
46+
47+ If the TDX module isn't loaded, the SEAMCALL instruction fails with a
48+ special error. In this case the kernel fails the module initialization
49+ and reports the module isn't loaded::
50+
51+ [..] virt/tdx: module not loaded
52+
53+ Initializing the TDX module consumes roughly ~1/256th system RAM size to
54+ use it as 'metadata' for the TDX memory. It also takes additional CPU
55+ time to initialize those metadata along with the TDX module itself. Both
56+ are not trivial. The kernel initializes the TDX module at runtime on
57+ demand.
58+
59+ Besides initializing the TDX module, a per-cpu initialization SEAMCALL
60+ must be done on one cpu before any other SEAMCALLs can be made on that
61+ cpu.
62+
63+ The kernel provides two functions, tdx_enable() and tdx_cpu_enable() to
64+ allow the user of TDX to enable the TDX module and enable TDX on local
65+ cpu respectively.
66+
67+ Making SEAMCALL requires VMXON has been done on that CPU. Currently only
68+ KVM implements VMXON. For now both tdx_enable() and tdx_cpu_enable()
69+ don't do VMXON internally (not trivial), but depends on the caller to
70+ guarantee that.
71+
72+ To enable TDX, the caller of TDX should: 1) temporarily disable CPU
73+ hotplug; 2) do VMXON and tdx_enable_cpu() on all online cpus; 3) call
74+ tdx_enable(). For example::
75+
76+ cpus_read_lock();
77+ on_each_cpu(vmxon_and_tdx_cpu_enable());
78+ ret = tdx_enable();
79+ cpus_read_unlock();
80+ if (ret)
81+ goto no_tdx;
82+ // TDX is ready to use
83+
84+ And the caller of TDX must guarantee the tdx_cpu_enable() has been
85+ successfully done on any cpu before it wants to run any other SEAMCALL.
86+ A typical usage is do both VMXON and tdx_cpu_enable() in CPU hotplug
87+ online callback, and refuse to online if tdx_cpu_enable() fails.
88+
89+ User can consult dmesg to see whether the TDX module has been initialized.
90+
91+ If the TDX module is initialized successfully, dmesg shows something
92+ like below::
93+
94+ [..] virt/tdx: 262668 KBs allocated for PAMT
95+ [..] virt/tdx: module initialized
96+
97+ If the TDX module failed to initialize, dmesg also shows it failed to
98+ initialize::
99+
100+ [..] virt/tdx: module initialization failed ...
101+
102+ TDX Interaction to Other Kernel Components
103+ ------------------------------------------
104+
105+ TDX Memory Policy
106+ ~~~~~~~~~~~~~~~~~
107+
108+ TDX reports a list of "Convertible Memory Region" (CMR) to tell the
109+ kernel which memory is TDX compatible. The kernel needs to build a list
110+ of memory regions (out of CMRs) as "TDX-usable" memory and pass those
111+ regions to the TDX module. Once this is done, those "TDX-usable" memory
112+ regions are fixed during module's lifetime.
113+
114+ To keep things simple, currently the kernel simply guarantees all pages
115+ in the page allocator are TDX memory. Specifically, the kernel uses all
116+ system memory in the core-mm "at the time of TDX module initialization"
117+ as TDX memory, and in the meantime, refuses to online any non-TDX-memory
118+ in the memory hotplug.
119+
120+ Physical Memory Hotplug
121+ ~~~~~~~~~~~~~~~~~~~~~~~
122+
123+ Note TDX assumes convertible memory is always physically present during
124+ machine's runtime. A non-buggy BIOS should never support hot-removal of
125+ any convertible memory. This implementation doesn't handle ACPI memory
126+ removal but depends on the BIOS to behave correctly.
127+
128+ CPU Hotplug
129+ ~~~~~~~~~~~
130+
131+ TDX module requires the per-cpu initialization SEAMCALL must be done on
132+ one cpu before any other SEAMCALLs can be made on that cpu. The kernel
133+ provides tdx_cpu_enable() to let the user of TDX to do it when the user
134+ wants to use a new cpu for TDX task.
135+
136+ TDX doesn't support physical (ACPI) CPU hotplug. During machine boot,
137+ TDX verifies all boot-time present logical CPUs are TDX compatible before
138+ enabling TDX. A non-buggy BIOS should never support hot-add/removal of
139+ physical CPU. Currently the kernel doesn't handle physical CPU hotplug,
140+ but depends on the BIOS to behave correctly.
141+
142+ Note TDX works with CPU logical online/offline, thus the kernel still
143+ allows to offline logical CPU and online it again.
144+
145+ Kexec()
146+ ~~~~~~~
147+
148+ TDX host support currently lacks the ability to handle kexec. For
149+ simplicity only one of them can be enabled in the Kconfig. This will be
150+ fixed in the future.
151+
152+ Erratum
153+ ~~~~~~~
154+
155+ The first few generations of TDX hardware have an erratum. A partial
156+ write to a TDX private memory cacheline will silently "poison" the
157+ line. Subsequent reads will consume the poison and generate a machine
158+ check.
159+
160+ A partial write is a memory write where a write transaction of less than
161+ cacheline lands at the memory controller. The CPU does these via
162+ non-temporal write instructions (like MOVNTI), or through UC/WC memory
163+ mappings. Devices can also do partial writes via DMA.
164+
165+ Theoretically, a kernel bug could do partial write to TDX private memory
166+ and trigger unexpected machine check. What's more, the machine check
167+ code will present these as "Hardware error" when they were, in fact, a
168+ software-triggered issue. But in the end, this issue is hard to trigger.
169+
170+ If the platform has such erratum, the kernel prints additional message in
171+ machine check handler to tell user the machine check may be caused by
172+ kernel bug on TDX private memory.
173+
174+ Interaction vs S3 and deeper states
175+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
176+
177+ TDX cannot survive from S3 and deeper states. The hardware resets and
178+ disables TDX completely when platform goes to S3 and deeper. Both TDX
179+ guests and the TDX module get destroyed permanently.
180+
181+ The kernel uses S3 for suspend-to-ram, and use S4 and deeper states for
182+ hibernation. Currently, for simplicity, the kernel chooses to make TDX
183+ mutually exclusive with S3 and hibernation.
184+
185+ The kernel disables TDX during early boot when hibernation support is
186+ available::
187+
188+ [..] virt/tdx: initialization failed: Hibernation support is enabled
189+
190+ Add 'nohibernate' kernel command line to disable hibernation in order to
191+ use TDX.
192+
193+ ACPI S3 is disabled during kernel early boot if TDX is enabled. The user
194+ needs to turn off TDX in the BIOS in order to use S3.
195+
196+ TDX Guest Support
197+ =================
13198Since the host cannot directly access guest registers or memory, much
14199normal functionality of a hypervisor must be moved into the guest. This is
15200implemented using a Virtualization Exception (#VE) that is handled by the
@@ -20,7 +205,7 @@ TDX includes new hypercall-like mechanisms for communicating from the
20205guest to the hypervisor or the TDX module.
21206
22207New TDX Exceptions
23- ==================
208+ ------------------
24209
25210TDX guests behave differently from bare-metal and traditional VMX guests.
26211In TDX guests, otherwise normal instructions or memory accesses can cause
@@ -30,7 +215,7 @@ Instructions marked with an '*' conditionally cause exceptions. The
30215details for these instructions are discussed below.
31216
32217Instruction-based #VE
33- ---------------------
218+ ~~~~~~~~~~~~~~~~~~~~~
34219
35220- Port I/O (INS, OUTS, IN, OUT)
36221- HLT
@@ -41,7 +226,7 @@ Instruction-based #VE
41226- CPUID*
42227
43228Instruction-based #GP
44- ---------------------
229+ ~~~~~~~~~~~~~~~~~~~~~
45230
46231- All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
47232 VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
@@ -52,7 +237,7 @@ Instruction-based #GP
52237- RDMSR*,WRMSR*
53238
54239RDMSR/WRMSR Behavior
55- --------------------
240+ ~~~~~~~~~~~~~~~~~~~~
56241
57242MSR access behavior falls into three categories:
58243
@@ -73,7 +258,7 @@ trapping and handling in the TDX module. Other than possibly being slow,
73258these MSRs appear to function just as they would on bare metal.
74259
75260CPUID Behavior
76- --------------
261+ ~~~~~~~~~~~~~~
77262
78263For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
79264return values (in guest EAX/EBX/ECX/EDX) are configurable by the
@@ -93,7 +278,7 @@ not know how to handle. The guest kernel may ask the hypervisor for the
93278value with a hypercall.
94279
95280#VE on Memory Accesses
96- ======================
281+ ----------------------
97282
98283There are essentially two classes of TDX memory: private and shared.
99284Private memory receives full TDX protections. Its content is protected
@@ -107,7 +292,7 @@ entries. This helps ensure that a guest does not place sensitive
107292information in shared memory, exposing it to the untrusted hypervisor.
108293
109294#VE on Shared Memory
110- --------------------
295+ ~~~~~~~~~~~~~~~~~~~~
111296
112297Access to shared mappings can cause a #VE. The hypervisor ultimately
113298controls whether a shared memory access causes a #VE, so the guest must be
@@ -127,7 +312,7 @@ be careful not to access device MMIO regions unless it is also prepared to
127312handle a #VE.
128313
129314#VE on Private Pages
130- --------------------
315+ ~~~~~~~~~~~~~~~~~~~~
131316
132317An access to private mappings can also cause a #VE. Since all kernel
133318memory is also private memory, the kernel might theoretically need to
@@ -145,7 +330,7 @@ The hypervisor is permitted to unilaterally move accepted pages to a
145330to handle the exception.
146331
147332Linux #VE handler
148- =================
333+ -----------------
149334
150335Just like page faults or #GP's, #VE exceptions can be either handled or be
151336fatal. Typically, an unhandled userspace #VE results in a SIGSEGV.
@@ -167,7 +352,7 @@ While the block is in place, any #VE is elevated to a double fault (#DF)
167352which is not recoverable.
168353
169354MMIO handling
170- =============
355+ -------------
171356
172357In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
173358mapping which will cause a VMEXIT on access, and then the hypervisor
@@ -189,7 +374,7 @@ MMIO access via other means (like structure overlays) may result in an
189374oops.
190375
191376Shared Memory Conversions
192- =========================
377+ -------------------------
193378
194379All TDX guest memory starts out as private at boot. This memory can not
195380be accessed by the hypervisor. However, some kernel users like device
0 commit comments