Skip to content

Commit db5e824

Browse files
amit3ssean-jc
authored andcommitted
KVM: SVM: Virtualize and advertise support for ERAPS
AMD CPUs with the Enhanced Return Address Predictor Security (ERAPS) feature (available on Zen5+) obviate the need for FILL_RETURN_BUFFER sequences right after VMEXITs. ERAPS adds guest/host tags to entries in the RSB (a.k.a. RAP). This helps with speculation protection across the VM boundary, and it also preserves host and guest entries in the RSB that can improve software performance (which would otherwise be flushed due to the FILL_RETURN_BUFFER sequences). Importantly, ERAPS also improves cross-domain security by clearing the RAP in certain situations. Specifically, the RAP is cleared in response to actions that are typically tied to software context switching between tasks. Per the APM: The ERAPS feature eliminates the need to execute CALL instructions to clear the return address predictor in most cases. On processors that support ERAPS, return addresses from CALL instructions executed in host mode are not used in guest mode, and vice versa. Additionally, the return address predictor is cleared in all cases when the TLB is implicitly invalidated and in the following cases: • MOV CR3 instruction • INVPCID other than single address invalidation (operation type 0) ERAPS also allows CPUs to extends the size of the RSB/RAP from the older standard (of 32 entries) to a new size, enumerated in CPUID leaf 0x80000021:EBX bits 23:16 (64 entries in Zen5 CPUs). In hardware, ERAPS is always-on, when running in host context, the CPU uses the full RSB/RAP size without any software changes necessary. However, when running in guest context, the CPU utilizes the full size of the RSB/RAP if and only if the new ALLOW_LARGER_RAP flag is set in the VMCB; if the flag is not set, the CPU limits itself to the historical size of 32 entires. Requiring software to opt-in for guest usage of RAPs larger than 32 entries allows hypervisors, i.e. KVM, to emulate the aforementioned conditions in which the RAP is cleared as well as the guest/host split. E.g. if the CPU unconditionally used the full RAP for guests, failure to clear the RAP on transitions between L1 or L2, or on emulated guest TLB flushes, would expose the guest to RAP-based attacks as a guest without support for ERAPS wouldn't know that its FILL_RETURN_BUFFER sequence is insufficient. Address the ~two broad categories of ERAPS emulation, and advertise ERAPS support to userspace, along with the RAP size enumerated in CPUID. 1. Architectural RAP clearing: as above, CPUs with ERAPS clear RAP entries on several conditions, including CR3 updates. To handle scenarios where a relevant operation is handled in common code (emulation of INVPCID and to a lesser extent MOV CR3), piggyback VCPU_EXREG_CR3 and create an alias, VCPU_EXREG_ERAPS. SVM doesn't utilize CR3 dirty tracking, and so for all intents and purposes VCPU_EXREG_CR3 is unused. Aliasing VCPU_EXREG_ERAPS ensures that any flow that writes CR3 will also clear the guest's RAP, and allows common x86 to mark ERAPS vCPUs as needing a RAP clear without having to add a new request (or other mechanism). 2. Nested guests: the ERAPS feature adds host/guest tagging to entries in the RSB, but does not distinguish between the guest ASIDs. To prevent the case of an L2 guest poisoning the RSB to attack the L1 guest, the CPU exposes a new VMCB bit (CLEAR_RAP). The next VMRUN with a VMCB that has this bit set causes the CPU to flush the RSB before entering the guest context. Set the bit in VMCB01 after a nested #VMEXIT to ensure the next time the L1 guest runs, its RSB contents aren't polluted by the L2's contents. Similarly, before entry into a nested guest, set the bit for VMCB02, so that the L1 guest's RSB contents are not leaked/used in the L2 context. Enable ALLOW_LARGER_RAP (and emulate RAP clears) if and only if ERAPS is exposed to the guest. Enabling ALLOW_LARGER_RAP unconditionally wouldn't cause any functional issues, but ignoring userspace's (and L1's) desires would put KVM into a grey area, which is especially undesirable due to the potential security implications. E.g. if a use case wants to have L1 do manual RAP clearing even when ERAPS is present in hardware, enabling ALLOW_LARGER_RAP could result in L1 leaving stale entries in the RAP. ERAPS is documented in AMD APM Vol 2 (Pub 24593), in revisions 3.43 and later. Signed-off-by: Amit Shah <amit.shah@amd.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Amit Shah <amit.shah@amd.com> Link: https://patch.msgid.link/aR913X8EqO6meCqa@google.com
1 parent 1d1722e commit db5e824

8 files changed

Lines changed: 77 additions & 3 deletions

File tree

arch/x86/include/asm/cpufeatures.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -472,6 +472,7 @@
472472
#define X86_FEATURE_GP_ON_USER_CPUID (20*32+17) /* User CPUID faulting */
473473

474474
#define X86_FEATURE_PREFETCHI (20*32+20) /* Prefetch Data/Instruction to Cache Level */
475+
#define X86_FEATURE_ERAPS (20*32+24) /* Enhanced Return Address Predictor Security */
475476
#define X86_FEATURE_SBPB (20*32+27) /* Selective Branch Prediction Barrier */
476477
#define X86_FEATURE_IBPB_BRTYPE (20*32+28) /* MSR_PRED_CMD[IBPB] flushes all branch type predictions */
477478
#define X86_FEATURE_SRSO_NO (20*32+29) /* CPU is not affected by SRSO */

arch/x86/include/asm/kvm_host.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -195,7 +195,15 @@ enum kvm_reg {
195195

196196
VCPU_EXREG_PDPTR = NR_VCPU_REGS,
197197
VCPU_EXREG_CR0,
198+
/*
199+
* Alias AMD's ERAPS (not a real register) to CR3 so that common code
200+
* can trigger emulation of the RAP (Return Address Predictor) with
201+
* minimal support required in common code. Piggyback CR3 as the RAP
202+
* is cleared on writes to CR3, i.e. marking CR3 dirty will naturally
203+
* mark ERAPS dirty as well.
204+
*/
198205
VCPU_EXREG_CR3,
206+
VCPU_EXREG_ERAPS = VCPU_EXREG_CR3,
199207
VCPU_EXREG_CR4,
200208
VCPU_EXREG_RFLAGS,
201209
VCPU_EXREG_SEGMENTS,

arch/x86/include/asm/svm.h

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
131131
u64 tsc_offset;
132132
u32 asid;
133133
u8 tlb_ctl;
134-
u8 reserved_2[3];
134+
u8 erap_ctl;
135+
u8 reserved_2[2];
135136
u32 int_ctl;
136137
u32 int_vector;
137138
u32 int_state;
@@ -182,6 +183,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
182183
#define TLB_CONTROL_FLUSH_ASID 3
183184
#define TLB_CONTROL_FLUSH_ASID_LOCAL 7
184185

186+
#define ERAP_CONTROL_ALLOW_LARGER_RAP BIT(0)
187+
#define ERAP_CONTROL_CLEAR_RAP BIT(1)
188+
185189
#define V_TPR_MASK 0x0f
186190

187191
#define V_IRQ_SHIFT 8

arch/x86/kvm/cpuid.c

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1223,6 +1223,7 @@ void kvm_set_cpu_caps(void)
12231223
/* PrefetchCtlMsr */
12241224
/* GpOnUserCpuid */
12251225
/* EPSF */
1226+
F(ERAPS),
12261227
SYNTHESIZED_F(SBPB),
12271228
SYNTHESIZED_F(IBPB_BRTYPE),
12281229
SYNTHESIZED_F(SRSO_NO),
@@ -1803,8 +1804,14 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
18031804
entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
18041805
break;
18051806
case 0x80000021:
1806-
entry->ebx = entry->edx = 0;
1807+
entry->edx = 0;
18071808
cpuid_entry_override(entry, CPUID_8000_0021_EAX);
1809+
1810+
if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))
1811+
entry->ebx &= GENMASK(23, 16);
1812+
else
1813+
entry->ebx = 0;
1814+
18081815
cpuid_entry_override(entry, CPUID_8000_0021_ECX);
18091816
break;
18101817
/* AMD Extended Performance Monitoring and Debug */

arch/x86/kvm/svm/nested.c

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -436,6 +436,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
436436
to->msrpm_base_pa = from->msrpm_base_pa;
437437
to->tsc_offset = from->tsc_offset;
438438
to->tlb_ctl = from->tlb_ctl;
439+
to->erap_ctl = from->erap_ctl;
439440
to->int_ctl = from->int_ctl;
440441
to->int_vector = from->int_vector;
441442
to->int_state = from->int_state;
@@ -885,6 +886,19 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
885886
}
886887
}
887888

889+
/*
890+
* Take ALLOW_LARGER_RAP from vmcb12 even though it should be safe to
891+
* let L2 use a larger RAP since KVM will emulate the necessary clears,
892+
* as it's possible L1 deliberately wants to restrict L2 to the legacy
893+
* RAP size. Unconditionally clear the RAP on nested VMRUN, as KVM is
894+
* responsible for emulating the host vs. guest tags (L1 is the "host",
895+
* L2 is the "guest").
896+
*/
897+
if (guest_cpu_cap_has(vcpu, X86_FEATURE_ERAPS))
898+
vmcb02->control.erap_ctl = (svm->nested.ctl.erap_ctl &
899+
ERAP_CONTROL_ALLOW_LARGER_RAP) |
900+
ERAP_CONTROL_CLEAR_RAP;
901+
888902
/*
889903
* Merge guest and host intercepts - must be called with vcpu in
890904
* guest-mode to take effect.
@@ -1180,6 +1194,9 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
11801194

11811195
kvm_nested_vmexit_handle_ibrs(vcpu);
11821196

1197+
if (guest_cpu_cap_has(vcpu, X86_FEATURE_ERAPS))
1198+
vmcb01->control.erap_ctl |= ERAP_CONTROL_CLEAR_RAP;
1199+
11831200
svm_switch_vmcb(svm, &svm->vmcb01);
11841201

11851202
/*
@@ -1686,6 +1703,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
16861703
dst->tsc_offset = from->tsc_offset;
16871704
dst->asid = from->asid;
16881705
dst->tlb_ctl = from->tlb_ctl;
1706+
dst->erap_ctl = from->erap_ctl;
16891707
dst->int_ctl = from->int_ctl;
16901708
dst->int_vector = from->int_vector;
16911709
dst->int_state = from->int_state;

arch/x86/kvm/svm/svm.c

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1141,6 +1141,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu, bool init_event)
11411141
svm_clr_intercept(svm, INTERCEPT_PAUSE);
11421142
}
11431143

1144+
if (guest_cpu_cap_has(vcpu, X86_FEATURE_ERAPS))
1145+
svm->vmcb->control.erap_ctl |= ERAP_CONTROL_ALLOW_LARGER_RAP;
1146+
11441147
if (kvm_vcpu_apicv_active(vcpu))
11451148
avic_init_vmcb(svm, vmcb);
11461149

@@ -3293,6 +3296,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
32933296
pr_err("%-20s%016llx\n", "tsc_offset:", control->tsc_offset);
32943297
pr_err("%-20s%d\n", "asid:", control->asid);
32953298
pr_err("%-20s%d\n", "tlb_ctl:", control->tlb_ctl);
3299+
pr_err("%-20s%d\n", "erap_ctl:", control->erap_ctl);
32963300
pr_err("%-20s%08x\n", "int_ctl:", control->int_ctl);
32973301
pr_err("%-20s%08x\n", "int_vector:", control->int_vector);
32983302
pr_err("%-20s%08x\n", "int_state:", control->int_state);
@@ -4004,6 +4008,13 @@ static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)
40044008
invlpga(gva, svm->vmcb->control.asid);
40054009
}
40064010

4011+
static void svm_flush_tlb_guest(struct kvm_vcpu *vcpu)
4012+
{
4013+
kvm_register_mark_dirty(vcpu, VCPU_EXREG_ERAPS);
4014+
4015+
svm_flush_tlb_asid(vcpu);
4016+
}
4017+
40074018
static inline void sync_cr8_to_lapic(struct kvm_vcpu *vcpu)
40084019
{
40094020
struct vcpu_svm *svm = to_svm(vcpu);
@@ -4262,6 +4273,10 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
42624273
}
42634274
svm->vmcb->save.cr2 = vcpu->arch.cr2;
42644275

4276+
if (guest_cpu_cap_has(vcpu, X86_FEATURE_ERAPS) &&
4277+
kvm_register_is_dirty(vcpu, VCPU_EXREG_ERAPS))
4278+
svm->vmcb->control.erap_ctl |= ERAP_CONTROL_CLEAR_RAP;
4279+
42654280
svm_hv_update_vp_id(svm->vmcb, vcpu);
42664281

42674282
/*
@@ -4339,6 +4354,14 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
43394354
}
43404355

43414356
svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
4357+
4358+
/*
4359+
* Unconditionally mask off the CLEAR_RAP bit, the AND is just as cheap
4360+
* as the TEST+Jcc to avoid it.
4361+
*/
4362+
if (cpu_feature_enabled(X86_FEATURE_ERAPS))
4363+
svm->vmcb->control.erap_ctl &= ~ERAP_CONTROL_CLEAR_RAP;
4364+
43424365
vmcb_mark_all_clean(svm->vmcb);
43434366

43444367
/* if exit due to PF check for async PF */
@@ -5094,7 +5117,7 @@ struct kvm_x86_ops svm_x86_ops __initdata = {
50945117
.flush_tlb_all = svm_flush_tlb_all,
50955118
.flush_tlb_current = svm_flush_tlb_current,
50965119
.flush_tlb_gva = svm_flush_tlb_gva,
5097-
.flush_tlb_guest = svm_flush_tlb_asid,
5120+
.flush_tlb_guest = svm_flush_tlb_guest,
50985121

50995122
.vcpu_pre_run = svm_vcpu_pre_run,
51005123
.vcpu_run = svm_vcpu_run,

arch/x86/kvm/svm/svm.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,7 @@ struct vmcb_ctrl_area_cached {
156156
u64 tsc_offset;
157157
u32 asid;
158158
u8 tlb_ctl;
159+
u8 erap_ctl;
159160
u32 int_ctl;
160161
u32 int_vector;
161162
u32 int_state;

arch/x86/kvm/x86.c

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14130,6 +14130,13 @@ int kvm_handle_invpcid(struct kvm_vcpu *vcpu, unsigned long type, gva_t gva)
1413014130
return 1;
1413114131
}
1413214132

14133+
/*
14134+
* When ERAPS is supported, invalidating a specific PCID clears
14135+
* the RAP (Return Address Predicator).
14136+
*/
14137+
if (guest_cpu_cap_has(vcpu, X86_FEATURE_ERAPS))
14138+
kvm_register_is_dirty(vcpu, VCPU_EXREG_ERAPS);
14139+
1413314140
kvm_invalidate_pcid(vcpu, operand.pcid);
1413414141
return kvm_skip_emulated_instruction(vcpu);
1413514142

@@ -14143,6 +14150,11 @@ int kvm_handle_invpcid(struct kvm_vcpu *vcpu, unsigned long type, gva_t gva)
1414314150

1414414151
fallthrough;
1414514152
case INVPCID_TYPE_ALL_INCL_GLOBAL:
14153+
/*
14154+
* Don't bother marking VCPU_EXREG_ERAPS dirty, SVM will take
14155+
* care of doing so when emulating the full guest TLB flush
14156+
* (the RAP is cleared on all implicit TLB flushes).
14157+
*/
1414614158
kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
1414714159
return kvm_skip_emulated_instruction(vcpu);
1414814160

0 commit comments

Comments
 (0)