Skip to content

Commit 54275f7

Browse files
sean-jcbonzini
authored andcommitted
KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use
Check for A/D bits being disabled instead of the access tracking mask being non-zero when deciding whether or not to attempt to fix a page fault vian the fast path. Originally, the access tracking mask was non-zero if and only if A/D bits were disabled by _KVM_ (including not being supported by hardware), but that hasn't been true since nVMX was fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause KVM to not use A/D bits while running L2 despite KVM using them while running L1. In other words, don't attempt the fast path just because EPT is enabled. Note, attempting the fast path for all !PRESENT faults can "fix" a very, _VERY_ tiny percentage of faults out of mmu_lock by detecting that the fault is spurious, i.e. has been fixed by a different vCPU, but again the odds of that happening are vanishingly small. E.g. booting an 8-vCPU VM gets less than 10 successes out of 30k+ faults, and that's likely one of the more favorable scenarios. Disabling dirty logging can likely lead to a rash of collisions between vCPUs for some workloads that operate on a common set of pages, but penalizing _all_ !PRESENT faults for that one case is unlikely to be a net positive, not to mention that that problem is best solved by not zapping in the first place. The number of spurious faults does scale with the number of vCPUs, e.g. a 255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast path (again out of 30k), but that's all of 0.2% of faults. Using legacy shadow paging does get more spurious faults, and a few more detected out of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring faults that are reflected into the guest), i.e. the extra detections are purely due to the sheer number of faults observed. On the other hand, getting a "negative" in the fast path takes in the neighborhood of 150-250 cycles. So while it is tempting to keep/extend the current behavior, such a change needs to come with hard numbers showing that it's actually a win in the grand scheme, or any scheme for that matter. Fixes: 995f00a ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
1 parent 91ab933 commit 54275f7

3 files changed

Lines changed: 41 additions & 17 deletions

File tree

arch/x86/kvm/mmu/mmu.c

Lines changed: 29 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3056,19 +3056,20 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
30563056

30573057
/*
30583058
* #PF can be fast if:
3059-
* 1. The shadow page table entry is not present, which could mean that
3060-
* the fault is potentially caused by access tracking (if enabled).
3061-
* 2. The shadow page table entry is present and the fault
3062-
* is caused by write-protect, that means we just need change the W
3063-
* bit of the spte which can be done out of mmu-lock.
30643059
*
3065-
* However, if access tracking is disabled we know that a non-present
3066-
* page must be a genuine page fault where we have to create a new SPTE.
3067-
* So, if access tracking is disabled, we return true only for write
3068-
* accesses to a present page.
3060+
* 1. The shadow page table entry is not present and A/D bits are
3061+
* disabled _by KVM_, which could mean that the fault is potentially
3062+
* caused by access tracking (if enabled). If A/D bits are enabled
3063+
* by KVM, but disabled by L1 for L2, KVM is forced to disable A/D
3064+
* bits for L2 and employ access tracking, but the fast page fault
3065+
* mechanism only supports direct MMUs.
3066+
* 2. The shadow page table entry is present, the access is a write,
3067+
* and no reserved bits are set (MMIO SPTEs cannot be "fixed"), i.e.
3068+
* the fault was caused by a write-protection violation. If the
3069+
* SPTE is MMU-writable (determined later), the fault can be fixed
3070+
* by setting the Writable bit, which can be done out of mmu_lock.
30693071
*/
3070-
3071-
return shadow_acc_track_mask != 0 || (fault->write && fault->present);
3072+
return !kvm_ad_enabled() || (fault->write && fault->present);
30723073
}
30733074

30743075
/*
@@ -3183,13 +3184,25 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
31833184

31843185
new_spte = spte;
31853186

3186-
if (is_access_track_spte(spte))
3187+
/*
3188+
* KVM only supports fixing page faults outside of MMU lock for
3189+
* direct MMUs, nested MMUs are always indirect, and KVM always
3190+
* uses A/D bits for non-nested MMUs. Thus, if A/D bits are
3191+
* enabled, the SPTE can't be an access-tracked SPTE.
3192+
*/
3193+
if (unlikely(!kvm_ad_enabled()) && is_access_track_spte(spte))
31873194
new_spte = restore_acc_track_spte(new_spte);
31883195

31893196
/*
3190-
* Currently, to simplify the code, write-protection can
3191-
* be removed in the fast path only if the SPTE was
3192-
* write-protected for dirty-logging or access tracking.
3197+
* To keep things simple, only SPTEs that are MMU-writable can
3198+
* be made fully writable outside of mmu_lock, e.g. only SPTEs
3199+
* that were write-protected for dirty-logging or access
3200+
* tracking are handled here. Don't bother checking if the
3201+
* SPTE is writable to prioritize running with A/D bits enabled.
3202+
* The is_access_allowed() check above handles the common case
3203+
* of the fault being spurious, and the SPTE is known to be
3204+
* shadow-present, i.e. except for access tracking restoration
3205+
* making the new SPTE writable, the check is wasteful.
31933206
*/
31943207
if (fault->write && is_mmu_writable_spte(spte)) {
31953208
new_spte |= PT_WRITABLE_MASK;
@@ -4794,7 +4807,7 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
47944807
role.efer_nx = true;
47954808
role.smm = cpu_role.base.smm;
47964809
role.guest_mode = cpu_role.base.guest_mode;
4797-
role.ad_disabled = (shadow_accessed_mask == 0);
4810+
role.ad_disabled = !kvm_ad_enabled();
47984811
role.level = kvm_mmu_get_tdp_level(vcpu);
47994812
role.direct = true;
48004813
role.has_4_byte_gpte = false;

arch/x86/kvm/mmu/spte.h

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -214,6 +214,17 @@ static inline bool is_shadow_present_pte(u64 pte)
214214
return !!(pte & SPTE_MMU_PRESENT_MASK);
215215
}
216216

217+
/*
218+
* Returns true if A/D bits are supported in hardware and are enabled by KVM.
219+
* When enabled, KVM uses A/D bits for all non-nested MMUs. Because L1 can
220+
* disable A/D bits in EPTP12, SP and SPTE variants are needed to handle the
221+
* scenario where KVM is using A/D bits for L1, but not L2.
222+
*/
223+
static inline bool kvm_ad_enabled(void)
224+
{
225+
return !!shadow_accessed_mask;
226+
}
227+
217228
static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
218229
{
219230
return sp->role.ad_disabled;

arch/x86/kvm/mmu/tdp_mmu.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1136,7 +1136,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
11361136
struct kvm_mmu_page *sp, bool account_nx,
11371137
bool shared)
11381138
{
1139-
u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
1139+
u64 spte = make_nonleaf_spte(sp->spt, !kvm_ad_enabled());
11401140
int ret = 0;
11411141

11421142
if (shared) {

0 commit comments

Comments
 (0)