Skip to content

Commit 8351039

Browse files
committed
Merge branch kvm-arm64/eager-page-splitting into kvmarm/next
* kvm-arm64/eager-page-splitting: : Eager Page Splitting, courtesy of Ricardo Koller. : : Dirty logging performance is dominated by the cost of splitting : hugepages to PTE granularity. On systems that mere mortals can get their : hands on, each fault incurs the cost of a full break-before-make : pattern, wherein the broadcast invalidation and ensuing serialization : significantly increases fault latency. : : The goal of eager page splitting is to move the cost of hugepage : splitting out of the stage-2 fault path and instead into the ioctls : responsible for managing the dirty log: : : - If manual protection is enabled for the VM, hugepage splitting : happens in the KVM_CLEAR_DIRTY_LOG ioctl. This is desirable as it : provides userspace granular control over hugepage splitting. : : - Otherwise, if userspace relies on the legacy dirty log behavior : (clear on collection), hugepage splitting is done at the moment dirty : logging is enabled for a particular memslot. : : Support for eager page splitting requires explicit opt-in from : userspace, which is realized through the : KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE capability. arm64: kvm: avoid overflow in integer division KVM: arm64: Use local TLBI on permission relaxation KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked() KVM: arm64: Split huge pages when dirty logging is enabled KVM: arm64: Add kvm_uninit_stage2_mmu() KVM: arm64: Refactor kvm_arch_commit_memory_region() KVM: arm64: Add kvm_pgtable_stage2_split() KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE KVM: arm64: Export kvm_are_all_memslots_empty() KVM: arm64: Add helper for creating unlinked stage2 subtrees KVM: arm64: Add KVM_PGTABLE_WALK flags for skipping CMOs and BBM TLBIs KVM: arm64: Rename free_removed to free_unlinked Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2 parents 44c026a + 14c3555 commit 8351039

15 files changed

Lines changed: 612 additions & 57 deletions

File tree

Documentation/virt/kvm/api.rst

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8445,6 +8445,33 @@ structure.
84458445
When getting the Modified Change Topology Report value, the attr->addr
84468446
must point to a byte where the value will be stored or retrieved from.
84478447

8448+
8.40 KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
8449+
---------------------------------------
8450+
8451+
:Capability: KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
8452+
:Architectures: arm64
8453+
:Type: vm
8454+
:Parameters: arg[0] is the new split chunk size.
8455+
:Returns: 0 on success, -EINVAL if any memslot was already created.
8456+
8457+
This capability sets the chunk size used in Eager Page Splitting.
8458+
8459+
Eager Page Splitting improves the performance of dirty-logging (used
8460+
in live migrations) when guest memory is backed by huge-pages. It
8461+
avoids splitting huge-pages (into PAGE_SIZE pages) on fault, by doing
8462+
it eagerly when enabling dirty logging (with the
8463+
KVM_MEM_LOG_DIRTY_PAGES flag for a memory region), or when using
8464+
KVM_CLEAR_DIRTY_LOG.
8465+
8466+
The chunk size specifies how many pages to break at a time, using a
8467+
single allocation for each chunk. Bigger the chunk size, more pages
8468+
need to be allocated ahead of time.
8469+
8470+
The chunk size needs to be a valid block size. The list of acceptable
8471+
block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
8472+
64-bit bitmap (each bit describing a block size). The default value is
8473+
0, to disable the eager page splitting.
8474+
84488475
9. Known KVM API problems
84498476
=========================
84508477

arch/arm64/include/asm/kvm_asm.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ enum __kvm_host_smccc_func {
6868
__KVM_HOST_SMCCC_FUNC___kvm_vcpu_run,
6969
__KVM_HOST_SMCCC_FUNC___kvm_flush_vm_context,
7070
__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_ipa,
71+
__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_ipa_nsh,
7172
__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid,
7273
__KVM_HOST_SMCCC_FUNC___kvm_flush_cpu_context,
7374
__KVM_HOST_SMCCC_FUNC___kvm_timer_set_cntvoff,
@@ -225,6 +226,9 @@ extern void __kvm_flush_vm_context(void);
225226
extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
226227
extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
227228
int level);
229+
extern void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
230+
phys_addr_t ipa,
231+
int level);
228232
extern void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu);
229233

230234
extern void __kvm_timer_set_cntvoff(u64 cntvoff);

arch/arm64/include/asm/kvm_host.h

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,21 @@ struct kvm_s2_mmu {
159159
/* The last vcpu id that ran on each physical CPU */
160160
int __percpu *last_vcpu_ran;
161161

162+
#define KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT 0
163+
/*
164+
* Memory cache used to split
165+
* KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE worth of huge pages. It
166+
* is used to allocate stage2 page tables while splitting huge
167+
* pages. The choice of KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
168+
* influences both the capacity of the split page cache, and
169+
* how often KVM reschedules. Be wary of raising CHUNK_SIZE
170+
* too high.
171+
*
172+
* Protected by kvm->slots_lock.
173+
*/
174+
struct kvm_mmu_memory_cache split_page_cache;
175+
uint64_t split_page_chunk_size;
176+
162177
struct kvm_arch *arch;
163178
};
164179

arch/arm64/include/asm/kvm_mmu.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -172,6 +172,7 @@ void __init free_hyp_pgds(void);
172172

173173
void stage2_unmap_vm(struct kvm *kvm);
174174
int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long type);
175+
void kvm_uninit_stage2_mmu(struct kvm *kvm);
175176
void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu);
176177
int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
177178
phys_addr_t pa, unsigned long size, bool writable);

arch/arm64/include/asm/kvm_pgtable.h

Lines changed: 75 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,24 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
9292
return level >= KVM_PGTABLE_MIN_BLOCK_LEVEL;
9393
}
9494

95+
static inline u32 kvm_supported_block_sizes(void)
96+
{
97+
u32 level = KVM_PGTABLE_MIN_BLOCK_LEVEL;
98+
u32 r = 0;
99+
100+
for (; level < KVM_PGTABLE_MAX_LEVELS; level++)
101+
r |= BIT(kvm_granule_shift(level));
102+
103+
return r;
104+
}
105+
106+
static inline bool kvm_is_block_size_supported(u64 size)
107+
{
108+
bool is_power_of_two = IS_ALIGNED(size, size);
109+
110+
return is_power_of_two && (size & kvm_supported_block_sizes());
111+
}
112+
95113
/**
96114
* struct kvm_pgtable_mm_ops - Memory management callbacks.
97115
* @zalloc_page: Allocate a single zeroed memory page.
@@ -104,7 +122,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
104122
* allocation is physically contiguous.
105123
* @free_pages_exact: Free an exact number of memory pages previously
106124
* allocated by zalloc_pages_exact.
107-
* @free_removed_table: Free a removed paging structure by unlinking and
125+
* @free_unlinked_table: Free an unlinked paging structure by unlinking and
108126
* dropping references.
109127
* @get_page: Increment the refcount on a page.
110128
* @put_page: Decrement the refcount on a page. When the
@@ -124,7 +142,7 @@ struct kvm_pgtable_mm_ops {
124142
void* (*zalloc_page)(void *arg);
125143
void* (*zalloc_pages_exact)(size_t size);
126144
void (*free_pages_exact)(void *addr, size_t size);
127-
void (*free_removed_table)(void *addr, u32 level);
145+
void (*free_unlinked_table)(void *addr, u32 level);
128146
void (*get_page)(void *addr);
129147
void (*put_page)(void *addr);
130148
int (*page_count)(void *addr);
@@ -195,13 +213,21 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
195213
* with other software walkers.
196214
* @KVM_PGTABLE_WALK_HANDLE_FAULT: Indicates the page-table walk was
197215
* invoked from a fault handler.
216+
* @KVM_PGTABLE_WALK_SKIP_BBM_TLBI: Visit and update table entries
217+
* without Break-before-make's
218+
* TLB invalidation.
219+
* @KVM_PGTABLE_WALK_SKIP_CMO: Visit and update table entries
220+
* without Cache maintenance
221+
* operations required.
198222
*/
199223
enum kvm_pgtable_walk_flags {
200224
KVM_PGTABLE_WALK_LEAF = BIT(0),
201225
KVM_PGTABLE_WALK_TABLE_PRE = BIT(1),
202226
KVM_PGTABLE_WALK_TABLE_POST = BIT(2),
203227
KVM_PGTABLE_WALK_SHARED = BIT(3),
204228
KVM_PGTABLE_WALK_HANDLE_FAULT = BIT(4),
229+
KVM_PGTABLE_WALK_SKIP_BBM_TLBI = BIT(5),
230+
KVM_PGTABLE_WALK_SKIP_CMO = BIT(6),
205231
};
206232

207233
struct kvm_pgtable_visit_ctx {
@@ -441,15 +467,41 @@ int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
441467
void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
442468

443469
/**
444-
* kvm_pgtable_stage2_free_removed() - Free a removed stage-2 paging structure.
470+
* kvm_pgtable_stage2_free_unlinked() - Free an unlinked stage-2 paging structure.
445471
* @mm_ops: Memory management callbacks.
446472
* @pgtable: Unlinked stage-2 paging structure to be freed.
447473
* @level: Level of the stage-2 paging structure to be freed.
448474
*
449475
* The page-table is assumed to be unreachable by any hardware walkers prior to
450476
* freeing and therefore no TLB invalidation is performed.
451477
*/
452-
void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level);
478+
void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level);
479+
480+
/**
481+
* kvm_pgtable_stage2_create_unlinked() - Create an unlinked stage-2 paging structure.
482+
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
483+
* @phys: Physical address of the memory to map.
484+
* @level: Starting level of the stage-2 paging structure to be created.
485+
* @prot: Permissions and attributes for the mapping.
486+
* @mc: Cache of pre-allocated and zeroed memory from which to allocate
487+
* page-table pages.
488+
* @force_pte: Force mappings to PAGE_SIZE granularity.
489+
*
490+
* Returns an unlinked page-table tree. This new page-table tree is
491+
* not reachable (i.e., it is unlinked) from the root pgd and it's
492+
* therefore unreachableby the hardware page-table walker. No TLB
493+
* invalidation or CMOs are performed.
494+
*
495+
* If device attributes are not explicitly requested in @prot, then the
496+
* mapping will be normal, cacheable.
497+
*
498+
* Return: The fully populated (unlinked) stage-2 paging structure, or
499+
* an ERR_PTR(error) on failure.
500+
*/
501+
kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt,
502+
u64 phys, u32 level,
503+
enum kvm_pgtable_prot prot,
504+
void *mc, bool force_pte);
453505

454506
/**
455507
* kvm_pgtable_stage2_map() - Install a mapping in a guest stage-2 page-table.
@@ -620,6 +672,25 @@ bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr);
620672
*/
621673
int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
622674

675+
/**
676+
* kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
677+
* to PAGE_SIZE guest pages.
678+
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init().
679+
* @addr: Intermediate physical address from which to split.
680+
* @size: Size of the range.
681+
* @mc: Cache of pre-allocated and zeroed memory from which to allocate
682+
* page-table pages.
683+
*
684+
* The function tries to split any level 1 or 2 entry that overlaps
685+
* with the input range (given by @addr and @size).
686+
*
687+
* Return: 0 on success, negative error code on failure. Note that
688+
* kvm_pgtable_stage2_split() is best effort: it tries to break as many
689+
* blocks in the input range as allowed by @mc_capacity.
690+
*/
691+
int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
692+
struct kvm_mmu_memory_cache *mc);
693+
623694
/**
624695
* kvm_pgtable_walk() - Walk a page-table.
625696
* @pgt: Page-table structure initialised by kvm_pgtable_*_init().

arch/arm64/kvm/arm.c

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
6565
struct kvm_enable_cap *cap)
6666
{
6767
int r;
68+
u64 new_cap;
6869

6970
if (cap->flags)
7071
return -EINVAL;
@@ -89,6 +90,24 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
8990
r = 0;
9091
set_bit(KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED, &kvm->arch.flags);
9192
break;
93+
case KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE:
94+
new_cap = cap->args[0];
95+
96+
mutex_lock(&kvm->slots_lock);
97+
/*
98+
* To keep things simple, allow changing the chunk
99+
* size only when no memory slots have been created.
100+
*/
101+
if (!kvm_are_all_memslots_empty(kvm)) {
102+
r = -EINVAL;
103+
} else if (new_cap && !kvm_is_block_size_supported(new_cap)) {
104+
r = -EINVAL;
105+
} else {
106+
r = 0;
107+
kvm->arch.mmu.split_page_chunk_size = new_cap;
108+
}
109+
mutex_unlock(&kvm->slots_lock);
110+
break;
92111
default:
93112
r = -EINVAL;
94113
break;
@@ -302,6 +321,15 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
302321
case KVM_CAP_ARM_PTRAUTH_GENERIC:
303322
r = system_has_full_ptr_auth();
304323
break;
324+
case KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE:
325+
if (kvm)
326+
r = kvm->arch.mmu.split_page_chunk_size;
327+
else
328+
r = KVM_ARM_EAGER_SPLIT_CHUNK_SIZE_DEFAULT;
329+
break;
330+
case KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES:
331+
r = kvm_supported_block_sizes();
332+
break;
305333
default:
306334
r = 0;
307335
}

arch/arm64/kvm/hyp/nvhe/hyp-main.c

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,15 @@ static void handle___kvm_tlb_flush_vmid_ipa(struct kvm_cpu_context *host_ctxt)
125125
__kvm_tlb_flush_vmid_ipa(kern_hyp_va(mmu), ipa, level);
126126
}
127127

128+
static void handle___kvm_tlb_flush_vmid_ipa_nsh(struct kvm_cpu_context *host_ctxt)
129+
{
130+
DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
131+
DECLARE_REG(phys_addr_t, ipa, host_ctxt, 2);
132+
DECLARE_REG(int, level, host_ctxt, 3);
133+
134+
__kvm_tlb_flush_vmid_ipa_nsh(kern_hyp_va(mmu), ipa, level);
135+
}
136+
128137
static void handle___kvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt)
129138
{
130139
DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
@@ -315,6 +324,7 @@ static const hcall_t host_hcall[] = {
315324
HANDLE_FUNC(__kvm_vcpu_run),
316325
HANDLE_FUNC(__kvm_flush_vm_context),
317326
HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa),
327+
HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa_nsh),
318328
HANDLE_FUNC(__kvm_tlb_flush_vmid),
319329
HANDLE_FUNC(__kvm_flush_cpu_context),
320330
HANDLE_FUNC(__kvm_timer_set_cntvoff),

arch/arm64/kvm/hyp/nvhe/mem_protect.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -91,9 +91,9 @@ static void host_s2_put_page(void *addr)
9191
hyp_put_page(&host_s2_pool, addr);
9292
}
9393

94-
static void host_s2_free_removed_table(void *addr, u32 level)
94+
static void host_s2_free_unlinked_table(void *addr, u32 level)
9595
{
96-
kvm_pgtable_stage2_free_removed(&host_mmu.mm_ops, addr, level);
96+
kvm_pgtable_stage2_free_unlinked(&host_mmu.mm_ops, addr, level);
9797
}
9898

9999
static int prepare_s2_pool(void *pgt_pool_base)
@@ -110,7 +110,7 @@ static int prepare_s2_pool(void *pgt_pool_base)
110110
host_mmu.mm_ops = (struct kvm_pgtable_mm_ops) {
111111
.zalloc_pages_exact = host_s2_zalloc_pages_exact,
112112
.zalloc_page = host_s2_zalloc_page,
113-
.free_removed_table = host_s2_free_removed_table,
113+
.free_unlinked_table = host_s2_free_unlinked_table,
114114
.phys_to_virt = hyp_phys_to_virt,
115115
.virt_to_phys = hyp_virt_to_phys,
116116
.page_count = hyp_page_count,

arch/arm64/kvm/hyp/nvhe/tlb.c

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,58 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
130130
__tlb_switch_to_host(&cxt);
131131
}
132132

133+
void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
134+
phys_addr_t ipa, int level)
135+
{
136+
struct tlb_inv_context cxt;
137+
138+
/* Switch to requested VMID */
139+
__tlb_switch_to_guest(mmu, &cxt, true);
140+
141+
/*
142+
* We could do so much better if we had the VA as well.
143+
* Instead, we invalidate Stage-2 for this IPA, and the
144+
* whole of Stage-1. Weep...
145+
*/
146+
ipa >>= 12;
147+
__tlbi_level(ipas2e1, ipa, level);
148+
149+
/*
150+
* We have to ensure completion of the invalidation at Stage-2,
151+
* since a table walk on another CPU could refill a TLB with a
152+
* complete (S1 + S2) walk based on the old Stage-2 mapping if
153+
* the Stage-1 invalidation happened first.
154+
*/
155+
dsb(nsh);
156+
__tlbi(vmalle1);
157+
dsb(nsh);
158+
isb();
159+
160+
/*
161+
* If the host is running at EL1 and we have a VPIPT I-cache,
162+
* then we must perform I-cache maintenance at EL2 in order for
163+
* it to have an effect on the guest. Since the guest cannot hit
164+
* I-cache lines allocated with a different VMID, we don't need
165+
* to worry about junk out of guest reset (we nuke the I-cache on
166+
* VMID rollover), but we do need to be careful when remapping
167+
* executable pages for the same guest. This can happen when KSM
168+
* takes a CoW fault on an executable page, copies the page into
169+
* a page that was previously mapped in the guest and then needs
170+
* to invalidate the guest view of the I-cache for that page
171+
* from EL1. To solve this, we invalidate the entire I-cache when
172+
* unmapping a page from a guest if we have a VPIPT I-cache but
173+
* the host is running at EL1. As above, we could do better if
174+
* we had the VA.
175+
*
176+
* The moral of this story is: if you have a VPIPT I-cache, then
177+
* you should be running with VHE enabled.
178+
*/
179+
if (icache_is_vpipt())
180+
icache_inval_all_pou();
181+
182+
__tlb_switch_to_host(&cxt);
183+
}
184+
133185
void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
134186
{
135187
struct tlb_inv_context cxt;

0 commit comments

Comments
 (0)