Skip to content

Commit 8b6a9d2

Browse files
committed
Merge branches 'guest_memfd', 'lam', 'misc', 'mmu' and 'pmu'
* guest_memfd: (35 commits) KVM: selftests: Test KVM exit behavior for private memory/access KVM: selftests: Add basic selftest for guest_memfd() KVM: selftests: Expand set_memory_region_test to validate guest_memfd() KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper KVM: selftests: Add x86-only selftest for private memory conversions KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86) KVM: selftests: Add helpers to convert guest memory b/w private and shared KVM: selftests: Add support for creating private memslots KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2 KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper KVM: x86: Add support for "protected VMs" that can utilize private memory KVM: Allow arch code to track number of memslot address spaces per VM KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro KVM: x86/mmu: Handle page fault for private memory KVM: x86: Disallow hugepages when memory attributes are mixed KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN KVM: Add transparent hugepage support for dedicated guest memory KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory ... * lam: (47 commits) KVM: x86: Use KVM-governed feature framework to track "LAM enabled" KVM: x86: Advertise and enable LAM (user and supervisor) KVM: x86: Virtualize LAM for user pointer KVM: x86: Virtualize LAM for supervisor pointer KVM: x86: Untag addresses for LAM emulation where applicable KVM: x86: Introduce get_untagged_addr() in kvm_x86_ops and call it in emulator KVM: x86: Remove kvm_vcpu_is_illegal_gpa() KVM: x86: Add & use kvm_vcpu_is_legal_cr3() to check CR3's legality KVM: x86/mmu: Drop non-PA bits when getting GFN for guest's PGD KVM: x86: Add X86EMUL_F_INVLPG and pass it in em_invlpg() KVM: x86: Add an emulation flag for implicit system access KVM: x86: Consolidate flags for __linearize() KVM: x86: hyper-v: Don't auto-enable stimer on write from user-space KVM: x86: Update the variable naming in kvm_x86_ops.sched_in() KVM: x86/mmu: Stop kicking vCPUs to sync the dirty log when PML is disabled KVM: x86: Use octal for file permission KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED KVM: x86/mmu: Zap KVM TDP when noncoherent DMA assignment starts/stops KVM: x86: Don't sync user-written TSC against startup values KVM: x86/mmu: Zap SPTEs on MTRR update iff guest MTRRs are honored ... * misc: KVM: x86: Ignore MSR_AMD64_TW_CFG access KVM: x86: remove the unused assigned_dev_head from kvm_arch x86: KVM: Add feature flag for CPUID.80000021H:EAX[bit 1] KVM: x86: remove always-false condition in kvmclock_sync_fn KVM: x86: hyper-v: Don't auto-enable stimer on write from user-space KVM: x86: Update the variable naming in kvm_x86_ops.sched_in() KVM: x86/mmu: Stop kicking vCPUs to sync the dirty log when PML is disabled KVM: x86: Use octal for file permission KVM: x86: Don't sync user-written TSC against startup values KVM: selftests: Test behavior of HWCR, a.k.a. MSR_K7_HWCR KVM: x86: Virtualize HWCR.TscFreqSel[bit 24] KVM: x86: Allow HWCR.McStatusWrEn to be cleared once set KVM: x86: Refine calculation of guest wall clock to use a single TSC read KVM: x86: Add SBPB support KVM: x86: Add IBPB_BRTYPE support KVM: x86: Add CONFIG_KVM_MAX_NR_VCPUS to allow up to 4096 vCPUs KVM: x86: Force TLB flush on userspace changes to special registers KVM: x86: Remove redundant vcpu->arch.cr0 assignments * mmu: KVM: x86/mmu: Remove unnecessary ‘NULL’ values from sptep KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED KVM: x86/mmu: Zap KVM TDP when noncoherent DMA assignment starts/stops KVM: x86/mmu: Zap SPTEs on MTRR update iff guest MTRRs are honored KVM: x86/mmu: Zap SPTEs when CR0.CD is toggled iff guest MTRRs are honored KVM: x86/mmu: Add helpers to return if KVM honors guest MTRRs * pmu: KVM: x86: Service NMI requests after PMI requests in VM-Enter path
6 parents 2b3f232 + 881375a + b291db5 + 2770d47 + 1de9992 + fad505b commit 8b6a9d2

69 files changed

Lines changed: 3712 additions & 477 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Documentation/virt/kvm/api.rst

Lines changed: 235 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -147,10 +147,29 @@ described as 'basic' will be available.
147147
The new VM has no virtual cpus and no memory.
148148
You probably want to use 0 as machine type.
149149

150+
X86:
151+
^^^^
152+
153+
Supported X86 VM types can be queried via KVM_CAP_VM_TYPES.
154+
155+
S390:
156+
^^^^^
157+
150158
In order to create user controlled virtual machines on S390, check
151159
KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
152160
privileged user (CAP_SYS_ADMIN).
153161

162+
MIPS:
163+
^^^^^
164+
165+
To use hardware assisted virtualization on MIPS (VZ ASE) rather than
166+
the default trap & emulate implementation (which changes the virtual
167+
memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
168+
flag KVM_VM_MIPS_VZ.
169+
170+
ARM64:
171+
^^^^^^
172+
154173
On arm64, the physical address size for a VM (IPA Size limit) is limited
155174
to 40bits by default. The limit can be configured if the host supports the
156175
extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
@@ -540,7 +559,7 @@ ioctl is useful if the in-kernel PIC is not used.
540559
PPC:
541560
^^^^
542561

543-
Queues an external interrupt to be injected. This ioctl is overleaded
562+
Queues an external interrupt to be injected. This ioctl is overloaded
544563
with 3 different irq values:
545564

546565
a) KVM_INTERRUPT_SET
@@ -965,7 +984,7 @@ be set in the flags field of this ioctl:
965984
The KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL flag requests KVM to generate
966985
the contents of the hypercall page automatically; hypercalls will be
967986
intercepted and passed to userspace through KVM_EXIT_XEN. In this
968-
ase, all of the blob size and address fields must be zero.
987+
case, all of the blob size and address fields must be zero.
969988

970989
The KVM_XEN_HVM_CONFIG_EVTCHN_SEND flag indicates to KVM that userspace
971990
will always use the KVM_XEN_HVM_EVTCHN_SEND ioctl to deliver event
@@ -1070,7 +1089,7 @@ Other flags returned by ``KVM_GET_CLOCK`` are accepted but ignored.
10701089
:Extended by: KVM_CAP_INTR_SHADOW
10711090
:Architectures: x86, arm64
10721091
:Type: vcpu ioctl
1073-
:Parameters: struct kvm_vcpu_event (out)
1092+
:Parameters: struct kvm_vcpu_events (out)
10741093
:Returns: 0 on success, -1 on error
10751094

10761095
X86:
@@ -1193,7 +1212,7 @@ directly to the virtual CPU).
11931212
:Extended by: KVM_CAP_INTR_SHADOW
11941213
:Architectures: x86, arm64
11951214
:Type: vcpu ioctl
1196-
:Parameters: struct kvm_vcpu_event (in)
1215+
:Parameters: struct kvm_vcpu_events (in)
11971216
:Returns: 0 on success, -1 on error
11981217

11991218
X86:
@@ -3063,7 +3082,7 @@ as follow::
30633082
};
30643083

30653084
An entry with a "page_shift" of 0 is unused. Because the array is
3066-
organized in increasing order, a lookup can stop when encoutering
3085+
organized in increasing order, a lookup can stop when encountering
30673086
such an entry.
30683087

30693088
The "slb_enc" field provides the encoding to use in the SLB for the
@@ -3455,7 +3474,7 @@ Possible features:
34553474
- KVM_RUN and KVM_GET_REG_LIST are not available;
34563475

34573476
- KVM_GET_ONE_REG and KVM_SET_ONE_REG cannot be used to access
3458-
the scalable archietctural SVE registers
3477+
the scalable architectural SVE registers
34593478
KVM_REG_ARM64_SVE_ZREG(), KVM_REG_ARM64_SVE_PREG() or
34603479
KVM_REG_ARM64_SVE_FFR;
34613480

@@ -4401,7 +4420,7 @@ This will have undefined effects on the guest if it has not already
44014420
placed itself in a quiescent state where no vcpu will make MMU enabled
44024421
memory accesses.
44034422

4404-
On succsful completion, the pending HPT will become the guest's active
4423+
On successful completion, the pending HPT will become the guest's active
44054424
HPT and the previous HPT will be discarded.
44064425

44074426
On failure, the guest will still be operating on its previous HPT.
@@ -5016,7 +5035,7 @@ before the vcpu is fully usable.
50165035

50175036
Between KVM_ARM_VCPU_INIT and KVM_ARM_VCPU_FINALIZE, the feature may be
50185037
configured by use of ioctls such as KVM_SET_ONE_REG. The exact configuration
5019-
that should be performaned and how to do it are feature-dependent.
5038+
that should be performed and how to do it are feature-dependent.
50205039

50215040
Other calls that depend on a particular feature being finalized, such as
50225041
KVM_RUN, KVM_GET_REG_LIST, KVM_GET_ONE_REG and KVM_SET_ONE_REG, will fail with
@@ -5124,6 +5143,24 @@ Valid values for 'action'::
51245143
#define KVM_PMU_EVENT_ALLOW 0
51255144
#define KVM_PMU_EVENT_DENY 1
51265145

5146+
Via this API, KVM userspace can also control the behavior of the VM's fixed
5147+
counters (if any) by configuring the "action" and "fixed_counter_bitmap" fields.
5148+
5149+
Specifically, KVM follows the following pseudo-code when determining whether to
5150+
allow the guest FixCtr[i] to count its pre-defined fixed event::
5151+
5152+
FixCtr[i]_is_allowed = (action == ALLOW) && (bitmap & BIT(i)) ||
5153+
(action == DENY) && !(bitmap & BIT(i));
5154+
FixCtr[i]_is_denied = !FixCtr[i]_is_allowed;
5155+
5156+
KVM always consumes fixed_counter_bitmap, it's userspace's responsibility to
5157+
ensure fixed_counter_bitmap is set correctly, e.g. if userspace wants to define
5158+
a filter that only affects general purpose counters.
5159+
5160+
Note, the "events" field also applies to fixed counters' hardcoded event_select
5161+
and unit_mask values. "fixed_counter_bitmap" has higher priority than "events"
5162+
if there is a contradiction between the two.
5163+
51275164
4.121 KVM_PPC_SVM_OFF
51285165
---------------------
51295166

@@ -5475,7 +5512,7 @@ KVM_XEN_ATTR_TYPE_EVTCHN
54755512
from the guest. A given sending port number may be directed back to
54765513
a specified vCPU (by APIC ID) / port / priority on the guest, or to
54775514
trigger events on an eventfd. The vCPU and priority can be changed
5478-
by setting KVM_XEN_EVTCHN_UPDATE in a subsequent call, but but other
5515+
by setting KVM_XEN_EVTCHN_UPDATE in a subsequent call, but other
54795516
fields cannot change for a given sending port. A port mapping is
54805517
removed by using KVM_XEN_EVTCHN_DEASSIGN in the flags field. Passing
54815518
KVM_XEN_EVTCHN_RESET in the flags field removes all interception of
@@ -6070,6 +6107,137 @@ writes to the CNTVCT_EL0 and CNTPCT_EL0 registers using the SET_ONE_REG
60706107
interface. No error will be returned, but the resulting offset will not be
60716108
applied.
60726109

6110+
4.139 KVM_SET_USER_MEMORY_REGION2
6111+
---------------------------------
6112+
6113+
:Capability: KVM_CAP_USER_MEMORY2
6114+
:Architectures: all
6115+
:Type: vm ioctl
6116+
:Parameters: struct kvm_userspace_memory_region2 (in)
6117+
:Returns: 0 on success, -1 on error
6118+
6119+
KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that
6120+
allows mapping guest_memfd memory into a guest. All fields shared with
6121+
KVM_SET_USER_MEMORY_REGION identically. Userspace can set KVM_MEM_PRIVATE in
6122+
flags to have KVM bind the memory region to a given guest_memfd range of
6123+
[guest_memfd_offset, guest_memfd_offset + memory_size]. The target guest_memfd
6124+
must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and
6125+
the target range must not be bound to any other memory region. All standard
6126+
bounds checks apply (use common sense).
6127+
6128+
::
6129+
6130+
struct kvm_userspace_memory_region2 {
6131+
__u32 slot;
6132+
__u32 flags;
6133+
__u64 guest_phys_addr;
6134+
__u64 memory_size; /* bytes */
6135+
__u64 userspace_addr; /* start of the userspace allocated memory */
6136+
__u64 guest_memfd_offset;
6137+
__u32 guest_memfd;
6138+
__u32 pad1;
6139+
__u64 pad2[14];
6140+
};
6141+
6142+
A KVM_MEM_PRIVATE region _must_ have a valid guest_memfd (private memory) and
6143+
userspace_addr (shared memory). However, "valid" for userspace_addr simply
6144+
means that the address itself must be a legal userspace address. The backing
6145+
mapping for userspace_addr is not required to be valid/populated at the time of
6146+
KVM_SET_USER_MEMORY_REGION2, e.g. shared memory can be lazily mapped/allocated
6147+
on-demand.
6148+
6149+
When mapping a gfn into the guest, KVM selects shared vs. private, i.e consumes
6150+
userspace_addr vs. guest_memfd, based on the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE
6151+
state. At VM creation time, all memory is shared, i.e. the PRIVATE attribute
6152+
is '0' for all gfns. Userspace can control whether memory is shared/private by
6153+
toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as needed.
6154+
6155+
4.140 KVM_SET_MEMORY_ATTRIBUTES
6156+
-------------------------------
6157+
6158+
:Capability: KVM_CAP_MEMORY_ATTRIBUTES
6159+
:Architectures: x86
6160+
:Type: vm ioctl
6161+
:Parameters: struct kvm_memory_attributes(in)
6162+
:Returns: 0 on success, <0 on error
6163+
6164+
KVM_SET_MEMORY_ATTRIBUTES allows userspace to set memory attributes for a range
6165+
of guest physical memory.
6166+
6167+
::
6168+
6169+
struct kvm_memory_attributes {
6170+
__u64 address;
6171+
__u64 size;
6172+
__u64 attributes;
6173+
__u64 flags;
6174+
};
6175+
6176+
#define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
6177+
6178+
The address and size must be page aligned. The supported attributes can be
6179+
retrieved via ioctl(KVM_CHECK_EXTENSION) on KVM_CAP_MEMORY_ATTRIBUTES. If
6180+
executed on a VM, KVM_CAP_MEMORY_ATTRIBUTES precisely returns the attributes
6181+
supported by that VM. If executed at system scope, KVM_CAP_MEMORY_ATTRIBUTES
6182+
returns all attributes supported by KVM. The only attribute defined at this
6183+
time is KVM_MEMORY_ATTRIBUTE_PRIVATE, which marks the associated gfn as being
6184+
guest private memory.
6185+
6186+
Note, there is no "get" API. Userspace is responsible for explicitly tracking
6187+
the state of a gfn/page as needed.
6188+
6189+
The "flags" field is reserved for future extensions and must be '0'.
6190+
6191+
4.141 KVM_CREATE_GUEST_MEMFD
6192+
----------------------------
6193+
6194+
:Capability: KVM_CAP_GUEST_MEMFD
6195+
:Architectures: none
6196+
:Type: vm ioctl
6197+
:Parameters: struct struct kvm_create_guest_memfd(in)
6198+
:Returns: 0 on success, <0 on error
6199+
6200+
KVM_CREATE_GUEST_MEMFD creates an anonymous file and returns a file descriptor
6201+
that refers to it. guest_memfd files are roughly analogous to files created
6202+
via memfd_create(), e.g. guest_memfd files live in RAM, have volatile storage,
6203+
and are automatically released when the last reference is dropped. Unlike
6204+
"regular" memfd_create() files, guest_memfd files are bound to their owning
6205+
virtual machine (see below), cannot be mapped, read, or written by userspace,
6206+
and cannot be resized (guest_memfd files do however support PUNCH_HOLE).
6207+
6208+
::
6209+
6210+
struct kvm_create_guest_memfd {
6211+
__u64 size;
6212+
__u64 flags;
6213+
__u64 reserved[6];
6214+
};
6215+
6216+
#define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0)
6217+
6218+
Conceptually, the inode backing a guest_memfd file represents physical memory,
6219+
i.e. is coupled to the virtual machine as a thing, not to a "struct kvm". The
6220+
file itself, which is bound to a "struct kvm", is that instance's view of the
6221+
underlying memory, e.g. effectively provides the translation of guest addresses
6222+
to host memory. This allows for use cases where multiple KVM structures are
6223+
used to manage a single virtual machine, e.g. when performing intrahost
6224+
migration of a virtual machine.
6225+
6226+
KVM currently only supports mapping guest_memfd via KVM_SET_USER_MEMORY_REGION2,
6227+
and more specifically via the guest_memfd and guest_memfd_offset fields in
6228+
"struct kvm_userspace_memory_region2", where guest_memfd_offset is the offset
6229+
into the guest_memfd instance. For a given guest_memfd file, there can be at
6230+
most one mapping per page, i.e. binding multiple memory regions to a single
6231+
guest_memfd range is not allowed (any number of memory regions can be bound to
6232+
a single guest_memfd file, but the bound ranges must not overlap).
6233+
6234+
If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set in flags, KVM will attempt to allocate
6235+
and map hugepages for the guest_memfd file. This is currently best effort. If
6236+
KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set, the size must be aligned to the maximum
6237+
transparent hugepage size supported by the kernel
6238+
6239+
See KVM_SET_USER_MEMORY_REGION2 for additional details.
6240+
60736241
5. The kvm_run structure
60746242
========================
60756243

@@ -6702,6 +6870,30 @@ array field represents return values. The userspace should update the return
67026870
values of SBI call before resuming the VCPU. For more details on RISC-V SBI
67036871
spec refer, https://github.com/riscv/riscv-sbi-doc.
67046872

6873+
::
6874+
6875+
/* KVM_EXIT_MEMORY_FAULT */
6876+
struct {
6877+
#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3)
6878+
__u64 flags;
6879+
__u64 gpa;
6880+
__u64 size;
6881+
} memory;
6882+
6883+
KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
6884+
could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe the
6885+
guest physical address range [gpa, gpa + size) of the fault. The 'flags' field
6886+
describes properties of the faulting access that are likely pertinent:
6887+
6888+
- KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
6889+
on a private memory access. When clear, indicates the fault occurred on a
6890+
shared access.
6891+
6892+
Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
6893+
accompanies a return code of '-1', not '0'! errno will always be set to EFAULT
6894+
or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
6895+
kvm_run.exit_reason is stale/undefined for all other error numbers.
6896+
67056897
::
67066898

67076899
/* KVM_EXIT_NOTIFY */
@@ -7736,6 +7928,27 @@ This capability is aimed to mitigate the threat that malicious VMs can
77367928
cause CPU stuck (due to event windows don't open up) and make the CPU
77377929
unavailable to host or other VMs.
77387930

7931+
7.34 KVM_CAP_MEMORY_FAULT_INFO
7932+
------------------------------
7933+
7934+
:Architectures: x86
7935+
:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
7936+
7937+
The presence of this capability indicates that KVM_RUN will fill
7938+
kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
7939+
there is a valid memslot but no backing VMA for the corresponding host virtual
7940+
address.
7941+
7942+
The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
7943+
an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
7944+
to KVM_EXIT_MEMORY_FAULT.
7945+
7946+
Note: Userspaces which attempt to resolve memory faults so that they can retry
7947+
KVM_RUN are encouraged to guard against repeatedly receiving the same
7948+
error/annotated fault.
7949+
7950+
See KVM_EXIT_MEMORY_FAULT for more information.
7951+
77397952
8. Other capabilities.
77407953
======================
77417954

@@ -8474,6 +8687,19 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
84748687
64-bit bitmap (each bit describing a block size). The default value is
84758688
0, to disable the eager page splitting.
84768689

8690+
8.41 KVM_CAP_VM_TYPES
8691+
---------------------
8692+
8693+
:Capability: KVM_CAP_MEMORY_ATTRIBUTES
8694+
:Architectures: x86
8695+
:Type: system ioctl
8696+
8697+
This capability returns a bitmap of support VM types. The 1-setting of bit @n
8698+
means the VM type with value @n is supported. Possible values of @n are::
8699+
8700+
#define KVM_X86_DEFAULT_VM 0
8701+
#define KVM_X86_SW_PROTECTED_VM 1
8702+
84778703
9. Known KVM API problems
84788704
=========================
84798705

0 commit comments

Comments
 (0)