Skip to content

Commit 5185c4d

Browse files
committed
Merge branch 'iommufd_dmabuf' into k.o-iommufd/for-next
Jason Gunthorpe says: ==================== This series is the start of adding full DMABUF support to iommufd. Currently it is limited to only work with VFIO's DMABUF exporter. It sits on top of Leon's series to add a DMABUF exporter to VFIO: https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.com/ The existing IOMMU_IOAS_MAP_FILE is enhanced to detect DMABUF fd's, but otherwise works the same as it does today for a memfd. The user can select a slice of the FD to map into the ioas and if the underliyng alignment requirements are met it will be placed in the iommu_domain. Though limited, it is enough to allow a VMM like QEMU to connect MMIO BAR memory from VFIO to an iommu_domain controlled by iommufd. This is used for PCI Peer to Peer support in VMs, and is the last feature that the VFIO type 1 container has that iommufd couldn't do. The VFIO type1 version extracts raw PFNs from VMAs, which has no lifetime control and is a use-after-free security problem. Instead iommufd relies on revokable DMABUFs. Whenever VFIO thinks there should be no access to the MMIO it can shoot down the mapping in iommufd which will unmap it from the iommu_domain. There is no automatic remap, this is a safety protocol so the kernel doesn't get stuck. Userspace is expected to know it is doing something that will revoke the dmabuf and map/unmap it around the activity. Eg when QEMU goes to issue FLR it should do the map/unmap to iommufd. Since DMABUF is missing some key general features for this use case it relies on a "private interconnect" between VFIO and iommufd via the vfio_pci_dma_buf_iommufd_map() call. The call confirms the DMABUF has revoke semantics and delivers a phys_addr for the memory suitable for use with iommu_map(). Medium term there is a desire to expand the supported DMABUFs to include GPU drivers to support DPDK/SPDK type use cases so future series will work to add a general concept of revoke and a general negotiation of interconnect to remove vfio_pci_dma_buf_iommufd_map(). I also plan another series to modify iommufd's vfio_compat to transparently pull a dmabuf out of a VFIO VMA to emulate more of the uAPI of type1. The latest series for interconnect negotation to exchange a phys_addr is: https://lore.kernel.org/r/20251027044712.1676175-1-vivek.kasireddy@intel.com And the discussion for design of revoke is here: https://lore.kernel.org/dri-devel/20250114173103.GE5556@nvidia.com/ ==================== Based on a shared branch with vfio. * iommufd_dmabuf: iommufd/selftest: Add some tests for the dmabuf flow iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE iommufd: Have iopt_map_file_pages convert the fd to a file iommufd: Have pfn_reader process DMABUF iopt_pages iommufd: Allow MMIO pages in a batch iommufd: Allow a DMABUF to be revoked iommufd: Do not map/unmap revoked DMABUFs iommufd: Add DMABUF to iopt_pages vfio/pci: Add vfio_pci_dma_buf_iommufd_map() vfio/nvgrace: Support get_dmabuf_phys vfio/pci: Add dma-buf export support for MMIO regions vfio/pci: Enable peer-to-peer DMA transactions by default vfio/pci: Share the core device pointer while invoking feature functions vfio: Export vfio device get and put registration helpers dma-buf: provide phys_vec to scatter-gather mapping routine PCI/P2PDMA: Document DMABUF model PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation PCI/P2PDMA: Simplify bus address mapping API PCI/P2PDMA: Separate the mmap() support from the core logic Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2 parents 81c45c6 + d2041f1 commit 5185c4d

33 files changed

Lines changed: 1887 additions & 211 deletions

Documentation/driver-api/pci/p2pdma.rst

Lines changed: 74 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -9,22 +9,48 @@ between two devices on the bus. This type of transaction is henceforth
99
called Peer-to-Peer (or P2P). However, there are a number of issues that
1010
make P2P transactions tricky to do in a perfectly safe way.
1111

12-
One of the biggest issues is that PCI doesn't require forwarding
13-
transactions between hierarchy domains, and in PCIe, each Root Port
14-
defines a separate hierarchy domain. To make things worse, there is no
15-
simple way to determine if a given Root Complex supports this or not.
16-
(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel
17-
only supports doing P2P when the endpoints involved are all behind the
18-
same PCI bridge, as such devices are all in the same PCI hierarchy
19-
domain, and the spec guarantees that all transactions within the
20-
hierarchy will be routable, but it does not require routing
21-
between hierarchies.
22-
23-
The second issue is that to make use of existing interfaces in Linux,
24-
memory that is used for P2P transactions needs to be backed by struct
25-
pages. However, PCI BARs are not typically cache coherent so there are
26-
a few corner case gotchas with these pages so developers need to
27-
be careful about what they do with them.
12+
For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined up
13+
until they reach a host bridge or root port. If the path includes PCIe switches
14+
then based on the ACS settings the transaction can route entirely within
15+
the PCIe hierarchy and never reach the root port. The kernel will evaluate
16+
the PCIe topology and always permit P2P in these well-defined cases.
17+
18+
However, if the P2P transaction reaches the host bridge then it might have to
19+
hairpin back out the same root port, be routed inside the CPU SOC to another
20+
PCIe root port, or routed internally to the SOC.
21+
22+
The PCIe specification doesn't define the forwarding of transactions between
23+
hierarchy domains and kernel defaults to blocking such routing. There is an
24+
allow list to allow detecting known-good HW, in which case P2P between any
25+
two PCIe devices will be permitted.
26+
27+
Since P2P inherently is doing transactions between two devices it requires two
28+
drivers to be co-operating inside the kernel. The providing driver has to convey
29+
its MMIO to the consuming driver. To meet the driver model lifecycle rules the
30+
MMIO must have all DMA mapping removed, all CPU accesses prevented, all page
31+
table mappings undone before the providing driver completes remove().
32+
33+
This requires the providing and consuming driver to actively work together to
34+
guarantee that the consuming driver has stopped using the MMIO during a removal
35+
cycle. This is done by either a synchronous invalidation shutdown or waiting
36+
for all usage refcounts to reach zero.
37+
38+
At the lowest level the P2P subsystem offers a naked struct p2p_provider that
39+
delegates lifecycle management to the providing driver. It is expected that
40+
drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF
41+
to provide an invalidation shutdown. These MMIO addresess have no struct page, and
42+
if used with mmap() must create special PTEs. As such there are very few
43+
kernel uAPIs that can accept pointers to them; in particular they cannot be used
44+
with read()/write(), including O_DIRECT.
45+
46+
Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE
47+
pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of
48+
pgmap ensures that when the pgmap is destroyed all other drivers have stopped
49+
using the MMIO. This option works with O_DIRECT flows, in some cases, if the
50+
underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through
51+
FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap
52+
it also relies on architecture support along with alignment and minimum size
53+
limitations.
2854

2955

3056
Driver Writer's Guide
@@ -114,14 +140,39 @@ allocating scatter-gather lists with P2P memory.
114140
Struct Page Caveats
115141
-------------------
116142

117-
Driver writers should be very careful about not passing these special
118-
struct pages to code that isn't prepared for it. At this time, the kernel
119-
interfaces do not have any checks for ensuring this. This obviously
120-
precludes passing these pages to userspace.
143+
While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs,
144+
pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set.
121145

122-
P2P memory is also technically IO memory but should never have any side
123-
effects behind it. Thus, the order of loads and stores should not be important
124-
and ioreadX(), iowriteX() and friends should not be necessary.
146+
The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. The
147+
KVA is still MMIO and must still be accessed through the normal
148+
readX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, just
149+
like any other MMIO mapping. While this will actually work on some
150+
architectures, others will experience corruption or just crash in the kernel.
151+
Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPU
152+
access happens.
153+
154+
155+
Usage With DMABUF
156+
=================
157+
158+
DMABUF provides an alternative to the above struct page-based
159+
client/provider/orchestrator system and should be used when struct page
160+
doesn't exist. In this mode the exporting driver will wrap
161+
some of its MMIO in a DMABUF and give the DMABUF FD to userspace.
162+
163+
Userspace can then pass the FD to an importing driver which will ask the
164+
exporting driver to map it to the importer.
165+
166+
In this case the initiator and target pci_devices are known and the P2P subsystem
167+
is used to determine the mapping type. The phys_addr_t-based DMA API is used to
168+
establish the dma_addr_t.
169+
170+
Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants
171+
to remove() it must deliver an invalidation shutdown to all DMABUF importing
172+
drivers through move_notify() and synchronously DMA unmap all the MMIO.
173+
174+
No importing driver can continue to have a DMA map to the MMIO after the
175+
exporting driver has destroyed its p2p_provider.
125176

126177

127178
P2P DMA Support Library

block/blk-mq-dma.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ static inline bool blk_can_dma_map_iova(struct request *req,
8585

8686
static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec)
8787
{
88-
iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr);
88+
iter->addr = pci_p2pdma_bus_addr_map(iter->p2pdma.mem, vec->paddr);
8989
iter->len = vec->len;
9090
return true;
9191
}

drivers/dma-buf/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# SPDX-License-Identifier: GPL-2.0-only
22
obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \
3-
dma-fence-unwrap.o dma-resv.o
3+
dma-fence-unwrap.o dma-resv.o dma-buf-mapping.o
44
obj-$(CONFIG_DMABUF_HEAPS) += dma-heap.o
55
obj-$(CONFIG_DMABUF_HEAPS) += heaps/
66
obj-$(CONFIG_SYNC_FILE) += sync_file.o

drivers/dma-buf/dma-buf-mapping.c

Lines changed: 248 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
// SPDX-License-Identifier: GPL-2.0-only
2+
/*
3+
* DMA BUF Mapping Helpers
4+
*
5+
*/
6+
#include <linux/dma-buf-mapping.h>
7+
#include <linux/dma-resv.h>
8+
9+
static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
10+
dma_addr_t addr)
11+
{
12+
unsigned int len, nents;
13+
int i;
14+
15+
nents = DIV_ROUND_UP(length, UINT_MAX);
16+
for (i = 0; i < nents; i++) {
17+
len = min_t(size_t, length, UINT_MAX);
18+
length -= len;
19+
/*
20+
* DMABUF abuses scatterlist to create a scatterlist
21+
* that does not have any CPU list, only the DMA list.
22+
* Always set the page related values to NULL to ensure
23+
* importers can't use it. The phys_addr based DMA API
24+
* does not require the CPU list for mapping or unmapping.
25+
*/
26+
sg_set_page(sgl, NULL, 0, 0);
27+
sg_dma_address(sgl) = addr + i * UINT_MAX;
28+
sg_dma_len(sgl) = len;
29+
sgl = sg_next(sgl);
30+
}
31+
32+
return sgl;
33+
}
34+
35+
static unsigned int calc_sg_nents(struct dma_iova_state *state,
36+
struct dma_buf_phys_vec *phys_vec,
37+
size_t nr_ranges, size_t size)
38+
{
39+
unsigned int nents = 0;
40+
size_t i;
41+
42+
if (!state || !dma_use_iova(state)) {
43+
for (i = 0; i < nr_ranges; i++)
44+
nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
45+
} else {
46+
/*
47+
* In IOVA case, there is only one SG entry which spans
48+
* for whole IOVA address space, but we need to make sure
49+
* that it fits sg->length, maybe we need more.
50+
*/
51+
nents = DIV_ROUND_UP(size, UINT_MAX);
52+
}
53+
54+
return nents;
55+
}
56+
57+
/**
58+
* struct dma_buf_dma - holds DMA mapping information
59+
* @sgt: Scatter-gather table
60+
* @state: DMA IOVA state relevant in IOMMU-based DMA
61+
* @size: Total size of DMA transfer
62+
*/
63+
struct dma_buf_dma {
64+
struct sg_table sgt;
65+
struct dma_iova_state *state;
66+
size_t size;
67+
};
68+
69+
/**
70+
* dma_buf_phys_vec_to_sgt - Returns the scatterlist table of the attachment
71+
* from arrays of physical vectors. This funciton is intended for MMIO memory
72+
* only.
73+
* @attach: [in] attachment whose scatterlist is to be returned
74+
* @provider: [in] p2pdma provider
75+
* @phys_vec: [in] array of physical vectors
76+
* @nr_ranges: [in] number of entries in phys_vec array
77+
* @size: [in] total size of phys_vec
78+
* @dir: [in] direction of DMA transfer
79+
*
80+
* Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
81+
* on error. May return -EINTR if it is interrupted by a signal.
82+
*
83+
* On success, the DMA addresses and lengths in the returned scatterlist are
84+
* PAGE_SIZE aligned.
85+
*
86+
* A mapping must be unmapped by using dma_buf_free_sgt().
87+
*
88+
* NOTE: This function is intended for exporters. If direct traffic routing is
89+
* mandatory exporter should call routing pci_p2pdma_map_type() before calling
90+
* this function.
91+
*/
92+
struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach,
93+
struct p2pdma_provider *provider,
94+
struct dma_buf_phys_vec *phys_vec,
95+
size_t nr_ranges, size_t size,
96+
enum dma_data_direction dir)
97+
{
98+
unsigned int nents, mapped_len = 0;
99+
struct dma_buf_dma *dma;
100+
struct scatterlist *sgl;
101+
dma_addr_t addr;
102+
size_t i;
103+
int ret;
104+
105+
dma_resv_assert_held(attach->dmabuf->resv);
106+
107+
if (WARN_ON(!attach || !attach->dmabuf || !provider))
108+
/* This function is supposed to work on MMIO memory only */
109+
return ERR_PTR(-EINVAL);
110+
111+
dma = kzalloc(sizeof(*dma), GFP_KERNEL);
112+
if (!dma)
113+
return ERR_PTR(-ENOMEM);
114+
115+
switch (pci_p2pdma_map_type(provider, attach->dev)) {
116+
case PCI_P2PDMA_MAP_BUS_ADDR:
117+
/*
118+
* There is no need in IOVA at all for this flow.
119+
*/
120+
break;
121+
case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
122+
dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
123+
if (!dma->state) {
124+
ret = -ENOMEM;
125+
goto err_free_dma;
126+
}
127+
128+
dma_iova_try_alloc(attach->dev, dma->state, 0, size);
129+
break;
130+
default:
131+
ret = -EINVAL;
132+
goto err_free_dma;
133+
}
134+
135+
nents = calc_sg_nents(dma->state, phys_vec, nr_ranges, size);
136+
ret = sg_alloc_table(&dma->sgt, nents, GFP_KERNEL | __GFP_ZERO);
137+
if (ret)
138+
goto err_free_state;
139+
140+
sgl = dma->sgt.sgl;
141+
142+
for (i = 0; i < nr_ranges; i++) {
143+
if (!dma->state) {
144+
addr = pci_p2pdma_bus_addr_map(provider,
145+
phys_vec[i].paddr);
146+
} else if (dma_use_iova(dma->state)) {
147+
ret = dma_iova_link(attach->dev, dma->state,
148+
phys_vec[i].paddr, 0,
149+
phys_vec[i].len, dir,
150+
DMA_ATTR_MMIO);
151+
if (ret)
152+
goto err_unmap_dma;
153+
154+
mapped_len += phys_vec[i].len;
155+
} else {
156+
addr = dma_map_phys(attach->dev, phys_vec[i].paddr,
157+
phys_vec[i].len, dir,
158+
DMA_ATTR_MMIO);
159+
ret = dma_mapping_error(attach->dev, addr);
160+
if (ret)
161+
goto err_unmap_dma;
162+
}
163+
164+
if (!dma->state || !dma_use_iova(dma->state))
165+
sgl = fill_sg_entry(sgl, phys_vec[i].len, addr);
166+
}
167+
168+
if (dma->state && dma_use_iova(dma->state)) {
169+
WARN_ON_ONCE(mapped_len != size);
170+
ret = dma_iova_sync(attach->dev, dma->state, 0, mapped_len);
171+
if (ret)
172+
goto err_unmap_dma;
173+
174+
sgl = fill_sg_entry(sgl, mapped_len, dma->state->addr);
175+
}
176+
177+
dma->size = size;
178+
179+
/*
180+
* No CPU list included — set orig_nents = 0 so others can detect
181+
* this via SG table (use nents only).
182+
*/
183+
dma->sgt.orig_nents = 0;
184+
185+
186+
/*
187+
* SGL must be NULL to indicate that SGL is the last one
188+
* and we allocated correct number of entries in sg_alloc_table()
189+
*/
190+
WARN_ON_ONCE(sgl);
191+
return &dma->sgt;
192+
193+
err_unmap_dma:
194+
if (!i || !dma->state) {
195+
; /* Do nothing */
196+
} else if (dma_use_iova(dma->state)) {
197+
dma_iova_destroy(attach->dev, dma->state, mapped_len, dir,
198+
DMA_ATTR_MMIO);
199+
} else {
200+
for_each_sgtable_dma_sg(&dma->sgt, sgl, i)
201+
dma_unmap_phys(attach->dev, sg_dma_address(sgl),
202+
sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
203+
}
204+
sg_free_table(&dma->sgt);
205+
err_free_state:
206+
kfree(dma->state);
207+
err_free_dma:
208+
kfree(dma);
209+
return ERR_PTR(ret);
210+
}
211+
EXPORT_SYMBOL_NS_GPL(dma_buf_phys_vec_to_sgt, "DMA_BUF");
212+
213+
/**
214+
* dma_buf_free_sgt- unmaps the buffer
215+
* @attach: [in] attachment to unmap buffer from
216+
* @sgt: [in] scatterlist info of the buffer to unmap
217+
* @dir: [in] direction of DMA transfer
218+
*
219+
* This unmaps a DMA mapping for @attached obtained
220+
* by dma_buf_phys_vec_to_sgt().
221+
*/
222+
void dma_buf_free_sgt(struct dma_buf_attachment *attach, struct sg_table *sgt,
223+
enum dma_data_direction dir)
224+
{
225+
struct dma_buf_dma *dma = container_of(sgt, struct dma_buf_dma, sgt);
226+
int i;
227+
228+
dma_resv_assert_held(attach->dmabuf->resv);
229+
230+
if (!dma->state) {
231+
; /* Do nothing */
232+
} else if (dma_use_iova(dma->state)) {
233+
dma_iova_destroy(attach->dev, dma->state, dma->size, dir,
234+
DMA_ATTR_MMIO);
235+
} else {
236+
struct scatterlist *sgl;
237+
238+
for_each_sgtable_dma_sg(sgt, sgl, i)
239+
dma_unmap_phys(attach->dev, sg_dma_address(sgl),
240+
sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
241+
}
242+
243+
sg_free_table(sgt);
244+
kfree(dma->state);
245+
kfree(dma);
246+
247+
}
248+
EXPORT_SYMBOL_NS_GPL(dma_buf_free_sgt, "DMA_BUF");

drivers/iommu/dma-iommu.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1439,8 +1439,8 @@ int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
14391439
* as a bus address, __finalise_sg() will copy the dma
14401440
* address into the output segment.
14411441
*/
1442-
s->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
1443-
sg_phys(s));
1442+
s->dma_address = pci_p2pdma_bus_addr_map(
1443+
p2pdma_state.mem, sg_phys(s));
14441444
sg_dma_len(s) = sg->length;
14451445
sg_dma_mark_bus_address(s);
14461446
continue;

0 commit comments

Comments
 (0)