Skip to content

Commit 8ce720d

Browse files
David Hildenbrand (Red Hat)akpm00
authored andcommitted
mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather
As reported, ever since commit 1013af4 ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") we can end up in some situations where we perform so many IPI broadcasts when unsharing hugetlb PMD page tables that it severely regresses some workloads. In particular, when we fork()+exit(), or when we munmap() a large area backed by many shared PMD tables, we perform one IPI broadcast per unshared PMD table. There are two optimizations to be had: (1) When we process (unshare) multiple such PMD tables, such as during exit(), it is sufficient to send a single IPI broadcast (as long as we respect locking rules) instead of one per PMD table. Locking prevents that any of these PMD tables could get reused before we drop the lock. (2) When we are not the last sharer (> 2 users including us), there is no need to send the IPI broadcast. The shared PMD tables cannot become exclusive (fully unshared) before an IPI will be broadcasted by the last sharer. Concurrent GUP-fast could walk into a PMD table just before we unshared it. It could then succeed in grabbing a page from the shared page table even after munmap() etc succeeded (and supressed an IPI). But there is not difference compared to GUP-fast just sleeping for a while after grabbing the page and re-enabling IRQs. Most importantly, GUP-fast will never walk into page tables that are no-longer shared, because the last sharer will issue an IPI broadcast. (if ever required, checking whether the PUD changed in GUP-fast after grabbing the page like we do in the PTE case could handle this) So let's rework PMD sharing TLB flushing + IPI sync to use the mmu_gather infrastructure so we can implement these optimizations and demystify the code at least a bit. Extend the mmu_gather infrastructure to be able to deal with our special hugetlb PMD table sharing implementation. To make initialization of the mmu_gather easier when working on a single VMA (in particular, when dealing with hugetlb), provide tlb_gather_mmu_vma(). We'll consolidate the handling for (full) unsharing of PMD tables in tlb_unshare_pmd_ptdesc() and tlb_flush_unshared_tables(), and track in "struct mmu_gather" whether we had (full) unsharing of PMD tables. Because locking is very special (concurrent unsharing+reuse must be prevented), we disallow deferring flushing to tlb_finish_mmu() and instead require an explicit earlier call to tlb_flush_unshared_tables(). From hugetlb code, we call huge_pmd_unshare_flush() where we make sure that the expected lock protecting us from concurrent unsharing+reuse is still held. Check with a VM_WARN_ON_ONCE() in tlb_finish_mmu() that tlb_flush_unshared_tables() was properly called earlier. Document it all properly. Notes about tlb_remove_table_sync_one() interaction with unsharing: There are two fairly tricky things: (1) tlb_remove_table_sync_one() is a NOP on architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE. Here, the assumption is that the previous TLB flush would send an IPI to all relevant CPUs. Careful: some architectures like x86 only send IPIs to all relevant CPUs when tlb->freed_tables is set. The relevant architectures should be selecting MMU_GATHER_RCU_TABLE_FREE, but x86 might not do that in stable kernels and it might have been problematic before this patch. Also, the arch flushing behavior (independent of IPIs) is different when tlb->freed_tables is set. Do we have to enlighten them to also take care of tlb->unshared_tables? So far we didn't care, so hopefully we are fine. Of course, we could be setting tlb->freed_tables as well, but that might then unnecessarily flush too much, because the semantics of tlb->freed_tables are a bit fuzzy. This patch changes nothing in this regard. (2) tlb_remove_table_sync_one() is not a NOP on architectures with CONFIG_MMU_GATHER_RCU_TABLE_FREE that actually don't need a sync. Take x86 as an example: in the common case (!pv, !X86_FEATURE_INVLPGB) we still issue IPIs during TLB flushes and don't actually need the second tlb_remove_table_sync_one(). This optimized can be implemented on top of this, by checking e.g., in tlb_remove_table_sync_one() whether we really need IPIs. But as described in (1), it really must honor tlb->freed_tables then to send IPIs to all relevant CPUs. Notes on TLB flushing changes: (1) Flushing for non-shared PMD tables We're converting from flush_hugetlb_tlb_range() to tlb_remove_huge_tlb_entry(). Given that we properly initialize the MMU gather in tlb_gather_mmu_vma() to be hugetlb aware, similar to __unmap_hugepage_range(), that should be fine. (2) Flushing for shared PMD tables We're converting from various things (flush_hugetlb_tlb_range(), tlb_flush_pmd_range(), flush_tlb_range()) to tlb_flush_pmd_range(). tlb_flush_pmd_range() achieves the same that tlb_remove_huge_tlb_entry() would achieve in these scenarios. Note that tlb_remove_huge_tlb_entry() also calls __tlb_remove_tlb_entry(), however that is only implemented on powerpc, which does not support PMD table sharing. Similar to (1), tlb_gather_mmu_vma() should make sure that TLB flushing keeps on working as expected. Further, note that the ptdesc_pmd_pts_dec() in huge_pmd_share() is not a concern, as we are holding the i_mmap_lock the whole time, preventing concurrent unsharing. That ptdesc_pmd_pts_dec() usage will be removed separately as a cleanup later. There are plenty more cleanups to be had, but they have to wait until this is fixed. [david@kernel.org: fix kerneldoc] Link: https://lkml.kernel.org/r/f223dd74-331c-412d-93fc-69e360a5006c@kernel.org Link: https://lkml.kernel.org/r/20251223214037.580860-5-david@kernel.org Fixes: 1013af4 ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Reported-by: Uschakow, Stanislav" <suschako@amazon.de> Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/ Tested-by: Laurence Oberman <loberman@redhat.com> Acked-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1 parent a8682d5 commit 8ce720d

6 files changed

Lines changed: 208 additions & 66 deletions

File tree

include/asm-generic/tlb.h

Lines changed: 75 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,8 @@
4646
*
4747
* The mmu_gather API consists of:
4848
*
49-
* - tlb_gather_mmu() / tlb_gather_mmu_fullmm() / tlb_finish_mmu()
49+
* - tlb_gather_mmu() / tlb_gather_mmu_fullmm() / tlb_gather_mmu_vma() /
50+
* tlb_finish_mmu()
5051
*
5152
* start and finish a mmu_gather
5253
*
@@ -364,6 +365,20 @@ struct mmu_gather {
364365
unsigned int vma_huge : 1;
365366
unsigned int vma_pfn : 1;
366367

368+
/*
369+
* Did we unshare (unmap) any shared page tables? For now only
370+
* used for hugetlb PMD table sharing.
371+
*/
372+
unsigned int unshared_tables : 1;
373+
374+
/*
375+
* Did we unshare any page tables such that they are now exclusive
376+
* and could get reused+modified by the new owner? When setting this
377+
* flag, "unshared_tables" will be set as well. For now only used
378+
* for hugetlb PMD table sharing.
379+
*/
380+
unsigned int fully_unshared_tables : 1;
381+
367382
unsigned int batch_count;
368383

369384
#ifndef CONFIG_MMU_GATHER_NO_GATHER
@@ -400,6 +415,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
400415
tlb->cleared_pmds = 0;
401416
tlb->cleared_puds = 0;
402417
tlb->cleared_p4ds = 0;
418+
tlb->unshared_tables = 0;
403419
/*
404420
* Do not reset mmu_gather::vma_* fields here, we do not
405421
* call into tlb_start_vma() again to set them if there is an
@@ -484,7 +500,7 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
484500
* these bits.
485501
*/
486502
if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
487-
tlb->cleared_puds || tlb->cleared_p4ds))
503+
tlb->cleared_puds || tlb->cleared_p4ds || tlb->unshared_tables))
488504
return;
489505

490506
tlb_flush(tlb);
@@ -773,6 +789,63 @@ static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
773789
}
774790
#endif
775791

792+
#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
793+
static inline void tlb_unshare_pmd_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt,
794+
unsigned long addr)
795+
{
796+
/*
797+
* The caller must make sure that concurrent unsharing + exclusive
798+
* reuse is impossible until tlb_flush_unshared_tables() was called.
799+
*/
800+
VM_WARN_ON_ONCE(!ptdesc_pmd_is_shared(pt));
801+
ptdesc_pmd_pts_dec(pt);
802+
803+
/* Clearing a PUD pointing at a PMD table with PMD leaves. */
804+
tlb_flush_pmd_range(tlb, addr & PUD_MASK, PUD_SIZE);
805+
806+
/*
807+
* If the page table is now exclusively owned, we fully unshared
808+
* a page table.
809+
*/
810+
if (!ptdesc_pmd_is_shared(pt))
811+
tlb->fully_unshared_tables = true;
812+
tlb->unshared_tables = true;
813+
}
814+
815+
static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb)
816+
{
817+
/*
818+
* As soon as the caller drops locks to allow for reuse of
819+
* previously-shared tables, these tables could get modified and
820+
* even reused outside of hugetlb context, so we have to make sure that
821+
* any page table walkers (incl. TLB, GUP-fast) are aware of that
822+
* change.
823+
*
824+
* Even if we are not fully unsharing a PMD table, we must
825+
* flush the TLB for the unsharer now.
826+
*/
827+
if (tlb->unshared_tables)
828+
tlb_flush_mmu_tlbonly(tlb);
829+
830+
/*
831+
* Similarly, we must make sure that concurrent GUP-fast will not
832+
* walk previously-shared page tables that are getting modified+reused
833+
* elsewhere. So broadcast an IPI to wait for any concurrent GUP-fast.
834+
*
835+
* We only perform this when we are the last sharer of a page table,
836+
* as the IPI will reach all CPUs: any GUP-fast.
837+
*
838+
* Note that on configs where tlb_remove_table_sync_one() is a NOP,
839+
* the expectation is that the tlb_flush_mmu_tlbonly() would have issued
840+
* required IPIs already for us.
841+
*/
842+
if (tlb->fully_unshared_tables) {
843+
tlb_remove_table_sync_one();
844+
tlb->fully_unshared_tables = false;
845+
}
846+
}
847+
#endif /* CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */
848+
776849
#endif /* CONFIG_MMU */
777850

778851
#endif /* _ASM_GENERIC__TLB_H */

include/linux/hugetlb.h

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -240,8 +240,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
240240
pte_t *huge_pte_offset(struct mm_struct *mm,
241241
unsigned long addr, unsigned long sz);
242242
unsigned long hugetlb_mask_last_page(struct hstate *h);
243-
int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
244-
unsigned long addr, pte_t *ptep);
243+
int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma,
244+
unsigned long addr, pte_t *ptep);
245+
void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma);
245246
void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
246247
unsigned long *start, unsigned long *end);
247248

@@ -300,13 +301,17 @@ static inline struct address_space *hugetlb_folio_mapping_lock_write(
300301
return NULL;
301302
}
302303

303-
static inline int huge_pmd_unshare(struct mm_struct *mm,
304-
struct vm_area_struct *vma,
305-
unsigned long addr, pte_t *ptep)
304+
static inline int huge_pmd_unshare(struct mmu_gather *tlb,
305+
struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
306306
{
307307
return 0;
308308
}
309309

310+
static inline void huge_pmd_unshare_flush(struct mmu_gather *tlb,
311+
struct vm_area_struct *vma)
312+
{
313+
}
314+
310315
static inline void adjust_range_if_pmd_sharing_possible(
311316
struct vm_area_struct *vma,
312317
unsigned long *start, unsigned long *end)

include/linux/mm_types.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1530,6 +1530,7 @@ static inline unsigned int mm_cid_size(void)
15301530
struct mmu_gather;
15311531
extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
15321532
extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
1533+
void tlb_gather_mmu_vma(struct mmu_gather *tlb, struct vm_area_struct *vma);
15331534
extern void tlb_finish_mmu(struct mmu_gather *tlb);
15341535

15351536
struct vm_fault;

0 commit comments

Comments
 (0)