Skip to content

Commit cb1fa2e

Browse files
hying-caritasctmarinas
authored andcommitted
arm64, tlbflush: don't TLBI broadcast if page reused in write fault
A multi-thread customer workload with large memory footprint uses fork()/exec() to run some external programs every tens seconds. When running the workload on an arm64 server machine, it's observed that quite some CPU cycles are spent in the TLB flushing functions. While running the workload on the x86_64 server machine, it's not. This causes the performance on arm64 to be much worse than that on x86_64. During the workload running, after fork()/exec() write-protects all pages in the parent process, memory writing in the parent process will cause a write protection fault. Then the page fault handler will make the PTE/PDE writable if the page can be reused, which is almost always true in the workload. On arm64, to avoid the write protection fault on other CPUs, the page fault handler flushes the TLB globally with TLBI broadcast after changing the PTE/PDE. However, this isn't always necessary. Firstly, it's safe to leave some stale read-only TLB entries as long as they will be flushed finally. Secondly, it's quite possible that the original read-only PTE/PDEs aren't cached in remote TLB at all if the memory footprint is large. In fact, on x86_64, the page fault handler doesn't flush the remote TLB in this situation, which benefits the performance a lot. To improve the performance on arm64, make the write protection fault handler flush the TLB locally instead of globally via TLBI broadcast after making the PTE/PDE writable. If there are stale read-only TLB entries in the remote CPUs, the page fault handler on these CPUs will regard the page fault as spurious and flush the stale TLB entries. To test the patchset, make the usemem.c from vm-scalability (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git). support calling fork()/exec() periodically. To mimic the behavior of the customer workload, run usemem with 4 threads, access 100GB memory, and call fork()/exec() every 40 seconds. Test results show that with the patchset the score of usemem improves ~40.6%. The cycles% of TLB flush functions reduces from ~50.5% to ~0.3% in perf profile. Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Will Deacon <will@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Christoph Lameter (Ampere) <cl@gentwo.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
1 parent 79301c7 commit cb1fa2e

4 files changed

Lines changed: 72 additions & 9 deletions

File tree

arch/arm64/include/asm/pgtable.h

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void)
130130
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
131131

132132
/*
133-
* Outside of a few very special situations (e.g. hibernation), we always
134-
* use broadcast TLB invalidation instructions, therefore a spurious page
135-
* fault on one CPU which has been handled concurrently by another CPU
136-
* does not need to perform additional invalidation.
133+
* We use local TLB invalidation instruction when reusing page in
134+
* write protection fault handler to avoid TLBI broadcast in the hot
135+
* path. This will cause spurious page faults if stale read-only TLB
136+
* entries exist.
137137
*/
138-
#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
138+
#define flush_tlb_fix_spurious_fault(vma, address, ptep) \
139+
local_flush_tlb_page_nonotify(vma, address)
140+
141+
#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \
142+
local_flush_tlb_page_nonotify(vma, address)
139143

140144
/*
141145
* ZERO_PAGE is a global shared page that is always zero: used

arch/arm64/include/asm/tlbflush.h

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -249,6 +249,19 @@ static inline unsigned long get_trans_granule(void)
249249
* cannot be easily determined, the value TLBI_TTL_UNKNOWN will
250250
* perform a non-hinted invalidation.
251251
*
252+
* local_flush_tlb_page(vma, addr)
253+
* Local variant of flush_tlb_page(). Stale TLB entries may
254+
* remain in remote CPUs.
255+
*
256+
* local_flush_tlb_page_nonotify(vma, addr)
257+
* Same as local_flush_tlb_page() except MMU notifier will not be
258+
* called.
259+
*
260+
* local_flush_tlb_contpte(vma, addr)
261+
* Invalidate the virtual-address range
262+
* '[addr, addr+CONT_PTE_SIZE)' mapped with contpte on local CPU
263+
* for the user address space corresponding to 'vma->mm'. Stale
264+
* TLB entries may remain in remote CPUs.
252265
*
253266
* Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented
254267
* on top of these routines, since that is our interface to the mmu_gather
@@ -282,6 +295,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
282295
mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
283296
}
284297

298+
static inline void __local_flush_tlb_page_nonotify_nosync(struct mm_struct *mm,
299+
unsigned long uaddr)
300+
{
301+
unsigned long addr;
302+
303+
dsb(nshst);
304+
addr = __TLBI_VADDR(uaddr, ASID(mm));
305+
__tlbi(vale1, addr);
306+
__tlbi_user(vale1, addr);
307+
}
308+
309+
static inline void local_flush_tlb_page_nonotify(struct vm_area_struct *vma,
310+
unsigned long uaddr)
311+
{
312+
__local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
313+
dsb(nsh);
314+
}
315+
316+
static inline void local_flush_tlb_page(struct vm_area_struct *vma,
317+
unsigned long uaddr)
318+
{
319+
__local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
320+
mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK,
321+
(uaddr & PAGE_MASK) + PAGE_SIZE);
322+
dsb(nsh);
323+
}
324+
285325
static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
286326
unsigned long uaddr)
287327
{
@@ -472,6 +512,22 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
472512
dsb(ish);
473513
}
474514

515+
static inline void local_flush_tlb_contpte(struct vm_area_struct *vma,
516+
unsigned long addr)
517+
{
518+
unsigned long asid;
519+
520+
addr = round_down(addr, CONT_PTE_SIZE);
521+
522+
dsb(nshst);
523+
asid = ASID(vma->vm_mm);
524+
__flush_tlb_range_op(vale1, addr, CONT_PTES, PAGE_SIZE, asid,
525+
3, true, lpa2_is_enabled());
526+
mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, addr,
527+
addr + CONT_PTE_SIZE);
528+
dsb(nsh);
529+
}
530+
475531
static inline void flush_tlb_range(struct vm_area_struct *vma,
476532
unsigned long start, unsigned long end)
477533
{

arch/arm64/mm/contpte.c

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -622,8 +622,7 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
622622
__ptep_set_access_flags(vma, addr, ptep, entry, 0);
623623

624624
if (dirty)
625-
__flush_tlb_range(vma, start_addr, addr,
626-
PAGE_SIZE, true, 3);
625+
local_flush_tlb_contpte(vma, start_addr);
627626
} else {
628627
__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
629628
__ptep_set_access_flags(vma, addr, ptep, entry, dirty);

arch/arm64/mm/fault.c

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -233,9 +233,13 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
233233
pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
234234
} while (pteval != old_pteval);
235235

236-
/* Invalidate a stale read-only entry */
236+
/*
237+
* Invalidate the local stale read-only entry. Remote stale entries
238+
* may still cause page faults and be invalidated via
239+
* flush_tlb_fix_spurious_fault().
240+
*/
237241
if (dirty)
238-
flush_tlb_page(vma, address);
242+
local_flush_tlb_page(vma, address);
239243
return 1;
240244
}
241245

0 commit comments

Comments
 (0)