Skip to content

Commit 0ea148a

Browse files
ryncsnakpm00
authored andcommitted
mm: userfaultfd: fix race of userfaultfd_move and swap cache
This commit fixes two kinds of races, they may have different results: Barry reported a BUG_ON in commit c50f8e6, we may see the same BUG_ON if the filemap lookup returned NULL and folio is added to swap cache after that. If another kind of race is triggered (folio changed after lookup) we may see RSS counter is corrupted: [ 406.893936] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0 type:MM_ANONPAGES val:-1 [ 406.894071] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0 type:MM_SHMEMPAGES val:1 Because the folio is being accounted to the wrong VMA. I'm not sure if there will be any data corruption though, seems no. The issues above are critical already. On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache lookup, and tries to move the found folio to the faulting vma. Currently, it relies on checking the PTE value to ensure that the moved folio still belongs to the src swap entry and that no new folio has been added to the swap cache, which turns out to be unreliable. While working and reviewing the swap table series with Barry, following existing races are observed and reproduced [1]: In the example below, move_pages_pte is moving src_pte to dst_pte, where src_pte is a swap entry PTE holding swap entry S1, and S1 is not in the swap cache: CPU1 CPU2 userfaultfd_move move_pages_pte() entry = pte_to_swp_entry(orig_src_pte); // Here it got entry = S1 ... < interrupted> ... <swapin src_pte, alloc and use folio A> // folio A is a new allocated folio // and get installed into src_pte <frees swap entry S1> // src_pte now points to folio A, S1 // has swap count == 0, it can be freed // by folio_swap_swap or swap // allocator's reclaim. <try to swap out another folio B> // folio B is a folio in another VMA. <put folio B to swap cache using S1 > // S1 is freed, folio B can use it // for swap out with no problem. ... folio = filemap_get_folio(S1) // Got folio B here !!! ... < interrupted again> ... <swapin folio B and free S1> // Now S1 is free to be used again. <swapout src_pte & folio A using S1> // Now src_pte is a swap entry PTE // holding S1 again. folio_trylock(folio) move_swap_pte double_pt_lock is_pte_pages_stable // Check passed because src_pte == S1 folio_move_anon_rmap(...) // Moved invalid folio B here !!! The race window is very short and requires multiple collisions of multiple rare events, so it's very unlikely to happen, but with a deliberately constructed reproducer and increased time window, it can be reproduced easily. This can be fixed by checking if the folio returned by filemap is the valid swap cache folio after acquiring the folio lock. Another similar race is possible: filemap_get_folio may return NULL, but folio (A) could be swapped in and then swapped out again using the same swap entry after the lookup. In such a case, folio (A) may remain in the swap cache, so it must be moved too: CPU1 CPU2 userfaultfd_move move_pages_pte() entry = pte_to_swp_entry(orig_src_pte); // Here it got entry = S1, and S1 is not in swap cache folio = filemap_get_folio(S1) // Got NULL ... < interrupted again> ... <swapin folio A and free S1> <swapout folio A re-using S1> move_swap_pte double_pt_lock is_pte_pages_stable // Check passed because src_pte == S1 folio_move_anon_rmap(...) // folio A is ignored !!! Fix this by checking the swap cache again after acquiring the src_pte lock. And to avoid the filemap overhead, we check swap_map directly [2]. The SWP_SYNCHRONOUS_IO path does make the problem more complex, but so far we don't need to worry about that, since folios can only be exposed to the swap cache in the swap out path, and this is covered in this patch by checking the swap cache again after acquiring the src_pte lock. Testing with a simple C program that allocates and moves several GB of memory did not show any observable performance change. Link: https://lkml.kernel.org/r/20250604151038.21968-1-ryncsn@gmail.com Fixes: adef440 ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Kairui Song <kasong@tencent.com> Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1] Link: https://lore.kernel.org/all/CAGsJ_4yJhJBo16XhiC-nUzSheyX-V3-nFE+tAi=8Y560K8eT=A@mail.gmail.com/ [2] Reviewed-by: Lokesh Gidra <lokeshgidra@google.com> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Chris Li <chrisl@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kairui Song <kasong@tencent.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1 parent 517f496 commit 0ea148a

1 file changed

Lines changed: 31 additions & 2 deletions

File tree

mm/userfaultfd.c

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
10841084
pte_t orig_dst_pte, pte_t orig_src_pte,
10851085
pmd_t *dst_pmd, pmd_t dst_pmdval,
10861086
spinlock_t *dst_ptl, spinlock_t *src_ptl,
1087-
struct folio *src_folio)
1087+
struct folio *src_folio,
1088+
struct swap_info_struct *si, swp_entry_t entry)
10881089
{
1090+
/*
1091+
* Check if the folio still belongs to the target swap entry after
1092+
* acquiring the lock. Folio can be freed in the swap cache while
1093+
* not locked.
1094+
*/
1095+
if (src_folio && unlikely(!folio_test_swapcache(src_folio) ||
1096+
entry.val != src_folio->swap.val))
1097+
return -EAGAIN;
1098+
10891099
double_pt_lock(dst_ptl, src_ptl);
10901100

10911101
if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
@@ -1102,6 +1112,25 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
11021112
if (src_folio) {
11031113
folio_move_anon_rmap(src_folio, dst_vma);
11041114
src_folio->index = linear_page_index(dst_vma, dst_addr);
1115+
} else {
1116+
/*
1117+
* Check if the swap entry is cached after acquiring the src_pte
1118+
* lock. Otherwise, we might miss a newly loaded swap cache folio.
1119+
*
1120+
* Check swap_map directly to minimize overhead, READ_ONCE is sufficient.
1121+
* We are trying to catch newly added swap cache, the only possible case is
1122+
* when a folio is swapped in and out again staying in swap cache, using the
1123+
* same entry before the PTE check above. The PTL is acquired and released
1124+
* twice, each time after updating the swap_map's flag. So holding
1125+
* the PTL here ensures we see the updated value. False positive is possible,
1126+
* e.g. SWP_SYNCHRONOUS_IO swapin may set the flag without touching the
1127+
* cache, or during the tiny synchronization window between swap cache and
1128+
* swap_map, but it will be gone very quickly, worst result is retry jitters.
1129+
*/
1130+
if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {
1131+
double_pt_unlock(dst_ptl, src_ptl);
1132+
return -EAGAIN;
1133+
}
11051134
}
11061135

11071136
orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
@@ -1412,7 +1441,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
14121441
}
14131442
err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
14141443
orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
1415-
dst_ptl, src_ptl, src_folio);
1444+
dst_ptl, src_ptl, src_folio, si, entry);
14161445
}
14171446

14181447
out:

0 commit comments

Comments
 (0)