Skip to content

Commit afb1352

Browse files
Hugh Dickinsgregkh
authored andcommitted
mm/thp: fix deferred split unqueue naming and locking
commit f8f931b upstream. Recent changes are putting more pressure on THP deferred split queues: under load revealing long-standing races, causing list_del corruptions, "Bad page state"s and worse (I keep BUGs in both of those, so usually don't get to see how badly they end up without). The relevant recent changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, improved swap allocation, and underused THP splitting. Before fixing locking: rename misleading folio_undo_large_rmappable(), which does not undo large_rmappable, to folio_unqueue_deferred_split(), which is what it does. But that and its out-of-line __callee are mm internals of very limited usability: add comment and WARN_ON_ONCEs to check usage; and return a bool to say if a deferred split was unqueued, which can then be used in WARN_ON_ONCEs around safety checks (sparing callers the arcane conditionals in __folio_unqueue_deferred_split()). Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all of whose callers now call it beforehand (and if any forget then bad_page() will tell) - except for its caller put_pages_list(), which itself no longer has any callers (and will be deleted separately). Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0 without checking and unqueueing a THP folio from deferred split list; which is unfortunate, since the split_queue_lock depends on the memcg (when memcg is enabled); so swapout has been unqueueing such THPs later, when freeing the folio, using the pgdat's lock instead: potentially corrupting the memcg's list. __remove_mapping() has frozen refcount to 0 here, so no problem with calling folio_unqueue_deferred_split() before resetting memcg_data. That goes back to 5.4 commit 87eaceb ("mm: thp: make deferred split shrinker memcg aware"): which included a check on swapcache before adding to deferred queue, but no check on deferred queue before adding THP to swapcache. That worked fine with the usual sequence of events in reclaim (though there were a couple of rare ways in which a THP on deferred queue could have been swapped out), but 6.12 commit dafff3f ("mm: split underused THPs") avoids splitting underused THPs in reclaim, which makes swapcache THPs on deferred queue commonplace. Keep the check on swapcache before adding to deferred queue? Yes: it is no longer essential, but preserves the existing behaviour, and is likely to be a worthwhile optimization (vmstat showed much more traffic on the queue under swapping load if the check was removed); update its comment. Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing folio->memcg_data without checking and unqueueing a THP folio from the deferred list, sometimes corrupting "from" memcg's list, like swapout. Refcount is non-zero here, so folio_unqueue_deferred_split() can only be used in a WARN_ON_ONCE to validate the fix, which must be done earlier: mem_cgroup_move_charge_pte_range() first try to split the THP (splitting of course unqueues), or skip it if that fails. Not ideal, but moving charge has been requested, and khugepaged should repair the THP later: nobody wants new custom unqueueing code just for this deprecated case. The 87eaceb commit did have the code to move from one deferred list to another (but was not conscious of its unsafety while refcount non-0); but that was removed by 5.6 commit fac0516 ("mm: thp: don't need care deferred split queue in memcg charge move path"), which argued that the existence of a PMD mapping guarantees that the THP cannot be on a deferred list. As above, false in rare cases, and now commonly false. Backport to 6.11 should be straightforward. Earlier backports must take care that other _deferred_list fixes and dependencies are included. There is not a strong case for backports, but they can fix cornercases. Link: https://lkml.kernel.org/r/8dc111ae-f6db-2da7-b25c-7a20b1effe3b@google.com Fixes: 87eaceb ("mm: thp: make deferred split shrinker memcg aware") Fixes: dafff3f ("mm: split underused THPs") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1 parent 1e58fe6 commit afb1352

8 files changed

Lines changed: 67 additions & 24 deletions

File tree

mm/huge_memory.c

Lines changed: 26 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3268,18 +3268,38 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
32683268
return ret;
32693269
}
32703270

3271-
void __folio_undo_large_rmappable(struct folio *folio)
3271+
/*
3272+
* __folio_unqueue_deferred_split() is not to be called directly:
3273+
* the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
3274+
* limits its calls to those folios which may have a _deferred_list for
3275+
* queueing THP splits, and that list is (racily observed to be) non-empty.
3276+
*
3277+
* It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
3278+
* zero: because even when split_queue_lock is held, a non-empty _deferred_list
3279+
* might be in use on deferred_split_scan()'s unlocked on-stack list.
3280+
*
3281+
* If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
3282+
* therefore important to unqueue deferred split before changing folio memcg.
3283+
*/
3284+
bool __folio_unqueue_deferred_split(struct folio *folio)
32723285
{
32733286
struct deferred_split *ds_queue;
32743287
unsigned long flags;
3288+
bool unqueued = false;
3289+
3290+
WARN_ON_ONCE(folio_ref_count(folio));
3291+
WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
32753292

32763293
ds_queue = get_deferred_split_queue(folio);
32773294
spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
32783295
if (!list_empty(&folio->_deferred_list)) {
32793296
ds_queue->split_queue_len--;
32803297
list_del_init(&folio->_deferred_list);
3298+
unqueued = true;
32813299
}
32823300
spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
3301+
3302+
return unqueued; /* useful for debug warnings */
32833303
}
32843304

32853305
void deferred_split_folio(struct folio *folio)
@@ -3298,14 +3318,11 @@ void deferred_split_folio(struct folio *folio)
32983318
return;
32993319

33003320
/*
3301-
* The try_to_unmap() in page reclaim path might reach here too,
3302-
* this may cause a race condition to corrupt deferred split queue.
3303-
* And, if page reclaim is already handling the same folio, it is
3304-
* unnecessary to handle it again in shrinker.
3305-
*
3306-
* Check the swapcache flag to determine if the folio is being
3307-
* handled by page reclaim since THP swap would add the folio into
3308-
* swap cache before calling try_to_unmap().
3321+
* Exclude swapcache: originally to avoid a corrupt deferred split
3322+
* queue. Nowadays that is fully prevented by mem_cgroup_swapout();
3323+
* but if page reclaim is already handling the same folio, it is
3324+
* unnecessary to handle it again in the shrinker, so excluding
3325+
* swapcache here may still be a useful optimization.
33093326
*/
33103327
if (folio_test_swapcache(folio))
33113328
return;

mm/internal.h

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -631,21 +631,21 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
631631
#endif
632632
}
633633

634-
void __folio_undo_large_rmappable(struct folio *folio);
635-
static inline void folio_undo_large_rmappable(struct folio *folio)
634+
bool __folio_unqueue_deferred_split(struct folio *folio);
635+
static inline bool folio_unqueue_deferred_split(struct folio *folio)
636636
{
637637
if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
638-
return;
638+
return false;
639639

640640
/*
641641
* At this point, there is no one trying to add the folio to
642642
* deferred_list. If folio is not in deferred_list, it's safe
643643
* to check without acquiring the split_queue_lock.
644644
*/
645645
if (data_race(list_empty(&folio->_deferred_list)))
646-
return;
646+
return false;
647647

648-
__folio_undo_large_rmappable(folio);
648+
return __folio_unqueue_deferred_split(folio);
649649
}
650650

651651
static inline struct folio *page_rmappable_folio(struct page *page)

mm/memcontrol-v1.c

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -845,6 +845,8 @@ static int mem_cgroup_move_account(struct folio *folio,
845845
css_get(&to->css);
846846
css_put(&from->css);
847847

848+
/* Warning should never happen, so don't worry about refcount non-0 */
849+
WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
848850
folio->memcg_data = (unsigned long)to;
849851

850852
__folio_memcg_unlock(from);
@@ -1214,7 +1216,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
12141216
enum mc_target_type target_type;
12151217
union mc_target target;
12161218
struct folio *folio;
1219+
bool tried_split_before = false;
12171220

1221+
retry_pmd:
12181222
ptl = pmd_trans_huge_lock(pmd, vma);
12191223
if (ptl) {
12201224
if (mc.precharge < HPAGE_PMD_NR) {
@@ -1224,6 +1228,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
12241228
target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
12251229
if (target_type == MC_TARGET_PAGE) {
12261230
folio = target.folio;
1231+
/*
1232+
* Deferred split queue locking depends on memcg,
1233+
* and unqueue is unsafe unless folio refcount is 0:
1234+
* split or skip if on the queue? first try to split.
1235+
*/
1236+
if (!list_empty(&folio->_deferred_list)) {
1237+
spin_unlock(ptl);
1238+
if (!tried_split_before)
1239+
split_folio(folio);
1240+
folio_unlock(folio);
1241+
folio_put(folio);
1242+
if (tried_split_before)
1243+
return 0;
1244+
tried_split_before = true;
1245+
goto retry_pmd;
1246+
}
1247+
/*
1248+
* So long as that pmd lock is held, the folio cannot
1249+
* be racily added to the _deferred_list, because
1250+
* __folio_remove_rmap() will find !partially_mapped.
1251+
*/
12271252
if (folio_isolate_lru(folio)) {
12281253
if (!mem_cgroup_move_account(folio, true,
12291254
mc.from, mc.to)) {

mm/memcontrol.c

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4604,9 +4604,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
46044604
struct obj_cgroup *objcg;
46054605

46064606
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
4607-
VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
4608-
!folio_test_hugetlb(folio) &&
4609-
!list_empty(&folio->_deferred_list), folio);
46104607

46114608
/*
46124609
* Nobody should be changing or seriously looking at
@@ -4653,6 +4650,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
46534650
ug->nr_memory += nr_pages;
46544651
ug->pgpgout++;
46554652

4653+
WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
46564654
folio->memcg_data = 0;
46574655
}
46584656

@@ -4769,6 +4767,9 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
47694767

47704768
/* Transfer the charge and the css ref */
47714769
commit_charge(new, memcg);
4770+
4771+
/* Warning should never happen, so don't worry about refcount non-0 */
4772+
WARN_ON_ONCE(folio_unqueue_deferred_split(old));
47724773
old->memcg_data = 0;
47734774
}
47744775

@@ -4955,6 +4956,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
49554956
VM_BUG_ON_FOLIO(oldid, folio);
49564957
mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
49574958

4959+
folio_unqueue_deferred_split(folio);
49584960
folio->memcg_data = 0;
49594961

49604962
if (!mem_cgroup_is_root(memcg))

mm/migrate.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -415,7 +415,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
415415
folio_test_large_rmappable(folio)) {
416416
if (!folio_ref_freeze(folio, expected_count))
417417
return -EAGAIN;
418-
folio_undo_large_rmappable(folio);
418+
folio_unqueue_deferred_split(folio);
419419
folio_ref_unfreeze(folio, expected_count);
420420
}
421421

@@ -438,7 +438,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
438438
}
439439

440440
/* Take off deferred split queue while frozen and memcg set */
441-
folio_undo_large_rmappable(folio);
441+
folio_unqueue_deferred_split(folio);
442442

443443
/*
444444
* Now we know that no one else is looking at the folio:

mm/page_alloc.c

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2663,7 +2663,6 @@ void free_unref_folios(struct folio_batch *folios)
26632663
unsigned long pfn = folio_pfn(folio);
26642664
unsigned int order = folio_order(folio);
26652665

2666-
folio_undo_large_rmappable(folio);
26672666
if (!free_pages_prepare(&folio->page, order))
26682667
continue;
26692668
/*

mm/swap.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,7 @@ void __folio_put(struct folio *folio)
123123
}
124124

125125
page_cache_release(folio);
126-
folio_undo_large_rmappable(folio);
126+
folio_unqueue_deferred_split(folio);
127127
mem_cgroup_uncharge(folio);
128128
free_unref_page(&folio->page, folio_order(folio));
129129
}
@@ -1020,7 +1020,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
10201020
free_huge_folio(folio);
10211021
continue;
10221022
}
1023-
folio_undo_large_rmappable(folio);
1023+
folio_unqueue_deferred_split(folio);
10241024
__page_cache_release(folio, &lruvec, &flags);
10251025

10261026
if (j != i)

mm/vmscan.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1462,7 +1462,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
14621462
*/
14631463
nr_reclaimed += nr_pages;
14641464

1465-
folio_undo_large_rmappable(folio);
1465+
folio_unqueue_deferred_split(folio);
14661466
if (folio_batch_add(&free_folios, folio) == 0) {
14671467
mem_cgroup_uncharge_folios(&free_folios);
14681468
try_to_unmap_flush();
@@ -1849,7 +1849,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
18491849
if (unlikely(folio_put_testzero(folio))) {
18501850
__folio_clear_lru_flags(folio);
18511851

1852-
folio_undo_large_rmappable(folio);
1852+
folio_unqueue_deferred_split(folio);
18531853
if (folio_batch_add(&free_folios, folio) == 0) {
18541854
spin_unlock_irq(&lruvec->lru_lock);
18551855
mem_cgroup_uncharge_folios(&free_folios);

0 commit comments

Comments
 (0)