Skip to content

Commit 1bc545b

Browse files
yosrym93akpm00
authored andcommitted
mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
When memory.reclaim was introduced, it became the first case where cgroup_reclaim() is true for the root cgroup. Johannes concluded [1] that for most cases this is okay, except for one case. Historically, kswapd would throttle reclaim on a node if a lot of pages marked for reclaim are under writeback (aka the node is congested). This occurred by setting LRUVEC_CONGESTED bit in lruvec->flags. The bit would be cleared when the node is balanced. Similarly, cgroup reclaim would set the same bit when an lruvec is congested, and clear it on the way out of reclaim (to throttle local reclaimers). Before the introduction of memory.reclaim, the root memcg was the only target of kswapd reclaim, and non-root memcgs were the only targets of cgroup reclaim, so they would never interfere. Using the same bit for both was fine. After memory.reclaim, it is possible for cgroup reclaim on the root cgroup to clear the bit set by kswapd. This would result in reclaim on the node to be unthrottled before the node is balanced. Fix this by introducing separate bits for cgroup-level and node-level congestion. kswapd can unthrottle an lruvec that is marked as congested by cgroup reclaim (as the entire node should no longer be congested), but not vice versa (to prevent premature unthrottling before the entire node is balanced). [1]https://lore.kernel.org/lkml/20230405200150.GA35884@cmpxchg.org/ Link: https://lkml.kernel.org/r/20230621023101.432780-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Closes: https://lore.kernel.org/lkml/20230405200150.GA35884@cmpxchg.org/ Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1 parent 7a70447 commit 1bc545b

2 files changed

Lines changed: 27 additions & 10 deletions

File tree

include/linux/mmzone.h

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -293,9 +293,21 @@ static inline bool is_active_lru(enum lru_list lru)
293293
#define ANON_AND_FILE 2
294294

295295
enum lruvec_flags {
296-
LRUVEC_CONGESTED, /* lruvec has many dirty pages
297-
* backed by a congested BDI
298-
*/
296+
/*
297+
* An lruvec has many dirty pages backed by a congested BDI:
298+
* 1. LRUVEC_CGROUP_CONGESTED is set by cgroup-level reclaim.
299+
* It can be cleared by cgroup reclaim or kswapd.
300+
* 2. LRUVEC_NODE_CONGESTED is set by kswapd node-level reclaim.
301+
* It can only be cleared by kswapd.
302+
*
303+
* Essentially, kswapd can unthrottle an lruvec throttled by cgroup
304+
* reclaim, but not vice versa. This only applies to the root cgroup.
305+
* The goal is to prevent cgroup reclaim on the root cgroup (e.g.
306+
* memory.reclaim) to unthrottle an unbalanced node (that was throttled
307+
* by kswapd).
308+
*/
309+
LRUVEC_CGROUP_CONGESTED,
310+
LRUVEC_NODE_CONGESTED,
299311
};
300312

301313
#endif /* !__GENERATING_BOUNDS_H */

mm/vmscan.c

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6578,10 +6578,13 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
65786578
* Legacy memcg will stall in page writeback so avoid forcibly
65796579
* stalling in reclaim_throttle().
65806580
*/
6581-
if ((current_is_kswapd() ||
6582-
(cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
6583-
sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
6584-
set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);
6581+
if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested) {
6582+
if (cgroup_reclaim(sc) && writeback_throttling_sane(sc))
6583+
set_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags);
6584+
6585+
if (current_is_kswapd())
6586+
set_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags);
6587+
}
65856588

65866589
/*
65876590
* Stall direct reclaim for IO completions if the lruvec is
@@ -6591,7 +6594,8 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
65916594
*/
65926595
if (!current_is_kswapd() && current_may_throttle() &&
65936596
!sc->hibernation_mode &&
6594-
test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
6597+
(test_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags) ||
6598+
test_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags)))
65956599
reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED);
65966600

65976601
if (should_continue_reclaim(pgdat, nr_node_reclaimed, sc))
@@ -6848,7 +6852,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
68486852

68496853
lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup,
68506854
zone->zone_pgdat);
6851-
clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
6855+
clear_bit(LRUVEC_CGROUP_CONGESTED, &lruvec->flags);
68526856
}
68536857
}
68546858

@@ -7237,7 +7241,8 @@ static void clear_pgdat_congested(pg_data_t *pgdat)
72377241
{
72387242
struct lruvec *lruvec = mem_cgroup_lruvec(NULL, pgdat);
72397243

7240-
clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
7244+
clear_bit(LRUVEC_NODE_CONGESTED, &lruvec->flags);
7245+
clear_bit(LRUVEC_CGROUP_CONGESTED, &lruvec->flags);
72417246
clear_bit(PGDAT_DIRTY, &pgdat->flags);
72427247
clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
72437248
}

0 commit comments

Comments
 (0)