Skip to content

Commit db975de

Browse files
author
Martin KaFai Lau
committed
Merge branch 'remove-task-and-cgroup-local-storage-percpu-counters'
Amery Hung says: ==================== Remove task and cgroup local storage percpu counters * Motivation * The goal of this patchset is to make bpf syscalls and helpers updating task and cgroup local storage more robust by removing percpu counters in them. Task local storage and cgroup storage each employs a percpu counter to prevent deadlock caused by recursion. Since the underlying bpf local storage takes spinlocks in various operations, bpf programs running recursively may try to take a spinlock which is already taken. For example, when a tracing bpf program called recursively during bpf_task_storage_get(..., F_CREATE) tries to call bpf_task_storage_get(..., F_CREATE) again, it will cause AA deadlock if the percpu variable is not in place. However, sometimes, the percpu counter may cause bpf syscalls or helpers to return errors spuriously, as soon as another threads is also updating the local storage or the local storage map. Ideally, the two threads could have taken turn to take the locks and perform their jobs respectively. However, due to the percpu counter, the syscalls and helpers can return -EBUSY even if one of them does not run recursively in another one. All it takes for this to happen is if the two threads run on the same CPU. This happened when BPF-CI ran the selftest of task local data. Since CI runs the test on VM with 2 CPUs, bpf_task_storage_get(..., F_CREATE) can easily fail. The failure mode is not good for users as they need to add retry logic in user space or bpf programs to avoid it. Even with retry, there is no guaranteed upper bound of the loop for a success call. Therefore, this patchset seeks to remove the percpu counter and makes the related bpf syscalls and helpers more reliable, while still make sure recursion deadlock will not happen, with the help of resilient queued spinlock (rqspinlock). * Implementation * To remove the percpu counter without introducing deadlock, bpf_local_storage is refactored by changing the locks from raw_spin_lock to rqspinlock, which prevents deadlock with deadlock detection and a timeout mechanism. The refactor basically repalces the locks with rqspinlock and propagates errors returned by the locking function to BPF helpers or syscalls. bpf_selem_unlink_nofail() is introduced to handle rqspinlock errors in two lock acquiring functions that cannot fail, bpf_local_storage_destroy() and bpf_local_storage_map_free() (i.e., local storage is being freed by the subsystem or the map is being freed). The high level idea is to bitfiel and atomic operation to track who is referencing an selem when any locks cannot be acquired. Additional care is needed to make sure special fields are freed and owner memory are uncharged safely and correctly. If not familiar with local storage, the last section briefly describe the locks and structure of local storage. It also shows the abbreviation used in the rest of the letter. * Test * Task and cgroup local storage selftests have already covered deadlock caused by recursion. Patch 14 updates the expected result of task local storage selftests as task local storage bpf helpers can now run on the same CPU as they don't cause deadlock. * Benchmark * ./bench -p 1 local-storage-create --storage-type <socket,task> \ --batch-size <16,32,64> The benchmark is a microbenchmark stress-testing how fast local storage can be created. After swicthing to rqspinlock and bpf_unlink_selem_nofail(), socket local storage creation speed has a ~5% gain. For task local storage, the number remains the same. Socket local storage batch creation speed creation speed diff --------------- ---- ------------------ ---- Before 16 134.371 ± 0.884k/s 3.12 kmallocs/create 32 133.032 ± 3.405k/s 3.12 kmallocs/create 64 133.494 ± 0.862k/s 3.12 kmallocs/create After 16 140.778 ± 1.306k/s 3.12 kmallocs/create +4.8% 32 140.550 ± 2.058k/s 3.11 kmallocs/create +5.7% 64 139.311 ± 0.911k/s 3.13 kmallocs/create +4.4% Task local storage batch creation speed creation speed diff --------------- ---- ------------------ ---- Before 16 25.301 ± 0.089k/s 2.43 kmallocs/create 32 23.797 ± 0.106k/s 2.51 kmallocs/create 64 23.251 ± 0.187k/s 2.51 kmallocs/create After 16 25.307 ± 0.080k/s 2.45 kmallocs/create +0.0% 32 23.889 ± 0.089k/s 2.46 kmallocs/create +0.0% 64 23.230 ± 0.113k/s 2.63 kmallocs/create -0.1% * Patchset organization * Patch 1-4 convert local storage internal helpers to failable. Patch 5 changes the locks to rqspinlock and propagate the error returned from raw_res_spin_lock_irqsave() to BPF heleprs and syscalls. Patch 6-8 remove percpu counters in task and cgroup local storage. Patch 9-11 address the unlikely rqspinlock errors by switching to bpf_selem_unlink_nofail() in map_free() and destroy(). Patch 12-17 update selftests. * Appendix: local storage internal * There are two locks in bpf_local_storage due to the ownership model as illustrated in the figure below. A map value, which consists of a pointer to the map and the data, is a bpf_local_storage_map_data (sdata) stored in a bpf_local_storage_elem (selem). A selem belongs to a bpf_local_storage and bpf_local_storage_map at the same time. bpf_local_storage::lock (lock_storage->lock in short) protects the list in a bpf_local_storage and bpf_local_storage_map_bucket::lock (b->lock) protects the hash bucket in a bpf_local_storage_map. task_struct ┌ task1 ───────┐ bpf_local_storage │ *bpf_storage │---->┌─────────┐ └──────────────┘<----│ *owner │ bpf_local_storage_elem │ *cache[16] (selem) selem │ *smap │ ┌──────────┐ ┌──────────┐ │ list │------->│ snode │<------->│ snode │ │ lock │ ┌---->│ map_node │<--┐ ┌-->│ map_node │ └─────────┘ │ │ sdata = │ │ │ │ sdata = │ task_struct │ │ {&mapA,} │ │ │ │ {&mapB,} │ ┌ task2 ───────┐ bpf_local_storage └──────────┘ │ │ └──────────┘ │ *bpf_storage │---->┌─────────┐ │ │ │ └──────────────┘<----│ *owner │ │ │ │ │ *cache[16] │ selem │ │ selem │ *smap │ │ ┌──────────┐ │ │ ┌──────────┐ │ list │--│---->│ snode │<--│-│-->│ snode │ │ lock │ │ ┌-->│ map_node │ └-│-->│ map_node │ └─────────┘ │ │ │ sdata = │ │ │ sdata = │ bpf_local_storage_map │ │ │ {&mapB,} │ │ │ {&mapA,} │ (smap) │ │ └──────────┘ │ └──────────┘ ┌ mapA ───────┐ │ │ │ │ bpf_map map │ bpf_local_storage_map_bucket │ │ *buckets │---->┌ b[0] ┐ │ │ │ └─────────────┘ │ list │------┘ │ │ │ lock │ │ │ └──────┘ │ │ smap ... │ │ ┌ mapB ───────┐ │ │ │ bpf_map map │ bpf_local_storage_map_bucket │ │ *buckets │---->┌ b[0] ┐ │ │ └─────────────┘ │ list │--------┘ │ │ lock │ │ └──────┘ │ ┌ b[1] ┐ │ │ list │-----------------------------┘ │ lock │ └──────┘ ... * Changelog * v6 -> v7 - Minor comment and commit msg tweaks - Patch 9: Remove unused "owner" (kernel test robot) - Patch 13: Update comments in task_ls_recursion.c (AI) Link: https://lore.kernel.org/bpf/20260205070208.186382-1-ameryhung@gmail.com/ v5 -> v6 - Redo benchmark - Patch 9: Remove storage->smap as it is not used any more - Patch 17: Remove storage->smap check in selftests - Patch 10, 11: Pass reuse_now = true to bpf_selem_free() and bpf_local_storage_free() to allow faster memory reclaim (Martin) - Patch 10: Use bitfield instead of refcount to track selem state to be more precise, which removes the possibility map_free missing an selem (Martin) - Patch 10: Allow map_free() to free local_storage and drop the change in bpf_local_storage_map_update() (Martin) - Patch 11: Simplify destroy() by not deferring work as an owner is unlikely to have too many maps that stalls RCU (Martin) Link: https://lore.kernel.org/bpf/20260201175050.468601-1-ameryhung@gmail.com/ v4 -> v5 - Patch 1: Fix incorrect bucket calculation (AI) - Patch 3: Fix memory leak in bpf_sk_storage_clone() (AI) - Patch 5: Fix memory leak in bpf_local_storage_update() (AI) - Fix typo/comment/commit msg (AI) - Patch 10: Replace smp_rmb() with smp_mb(). smp_rmb does not imply acquire semantics Link: https://lore.kernel.org/bpf/20260131050920.2574084-1-ameryhung@gmail.com/ v3 -> v4 - Add performance numbers - Avoid stale element when calling bpf_local_storage_map_free() by allowing it to unlink selem from local_storage->list and uncharge memory. Block destroy() from returning when pending map_free() are uncharging - Fix an -EAGAIN bug in bpf_local_storage_update() as map_free() now does not free local storage - Fix possible double-free of selem by ensuring an selem is only processed once for each caller (Kumar) - Fix possible inifinite loop in bpf_selem_unlink_nofail() when iterating b->list by replacing while loop with hlist_for_each_entry_rcu - Fix unsafe iteration in destroy() by iterating local_storage->list using hlist_for_each_entry_rcu - Fix UAF due to clearing storage_owner after destroy(). Flip the order to fix it - Misc clean-up suggested by Martin Link: https://lore.kernel.org/bpf/20251218175628.1460321-1-ameryhung@gmail.com/ v2 -> v3 - Rebase to bpf-next where BPF memory allocator is replaced with kmalloc_nolock() - Revert to selecting bucket based on selem - Introduce bpf_selem_unlink_lockless() to allow unlinking and freeing selem without taking locks Link: https://lore.kernel.org/bpf/20251002225356.1505480-1-ameryhung@gmail.com/ v1 -> v2 - Rebase to bpf-next - Select bucket based on local_storage instead of selem (Martin) - Simplify bpf_selem_unlink (Martin) - Change handling of rqspinlock errors in bpf_local_storage_destroy() and bpf_local_storage_map_free(). Retry instead of WARN_ON. Link: https://lore.kernel.org/bpf/20250729182550.185356-1-ameryhung@gmail.com/ ==================== Link: https://patch.msgid.link/20260205222916.1788211-1-ameryhung@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2 parents b8c89f5 + 97b859b commit db975de

15 files changed

Lines changed: 354 additions & 561 deletions

include/linux/bpf_local_storage.h

Lines changed: 19 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -15,12 +15,13 @@
1515
#include <linux/types.h>
1616
#include <linux/bpf_mem_alloc.h>
1717
#include <uapi/linux/btf.h>
18+
#include <asm/rqspinlock.h>
1819

1920
#define BPF_LOCAL_STORAGE_CACHE_SIZE 16
2021

2122
struct bpf_local_storage_map_bucket {
2223
struct hlist_head list;
23-
raw_spinlock_t lock;
24+
rqspinlock_t lock;
2425
};
2526

2627
/* Thp map is not the primary owner of a bpf_local_storage_elem.
@@ -67,6 +68,11 @@ struct bpf_local_storage_data {
6768
u8 data[] __aligned(8);
6869
};
6970

71+
#define SELEM_MAP_UNLINKED (1 << 0)
72+
#define SELEM_STORAGE_UNLINKED (1 << 1)
73+
#define SELEM_UNLINKED (SELEM_MAP_UNLINKED | SELEM_STORAGE_UNLINKED)
74+
#define SELEM_TOFREE (1 << 2)
75+
7076
/* Linked to bpf_local_storage and bpf_local_storage_map */
7177
struct bpf_local_storage_elem {
7278
struct hlist_node map_node; /* Linked to bpf_local_storage_map */
@@ -79,7 +85,9 @@ struct bpf_local_storage_elem {
7985
* after raw_spin_unlock
8086
*/
8187
};
82-
/* 8 bytes hole */
88+
atomic_t state;
89+
bool use_kmalloc_nolock;
90+
/* 3 bytes hole */
8391
/* The data is stored in another cacheline to minimize
8492
* the number of cachelines access during a cache hit.
8593
*/
@@ -88,13 +96,14 @@ struct bpf_local_storage_elem {
8896

8997
struct bpf_local_storage {
9098
struct bpf_local_storage_data __rcu *cache[BPF_LOCAL_STORAGE_CACHE_SIZE];
91-
struct bpf_local_storage_map __rcu *smap;
9299
struct hlist_head list; /* List of bpf_local_storage_elem */
93100
void *owner; /* The object that owns the above "list" of
94101
* bpf_local_storage_elem.
95102
*/
96103
struct rcu_head rcu;
97-
raw_spinlock_t lock; /* Protect adding/removing from the "list" */
104+
rqspinlock_t lock; /* Protect adding/removing from the "list" */
105+
u64 mem_charge; /* Copy of mem charged to owner. Protected by "lock" */
106+
refcount_t owner_refcnt;/* Used to pin owner when map_free is uncharging */
98107
bool use_kmalloc_nolock;
99108
};
100109

@@ -162,11 +171,10 @@ bpf_local_storage_lookup(struct bpf_local_storage *local_storage,
162171
return SDATA(selem);
163172
}
164173

165-
void bpf_local_storage_destroy(struct bpf_local_storage *local_storage);
174+
u32 bpf_local_storage_destroy(struct bpf_local_storage *local_storage);
166175

167176
void bpf_local_storage_map_free(struct bpf_map *map,
168-
struct bpf_local_storage_cache *cache,
169-
int __percpu *busy_counter);
177+
struct bpf_local_storage_cache *cache);
170178

171179
int bpf_local_storage_map_check_btf(const struct bpf_map *map,
172180
const struct btf *btf,
@@ -176,10 +184,11 @@ int bpf_local_storage_map_check_btf(const struct bpf_map *map,
176184
void bpf_selem_link_storage_nolock(struct bpf_local_storage *local_storage,
177185
struct bpf_local_storage_elem *selem);
178186

179-
void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now);
187+
int bpf_selem_unlink(struct bpf_local_storage_elem *selem);
180188

181-
void bpf_selem_link_map(struct bpf_local_storage_map *smap,
182-
struct bpf_local_storage_elem *selem);
189+
int bpf_selem_link_map(struct bpf_local_storage_map *smap,
190+
struct bpf_local_storage *local_storage,
191+
struct bpf_local_storage_elem *selem);
183192

184193
struct bpf_local_storage_elem *
185194
bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner, void *value,

kernel/bpf/bpf_cgrp_storage.c

Lines changed: 9 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -11,29 +11,6 @@
1111

1212
DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
1313

14-
static DEFINE_PER_CPU(int, bpf_cgrp_storage_busy);
15-
16-
static void bpf_cgrp_storage_lock(void)
17-
{
18-
cant_migrate();
19-
this_cpu_inc(bpf_cgrp_storage_busy);
20-
}
21-
22-
static void bpf_cgrp_storage_unlock(void)
23-
{
24-
this_cpu_dec(bpf_cgrp_storage_busy);
25-
}
26-
27-
static bool bpf_cgrp_storage_trylock(void)
28-
{
29-
cant_migrate();
30-
if (unlikely(this_cpu_inc_return(bpf_cgrp_storage_busy) != 1)) {
31-
this_cpu_dec(bpf_cgrp_storage_busy);
32-
return false;
33-
}
34-
return true;
35-
}
36-
3714
static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
3815
{
3916
struct cgroup *cg = owner;
@@ -45,16 +22,14 @@ void bpf_cgrp_storage_free(struct cgroup *cgroup)
4522
{
4623
struct bpf_local_storage *local_storage;
4724

48-
rcu_read_lock_dont_migrate();
25+
rcu_read_lock();
4926
local_storage = rcu_dereference(cgroup->bpf_cgrp_storage);
5027
if (!local_storage)
5128
goto out;
5229

53-
bpf_cgrp_storage_lock();
5430
bpf_local_storage_destroy(local_storage);
55-
bpf_cgrp_storage_unlock();
5631
out:
57-
rcu_read_unlock_migrate();
32+
rcu_read_unlock();
5833
}
5934

6035
static struct bpf_local_storage_data *
@@ -83,9 +58,7 @@ static void *bpf_cgrp_storage_lookup_elem(struct bpf_map *map, void *key)
8358
if (IS_ERR(cgroup))
8459
return ERR_CAST(cgroup);
8560

86-
bpf_cgrp_storage_lock();
8761
sdata = cgroup_storage_lookup(cgroup, map, true);
88-
bpf_cgrp_storage_unlock();
8962
cgroup_put(cgroup);
9063
return sdata ? sdata->data : NULL;
9164
}
@@ -102,10 +75,8 @@ static long bpf_cgrp_storage_update_elem(struct bpf_map *map, void *key,
10275
if (IS_ERR(cgroup))
10376
return PTR_ERR(cgroup);
10477

105-
bpf_cgrp_storage_lock();
10678
sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
10779
value, map_flags, false, GFP_ATOMIC);
108-
bpf_cgrp_storage_unlock();
10980
cgroup_put(cgroup);
11081
return PTR_ERR_OR_ZERO(sdata);
11182
}
@@ -118,8 +89,7 @@ static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map *map)
11889
if (!sdata)
11990
return -ENOENT;
12091

121-
bpf_selem_unlink(SELEM(sdata), false);
122-
return 0;
92+
return bpf_selem_unlink(SELEM(sdata));
12393
}
12494

12595
static long bpf_cgrp_storage_delete_elem(struct bpf_map *map, void *key)
@@ -132,9 +102,7 @@ static long bpf_cgrp_storage_delete_elem(struct bpf_map *map, void *key)
132102
if (IS_ERR(cgroup))
133103
return PTR_ERR(cgroup);
134104

135-
bpf_cgrp_storage_lock();
136105
err = cgroup_storage_delete(cgroup, map);
137-
bpf_cgrp_storage_unlock();
138106
cgroup_put(cgroup);
139107
return err;
140108
}
@@ -151,15 +119,14 @@ static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
151119

152120
static void cgroup_storage_map_free(struct bpf_map *map)
153121
{
154-
bpf_local_storage_map_free(map, &cgroup_cache, &bpf_cgrp_storage_busy);
122+
bpf_local_storage_map_free(map, &cgroup_cache);
155123
}
156124

157125
/* *gfp_flags* is a hidden argument provided by the verifier */
158126
BPF_CALL_5(bpf_cgrp_storage_get, struct bpf_map *, map, struct cgroup *, cgroup,
159127
void *, value, u64, flags, gfp_t, gfp_flags)
160128
{
161129
struct bpf_local_storage_data *sdata;
162-
bool nobusy;
163130

164131
WARN_ON_ONCE(!bpf_rcu_lock_held());
165132
if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
@@ -168,38 +135,27 @@ BPF_CALL_5(bpf_cgrp_storage_get, struct bpf_map *, map, struct cgroup *, cgroup,
168135
if (!cgroup)
169136
return (unsigned long)NULL;
170137

171-
nobusy = bpf_cgrp_storage_trylock();
172-
173-
sdata = cgroup_storage_lookup(cgroup, map, nobusy);
138+
sdata = cgroup_storage_lookup(cgroup, map, true);
174139
if (sdata)
175-
goto unlock;
140+
goto out;
176141

177142
/* only allocate new storage, when the cgroup is refcounted */
178143
if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
179-
(flags & BPF_LOCAL_STORAGE_GET_F_CREATE) && nobusy)
144+
(flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
180145
sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
181146
value, BPF_NOEXIST, false, gfp_flags);
182147

183-
unlock:
184-
if (nobusy)
185-
bpf_cgrp_storage_unlock();
148+
out:
186149
return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned long)sdata->data;
187150
}
188151

189152
BPF_CALL_2(bpf_cgrp_storage_delete, struct bpf_map *, map, struct cgroup *, cgroup)
190153
{
191-
int ret;
192-
193154
WARN_ON_ONCE(!bpf_rcu_lock_held());
194155
if (!cgroup)
195156
return -EINVAL;
196157

197-
if (!bpf_cgrp_storage_trylock())
198-
return -EBUSY;
199-
200-
ret = cgroup_storage_delete(cgroup, map);
201-
bpf_cgrp_storage_unlock();
202-
return ret;
158+
return cgroup_storage_delete(cgroup, map);
203159
}
204160

205161
const struct bpf_map_ops cgrp_storage_map_ops = {

kernel/bpf/bpf_inode_storage.c

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -110,9 +110,7 @@ static int inode_storage_delete(struct inode *inode, struct bpf_map *map)
110110
if (!sdata)
111111
return -ENOENT;
112112

113-
bpf_selem_unlink(SELEM(sdata), false);
114-
115-
return 0;
113+
return bpf_selem_unlink(SELEM(sdata));
116114
}
117115

118116
static long bpf_fd_inode_storage_delete_elem(struct bpf_map *map, void *key)
@@ -186,7 +184,7 @@ static struct bpf_map *inode_storage_map_alloc(union bpf_attr *attr)
186184

187185
static void inode_storage_map_free(struct bpf_map *map)
188186
{
189-
bpf_local_storage_map_free(map, &inode_cache, NULL);
187+
bpf_local_storage_map_free(map, &inode_cache);
190188
}
191189

192190
const struct bpf_map_ops inode_storage_map_ops = {

0 commit comments

Comments
 (0)