Skip to content

Commit 3f7f1a6

Browse files
anakryikoPeter Zijlstra
authored andcommitted
uprobes: revamp uprobe refcounting and lifetime management
Revamp how struct uprobe is refcounted, and thus how its lifetime is managed. Right now, there are a few possible "owners" of uprobe refcount: - uprobes_tree RB tree assumes one refcount when uprobe is registered and added to the lookup tree; - while uprobe is triggered and kernel is handling it in the breakpoint handler code, temporary refcount bump is done to keep uprobe from being freed; - if we have uretprobe requested on a given struct uprobe instance, we take another refcount to keep uprobe alive until user space code returns from the function and triggers return handler. The uprobe_tree's extra refcount of 1 is confusing and problematic. No matter how many actual consumers are attached, they all share the same refcount, and we have an extra logic to drop the "last" (which might not really be last) refcount once uprobe's consumer list becomes empty. This is unconventional and has to be kept in mind as a special case all the time. Further, because of this design we have the situations where find_uprobe() will find uprobe, bump refcount, return it to the caller, but that uprobe will still need uprobe_is_active() check, after which the caller is required to drop refcount and try again. This is just too many details leaking to the higher level logic. This patch changes refcounting scheme in such a way as to not have uprobes_tree keeping extra refcount for struct uprobe. Instead, each uprobe_consumer is assuming its own refcount, which will be dropped when consumer is unregistered. Other than that, all the active users of uprobe (entry and return uprobe handling code) keeps exactly the same refcounting approach. With the above setup, once uprobe's refcount drops to zero, we need to make sure that uprobe's "destructor" removes uprobe from uprobes_tree, of course. This, though, races with uprobe entry handling code in handle_swbp(), which, through find_active_uprobe()->find_uprobe() lookup, can race with uprobe being destroyed after refcount drops to zero (e.g., due to uprobe_consumer unregistering). So we add try_get_uprobe(), which will attempt to bump refcount, unless it already is zero. Caller needs to guarantee that uprobe instance won't be freed in parallel, which is the case while we keep uprobes_treelock (for read or write, doesn't matter). Note also, we now don't leak the race between registration and unregistration, so we remove the retry logic completely. If find_uprobe() returns valid uprobe, it's guaranteed to remain in uprobes_tree with properly incremented refcount. The race is handled inside __insert_uprobe() and put_uprobe() working together: __insert_uprobe() will remove uprobe from RB-tree, if it can't bump refcount and will retry to insert the new uprobe instance. put_uprobe() won't attempt to remove uprobe from RB-tree, if it's already not there. All that is protected by uprobes_treelock, which keeps things simple. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240903174603.3554182-2-andrii@kernel.org
1 parent 5fe6e30 commit 3f7f1a6

1 file changed

Lines changed: 101 additions & 78 deletions

File tree

kernel/events/uprobes.c

Lines changed: 101 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,11 @@ struct xol_area {
109109
unsigned long vaddr; /* Page(s) of instruction slots */
110110
};
111111

112+
static void uprobe_warn(struct task_struct *t, const char *msg)
113+
{
114+
pr_warn("uprobe: %s:%d failed to %s\n", current->comm, current->pid, msg);
115+
}
116+
112117
/*
113118
* valid_vma: Verify if the specified vma is an executable vma
114119
* Relax restrictions while unregistering: vm_flags might have
@@ -587,25 +592,53 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
587592
*(uprobe_opcode_t *)&auprobe->insn);
588593
}
589594

595+
/* uprobe should have guaranteed positive refcount */
590596
static struct uprobe *get_uprobe(struct uprobe *uprobe)
591597
{
592598
refcount_inc(&uprobe->ref);
593599
return uprobe;
594600
}
595601

602+
/*
603+
* uprobe should have guaranteed lifetime, which can be either of:
604+
* - caller already has refcount taken (and wants an extra one);
605+
* - uprobe is RCU protected and won't be freed until after grace period;
606+
* - we are holding uprobes_treelock (for read or write, doesn't matter).
607+
*/
608+
static struct uprobe *try_get_uprobe(struct uprobe *uprobe)
609+
{
610+
if (refcount_inc_not_zero(&uprobe->ref))
611+
return uprobe;
612+
return NULL;
613+
}
614+
615+
static inline bool uprobe_is_active(struct uprobe *uprobe)
616+
{
617+
return !RB_EMPTY_NODE(&uprobe->rb_node);
618+
}
619+
596620
static void put_uprobe(struct uprobe *uprobe)
597621
{
598-
if (refcount_dec_and_test(&uprobe->ref)) {
599-
/*
600-
* If application munmap(exec_vma) before uprobe_unregister()
601-
* gets called, we don't get a chance to remove uprobe from
602-
* delayed_uprobe_list from remove_breakpoint(). Do it here.
603-
*/
604-
mutex_lock(&delayed_uprobe_lock);
605-
delayed_uprobe_remove(uprobe, NULL);
606-
mutex_unlock(&delayed_uprobe_lock);
607-
kfree(uprobe);
608-
}
622+
if (!refcount_dec_and_test(&uprobe->ref))
623+
return;
624+
625+
write_lock(&uprobes_treelock);
626+
627+
if (uprobe_is_active(uprobe))
628+
rb_erase(&uprobe->rb_node, &uprobes_tree);
629+
630+
write_unlock(&uprobes_treelock);
631+
632+
/*
633+
* If application munmap(exec_vma) before uprobe_unregister()
634+
* gets called, we don't get a chance to remove uprobe from
635+
* delayed_uprobe_list from remove_breakpoint(). Do it here.
636+
*/
637+
mutex_lock(&delayed_uprobe_lock);
638+
delayed_uprobe_remove(uprobe, NULL);
639+
mutex_unlock(&delayed_uprobe_lock);
640+
641+
kfree(uprobe);
609642
}
610643

611644
static __always_inline
@@ -656,7 +689,7 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
656689
struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
657690

658691
if (node)
659-
return get_uprobe(__node_2_uprobe(node));
692+
return try_get_uprobe(__node_2_uprobe(node));
660693

661694
return NULL;
662695
}
@@ -676,26 +709,44 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
676709
return uprobe;
677710
}
678711

712+
/*
713+
* Attempt to insert a new uprobe into uprobes_tree.
714+
*
715+
* If uprobe already exists (for given inode+offset), we just increment
716+
* refcount of previously existing uprobe.
717+
*
718+
* If not, a provided new instance of uprobe is inserted into the tree (with
719+
* assumed initial refcount == 1).
720+
*
721+
* In any case, we return a uprobe instance that ends up being in uprobes_tree.
722+
* Caller has to clean up new uprobe instance, if it ended up not being
723+
* inserted into the tree.
724+
*
725+
* We assume that uprobes_treelock is held for writing.
726+
*/
679727
static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
680728
{
681729
struct rb_node *node;
682-
730+
again:
683731
node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
684-
if (node)
685-
return get_uprobe(__node_2_uprobe(node));
732+
if (node) {
733+
struct uprobe *u = __node_2_uprobe(node);
686734

687-
/* get access + creation ref */
688-
refcount_set(&uprobe->ref, 2);
689-
return NULL;
735+
if (!try_get_uprobe(u)) {
736+
rb_erase(node, &uprobes_tree);
737+
RB_CLEAR_NODE(&u->rb_node);
738+
goto again;
739+
}
740+
741+
return u;
742+
}
743+
744+
return uprobe;
690745
}
691746

692747
/*
693-
* Acquire uprobes_treelock.
694-
* Matching uprobe already exists in rbtree;
695-
* increment (access refcount) and return the matching uprobe.
696-
*
697-
* No matching uprobe; insert the uprobe in rb_tree;
698-
* get a double refcount (access + creation) and return NULL.
748+
* Acquire uprobes_treelock and insert uprobe into uprobes_tree
749+
* (or reuse existing one, see __insert_uprobe() comments above).
699750
*/
700751
static struct uprobe *insert_uprobe(struct uprobe *uprobe)
701752
{
@@ -732,11 +783,13 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
732783
uprobe->ref_ctr_offset = ref_ctr_offset;
733784
init_rwsem(&uprobe->register_rwsem);
734785
init_rwsem(&uprobe->consumer_rwsem);
786+
RB_CLEAR_NODE(&uprobe->rb_node);
787+
refcount_set(&uprobe->ref, 1);
735788

736789
/* add to uprobes_tree, sorted on inode:offset */
737790
cur_uprobe = insert_uprobe(uprobe);
738791
/* a uprobe exists for this inode:offset combination */
739-
if (cur_uprobe) {
792+
if (cur_uprobe != uprobe) {
740793
if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
741794
ref_ctr_mismatch_warn(cur_uprobe, uprobe);
742795
put_uprobe(cur_uprobe);
@@ -921,26 +974,6 @@ remove_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, unsigned long vad
921974
return set_orig_insn(&uprobe->arch, mm, vaddr);
922975
}
923976

924-
static inline bool uprobe_is_active(struct uprobe *uprobe)
925-
{
926-
return !RB_EMPTY_NODE(&uprobe->rb_node);
927-
}
928-
/*
929-
* There could be threads that have already hit the breakpoint. They
930-
* will recheck the current insn and restart if find_uprobe() fails.
931-
* See find_active_uprobe().
932-
*/
933-
static void delete_uprobe(struct uprobe *uprobe)
934-
{
935-
if (WARN_ON(!uprobe_is_active(uprobe)))
936-
return;
937-
938-
write_lock(&uprobes_treelock);
939-
rb_erase(&uprobe->rb_node, &uprobes_tree);
940-
write_unlock(&uprobes_treelock);
941-
RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
942-
}
943-
944977
struct map_info {
945978
struct map_info *next;
946979
struct mm_struct *mm;
@@ -1094,17 +1127,13 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
10941127
int err;
10951128

10961129
down_write(&uprobe->register_rwsem);
1097-
if (WARN_ON(!consumer_del(uprobe, uc)))
1130+
if (WARN_ON(!consumer_del(uprobe, uc))) {
10981131
err = -ENOENT;
1099-
else
1132+
} else {
11001133
err = register_for_each_vma(uprobe, NULL);
1101-
1102-
/* TODO : cant unregister? schedule a worker thread */
1103-
if (!err) {
1104-
if (!uprobe->consumers)
1105-
delete_uprobe(uprobe);
1106-
else
1107-
err = -EBUSY;
1134+
/* TODO : cant unregister? schedule a worker thread */
1135+
if (unlikely(err))
1136+
uprobe_warn(current, "unregister, leaking uprobe");
11081137
}
11091138
up_write(&uprobe->register_rwsem);
11101139

@@ -1159,27 +1188,16 @@ struct uprobe *uprobe_register(struct inode *inode,
11591188
if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
11601189
return ERR_PTR(-EINVAL);
11611190

1162-
retry:
11631191
uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
11641192
if (IS_ERR(uprobe))
11651193
return uprobe;
11661194

1167-
/*
1168-
* We can race with uprobe_unregister()->delete_uprobe().
1169-
* Check uprobe_is_active() and retry if it is false.
1170-
*/
11711195
down_write(&uprobe->register_rwsem);
1172-
ret = -EAGAIN;
1173-
if (likely(uprobe_is_active(uprobe))) {
1174-
consumer_add(uprobe, uc);
1175-
ret = register_for_each_vma(uprobe, uc);
1176-
}
1196+
consumer_add(uprobe, uc);
1197+
ret = register_for_each_vma(uprobe, uc);
11771198
up_write(&uprobe->register_rwsem);
1178-
put_uprobe(uprobe);
11791199

11801200
if (ret) {
1181-
if (unlikely(ret == -EAGAIN))
1182-
goto retry;
11831201
uprobe_unregister(uprobe, uc);
11841202
return ERR_PTR(ret);
11851203
}
@@ -1286,15 +1304,17 @@ static void build_probe_list(struct inode *inode,
12861304
u = rb_entry(t, struct uprobe, rb_node);
12871305
if (u->inode != inode || u->offset < min)
12881306
break;
1289-
list_add(&u->pending_list, head);
1290-
get_uprobe(u);
1307+
/* if uprobe went away, it's safe to ignore it */
1308+
if (try_get_uprobe(u))
1309+
list_add(&u->pending_list, head);
12911310
}
12921311
for (t = n; (t = rb_next(t)); ) {
12931312
u = rb_entry(t, struct uprobe, rb_node);
12941313
if (u->inode != inode || u->offset > max)
12951314
break;
1296-
list_add(&u->pending_list, head);
1297-
get_uprobe(u);
1315+
/* if uprobe went away, it's safe to ignore it */
1316+
if (try_get_uprobe(u))
1317+
list_add(&u->pending_list, head);
12981318
}
12991319
}
13001320
read_unlock(&uprobes_treelock);
@@ -1751,6 +1771,12 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
17511771
return -ENOMEM;
17521772

17531773
*n = *o;
1774+
/*
1775+
* uprobe's refcnt has to be positive at this point, kept by
1776+
* utask->return_instances items; return_instances can't be
1777+
* removed right now, as task is blocked due to duping; so
1778+
* get_uprobe() is safe to use here.
1779+
*/
17541780
get_uprobe(n->uprobe);
17551781
n->next = NULL;
17561782

@@ -1762,12 +1788,6 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
17621788
return 0;
17631789
}
17641790

1765-
static void uprobe_warn(struct task_struct *t, const char *msg)
1766-
{
1767-
pr_warn("uprobe: %s:%d failed to %s\n",
1768-
current->comm, current->pid, msg);
1769-
}
1770-
17711791
static void dup_xol_work(struct callback_head *work)
17721792
{
17731793
if (current->flags & PF_EXITING)
@@ -1893,7 +1913,10 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
18931913
}
18941914
orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
18951915
}
1896-
1916+
/*
1917+
* uprobe's refcnt is positive, held by caller, so it's safe to
1918+
* unconditionally bump it one more time here
1919+
*/
18971920
ri->uprobe = get_uprobe(uprobe);
18981921
ri->func = instruction_pointer(regs);
18991922
ri->stack = user_stack_pointer(regs);

0 commit comments

Comments
 (0)