Skip to content

Commit b1e809e

Browse files
committed
Merge patch series "introduce PIDFD_SELF* sentinels"
Lorenzo Stoakes <lorenzo.stoakes@oracle.com> says: If you wish to utilise a pidfd interface to refer to the current process or thread then there is a lot of ceremony involved, looking something like: pid_t pid = getpid(); int pidfd = pidfd_open(pid, PIDFD_THREAD); if (pidfd < 0) { ... error handling ... } if (process_madvise(pidfd, iovec, 8, MADV_GUARD_INSTALL, 0)) { ... cleanup pidfd ... ... error handling ... } ... ... cleanup pidfd ... This adds unnecessary overhead + system calls, complicated error handling and in addition pidfd_open() is subject to RLIMIT_NOFILE (i.e. the limit of per-process number of open file descriptors), so the call may fail spuriously on this basis. Rather than doing this we can make use of sentinels for this purpose which can be passed as the pidfd instead. This looks like: if (process_madvise(PIDFD_SELF, iovec, 8, MADV_GUARD_INSTALL, 0)) { ... error handling ... } And avoids all of the aforementioned issues. This series introduces such sentinels. It is useful to refer to both the current thread from the userland's perspective for which we use PIDFD_SELF, and the current process from the userland's perspective, for which we use PIDFD_SELF_PROCESS. There is unfortunately some confusion between the kernel and userland as to what constitutes a process - a thread from the userland perspective is a process in userland, and a userland process is a thread group (more specifically the thread group leader from the kernel perspective). We therefore alias things thusly: * PIDFD_SELF_THREAD aliased by PIDFD_SELF - use PIDTYPE_PID. * PIDFD_SELF_THREAD_GROUP alised by PIDFD_SELF_PROCESS - use PIDTYPE_TGID. In all of the kernel code we refer to PIDFD_SELF_THREAD and PIDFD_SELF_THREAD_GROUP. However we expect users to use PIDFD_SELF and PIDFD_SELF_PROCESS. This matters for cases where, for instance, a user unshare()'s FDs or does thread-specific signal handling and where the user would be hugely confused if the FDs referenced or signal processed referred to the thread group leader rather than the individual thread. Another use-case comes from Android. When installing multiple MADV_GUARD_INSTALL guard pages they considered caching the result of pidfd_open() for reuse. That however wouldn't work in shared libraries where users can fork() in between such calls. For now we only adjust pidfd_get_task() and the pidfd_send_signal() system call with specific handling for this, implementing this functionality for process_madvise(), process_mrelease() (albeit, using it here wouldn't really make sense) and pidfd_send_signal(). We defer making further changes, as this would require a significant rework of the pidfd mechanism. The motivating case here is to support PIDFD_SELF in process_madvise(), so this suffices for immediate uses. Moving forward, this can be further expanded to other uses. * patches from https://lore.kernel.org/r/cover.1738268370.git.lorenzo.stoakes@oracle.com: selftests/mm: use PIDFD_SELF in guard pages test selftests/pidfd: add tests for PIDFD_SELF_* selftests/pidfd: add new PIDFD_SELF* defines pidfd: add PIDFD_SELF* sentinels to refer to own thread/process Link: https://lore.kernel.org/r/cover.1738268370.git.lorenzo.stoakes@oracle.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2 parents 2014c95 + 62d648c commit b1e809e

7 files changed

Lines changed: 193 additions & 74 deletions

File tree

include/uapi/linux/pidfd.h

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,30 @@
2323

2424
#define PIDFD_INFO_SIZE_VER0 64 /* sizeof first published struct */
2525

26+
/*
27+
* The concept of process and threads in userland and the kernel is a confusing
28+
* one - within the kernel every thread is a 'task' with its own individual PID,
29+
* however from userland's point of view threads are grouped by a single PID,
30+
* which is that of the 'thread group leader', typically the first thread
31+
* spawned.
32+
*
33+
* To cut the Gideon knot, for internal kernel usage, we refer to
34+
* PIDFD_SELF_THREAD to refer to the current thread (or task from a kernel
35+
* perspective), and PIDFD_SELF_THREAD_GROUP to refer to the current thread
36+
* group leader...
37+
*/
38+
#define PIDFD_SELF_THREAD -10000 /* Current thread. */
39+
#define PIDFD_SELF_THREAD_GROUP -20000 /* Current thread group leader. */
40+
41+
/*
42+
* ...and for userland we make life simpler - PIDFD_SELF refers to the current
43+
* thread, PIDFD_SELF_PROCESS refers to the process thread group leader.
44+
*
45+
* For nearly all practical uses, a user will want to use PIDFD_SELF.
46+
*/
47+
#define PIDFD_SELF PIDFD_SELF_THREAD
48+
#define PIDFD_SELF_PROCESS PIDFD_SELF_THREAD_GROUP
49+
2650
struct pidfd_info {
2751
/*
2852
* This mask is similar to the request_mask in statx(2).

kernel/pid.c

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -564,15 +564,29 @@ struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags)
564564
*/
565565
struct task_struct *pidfd_get_task(int pidfd, unsigned int *flags)
566566
{
567-
unsigned int f_flags;
567+
unsigned int f_flags = 0;
568568
struct pid *pid;
569569
struct task_struct *task;
570+
enum pid_type type;
570571

571-
pid = pidfd_get_pid(pidfd, &f_flags);
572-
if (IS_ERR(pid))
573-
return ERR_CAST(pid);
572+
switch (pidfd) {
573+
case PIDFD_SELF_THREAD:
574+
type = PIDTYPE_PID;
575+
pid = get_task_pid(current, type);
576+
break;
577+
case PIDFD_SELF_THREAD_GROUP:
578+
type = PIDTYPE_TGID;
579+
pid = get_task_pid(current, type);
580+
break;
581+
default:
582+
pid = pidfd_get_pid(pidfd, &f_flags);
583+
if (IS_ERR(pid))
584+
return ERR_CAST(pid);
585+
type = PIDTYPE_TGID;
586+
break;
587+
}
574588

575-
task = get_pid_task(pid, PIDTYPE_TGID);
589+
task = get_pid_task(pid, type);
576590
put_pid(pid);
577591
if (!task)
578592
return ERR_PTR(-ESRCH);

kernel/signal.c

Lines changed: 63 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -4009,6 +4009,47 @@ static struct pid *pidfd_to_pid(const struct file *file)
40094009
(PIDFD_SIGNAL_THREAD | PIDFD_SIGNAL_THREAD_GROUP | \
40104010
PIDFD_SIGNAL_PROCESS_GROUP)
40114011

4012+
static int do_pidfd_send_signal(struct pid *pid, int sig, enum pid_type type,
4013+
siginfo_t __user *info, unsigned int flags)
4014+
{
4015+
kernel_siginfo_t kinfo;
4016+
4017+
switch (flags) {
4018+
case PIDFD_SIGNAL_THREAD:
4019+
type = PIDTYPE_PID;
4020+
break;
4021+
case PIDFD_SIGNAL_THREAD_GROUP:
4022+
type = PIDTYPE_TGID;
4023+
break;
4024+
case PIDFD_SIGNAL_PROCESS_GROUP:
4025+
type = PIDTYPE_PGID;
4026+
break;
4027+
}
4028+
4029+
if (info) {
4030+
int ret;
4031+
4032+
ret = copy_siginfo_from_user_any(&kinfo, info);
4033+
if (unlikely(ret))
4034+
return ret;
4035+
4036+
if (unlikely(sig != kinfo.si_signo))
4037+
return -EINVAL;
4038+
4039+
/* Only allow sending arbitrary signals to yourself. */
4040+
if ((task_pid(current) != pid || type > PIDTYPE_TGID) &&
4041+
(kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL))
4042+
return -EPERM;
4043+
} else {
4044+
prepare_kill_siginfo(sig, &kinfo, type);
4045+
}
4046+
4047+
if (type == PIDTYPE_PGID)
4048+
return kill_pgrp_info(sig, &kinfo, pid);
4049+
4050+
return kill_pid_info_type(sig, &kinfo, pid, type);
4051+
}
4052+
40124053
/**
40134054
* sys_pidfd_send_signal - Signal a process through a pidfd
40144055
* @pidfd: file descriptor of the process
@@ -4026,9 +4067,7 @@ static struct pid *pidfd_to_pid(const struct file *file)
40264067
SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
40274068
siginfo_t __user *, info, unsigned int, flags)
40284069
{
4029-
int ret;
40304070
struct pid *pid;
4031-
kernel_siginfo_t kinfo;
40324071
enum pid_type type;
40334072

40344073
/* Enforce flags be set to 0 until we add an extension. */
@@ -4039,57 +4078,39 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
40394078
if (hweight32(flags & PIDFD_SEND_SIGNAL_FLAGS) > 1)
40404079
return -EINVAL;
40414080

4042-
CLASS(fd, f)(pidfd);
4043-
if (fd_empty(f))
4044-
return -EBADF;
4081+
switch (pidfd) {
4082+
case PIDFD_SELF_THREAD:
4083+
pid = get_task_pid(current, PIDTYPE_PID);
4084+
type = PIDTYPE_PID;
4085+
break;
4086+
case PIDFD_SELF_THREAD_GROUP:
4087+
pid = get_task_pid(current, PIDTYPE_TGID);
4088+
type = PIDTYPE_TGID;
4089+
break;
4090+
default: {
4091+
CLASS(fd, f)(pidfd);
4092+
if (fd_empty(f))
4093+
return -EBADF;
40454094

4046-
/* Is this a pidfd? */
4047-
pid = pidfd_to_pid(fd_file(f));
4048-
if (IS_ERR(pid))
4049-
return PTR_ERR(pid);
4095+
/* Is this a pidfd? */
4096+
pid = pidfd_to_pid(fd_file(f));
4097+
if (IS_ERR(pid))
4098+
return PTR_ERR(pid);
40504099

4051-
if (!access_pidfd_pidns(pid))
4052-
return -EINVAL;
4100+
if (!access_pidfd_pidns(pid))
4101+
return -EINVAL;
40534102

4054-
switch (flags) {
4055-
case 0:
40564103
/* Infer scope from the type of pidfd. */
40574104
if (fd_file(f)->f_flags & PIDFD_THREAD)
40584105
type = PIDTYPE_PID;
40594106
else
40604107
type = PIDTYPE_TGID;
4061-
break;
4062-
case PIDFD_SIGNAL_THREAD:
4063-
type = PIDTYPE_PID;
4064-
break;
4065-
case PIDFD_SIGNAL_THREAD_GROUP:
4066-
type = PIDTYPE_TGID;
4067-
break;
4068-
case PIDFD_SIGNAL_PROCESS_GROUP:
4069-
type = PIDTYPE_PGID;
4070-
break;
4071-
}
40724108

4073-
if (info) {
4074-
ret = copy_siginfo_from_user_any(&kinfo, info);
4075-
if (unlikely(ret))
4076-
return ret;
4077-
4078-
if (unlikely(sig != kinfo.si_signo))
4079-
return -EINVAL;
4080-
4081-
/* Only allow sending arbitrary signals to yourself. */
4082-
if ((task_pid(current) != pid || type > PIDTYPE_TGID) &&
4083-
(kinfo.si_code >= 0 || kinfo.si_code == SI_TKILL))
4084-
return -EPERM;
4085-
} else {
4086-
prepare_kill_siginfo(sig, &kinfo, type);
4109+
return do_pidfd_send_signal(pid, sig, type, info, flags);
4110+
}
40874111
}
40884112

4089-
if (type == PIDTYPE_PGID)
4090-
return kill_pgrp_info(sig, &kinfo, pid);
4091-
else
4092-
return kill_pid_info_type(sig, &kinfo, pid, type);
4113+
return do_pidfd_send_signal(pid, sig, type, info, flags);
40934114
}
40944115

40954116
static int

tools/testing/selftests/mm/guard-pages.c

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@
1919
#include <sys/uio.h>
2020
#include <unistd.h>
2121

22+
#include "../pidfd/pidfd.h"
23+
2224
/*
2325
* Ignore the checkpatch warning, as per the C99 standard, section 7.14.1.1:
2426
*
@@ -50,11 +52,6 @@ static void handle_fatal(int c)
5052
siglongjmp(signal_jmp_buf, c);
5153
}
5254

53-
static int pidfd_open(pid_t pid, unsigned int flags)
54-
{
55-
return syscall(SYS_pidfd_open, pid, flags);
56-
}
57-
5855
static ssize_t sys_process_madvise(int pidfd, const struct iovec *iovec,
5956
size_t n, int advice, unsigned int flags)
6057
{
@@ -370,14 +367,10 @@ TEST_F(guard_pages, multi_vma)
370367
TEST_F(guard_pages, process_madvise)
371368
{
372369
const unsigned long page_size = self->page_size;
373-
pid_t pid = getpid();
374-
int pidfd = pidfd_open(pid, 0);
375370
char *ptr_region, *ptr1, *ptr2, *ptr3;
376371
ssize_t count;
377372
struct iovec vec[6];
378373

379-
ASSERT_NE(pidfd, -1);
380-
381374
/* Reserve region to map over. */
382375
ptr_region = mmap(NULL, 100 * page_size, PROT_NONE,
383376
MAP_ANON | MAP_PRIVATE, -1, 0);
@@ -425,7 +418,7 @@ TEST_F(guard_pages, process_madvise)
425418
ASSERT_EQ(munmap(&ptr_region[99 * page_size], page_size), 0);
426419

427420
/* Now guard in one step. */
428-
count = sys_process_madvise(pidfd, vec, 6, MADV_GUARD_INSTALL, 0);
421+
count = sys_process_madvise(PIDFD_SELF, vec, 6, MADV_GUARD_INSTALL, 0);
429422

430423
/* OK we don't have permission to do this, skip. */
431424
if (count == -1 && errno == EPERM)
@@ -446,7 +439,7 @@ TEST_F(guard_pages, process_madvise)
446439
ASSERT_FALSE(try_read_write_buf(&ptr3[19 * page_size]));
447440

448441
/* Now do the same with unguard... */
449-
count = sys_process_madvise(pidfd, vec, 6, MADV_GUARD_REMOVE, 0);
442+
count = sys_process_madvise(PIDFD_SELF, vec, 6, MADV_GUARD_REMOVE, 0);
450443

451444
/* ...and everything should now succeed. */
452445

@@ -463,7 +456,6 @@ TEST_F(guard_pages, process_madvise)
463456
ASSERT_EQ(munmap(ptr1, 10 * page_size), 0);
464457
ASSERT_EQ(munmap(ptr2, 5 * page_size), 0);
465458
ASSERT_EQ(munmap(ptr3, 20 * page_size), 0);
466-
close(pidfd);
467459
}
468460

469461
/* Assert that unmapping ranges does not leave guard markers behind. */

tools/testing/selftests/pidfd/pidfd.h

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,22 @@
5050
#define PIDFD_NONBLOCK O_NONBLOCK
5151
#endif
5252

53+
#ifndef PIDFD_SELF_THREAD
54+
#define PIDFD_SELF_THREAD -10000 /* Current thread. */
55+
#endif
56+
57+
#ifndef PIDFD_SELF_THREAD_GROUP
58+
#define PIDFD_SELF_THREAD_GROUP -20000 /* Current thread group leader. */
59+
#endif
60+
61+
#ifndef PIDFD_SELF
62+
#define PIDFD_SELF PIDFD_SELF_THREAD
63+
#endif
64+
65+
#ifndef PIDFD_SELF_PROCESS
66+
#define PIDFD_SELF_PROCESS PIDFD_SELF_THREAD_GROUP
67+
#endif
68+
5369
/*
5470
* The kernel reserves 300 pids via RESERVED_PIDS in kernel/pid.c
5571
* That means, when it wraps around any pid < 300 will be skipped.

tools/testing/selftests/pidfd/pidfd_open_test.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@
3131
#define PIDFD_INFO_CGROUPID (1UL << 0)
3232

3333
struct pidfd_info {
34-
__u64 request_mask;
34+
__u64 mask;
3535
__u64 cgroupid;
3636
__u32 pid;
3737
__u32 tgid;
@@ -148,7 +148,7 @@ static pid_t get_pid_from_fdinfo_file(int pidfd, const char *key, size_t keylen)
148148
int main(int argc, char **argv)
149149
{
150150
struct pidfd_info info = {
151-
.request_mask = PIDFD_INFO_CGROUPID,
151+
.mask = PIDFD_INFO_CGROUPID,
152152
};
153153
int pidfd = -1, ret = 1;
154154
pid_t pid;
@@ -227,7 +227,7 @@ int main(int argc, char **argv)
227227
getegid(), info.sgid);
228228
goto on_error;
229229
}
230-
if ((info.request_mask & PIDFD_INFO_CGROUPID) && info.cgroupid == 0) {
230+
if ((info.mask & PIDFD_INFO_CGROUPID) && info.cgroupid == 0) {
231231
ksft_print_msg("cgroupid should not be 0 when PIDFD_INFO_CGROUPID is set\n");
232232
goto on_error;
233233
}

0 commit comments

Comments
 (0)