Skip to content

Commit d579c46

Browse files
committed
Merge tag 'trace-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt: - User events are finally ready! After lots of collaboration between various parties, we finally locked down on a stable interface for user events that can also work with user space only tracing. This is implemented by telling the kernel (or user space library, but that part is user space only and not part of this patch set), where the variable is that the application uses to know if something is listening to the trace. There's also an interface to tell the kernel about these events, which will show up in the /sys/kernel/tracing/events/user_events/ directory, where it can be enabled. When it's enabled, the kernel will update the variable, to tell the application to start writing to the kernel. See https://lwn.net/Articles/927595/ - Cleaned up the direct trampolines code to simplify arm64 addition of direct trampolines. Direct trampolines use the ftrace interface but instead of jumping to the ftrace trampoline, applications (mostly BPF) can register their own trampoline for performance reasons. - Some updates to the fprobe infrastructure. fprobes are more efficient than kprobes, as it does not need to save all the registers that kprobes on ftrace do. More work needs to be done before the fprobes will be exposed as dynamic events. - More updates to references to the obsolete path of /sys/kernel/debug/tracing for the new /sys/kernel/tracing path. - Add a seq_buf_do_printk() helper to seq_bufs, to print a large buffer line by line instead of all at once. There are users in production kernels that have a large data dump that originally used printk() directly, but the data dump was larger than what printk() allowed as a single print. Using seq_buf() to do the printing fixes that. - Add /sys/kernel/tracing/touched_functions that shows all functions that was every traced by ftrace or a direct trampoline. This is used for debugging issues where a traced function could have caused a crash by a bpf program or live patching. - Add a "fields" option that is similar to "raw" but outputs the fields of the events. It's easier to read by humans. - Some minor fixes and clean ups. * tag 'trace-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (41 commits) ring-buffer: Sync IRQ works before buffer destruction tracing: Add missing spaces in trace_print_hex_seq() ring-buffer: Ensure proper resetting of atomic variables in ring_buffer_reset_online_cpus recordmcount: Fix memory leaks in the uwrite function tracing/user_events: Limit max fault-in attempts tracing/user_events: Prevent same address and bit per process tracing/user_events: Ensure bit is cleared on unregister tracing/user_events: Ensure write index cannot be negative seq_buf: Add seq_buf_do_printk() helper tracing: Fix print_fields() for __dyn_loc/__rel_loc tracing/user_events: Set event filter_type from type ring-buffer: Clearly check null ptr returned by rb_set_head_page() tracing: Unbreak user events tracing/user_events: Use print_format_fields() for trace output tracing/user_events: Align structs with tabs for readability tracing/user_events: Limit global user_event count tracing/user_events: Charge event allocs to cgroups tracing/user_events: Update documentation for ABI tracing/user_events: Use write ABI in example tracing/user_events: Add ABI self-test ...
2 parents f20730e + 675751b commit d579c46

35 files changed

Lines changed: 1959 additions & 518 deletions

Documentation/trace/fprobe.rst

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -87,14 +87,16 @@ returns as same as unregister_ftrace_function().
8787
The fprobe entry/exit handler
8888
=============================
8989

90-
The prototype of the entry/exit callback function is as follows:
90+
The prototype of the entry/exit callback function are as follows:
9191

9292
.. code-block:: c
9393
94-
void callback_func(struct fprobe *fp, unsigned long entry_ip, struct pt_regs *regs);
94+
int entry_callback(struct fprobe *fp, unsigned long entry_ip, struct pt_regs *regs, void *entry_data);
9595
96-
Note that both entry and exit callbacks have same ptototype. The @entry_ip is
97-
saved at function entry and passed to exit handler.
96+
void exit_callback(struct fprobe *fp, unsigned long entry_ip, struct pt_regs *regs, void *entry_data);
97+
98+
Note that the @entry_ip is saved at function entry and passed to exit handler.
99+
If the entry callback function returns !0, the corresponding exit callback will be cancelled.
98100

99101
@fp
100102
This is the address of `fprobe` data structure related to this handler.
@@ -113,6 +115,12 @@ saved at function entry and passed to exit handler.
113115
to use @entry_ip. On the other hand, in the exit_handler, the instruction
114116
pointer of @regs is set to the currect return address.
115117

118+
@entry_data
119+
This is a local storage to share the data between entry and exit handlers.
120+
This storage is NULL by default. If the user specify `exit_handler` field
121+
and `entry_data_size` field when registering the fprobe, the storage is
122+
allocated and passed to both `entry_handler` and `exit_handler`.
123+
116124
Share the callbacks with kprobes
117125
================================
118126

Documentation/trace/ftrace.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1027,6 +1027,7 @@ To see what is available, simply cat the file::
10271027
nohex
10281028
nobin
10291029
noblock
1030+
nofields
10301031
trace_printk
10311032
annotate
10321033
nouserstacktrace
@@ -1110,6 +1111,11 @@ Here are the available options:
11101111
block
11111112
When set, reading trace_pipe will not block when polled.
11121113

1114+
fields
1115+
Print the fields as described by their types. This is a better
1116+
option than using hex, bin or raw, as it gives a better parsing
1117+
of the content of the event.
1118+
11131119
trace_printk
11141120
Can disable trace_printk() from writing into the buffer.
11151121

Documentation/trace/user_events.rst

Lines changed: 97 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,10 @@ dynamic_events is the same as the ioctl with the u: prefix applied.
2020

2121
Typically programs will register a set of events that they wish to expose to
2222
tools that can read trace_events (such as ftrace and perf). The registration
23-
process gives back two ints to the program for each event. The first int is
24-
the status bit. This describes which bit in little-endian format in the
25-
/sys/kernel/tracing/user_events_status file represents this event. The
26-
second int is the write index which describes the data when a write() or
27-
writev() is called on the /sys/kernel/tracing/user_events_data file.
23+
process tells the kernel which address and bit to reflect if any tool has
24+
enabled the event and data should be written. The registration will give back
25+
a write index which describes the data when a write() or writev() is called
26+
on the /sys/kernel/tracing/user_events_data file.
2827

2928
The structures referenced in this document are contained within the
3029
/include/uapi/linux/user_events.h file in the source tree.
@@ -41,23 +40,64 @@ DIAG_IOCSREG.
4140
This command takes a packed struct user_reg as an argument::
4241

4342
struct user_reg {
44-
u32 size;
45-
u64 name_args;
46-
u32 status_bit;
47-
u32 write_index;
48-
};
43+
/* Input: Size of the user_reg structure being used */
44+
__u32 size;
45+
46+
/* Input: Bit in enable address to use */
47+
__u8 enable_bit;
48+
49+
/* Input: Enable size in bytes at address */
50+
__u8 enable_size;
51+
52+
/* Input: Flags for future use, set to 0 */
53+
__u16 flags;
54+
55+
/* Input: Address to update when enabled */
56+
__u64 enable_addr;
57+
58+
/* Input: Pointer to string with event name, description and flags */
59+
__u64 name_args;
60+
61+
/* Output: Index of the event to use when writing data */
62+
__u32 write_index;
63+
} __attribute__((__packed__));
64+
65+
The struct user_reg requires all the above inputs to be set appropriately.
66+
67+
+ size: This must be set to sizeof(struct user_reg).
4968

50-
The struct user_reg requires two inputs, the first is the size of the structure
51-
to ensure forward and backward compatibility. The second is the command string
52-
to issue for registering. Upon success two outputs are set, the status bit
53-
and the write index.
69+
+ enable_bit: The bit to reflect the event status at the address specified by
70+
enable_addr.
71+
72+
+ enable_size: The size of the value specified by enable_addr.
73+
This must be 4 (32-bit) or 8 (64-bit). 64-bit values are only allowed to be
74+
used on 64-bit kernels, however, 32-bit can be used on all kernels.
75+
76+
+ flags: The flags to use, if any. For the initial version this must be 0.
77+
Callers should first attempt to use flags and retry without flags to ensure
78+
support for lower versions of the kernel. If a flag is not supported -EINVAL
79+
is returned.
80+
81+
+ enable_addr: The address of the value to use to reflect event status. This
82+
must be naturally aligned and write accessible within the user program.
83+
84+
+ name_args: The name and arguments to describe the event, see command format
85+
for details.
86+
87+
Upon successful registration the following is set.
88+
89+
+ write_index: The index to use for this file descriptor that represents this
90+
event when writing out data. The index is unique to this instance of the file
91+
descriptor that was used for the registration. See writing data for details.
5492

5593
User based events show up under tracefs like any other event under the
5694
subsystem named "user_events". This means tools that wish to attach to the
5795
events need to use /sys/kernel/tracing/events/user_events/[name]/enable
5896
or perf record -e user_events:[name] when attaching/recording.
5997

60-
**NOTE:** *The write_index returned is only valid for the FD that was used*
98+
**NOTE:** The event subsystem name by default is "user_events". Callers should
99+
not assume it will always be "user_events". Operators reserve the right in the
100+
future to change the subsystem name per-process to accomodate event isolation.
61101

62102
Command Format
63103
^^^^^^^^^^^^^^
@@ -94,7 +134,7 @@ Would be represented by the following field::
94134
struct mytype myname 20
95135

96136
Deleting
97-
-----------
137+
--------
98138
Deleting an event from within a user process is done via ioctl() out to the
99139
/sys/kernel/tracing/user_events_data file. The command to issue is
100140
DIAG_IOCSDEL.
@@ -104,92 +144,79 @@ its name. Delete will only succeed if there are no references left to the
104144
event (in both user and kernel space). User programs should use a separate file
105145
to request deletes than the one used for registration due to this.
106146

107-
Status
108-
------
109-
When tools attach/record user based events the status of the event is updated
110-
in realtime. This allows user programs to only incur the cost of the write() or
111-
writev() calls when something is actively attached to the event.
112-
113-
User programs call mmap() on /sys/kernel/tracing/user_events_status to
114-
check the status for each event that is registered. The bit to check in the
115-
file is given back after the register ioctl() via user_reg.status_bit. The bit
116-
is always in little-endian format. Programs can check if the bit is set either
117-
using a byte-wise index with a mask or a long-wise index with a little-endian
118-
mask.
147+
Unregistering
148+
-------------
149+
If after registering an event it is no longer wanted to be updated then it can
150+
be disabled via ioctl() out to the /sys/kernel/tracing/user_events_data file.
151+
The command to issue is DIAG_IOCSUNREG. This is different than deleting, where
152+
deleting actually removes the event from the system. Unregistering simply tells
153+
the kernel your process is no longer interested in updates to the event.
119154

120-
Currently the size of user_events_status is a single page, however, custom
121-
kernel configurations can change this size to allow more user based events. In
122-
all cases the size of the file is a multiple of a page size.
155+
This command takes a packed struct user_unreg as an argument::
123156

124-
For example, if the register ioctl() gives back a status_bit of 3 you would
125-
check byte 0 (3 / 8) of the returned mmap data and then AND the result with 8
126-
(1 << (3 % 8)) to see if anything is attached to that event.
157+
struct user_unreg {
158+
/* Input: Size of the user_unreg structure being used */
159+
__u32 size;
127160

128-
A byte-wise index check is performed as follows::
161+
/* Input: Bit to unregister */
162+
__u8 disable_bit;
129163

130-
int index, mask;
131-
char *status_page;
164+
/* Input: Reserved, set to 0 */
165+
__u8 __reserved;
132166

133-
index = status_bit / 8;
134-
mask = 1 << (status_bit % 8);
135-
136-
...
167+
/* Input: Reserved, set to 0 */
168+
__u16 __reserved2;
137169

138-
if (status_page[index] & mask) {
139-
/* Enabled */
140-
}
170+
/* Input: Address to unregister */
171+
__u64 disable_addr;
172+
} __attribute__((__packed__));
141173

142-
A long-wise index check is performed as follows::
174+
The struct user_unreg requires all the above inputs to be set appropriately.
143175

144-
#include <asm/bitsperlong.h>
145-
#include <endian.h>
176+
+ size: This must be set to sizeof(struct user_unreg).
146177

147-
#if __BITS_PER_LONG == 64
148-
#define endian_swap(x) htole64(x)
149-
#else
150-
#define endian_swap(x) htole32(x)
151-
#endif
178+
+ disable_bit: This must be set to the bit to disable (same bit that was
179+
previously registered via enable_bit).
152180

153-
long index, mask, *status_page;
181+
+ disable_addr: This must be set to the address to disable (same address that was
182+
previously registered via enable_addr).
154183

155-
index = status_bit / __BITS_PER_LONG;
156-
mask = 1L << (status_bit % __BITS_PER_LONG);
157-
mask = endian_swap(mask);
184+
**NOTE:** Events are automatically unregistered when execve() is invoked. During
185+
fork() the registered events will be retained and must be unregistered manually
186+
in each process if wanted.
158187

159-
...
188+
Status
189+
------
190+
When tools attach/record user based events the status of the event is updated
191+
in realtime. This allows user programs to only incur the cost of the write() or
192+
writev() calls when something is actively attached to the event.
160193

161-
if (status_page[index] & mask) {
162-
/* Enabled */
163-
}
194+
The kernel will update the specified bit that was registered for the event as
195+
tools attach/detach from the event. User programs simply check if the bit is set
196+
to see if something is attached or not.
164197

165198
Administrators can easily check the status of all registered events by reading
166199
the user_events_status file directly via a terminal. The output is as follows::
167200

168-
Byte:Name [# Comments]
201+
Name [# Comments]
169202
...
170203

171204
Active: ActiveCount
172205
Busy: BusyCount
173-
Max: MaxCount
174206

175207
For example, on a system that has a single event the output looks like this::
176208

177-
1:test
209+
test
178210

179211
Active: 1
180212
Busy: 0
181-
Max: 32768
182213

183214
If a user enables the user event via ftrace, the output would change to this::
184215

185-
1:test # Used by ftrace
216+
test # Used by ftrace
186217

187218
Active: 1
188219
Busy: 1
189-
Max: 32768
190-
191-
**NOTE:** *A status bit of 0 will never be returned. This allows user programs
192-
to have a bit that can be used on error cases.*
193220

194221
Writing Data
195222
------------
@@ -217,7 +244,7 @@ For example, if I have a struct like this::
217244
int src;
218245
int dst;
219246
int flags;
220-
};
247+
} __attribute__((__packed__));
221248

222249
It's advised for user programs to do the following::
223250

fs/exec.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@
6565
#include <linux/syscall_user_dispatch.h>
6666
#include <linux/coredump.h>
6767
#include <linux/time_namespace.h>
68+
#include <linux/user_events.h>
6869

6970
#include <linux/uaccess.h>
7071
#include <asm/mmu_context.h>
@@ -1859,6 +1860,7 @@ static int bprm_execve(struct linux_binprm *bprm,
18591860
current->fs->in_exec = 0;
18601861
current->in_execve = 0;
18611862
rseq_execve(current);
1863+
user_events_execve(current);
18621864
acct_update_integrals(current);
18631865
task_numa_free(current, false);
18641866
return retval;

include/linux/fprobe.h

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@
1313
* @nmissed: The counter for missing events.
1414
* @flags: The status flag.
1515
* @rethook: The rethook data structure. (internal data)
16+
* @entry_data_size: The private data storage size.
17+
* @nr_maxactive: The max number of active functions.
1618
* @entry_handler: The callback function for function entry.
1719
* @exit_handler: The callback function for function exit.
1820
*/
@@ -29,9 +31,13 @@ struct fprobe {
2931
unsigned long nmissed;
3032
unsigned int flags;
3133
struct rethook *rethook;
34+
size_t entry_data_size;
35+
int nr_maxactive;
3236

33-
void (*entry_handler)(struct fprobe *fp, unsigned long entry_ip, struct pt_regs *regs);
34-
void (*exit_handler)(struct fprobe *fp, unsigned long entry_ip, struct pt_regs *regs);
37+
int (*entry_handler)(struct fprobe *fp, unsigned long entry_ip,
38+
struct pt_regs *regs, void *entry_data);
39+
void (*exit_handler)(struct fprobe *fp, unsigned long entry_ip,
40+
struct pt_regs *regs, void *entry_data);
3541
};
3642

3743
/* This fprobe is soft-disabled. */

include/linux/ftrace.h

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -548,6 +548,7 @@ bool is_ftrace_trampoline(unsigned long addr);
548548
* DIRECT - there is a direct function to call
549549
* CALL_OPS - the record can use callsite-specific ops
550550
* CALL_OPS_EN - the function is set up to use callsite-specific ops
551+
* TOUCHED - A callback was added since boot up
551552
*
552553
* When a new ftrace_ops is registered and wants a function to save
553554
* pt_regs, the rec->flags REGS is set. When the function has been
@@ -567,9 +568,10 @@ enum {
567568
FTRACE_FL_DIRECT_EN = (1UL << 23),
568569
FTRACE_FL_CALL_OPS = (1UL << 22),
569570
FTRACE_FL_CALL_OPS_EN = (1UL << 21),
571+
FTRACE_FL_TOUCHED = (1UL << 20),
570572
};
571573

572-
#define FTRACE_REF_MAX_SHIFT 21
574+
#define FTRACE_REF_MAX_SHIFT 20
573575
#define FTRACE_REF_MAX ((1UL << FTRACE_REF_MAX_SHIFT) - 1)
574576

575577
#define ftrace_rec_count(rec) ((rec)->flags & FTRACE_REF_MAX)
@@ -628,6 +630,7 @@ enum {
628630
FTRACE_ITER_PROBE = (1 << 4),
629631
FTRACE_ITER_MOD = (1 << 5),
630632
FTRACE_ITER_ENABLED = (1 << 6),
633+
FTRACE_ITER_TOUCHED = (1 << 7),
631634
};
632635

633636
void arch_ftrace_update_code(int command);

include/linux/sched.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ struct sighand_struct;
7070
struct signal_struct;
7171
struct task_delay_info;
7272
struct task_group;
73+
struct user_event_mm;
7374

7475
/*
7576
* Task state bitmask. NOTE! These bits are also
@@ -1529,6 +1530,10 @@ struct task_struct {
15291530
union rv_task_monitor rv[RV_PER_TASK_MONITORS];
15301531
#endif
15311532

1533+
#ifdef CONFIG_USER_EVENTS
1534+
struct user_event_mm *user_event_mm;
1535+
#endif
1536+
15321537
/*
15331538
* New fields for task_struct should be added above here, so that
15341539
* they are included in the randomized portion of task_struct.

include/linux/seq_buf.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,4 +159,6 @@ extern int
159159
seq_buf_bprintf(struct seq_buf *s, const char *fmt, const u32 *binary);
160160
#endif
161161

162+
void seq_buf_do_printk(struct seq_buf *s, const char *lvl);
163+
162164
#endif /* _LINUX_SEQ_BUF_H */

0 commit comments

Comments
 (0)