Skip to content

Commit 27dc2ae

Browse files
beaubelgraverostedt
authored andcommitted
tracing/user_events: Update documentation for ABI
The ABI for user_events has changed from mmap() based to remote writes. Update the documentation to reflect these changes, add new section for unregistering events since lifetime is now tied to tasks instead of files. Link: https://lkml.kernel.org/r/20230328235219.203-10-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
1 parent 9211dda commit 27dc2ae

1 file changed

Lines changed: 97 additions & 70 deletions

File tree

Documentation/trace/user_events.rst

Lines changed: 97 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,10 @@ dynamic_events is the same as the ioctl with the u: prefix applied.
2020

2121
Typically programs will register a set of events that they wish to expose to
2222
tools that can read trace_events (such as ftrace and perf). The registration
23-
process gives back two ints to the program for each event. The first int is
24-
the status bit. This describes which bit in little-endian format in the
25-
/sys/kernel/tracing/user_events_status file represents this event. The
26-
second int is the write index which describes the data when a write() or
27-
writev() is called on the /sys/kernel/tracing/user_events_data file.
23+
process tells the kernel which address and bit to reflect if any tool has
24+
enabled the event and data should be written. The registration will give back
25+
a write index which describes the data when a write() or writev() is called
26+
on the /sys/kernel/tracing/user_events_data file.
2827

2928
The structures referenced in this document are contained within the
3029
/include/uapi/linux/user_events.h file in the source tree.
@@ -41,23 +40,64 @@ DIAG_IOCSREG.
4140
This command takes a packed struct user_reg as an argument::
4241

4342
struct user_reg {
44-
u32 size;
45-
u64 name_args;
46-
u32 status_bit;
47-
u32 write_index;
48-
};
43+
/* Input: Size of the user_reg structure being used */
44+
__u32 size;
45+
46+
/* Input: Bit in enable address to use */
47+
__u8 enable_bit;
48+
49+
/* Input: Enable size in bytes at address */
50+
__u8 enable_size;
51+
52+
/* Input: Flags for future use, set to 0 */
53+
__u16 flags;
54+
55+
/* Input: Address to update when enabled */
56+
__u64 enable_addr;
57+
58+
/* Input: Pointer to string with event name, description and flags */
59+
__u64 name_args;
60+
61+
/* Output: Index of the event to use when writing data */
62+
__u32 write_index;
63+
} __attribute__((__packed__));
64+
65+
The struct user_reg requires all the above inputs to be set appropriately.
66+
67+
+ size: This must be set to sizeof(struct user_reg).
4968

50-
The struct user_reg requires two inputs, the first is the size of the structure
51-
to ensure forward and backward compatibility. The second is the command string
52-
to issue for registering. Upon success two outputs are set, the status bit
53-
and the write index.
69+
+ enable_bit: The bit to reflect the event status at the address specified by
70+
enable_addr.
71+
72+
+ enable_size: The size of the value specified by enable_addr.
73+
This must be 4 (32-bit) or 8 (64-bit). 64-bit values are only allowed to be
74+
used on 64-bit kernels, however, 32-bit can be used on all kernels.
75+
76+
+ flags: The flags to use, if any. For the initial version this must be 0.
77+
Callers should first attempt to use flags and retry without flags to ensure
78+
support for lower versions of the kernel. If a flag is not supported -EINVAL
79+
is returned.
80+
81+
+ enable_addr: The address of the value to use to reflect event status. This
82+
must be naturally aligned and write accessible within the user program.
83+
84+
+ name_args: The name and arguments to describe the event, see command format
85+
for details.
86+
87+
Upon successful registration the following is set.
88+
89+
+ write_index: The index to use for this file descriptor that represents this
90+
event when writing out data. The index is unique to this instance of the file
91+
descriptor that was used for the registration. See writing data for details.
5492

5593
User based events show up under tracefs like any other event under the
5694
subsystem named "user_events". This means tools that wish to attach to the
5795
events need to use /sys/kernel/tracing/events/user_events/[name]/enable
5896
or perf record -e user_events:[name] when attaching/recording.
5997

60-
**NOTE:** *The write_index returned is only valid for the FD that was used*
98+
**NOTE:** The event subsystem name by default is "user_events". Callers should
99+
not assume it will always be "user_events". Operators reserve the right in the
100+
future to change the subsystem name per-process to accomodate event isolation.
61101

62102
Command Format
63103
^^^^^^^^^^^^^^
@@ -94,7 +134,7 @@ Would be represented by the following field::
94134
struct mytype myname 20
95135

96136
Deleting
97-
-----------
137+
--------
98138
Deleting an event from within a user process is done via ioctl() out to the
99139
/sys/kernel/tracing/user_events_data file. The command to issue is
100140
DIAG_IOCSDEL.
@@ -104,92 +144,79 @@ its name. Delete will only succeed if there are no references left to the
104144
event (in both user and kernel space). User programs should use a separate file
105145
to request deletes than the one used for registration due to this.
106146

107-
Status
108-
------
109-
When tools attach/record user based events the status of the event is updated
110-
in realtime. This allows user programs to only incur the cost of the write() or
111-
writev() calls when something is actively attached to the event.
112-
113-
User programs call mmap() on /sys/kernel/tracing/user_events_status to
114-
check the status for each event that is registered. The bit to check in the
115-
file is given back after the register ioctl() via user_reg.status_bit. The bit
116-
is always in little-endian format. Programs can check if the bit is set either
117-
using a byte-wise index with a mask or a long-wise index with a little-endian
118-
mask.
147+
Unregistering
148+
-------------
149+
If after registering an event it is no longer wanted to be updated then it can
150+
be disabled via ioctl() out to the /sys/kernel/tracing/user_events_data file.
151+
The command to issue is DIAG_IOCSUNREG. This is different than deleting, where
152+
deleting actually removes the event from the system. Unregistering simply tells
153+
the kernel your process is no longer interested in updates to the event.
119154

120-
Currently the size of user_events_status is a single page, however, custom
121-
kernel configurations can change this size to allow more user based events. In
122-
all cases the size of the file is a multiple of a page size.
155+
This command takes a packed struct user_unreg as an argument::
123156

124-
For example, if the register ioctl() gives back a status_bit of 3 you would
125-
check byte 0 (3 / 8) of the returned mmap data and then AND the result with 8
126-
(1 << (3 % 8)) to see if anything is attached to that event.
157+
struct user_unreg {
158+
/* Input: Size of the user_unreg structure being used */
159+
__u32 size;
127160

128-
A byte-wise index check is performed as follows::
161+
/* Input: Bit to unregister */
162+
__u8 disable_bit;
129163

130-
int index, mask;
131-
char *status_page;
164+
/* Input: Reserved, set to 0 */
165+
__u8 __reserved;
132166

133-
index = status_bit / 8;
134-
mask = 1 << (status_bit % 8);
135-
136-
...
167+
/* Input: Reserved, set to 0 */
168+
__u16 __reserved2;
137169

138-
if (status_page[index] & mask) {
139-
/* Enabled */
140-
}
170+
/* Input: Address to unregister */
171+
__u64 disable_addr;
172+
} __attribute__((__packed__));
141173

142-
A long-wise index check is performed as follows::
174+
The struct user_unreg requires all the above inputs to be set appropriately.
143175

144-
#include <asm/bitsperlong.h>
145-
#include <endian.h>
176+
+ size: This must be set to sizeof(struct user_unreg).
146177

147-
#if __BITS_PER_LONG == 64
148-
#define endian_swap(x) htole64(x)
149-
#else
150-
#define endian_swap(x) htole32(x)
151-
#endif
178+
+ disable_bit: This must be set to the bit to disable (same bit that was
179+
previously registered via enable_bit).
152180

153-
long index, mask, *status_page;
181+
+ disable_addr: This must be set to the address to disable (same address that was
182+
previously registered via enable_addr).
154183

155-
index = status_bit / __BITS_PER_LONG;
156-
mask = 1L << (status_bit % __BITS_PER_LONG);
157-
mask = endian_swap(mask);
184+
**NOTE:** Events are automatically unregistered when execve() is invoked. During
185+
fork() the registered events will be retained and must be unregistered manually
186+
in each process if wanted.
158187

159-
...
188+
Status
189+
------
190+
When tools attach/record user based events the status of the event is updated
191+
in realtime. This allows user programs to only incur the cost of the write() or
192+
writev() calls when something is actively attached to the event.
160193

161-
if (status_page[index] & mask) {
162-
/* Enabled */
163-
}
194+
The kernel will update the specified bit that was registered for the event as
195+
tools attach/detach from the event. User programs simply check if the bit is set
196+
to see if something is attached or not.
164197

165198
Administrators can easily check the status of all registered events by reading
166199
the user_events_status file directly via a terminal. The output is as follows::
167200

168-
Byte:Name [# Comments]
201+
Name [# Comments]
169202
...
170203

171204
Active: ActiveCount
172205
Busy: BusyCount
173-
Max: MaxCount
174206

175207
For example, on a system that has a single event the output looks like this::
176208

177-
1:test
209+
test
178210

179211
Active: 1
180212
Busy: 0
181-
Max: 32768
182213

183214
If a user enables the user event via ftrace, the output would change to this::
184215

185-
1:test # Used by ftrace
216+
test # Used by ftrace
186217

187218
Active: 1
188219
Busy: 1
189-
Max: 32768
190-
191-
**NOTE:** *A status bit of 0 will never be returned. This allows user programs
192-
to have a bit that can be used on error cases.*
193220

194221
Writing Data
195222
------------
@@ -217,7 +244,7 @@ For example, if I have a struct like this::
217244
int src;
218245
int dst;
219246
int flags;
220-
};
247+
} __attribute__((__packed__));
221248

222249
It's advised for user programs to do the following::
223250

0 commit comments

Comments
 (0)