|
| 1 | +========================================= |
| 2 | +user_events: User-based Event Tracing |
| 3 | +========================================= |
| 4 | + |
| 5 | +:Author: Beau Belgrave |
| 6 | + |
| 7 | +Overview |
| 8 | +-------- |
| 9 | +User based trace events allow user processes to create events and trace data |
| 10 | +that can be viewed via existing tools, such as ftrace, perf and eBPF. |
| 11 | +To enable this feature, build your kernel with CONFIG_USER_EVENTS=y. |
| 12 | + |
| 13 | +Programs can view status of the events via |
| 14 | +/sys/kernel/debug/tracing/user_events_status and can both register and write |
| 15 | +data out via /sys/kernel/debug/tracing/user_events_data. |
| 16 | + |
| 17 | +Programs can also use /sys/kernel/debug/tracing/dynamic_events to register and |
| 18 | +delete user based events via the u: prefix. The format of the command to |
| 19 | +dynamic_events is the same as the ioctl with the u: prefix applied. |
| 20 | + |
| 21 | +Typically programs will register a set of events that they wish to expose to |
| 22 | +tools that can read trace_events (such as ftrace and perf). The registration |
| 23 | +process gives back two ints to the program for each event. The first int is the |
| 24 | +status index. This index describes which byte in the |
| 25 | +/sys/kernel/debug/tracing/user_events_status file represents this event. The |
| 26 | +second int is the write index. This index describes the data when a write() or |
| 27 | +writev() is called on the /sys/kernel/debug/tracing/user_events_data file. |
| 28 | + |
| 29 | +The structures referenced in this document are contained with the |
| 30 | +/include/uap/linux/user_events.h file in the source tree. |
| 31 | + |
| 32 | +**NOTE:** *Both user_events_status and user_events_data are under the tracefs |
| 33 | +filesystem and may be mounted at different paths than above.* |
| 34 | + |
| 35 | +Registering |
| 36 | +----------- |
| 37 | +Registering within a user process is done via ioctl() out to the |
| 38 | +/sys/kernel/debug/tracing/user_events_data file. The command to issue is |
| 39 | +DIAG_IOCSREG. |
| 40 | + |
| 41 | +This command takes a struct user_reg as an argument:: |
| 42 | + |
| 43 | + struct user_reg { |
| 44 | + u32 size; |
| 45 | + u64 name_args; |
| 46 | + u32 status_index; |
| 47 | + u32 write_index; |
| 48 | + }; |
| 49 | + |
| 50 | +The struct user_reg requires two inputs, the first is the size of the structure |
| 51 | +to ensure forward and backward compatibility. The second is the command string |
| 52 | +to issue for registering. Upon success two outputs are set, the status index |
| 53 | +and the write index. |
| 54 | + |
| 55 | +User based events show up under tracefs like any other event under the |
| 56 | +subsystem named "user_events". This means tools that wish to attach to the |
| 57 | +events need to use /sys/kernel/debug/tracing/events/user_events/[name]/enable |
| 58 | +or perf record -e user_events:[name] when attaching/recording. |
| 59 | + |
| 60 | +**NOTE:** *The write_index returned is only valid for the FD that was used* |
| 61 | + |
| 62 | +Command Format |
| 63 | +^^^^^^^^^^^^^^ |
| 64 | +The command string format is as follows:: |
| 65 | + |
| 66 | + name[:FLAG1[,FLAG2...]] [Field1[;Field2...]] |
| 67 | + |
| 68 | +Supported Flags |
| 69 | +^^^^^^^^^^^^^^^ |
| 70 | +**BPF_ITER** - EBPF programs attached to this event will get the raw iovec |
| 71 | +struct instead of any data copies for max performance. |
| 72 | + |
| 73 | +Field Format |
| 74 | +^^^^^^^^^^^^ |
| 75 | +:: |
| 76 | + |
| 77 | + type name [size] |
| 78 | + |
| 79 | +Basic types are supported (__data_loc, u32, u64, int, char, char[20], etc). |
| 80 | +User programs are encouraged to use clearly sized types like u32. |
| 81 | + |
| 82 | +**NOTE:** *Long is not supported since size can vary between user and kernel.* |
| 83 | + |
| 84 | +The size is only valid for types that start with a struct prefix. |
| 85 | +This allows user programs to describe custom structs out to tools, if required. |
| 86 | + |
| 87 | +For example, a struct in C that looks like this:: |
| 88 | + |
| 89 | + struct mytype { |
| 90 | + char data[20]; |
| 91 | + }; |
| 92 | + |
| 93 | +Would be represented by the following field:: |
| 94 | + |
| 95 | + struct mytype myname 20 |
| 96 | + |
| 97 | +Deleting |
| 98 | +----------- |
| 99 | +Deleting an event from within a user process is done via ioctl() out to the |
| 100 | +/sys/kernel/debug/tracing/user_events_data file. The command to issue is |
| 101 | +DIAG_IOCSDEL. |
| 102 | + |
| 103 | +This command only requires a single string specifying the event to delete by |
| 104 | +its name. Delete will only succeed if there are no references left to the |
| 105 | +event (in both user and kernel space). User programs should use a separate file |
| 106 | +to request deletes than the one used for registration due to this. |
| 107 | + |
| 108 | +Status |
| 109 | +------ |
| 110 | +When tools attach/record user based events the status of the event is updated |
| 111 | +in realtime. This allows user programs to only incur the cost of the write() or |
| 112 | +writev() calls when something is actively attached to the event. |
| 113 | + |
| 114 | +User programs call mmap() on /sys/kernel/debug/tracing/user_events_status to |
| 115 | +check the status for each event that is registered. The byte to check in the |
| 116 | +file is given back after the register ioctl() via user_reg.status_index. |
| 117 | +Currently the size of user_events_status is a single page, however, custom |
| 118 | +kernel configurations can change this size to allow more user based events. In |
| 119 | +all cases the size of the file is a multiple of a page size. |
| 120 | + |
| 121 | +For example, if the register ioctl() gives back a status_index of 3 you would |
| 122 | +check byte 3 of the returned mmap data to see if anything is attached to that |
| 123 | +event. |
| 124 | + |
| 125 | +Administrators can easily check the status of all registered events by reading |
| 126 | +the user_events_status file directly via a terminal. The output is as follows:: |
| 127 | + |
| 128 | + Byte:Name [# Comments] |
| 129 | + ... |
| 130 | + |
| 131 | + Active: ActiveCount |
| 132 | + Busy: BusyCount |
| 133 | + Max: MaxCount |
| 134 | + |
| 135 | +For example, on a system that has a single event the output looks like this:: |
| 136 | + |
| 137 | + 1:test |
| 138 | + |
| 139 | + Active: 1 |
| 140 | + Busy: 0 |
| 141 | + Max: 4096 |
| 142 | + |
| 143 | +If a user enables the user event via ftrace, the output would change to this:: |
| 144 | + |
| 145 | + 1:test # Used by ftrace |
| 146 | + |
| 147 | + Active: 1 |
| 148 | + Busy: 1 |
| 149 | + Max: 4096 |
| 150 | + |
| 151 | +**NOTE:** *A status index of 0 will never be returned. This allows user |
| 152 | +programs to have an index that can be used on error cases.* |
| 153 | + |
| 154 | +Status Bits |
| 155 | +^^^^^^^^^^^ |
| 156 | +The byte being checked will be non-zero if anything is attached. Programs can |
| 157 | +check specific bits in the byte to see what mechanism has been attached. |
| 158 | + |
| 159 | +The following values are defined to aid in checking what has been attached: |
| 160 | + |
| 161 | +**EVENT_STATUS_FTRACE** - Bit set if ftrace has been attached (Bit 0). |
| 162 | + |
| 163 | +**EVENT_STATUS_PERF** - Bit set if perf/eBPF has been attached (Bit 1). |
| 164 | + |
| 165 | +Writing Data |
| 166 | +------------ |
| 167 | +After registering an event the same fd that was used to register can be used |
| 168 | +to write an entry for that event. The write_index returned must be at the start |
| 169 | +of the data, then the remaining data is treated as the payload of the event. |
| 170 | + |
| 171 | +For example, if write_index returned was 1 and I wanted to write out an int |
| 172 | +payload of the event. Then the data would have to be 8 bytes (2 ints) in size, |
| 173 | +with the first 4 bytes being equal to 1 and the last 4 bytes being equal to the |
| 174 | +value I want as the payload. |
| 175 | + |
| 176 | +In memory this would look like this:: |
| 177 | + |
| 178 | + int index; |
| 179 | + int payload; |
| 180 | + |
| 181 | +User programs might have well known structs that they wish to use to emit out |
| 182 | +as payloads. In those cases writev() can be used, with the first vector being |
| 183 | +the index and the following vector(s) being the actual event payload. |
| 184 | + |
| 185 | +For example, if I have a struct like this:: |
| 186 | + |
| 187 | + struct payload { |
| 188 | + int src; |
| 189 | + int dst; |
| 190 | + int flags; |
| 191 | + }; |
| 192 | + |
| 193 | +It's advised for user programs to do the following:: |
| 194 | + |
| 195 | + struct iovec io[2]; |
| 196 | + struct payload e; |
| 197 | + |
| 198 | + io[0].iov_base = &write_index; |
| 199 | + io[0].iov_len = sizeof(write_index); |
| 200 | + io[1].iov_base = &e; |
| 201 | + io[1].iov_len = sizeof(e); |
| 202 | + |
| 203 | + writev(fd, (const struct iovec*)io, 2); |
| 204 | + |
| 205 | +**NOTE:** *The write_index is not emitted out into the trace being recorded.* |
| 206 | + |
| 207 | +EBPF |
| 208 | +---- |
| 209 | +EBPF programs that attach to a user-based event tracepoint are given a pointer |
| 210 | +to a struct user_bpf_context. The bpf context contains the data type (which can |
| 211 | +be a user or kernel buffer, or can be a pointer to the iovec) and the data |
| 212 | +length that was emitted (minus the write_index). |
| 213 | + |
| 214 | +Example Code |
| 215 | +------------ |
| 216 | +See sample code in samples/user_events. |
0 commit comments