|
| 1 | +perf-arm-spe(1) |
| 2 | +================ |
| 3 | + |
| 4 | +NAME |
| 5 | +---- |
| 6 | +perf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools |
| 7 | + |
| 8 | +SYNOPSIS |
| 9 | +-------- |
| 10 | +[verse] |
| 11 | +'perf record' -e arm_spe// |
| 12 | + |
| 13 | +DESCRIPTION |
| 14 | +----------- |
| 15 | + |
| 16 | +The SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and |
| 17 | + events down to individual instructions. Rather than being interrupt-driven, it picks an |
| 18 | +instruction to sample and then captures data for it during execution. Data includes execution time |
| 19 | +in cycles. For loads and stores it also includes data address, cache miss events, and data origin. |
| 20 | + |
| 21 | +The sampling has 5 stages: |
| 22 | + |
| 23 | + 1. Choose an operation |
| 24 | + 2. Collect data about the operation |
| 25 | + 3. Optionally discard the record based on a filter |
| 26 | + 4. Write the record to memory |
| 27 | + 5. Interrupt when the buffer is full |
| 28 | + |
| 29 | +Choose an operation |
| 30 | +~~~~~~~~~~~~~~~~~~~ |
| 31 | + |
| 32 | +This is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all |
| 33 | +architectural instructions or all micro-ops. Sampling happens at a programmable interval. The |
| 34 | +architecture provides a mechanism for the SPE driver to infer the minimum interval at which it should |
| 35 | +sample. This minimum interval is used by the driver if no interval is specified. A pseudo-random |
| 36 | +perturbation is also added to the sampling interval by default. |
| 37 | + |
| 38 | +Collect data about the operation |
| 39 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 40 | + |
| 41 | +Program counter, PMU events, timings and data addresses related to the operation are recorded. |
| 42 | +Sampling ensures there is only one sampled operation is in flight. |
| 43 | + |
| 44 | +Optionally discard the record based on a filter |
| 45 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 46 | + |
| 47 | +Based on programmable criteria, choose whether to keep the record or discard it. If the record is |
| 48 | +discarded then the flow stops here for this sample. |
| 49 | + |
| 50 | +Write the record to memory |
| 51 | +~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 52 | + |
| 53 | +The record is appended to a memory buffer |
| 54 | + |
| 55 | +Interrupt when the buffer is full |
| 56 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 57 | + |
| 58 | +When the buffer fills, an interrupt is sent and the driver signals Perf to collect the records. |
| 59 | +Perf saves the raw data in the perf.data file. |
| 60 | + |
| 61 | +Opening the file |
| 62 | +---------------- |
| 63 | + |
| 64 | +Up until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the |
| 65 | +recorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding |
| 66 | +the data, Perf generates "synthetic samples" as if these were generated at the time of the |
| 67 | +recording. These samples are the same as if normal sampling was done by Perf without using SPE, |
| 68 | +although they may have more attributes associated with them. For example a normal sample may have |
| 69 | +just the instruction pointer, but an SPE sample can have data addresses and latency attributes. |
| 70 | + |
| 71 | +Why Sampling? |
| 72 | +------------- |
| 73 | + |
| 74 | + - Sampling, rather than tracing, cuts down the profiling problem to something more manageable for |
| 75 | + hardware. Only one sampled operation is in flight at a time. |
| 76 | + |
| 77 | + - Allows precise attribution data, including: Full PC of instruction, data virtual and physical |
| 78 | + addresses. |
| 79 | + |
| 80 | + - Allows correlation between an instruction and events, such as TLB and cache miss. (Data source |
| 81 | + indicates which particular cache was hit, but the meaning is implementation defined because |
| 82 | + different implementations can have different cache configurations.) |
| 83 | + |
| 84 | +However, SPE does not provide any call-graph information, and relies on statistical methods. |
| 85 | + |
| 86 | +Collisions |
| 87 | +---------- |
| 88 | + |
| 89 | +When an operation is sampled while a previous sampled operation has not finished, a collision |
| 90 | +occurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate |
| 91 | +should be set to avoid collisions. |
| 92 | + |
| 93 | +The 'sample_collision' PMU event can be used to determine the number of lost samples. Although this |
| 94 | +count is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact |
| 95 | +number for samples dropped that would have made it through the filter, but can be a rough |
| 96 | +guide. |
| 97 | + |
| 98 | +The effect of microarchitectural sampling |
| 99 | +----------------------------------------- |
| 100 | + |
| 101 | +If an implementation samples micro-operations instead of instructions, the results of sampling must |
| 102 | +be weighted accordingly. |
| 103 | + |
| 104 | +For example, if a given instruction A is always converted into two micro-operations, A0 and A1, it |
| 105 | +becomes twice as likely to appear in the sample population. |
| 106 | + |
| 107 | +The coarse effect of conversions, and, if applicable, sampling of speculative operations, can be |
| 108 | +estimated from the 'sample_pop' and 'inst_retired' PMU events. |
| 109 | + |
| 110 | +Kernel Requirements |
| 111 | +------------------- |
| 112 | + |
| 113 | +The ARM_SPE_PMU config must be set to build as either a module or statically. |
| 114 | + |
| 115 | +Depending on CPU model, the kernel may need to be booted with page table isolation disabled |
| 116 | +(kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer |
| 117 | +inaccessible. Try passing 'kpti=off' on the kernel command line". |
| 118 | + |
| 119 | +Capturing SPE with perf command-line tools |
| 120 | +------------------------------------------ |
| 121 | + |
| 122 | +You can record a session with SPE samples: |
| 123 | + |
| 124 | + perf record -e arm_spe// -- ./mybench |
| 125 | + |
| 126 | +The sample period is set from the -c option, and because the minimum interval is used by default |
| 127 | +it's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL. |
| 128 | + |
| 129 | +Config parameters |
| 130 | +~~~~~~~~~~~~~~~~~ |
| 131 | + |
| 132 | +These are placed between the // in the event and comma separated. For example '-e |
| 133 | +arm_spe/load_filter=1,min_latency=10/' |
| 134 | + |
| 135 | + branch_filter=1 - collect branches only (PMSFCR.B) |
| 136 | + event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below |
| 137 | + jitter=1 - use jitter to avoid resonance when sampling (PMSIRR.RND) |
| 138 | + load_filter=1 - collect loads only (PMSFCR.LD) |
| 139 | + min_latency=<n> - collect only samples with this latency or higher* (PMSLATFR) |
| 140 | + pa_enable=1 - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege |
| 141 | + pct_enable=1 - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege |
| 142 | + store_filter=1 - collect stores only (PMSFCR.ST) |
| 143 | + ts_enable=1 - enable timestamping with value of generic timer (PMSCR.TS) |
| 144 | + |
| 145 | ++++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather |
| 146 | +than only the execution latency. |
| 147 | + |
| 148 | +Only some events can be filtered on; these include: |
| 149 | + |
| 150 | + bit 1 - instruction retired (i.e. omit speculative instructions) |
| 151 | + bit 3 - L1D refill |
| 152 | + bit 5 - TLB refill |
| 153 | + bit 7 - mispredict |
| 154 | + bit 11 - misaligned access |
| 155 | + |
| 156 | +So to sample just retired instructions: |
| 157 | + |
| 158 | + perf record -e arm_spe/event_filter=2/ -- ./mybench |
| 159 | + |
| 160 | +or just mispredicted branches: |
| 161 | + |
| 162 | + perf record -e arm_spe/event_filter=0x80/ -- ./mybench |
| 163 | + |
| 164 | +Viewing the data |
| 165 | +~~~~~~~~~~~~~~~~~ |
| 166 | + |
| 167 | +By default perf report and perf script will assign samples to separate groups depending on the |
| 168 | +attributes/events of the SPE record. Because instructions can have multiple events associated with |
| 169 | +them, the samples in these groups are not necessarily unique. For example perf report shows these |
| 170 | +groups: |
| 171 | + |
| 172 | + Available samples |
| 173 | + 0 arm_spe// |
| 174 | + 0 dummy:u |
| 175 | + 21 l1d-miss |
| 176 | + 897 l1d-access |
| 177 | + 5 llc-miss |
| 178 | + 7 llc-access |
| 179 | + 2 tlb-miss |
| 180 | + 1K tlb-access |
| 181 | + 36 branch-miss |
| 182 | + 0 remote-access |
| 183 | + 900 memory |
| 184 | + |
| 185 | +The arm_spe// and dummy:u events are implementation details and are expected to be empty. |
| 186 | + |
| 187 | +To get a full list of unique samples that are not sorted into groups, set the itrace option to |
| 188 | +generate 'instruction' samples. The period option is also taken into account, so set it to 1 |
| 189 | +instruction unless you want to further downsample the already sampled SPE data: |
| 190 | + |
| 191 | + perf report --itrace=i1i |
| 192 | + |
| 193 | +Memory access details are also stored on the samples and this can be viewed with: |
| 194 | + |
| 195 | + perf report --mem-mode |
| 196 | + |
| 197 | +Common errors |
| 198 | +~~~~~~~~~~~~~ |
| 199 | + |
| 200 | + - "Cannot find PMU `arm_spe'. Missing kernel support?" |
| 201 | + |
| 202 | + Module not built or loaded, KPTI not disabled (see above), or running on a VM |
| 203 | + |
| 204 | + - "Arm SPE CONTEXT packets not found in the traces." |
| 205 | + |
| 206 | + Root privilege is required to collect context packets. But these only increase the accuracy of |
| 207 | + assigning PIDs to kernel samples. For userspace sampling this can be ignored. |
| 208 | + |
| 209 | + - Excessively large perf.data file size |
| 210 | + |
| 211 | + Increase sampling interval (see above) |
| 212 | + |
| 213 | + |
| 214 | +SEE ALSO |
| 215 | +-------- |
| 216 | + |
| 217 | +linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1], |
| 218 | +linkperf:perf-inject[1] |
0 commit comments