|
| 1 | +perf-amd-ibs(1) |
| 2 | +=============== |
| 3 | + |
| 4 | +NAME |
| 5 | +---- |
| 6 | +perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool |
| 7 | + |
| 8 | +SYNOPSIS |
| 9 | +-------- |
| 10 | +[verse] |
| 11 | +'perf record' -e ibs_op// |
| 12 | +'perf record' -e ibs_fetch// |
| 13 | + |
| 14 | +DESCRIPTION |
| 15 | +----------- |
| 16 | + |
| 17 | +Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP) |
| 18 | +profiling support on AMD platforms. IBS has two independent components: IBS |
| 19 | +Op and IBS Fetch. IBS Op sampling provides information about instruction |
| 20 | +execution (micro-op execution to be precise) with details like d-cache |
| 21 | +hit/miss, d-TLB hit/miss, cache miss latency, load/store data source, branch |
| 22 | +behavior etc. IBS Fetch sampling provides information about instruction fetch |
| 23 | +with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc. IBS is |
| 24 | +per-smt-thread i.e. each SMT hardware thread contains standalone IBS units. |
| 25 | + |
| 26 | +Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited |
| 27 | +using the Linux perf utility. The following files will be created at boot time |
| 28 | +if IBS is supported by the hardware and kernel. |
| 29 | + |
| 30 | + /sys/bus/event_source/devices/ibs_op/ |
| 31 | + /sys/bus/event_source/devices/ibs_fetch/ |
| 32 | + |
| 33 | +IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports |
| 34 | +one event: fetch ops. |
| 35 | + |
| 36 | +IBS PMUs do not have user/kernel filtering capability and thus it requires |
| 37 | +CAP_SYS_ADMIN or CAP_PERFMON privilege. |
| 38 | + |
| 39 | +IBS VS. REGULAR CORE PMU |
| 40 | +------------------------ |
| 41 | + |
| 42 | +IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has |
| 43 | +no skid. Whereas the IP recorded by regular core PMU will have some skid |
| 44 | +(sample was generated at IP X but perf would record it at IP X+n). Hence, |
| 45 | +regular core PMU might not help for profiling with instruction level |
| 46 | +precision. Further, IBS provides additional information about the sample in |
| 47 | +question. On the other hand, regular core PMU has it's own advantages like |
| 48 | +plethora of events, counting mode (less interference), up to 6 parallel |
| 49 | +counters, event grouping support, filtering capabilities etc. |
| 50 | + |
| 51 | +Three regular core PMU events are internally forwarded to IBS Op PMU when |
| 52 | +precise_ip attribute is set: |
| 53 | + |
| 54 | + -e cpu-cycles:p becomes -e ibs_op// |
| 55 | + -e r076:p becomes -e ibs_op// |
| 56 | + -e r0C1:p becomes -e ibs_op/cnt_ctl=1/ |
| 57 | + |
| 58 | +EXAMPLES |
| 59 | +-------- |
| 60 | + |
| 61 | +IBS Op PMU |
| 62 | +~~~~~~~~~~ |
| 63 | + |
| 64 | +System-wide profile, cycles event, sampling period: 100000 |
| 65 | + |
| 66 | + # perf record -e ibs_op// -c 100000 -a |
| 67 | + |
| 68 | +Per-cpu profile (cpu10), cycles event, sampling period: 100000 |
| 69 | + |
| 70 | + # perf record -e ibs_op// -c 100000 -C 10 |
| 71 | + |
| 72 | +Per-cpu profile (cpu10), cycles event, sampling freq: 1000 |
| 73 | + |
| 74 | + # perf record -e ibs_op// -F 1000 -C 10 |
| 75 | + |
| 76 | +System-wide profile, uOps event, sampling period: 100000 |
| 77 | + |
| 78 | + # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a |
| 79 | + |
| 80 | +Same command, but also capture IBS register raw dump along with perf sample: |
| 81 | + |
| 82 | + # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples |
| 83 | + |
| 84 | +System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward) |
| 85 | + |
| 86 | + # perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a |
| 87 | + |
| 88 | +Per process(upstream v6.2 onward), uOps event, sampling period: 100000 |
| 89 | + |
| 90 | + # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 1234 |
| 91 | + |
| 92 | +Per process(upstream v6.2 onward), uOps event, sampling period: 100000 |
| 93 | + |
| 94 | + # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls |
| 95 | + |
| 96 | +To analyse recorded profile in aggregate mode |
| 97 | + |
| 98 | + # perf report |
| 99 | + /* Select a line and press 'a' to drill down at instruction level. */ |
| 100 | + |
| 101 | +To go over each sample |
| 102 | + |
| 103 | + # perf script |
| 104 | + |
| 105 | +Raw dump of IBS registers when profiled with --raw-samples |
| 106 | + |
| 107 | + # perf report -D |
| 108 | + /* Look for PERF_RECORD_SAMPLE */ |
| 109 | + |
| 110 | + Example register raw dump: |
| 111 | + |
| 112 | + ibs_op_ctl: 000002c30006186a MaxCnt 100000 L3MissOnly 0 En 1 |
| 113 | + Val 1 CntCtl 0=cycles CurCnt 707 |
| 114 | + IbsOpRip: ffffffff8204aea7 |
| 115 | + ibs_op_data: 0000010002550001 CompToRetCtr 1 TagToRetCtr 597 |
| 116 | + BrnRet 0 RipInvalid 0 BrnFuse 0 Microcode 1 |
| 117 | + ibs_op_data2: 0000000000000013 RmtNode 1 DataSrc 3=DRAM |
| 118 | + ibs_op_data3: 0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0 |
| 119 | + DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0 |
| 120 | + DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0 |
| 121 | + DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1 |
| 122 | + DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes |
| 123 | + OpDcMissOpenMemReqs 12 DcMissLat 0 TlbRefillLat 0 |
| 124 | + IbsDCLinAd: ff110008a5398920 |
| 125 | + IbsDCPhysAd: 00000008a5398920 |
| 126 | + |
| 127 | +IBS applied in a real world usecase |
| 128 | + |
| 129 | + ~90% regression was observed in tbench with specific scheduler hint |
| 130 | + which was counter intuitive. IBS profile of good and bad run captured |
| 131 | + using perf helped in identifying exact cause of the problem: |
| 132 | + |
| 133 | + https://lore.kernel.org/r/20220921063638.2489-1-kprateek.nayak@amd.com |
| 134 | + |
| 135 | +IBS Fetch PMU |
| 136 | +~~~~~~~~~~~~~ |
| 137 | + |
| 138 | +Similar commands can be used with Fetch PMU as well. |
| 139 | + |
| 140 | +System-wide profile, fetch ops event, sampling period: 100000 |
| 141 | + |
| 142 | + # perf record -e ibs_fetch// -c 100000 -a |
| 143 | + |
| 144 | +System-wide profile, fetch ops event, sampling period: 100000, Random enable |
| 145 | + |
| 146 | + # perf record -e ibs_fetch/rand_en=1/ -c 100000 -a |
| 147 | + |
| 148 | + Random enable adds small degree of variability to sample period. This |
| 149 | + helps in cases like long running loops where PMU is tagging the same |
| 150 | + instruction over and over because of fixed sample period. |
| 151 | + |
| 152 | +etc. |
| 153 | + |
| 154 | +PERF MEM AND PERF C2C |
| 155 | +--------------------- |
| 156 | + |
| 157 | +perf mem is a memory access profiler tool and perf c2c is a shared data |
| 158 | +cacheline analyser tool. Both of them internally uses IBS Op PMU on AMD. |
| 159 | +Below is a simple example of the perf mem tool. |
| 160 | + |
| 161 | + # perf mem record -c 100000 -- make |
| 162 | + # perf mem report |
| 163 | + |
| 164 | +A normal perf mem report output will provide detailed memory access profile. |
| 165 | +However, it can also be aggregated based on output fields. For example: |
| 166 | + |
| 167 | + # perf mem report -F mem,sample,snoop |
| 168 | + Samples: 3M of event 'ibs_op//', Event count (approx.): 23524876 |
| 169 | + Memory access Samples Snoop |
| 170 | + N/A 1903343 N/A |
| 171 | + L1 hit 1056754 N/A |
| 172 | + L2 hit 75231 N/A |
| 173 | + L3 hit 9496 HitM |
| 174 | + L3 hit 2270 N/A |
| 175 | + RAM hit 8710 N/A |
| 176 | + Remote node, same socket RAM hit 3241 N/A |
| 177 | + Remote core, same node Any cache hit 1572 HitM |
| 178 | + Remote core, same node Any cache hit 514 N/A |
| 179 | + Remote node, same socket Any cache hit 1216 HitM |
| 180 | + Remote node, same socket Any cache hit 350 N/A |
| 181 | + Uncached hit 18 N/A |
| 182 | + |
| 183 | +Please refer to their man page for more detail. |
| 184 | + |
| 185 | +SEE ALSO |
| 186 | +-------- |
| 187 | + |
| 188 | +linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1], |
| 189 | +linkperf:perf-mem[1], linkperf:perf-c2c[1] |
0 commit comments