|
| 1 | +# Disk I/O Performance Graphing for MicroShift CI |
| 2 | + |
| 3 | +Tool for visualizing disk I/O activity from MicroShift CI job runs, primarily |
| 4 | +aimed at investigating etcd performance issues. |
| 5 | + |
| 6 | +## Background |
| 7 | + |
| 8 | +MicroShift uses an embedded etcd instance that is highly sensitive to disk I/O |
| 9 | +latency. Symptoms such as slow leader elections, "apply request took too long" |
| 10 | +warnings, or WAL sync delays often correlate with disk I/O spikes on the CI |
| 11 | +node. CI jobs collect [Performance Co-Pilot (PCP)](https://pcp.io/documentation.html) |
| 12 | +archives via the `pcp-zeroconf` package throughout the test run, capturing |
| 13 | +system-wide performance metrics at high resolution. |
| 14 | + |
| 15 | +This tool processes those PCP archives and produces a time-series graph of |
| 16 | +**Disk Read OPS**, **Disk Write OPS**, and **Disk Await** (`disk.dev.await`) |
| 17 | +at 15-second intervals, making it straightforward to correlate etcd issues |
| 18 | +with underlying disk activity. |
| 19 | + |
| 20 | +**Disk Await** is the average time (in milliseconds) that I/O requests spend |
| 21 | +waiting to be serviced by the device, including queue time and actual service |
| 22 | +time. It is the single most useful metric for diagnosing etcd I/O problems |
| 23 | +because etcd requires low-latency `fdatasync` calls on its WAL and snap files. |
| 24 | +When await rises above ~10 ms, etcd heartbeats can be missed and leader |
| 25 | +elections may be triggered. The tool reports the **max await across all block |
| 26 | +devices** at each sample point so that the worst-case device is always visible. |
| 27 | + |
| 28 | +## Obtaining PCP Data from CI |
| 29 | + |
| 30 | +PCP archives are stored with the CI artifacts under: |
| 31 | + |
| 32 | +```text |
| 33 | +artifacts/<test_name>/openshift-microshift-infra-pmlogs/artifacts/<ci_hostname>/ |
| 34 | +``` |
| 35 | + |
| 36 | +Download them using `gsutil`: |
| 37 | + |
| 38 | +```bash |
| 39 | +mkdir -p ~/pmlogs && cd ~/pmlogs |
| 40 | +python3 -m venv . |
| 41 | +./bin/python3 -m pip install gsutil |
| 42 | +./bin/gsutil -m cp -r gs://<path> ~/pmlogs/ |
| 43 | +``` |
| 44 | + |
| 45 | +The directory should contain files like `yyyymmdd.hh.mm.{0,index,meta}` and a |
| 46 | +`Latest` folio file. |
| 47 | + |
| 48 | +## Prerequisites |
| 49 | + |
| 50 | +- `podman` (used to build and run the analysis container) |
| 51 | + |
| 52 | +No other local dependencies are required. The container image includes PCP |
| 53 | +tools, Python 3, and matplotlib. |
| 54 | + |
| 55 | +## Usage |
| 56 | + |
| 57 | +```bash |
| 58 | +./run_io_graph.sh [--timezone TZ] [--output-dir DIR] [pcp-data-dir] |
| 59 | +``` |
| 60 | + |
| 61 | +| Option | Description | Default | |
| 62 | +|---|---|---| |
| 63 | +| `--timezone TZ` | IANA timezone for timestamps | `UTC` | |
| 64 | +| `--output-dir DIR` | Directory for output files | Script directory | |
| 65 | +| `pcp-data-dir` | Directory with PCP archive files | Auto-detected | |
| 66 | + |
| 67 | +### Examples |
| 68 | + |
| 69 | +```bash |
| 70 | +# Auto-detect PCP data, default timezone (UTC) |
| 71 | +./run_io_graph.sh |
| 72 | + |
| 73 | +# Specify data directory and timezone |
| 74 | +./run_io_graph.sh --timezone US/Eastern ./path/to/pcp-data |
| 75 | + |
| 76 | +# Custom output directory |
| 77 | +./run_io_graph.sh --output-dir /tmp/results --timezone UTC ./path/to/pcp-data |
| 78 | +``` |
| 79 | + |
| 80 | +## Output |
| 81 | + |
| 82 | +| File | Description | |
| 83 | +|---|---| |
| 84 | +| `io_data.json` | Extracted data with arrays: `timestamps`, `bi` (reads/s), `bo` (writes/s), `await` (disk await ms) | |
| 85 | +| `disk_io_performance.png` | Time-series chart with Read OPS (blue), Write OPS (red), and Disk Await (green, dashed, right Y-axis) | |
| 86 | + |
| 87 | +### Sample Graph |
| 88 | + |
| 89 | + |
| 90 | + |
| 91 | +## How to Read the Graph |
| 92 | + |
| 93 | +When investigating etcd performance problems, look for: |
| 94 | + |
| 95 | +- **Disk Await spikes** (green, dashed, right Y-axis) are the primary |
| 96 | + indicator of I/O latency problems. etcd requires `fdatasync` to complete |
| 97 | + within its heartbeat interval (default 100 ms). Await values above ~10 ms |
| 98 | + indicate the disk is under pressure; sustained values above ~50 ms almost |
| 99 | + always correlate with etcd warnings such as "slow fdatasync", "apply request |
| 100 | + took too long", or missed heartbeats leading to leader elections. |
| 101 | +- **Write spikes** (red) coinciding with etcd "slow fdatasync" or WAL warnings |
| 102 | + in MicroShift journal logs. Sustained write activity above the baseline |
| 103 | + suggests I/O contention from concurrent workloads or test operations. |
| 104 | +- **Read spikes** (blue) during test setup or image pulls that may starve etcd |
| 105 | + of I/O bandwidth. |
| 106 | +- **Correlation with timestamps** from etcd log entries. Convert etcd log |
| 107 | + timestamps to the timezone used in the graph to align events. |
| 108 | + |
| 109 | +## Files |
| 110 | + |
| 111 | +| File | Purpose | |
| 112 | +|---|---| |
| 113 | +| `run_io_graph.sh` | Orchestrator: builds container, runs extraction and plotting | |
| 114 | +| `Dockerfile` | Container image with PCP tools, Python 3, and matplotlib | |
| 115 | +| `extract_io.sh` | Runs `pcp2json` on PCP archives, extracts all metrics in one pass | |
| 116 | +| `parse_pcp.py` | Parses pcp2json output, aggregates per-device instances (sum read/write, max await) | |
| 117 | +| `plot_io.py` | Generates the PNG chart from JSON data using matplotlib | |
0 commit comments