Skip to content

Commit d81e060

Browse files
Merge pull request #6483 from eslutsky/claude-tool-etcd-graph
USHIFT-6793: introduce io-performance-analyzer pcp script
2 parents 0d63bcb + 1a5fb29 commit d81e060

8 files changed

Lines changed: 3594 additions & 0 deletions

File tree

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
FROM registry.fedoraproject.org/fedora:43
2+
3+
RUN dnf install -y pcp pcp-system-tools pcp-export-pcp2json python3 python3-pip && \
4+
dnf clean all
5+
6+
RUN pip3 install matplotlib gsutil
7+
8+
WORKDIR /data
9+
10+
ENTRYPOINT ["/bin/bash"]
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# Disk I/O Performance Graphing for MicroShift CI
2+
3+
Tool for visualizing disk I/O activity from MicroShift CI job runs, primarily
4+
aimed at investigating etcd performance issues.
5+
6+
## Background
7+
8+
MicroShift uses an embedded etcd instance that is highly sensitive to disk I/O
9+
latency. Symptoms such as slow leader elections, "apply request took too long"
10+
warnings, or WAL sync delays often correlate with disk I/O spikes on the CI
11+
node. CI jobs collect [Performance Co-Pilot (PCP)](https://pcp.io/documentation.html)
12+
archives via the `pcp-zeroconf` package throughout the test run, capturing
13+
system-wide performance metrics at high resolution.
14+
15+
This tool processes those PCP archives and produces a time-series graph of
16+
**Disk Read OPS**, **Disk Write OPS**, and **Disk Await** (`disk.dev.await`)
17+
at 15-second intervals, making it straightforward to correlate etcd issues
18+
with underlying disk activity.
19+
20+
**Disk Await** is the average time (in milliseconds) that I/O requests spend
21+
waiting to be serviced by the device, including queue time and actual service
22+
time. It is the single most useful metric for diagnosing etcd I/O problems
23+
because etcd requires low-latency `fdatasync` calls on its WAL and snap files.
24+
When await rises above ~10 ms, etcd heartbeats can be missed and leader
25+
elections may be triggered. The tool reports the **max await across all block
26+
devices** at each sample point so that the worst-case device is always visible.
27+
28+
## Obtaining PCP Data from CI
29+
30+
PCP archives are stored with the CI artifacts under:
31+
32+
```text
33+
artifacts/<test_name>/openshift-microshift-infra-pmlogs/artifacts/<ci_hostname>/
34+
```
35+
36+
Download them using `gsutil`:
37+
38+
```bash
39+
mkdir -p ~/pmlogs && cd ~/pmlogs
40+
python3 -m venv .
41+
./bin/python3 -m pip install gsutil
42+
./bin/gsutil -m cp -r gs://<path> ~/pmlogs/
43+
```
44+
45+
The directory should contain files like `yyyymmdd.hh.mm.{0,index,meta}` and a
46+
`Latest` folio file.
47+
48+
## Prerequisites
49+
50+
- `podman` (used to build and run the analysis container)
51+
52+
No other local dependencies are required. The container image includes PCP
53+
tools, Python 3, and matplotlib.
54+
55+
## Usage
56+
57+
```bash
58+
./run_io_graph.sh [--timezone TZ] [--output-dir DIR] [pcp-data-dir]
59+
```
60+
61+
| Option | Description | Default |
62+
|---|---|---|
63+
| `--timezone TZ` | IANA timezone for timestamps | `UTC` |
64+
| `--output-dir DIR` | Directory for output files | Script directory |
65+
| `pcp-data-dir` | Directory with PCP archive files | Auto-detected |
66+
67+
### Examples
68+
69+
```bash
70+
# Auto-detect PCP data, default timezone (UTC)
71+
./run_io_graph.sh
72+
73+
# Specify data directory and timezone
74+
./run_io_graph.sh --timezone US/Eastern ./path/to/pcp-data
75+
76+
# Custom output directory
77+
./run_io_graph.sh --output-dir /tmp/results --timezone UTC ./path/to/pcp-data
78+
```
79+
80+
## Output
81+
82+
| File | Description |
83+
|---|---|
84+
| `io_data.json` | Extracted data with arrays: `timestamps`, `bi` (reads/s), `bo` (writes/s), `await` (disk await ms) |
85+
| `disk_io_performance.png` | Time-series chart with Read OPS (blue), Write OPS (red), and Disk Await (green, dashed, right Y-axis) |
86+
87+
### Sample Graph
88+
89+
![Disk I/O Performance](disk_io_performance.png)
90+
91+
## How to Read the Graph
92+
93+
When investigating etcd performance problems, look for:
94+
95+
- **Disk Await spikes** (green, dashed, right Y-axis) are the primary
96+
indicator of I/O latency problems. etcd requires `fdatasync` to complete
97+
within its heartbeat interval (default 100 ms). Await values above ~10 ms
98+
indicate the disk is under pressure; sustained values above ~50 ms almost
99+
always correlate with etcd warnings such as "slow fdatasync", "apply request
100+
took too long", or missed heartbeats leading to leader elections.
101+
- **Write spikes** (red) coinciding with etcd "slow fdatasync" or WAL warnings
102+
in MicroShift journal logs. Sustained write activity above the baseline
103+
suggests I/O contention from concurrent workloads or test operations.
104+
- **Read spikes** (blue) during test setup or image pulls that may starve etcd
105+
of I/O bandwidth.
106+
- **Correlation with timestamps** from etcd log entries. Convert etcd log
107+
timestamps to the timezone used in the graph to align events.
108+
109+
## Files
110+
111+
| File | Purpose |
112+
|---|---|
113+
| `run_io_graph.sh` | Orchestrator: builds container, runs extraction and plotting |
114+
| `Dockerfile` | Container image with PCP tools, Python 3, and matplotlib |
115+
| `extract_io.sh` | Runs `pcp2json` on PCP archives, extracts all metrics in one pass |
116+
| `parse_pcp.py` | Parses pcp2json output, aggregates per-device instances (sum read/write, max await) |
117+
| `plot_io.py` | Generates the PNG chart from JSON data using matplotlib |
386 KB
Loading
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
#!/bin/bash
2+
# Extract disk IO data from PCP archive using pcp2json with 15-second intervals.
3+
# Outputs JSON with arrays: timestamps, bi (reads/s), bo (writes/s), await (disk await ms)
4+
#
5+
# Usage: ./extract_io.sh <pcp-archive-dir> [output-json] [timezone]
6+
7+
set -euo pipefail
8+
9+
DATA_DIR="${1:?Usage: $0 <pcp-archive-dir> [output-json] [timezone]}"
10+
OUTPUT="${2:-/data/io_data.json}"
11+
TIMEZONE="${3:-UTC}"
12+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
13+
14+
echo "Using archive directory: ${DATA_DIR}"
15+
16+
# Extract all metrics in a single pcp2json call
17+
TMPFILE=$(mktemp)
18+
trap 'rm -f "${TMPFILE}"' EXIT
19+
20+
(cd "${DATA_DIR}" && pcp2json -a . -t 15sec \
21+
disk.dev.read disk.dev.write disk.dev.await) \
22+
> "${TMPFILE}" 2>/dev/null || true
23+
24+
# Parse pcp2json output into clean plot-ready JSON
25+
python3 "${SCRIPT_DIR}/parse_pcp.py" --timezone "${TIMEZONE}" \
26+
"${TMPFILE}" "${OUTPUT}"
27+
28+
echo "Wrote $(python3 -c "import json; d=json.load(open('${OUTPUT}')); print(len(d['timestamps']))" 2>/dev/null || echo 0) data points to ${OUTPUT}"

0 commit comments

Comments
 (0)