tests/slo_workloads: align with ydb-slo-action v2#609
Open
polRk wants to merge 11 commits into
Open
Conversation
The v2 SLO action (ydb-platform/ydb-slo-action) starts a workload
container once per run, passes config via env vars (WORKLOAD_REF,
WORKLOAD_DURATION, YDB_CONNECTION_STRING, ...), and queries
sdk_operation_latency_p{50,95,99}_seconds as pre-computed gauges
labeled by (operation_type, operation_status, ref).
Changes:
- Dockerfile: drop SLO_BRANCH_REF compile-time ref baking; a single
ENTRYPOINT binary without CMD is now sufficient.
- utils/utils.cpp: subcommand is optional — with no free-arg the
binary runs create -> run -> cleanup sequentially (cleanup always
runs, even if run errored). Option resolution follows the standard
CLI > env > default priority via .DefaultValue():
- -c from YDB_CONNECTION_STRING (or built from YDB_ENDPOINT +
YDB_DATABASE);
- --metrics-push-url from OTEL_EXPORTER_OTLP_METRICS_ENDPOINT;
- --time from WORKLOAD_DURATION.
- utils/metrics.cpp: replace the OTel Histogram with HDR-backed
gauges. Only successful operations are recorded; a background
thread snapshots p50/p95/p99 every second, publishes the gauges
with operation_status="success", then resets the HDR window.
sdk_retry_attempts_total becomes a counter of (retry_attempts + 1)
per op. ref is read from WORKLOAD_REF at startup, not compile time.
- utils/CMakeLists.txt: pull HdrHistogram_c 0.11.8 via FetchContent
(opt-in via the outer YDB_SDK_TESTS flag) and link slo-utils
against hdr_histogram_static.
There was a problem hiding this comment.
Pull request overview
This PR updates the C++ SLO workload (tests/slo_workloads/) to match the v2 ydb-slo-action contract: configuration via environment variables, a single-container lifecycle (optionally running create→run→cleanup in one process), and publishing precomputed latency percentile gauges with the expected labels.
Changes:
- Make the subcommand optional; when omitted, run
create → run → cleanupsequentially in one process, ensuring cleanup always runs. - Switch option resolution to CLI > env > default, including env-derived defaults for connection string, metrics endpoint, and workload duration.
- Replace the exported latency histogram with HDR-backed p50/p95/p99 gauge metrics and adjust retry-attempts reporting; add HdrHistogram_c via FetchContent; simplify the Docker build.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/slo_workloads/utils/utils.h | Adds All command mode to support full lifecycle execution with no subcommand. |
| tests/slo_workloads/utils/utils.cpp | Implements env-derived defaults, allows zero free args, and runs create→run→cleanup in All mode. |
| tests/slo_workloads/utils/metrics.cpp | Reworks OTLP metrics to export percentile gauges from an HDR window and updates counters/labels. |
| tests/slo_workloads/utils/CMakeLists.txt | Fetches and links HdrHistogram_c for the SLO utils library. |
| tests/slo_workloads/Dockerfile | Removes build-time REF arg and builds the workload with a single ENTRYPOINT. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+192
to
+203
| void PublishPercentiles() { | ||
| auto snapshot = Latency_.SnapshotAndReset(); | ||
| if (!snapshot.HasData) { | ||
| return; | ||
| } | ||
| auto attrs = MergeAttributes({ | ||
| {"operation_type", OperationType_}, | ||
| {"operation_status", "success"}, | ||
| }); | ||
| LatencyP50_->Record(snapshot.P50, attrs); | ||
| LatencyP95_->Record(snapshot.P95, attrs); | ||
| LatencyP99_->Record(snapshot.P99, attrs); |
Comment on lines
161
to
168
| void InitMetrics() { | ||
| ErrorsTotal_ = Meter_->CreateUInt64Counter("sdk_errors_total", | ||
| "Total number of errors encountered, categorized by error type." | ||
| OperationsTotal_ = Meter_->CreateDoubleCounter("sdk_operations_total", | ||
| "Total number of operations, categorized by operation type and status." | ||
| ); | ||
|
|
||
| OperationsTotal_ = Meter_->CreateUInt64Counter("sdk_operations_total", | ||
| "Total number of operations, categorized by type attempted by the SDK." | ||
| ); | ||
|
|
||
| OperationsSuccessTotal_ = Meter_->CreateUInt64Counter("sdk_operations_success_total", | ||
| "Total number of successful operations, categorized by type." | ||
| RetryAttemptsTotal_ = Meter_->CreateDoubleCounter("sdk_retry_attempts_total", | ||
| "Total number of retry attempts (including the first attempt), categorized by operation type." | ||
| ); |
Rewrite the SLO workflows following the pattern used by ydb-java-sdk (ydb-platform/ydb-java-sdk#644). slo.yml: - Drop the hand-rolled docker run orchestration that spun up YDB and invoked the workload with --dont-push / explicit create/run phases. The v2 action owns that lifecycle via deploy/compose.yml — we just hand it two prebuilt images. - Gate on the `SLO` PR label. - Build both images with a single `docker build` per ref; if the baseline commit can't be built (missing Dockerfile or compile error on a historical SHA), fall back to the current image so the run is comparable against itself rather than silently failing. - Drop `--build-arg REF=…` — ref is now read from WORKLOAD_REF env at runtime. - Rename matrix entry to `cpp-key-value` to match the built binary and collapse the per-compiler matrix to a single clang entry (gcc variant can be added back when needed). slo_report.yml: - Pin to @v2. - Add a second job that removes the `SLO` label from the PR after the report is published, matching js-sdk/java-sdk/go-sdk.
- Counters: switch OperationsTotal/RetryAttemptsTotal from DoubleCounter to UInt64Counter. Operation counts and retry attempts are inherently integer; float representation could drift for very high cumulative values and doesn't match the Prometheus counter convention. - Percentile gauges: publish 0.0 on empty-window intervals instead of returning early. OTel's sync Gauge holds the last Record() value and the periodic exporter re-emits it every collection cycle — without an explicit reset, gauges would look "stuck" at the last non-empty value after load stops, contradicting the per-second HDR reset semantics.
Shared CI runners (and local docker builds) periodically hit connection timeouts on ppa.launchpadcontent.net. Telling apt itself to retry via Acquire::Retries=5 plus a 60 s connect timeout handles the blip in-place without a shell retry loop.
Two separate issues were making every SLO PR a coin flip:
1. Any transient network error (GitHub release download, PPA timeout,
abseil/protobuf/grpc tarball) kills the entire 30-min build. Each
of the 8 wget calls and both apt-get steps is now retried:
- wget gets --tries=5 --waitretry=15 --timeout=60
--retry-connrefused --retry-on-http-error=500,502,503,504
via a shared WGET_OPTS env var.
- apt already has Acquire::Retries=5.
2. Every CI run does a full cold build. The Dockerfile's toolchain and
dep layers (~25 min) never change between SDK commits, so caching
them is trivially safe.
- Switch slo.yml to docker/build-push-action@v6 with GHA type=gha
cache export/import, scoped per preset.
- First build on a runner still takes ~30 min; subsequent builds
where only the SDK source changed should take ~3 min.
- Baseline build uses continue-on-error + an explicit fallback
step that retags ydb-app-current as ydb-app-baseline when the
historical commit won't compile — replaces the inline shell
if-else that docker/build-push-action can't express.
The earlier Acquire::Retries=5 tweak covered HTTP-level errors but didn't handle TCP connect timeouts to ppa.launchpadcontent.net, which is the actual failure mode observed on shared CI runners. Wrap both the add-apt-repository step and the main apt-get install in shell retry loops (5 attempts, 15/30/45/60/75 s backoff) so a CDN blip no longer kills the 30-min build.
The v2 SLO action invokes the workload as `slo-key-value <run-args>` without a subcommand keyword. The global parser used to error on unknown long options like --read-rps; now it tolerates them and the implicit All branch passes the leftover argv to the run phase.
The baseline image used to be built from the merge-base checkout's tests/slo_workloads/, so any harness change on the PR (Dockerfile, CLI parser, …) was absent from the baseline. SDK comparison should only vary the library, not the harness — so reuse current's harness on both sides.
cmake configure/build now run under ccache as the C/C++ compiler launcher, with /root/.ccache exposed as a BuildKit cache mount so state persists across runs through cache-to=type=gha,mode=max (BuildKit ≥0.13 exports cache mounts under mode=max). Cold runs incur the usual full compile; warm runs reuse object hashes and should drop cmake --build from ~14 min to a few minutes.
…ring lacks ?database=
The v2 SLO action provides YDB_CONNECTION_STRING in path form
(grpc://host:port/Root/testdb), which GetDatabase can't parse, so
prefix stayed empty and create issued CreateTable("key_value")
without the database root → BAD_REQUEST. Falling back to the
YDB_DATABASE env var (which the action also sets) restores the
correct table path.
…he-dance cache-to: type=gha,mode=max exports layer cache but not the contents of --mount=type=cache, so the ccache mount started empty on every run (0% hit rate observed). Restore /root/.ccache from a host dir backed by actions/cache, inject it into BuildKit before each build via the cache-dance action, and extract the updated state for the next run. Same scope is shared by current and baseline since the SDK code overlap dominates ccache hit potential.
Gazizonoki
reviewed
May 8, 2026
| // YDB SLO action sets YDB_CONNECTION_STRING in path form | ||
| // (grpc://host:port/Root/testdb), which GetDatabase can't parse. | ||
| // Fall back to YDB_DATABASE which the action sets alongside it. | ||
| prefix = GetEnv("YDB_DATABASE"); |
Collaborator
There was a problem hiding this comment.
database можно вытащить из драйвера
Comment on lines
+252
to
+257
| char programName[] = "slo"; | ||
| std::vector<char*> runArgv; | ||
| runArgv.reserve(argc + 1); | ||
| runArgv.push_back(programName); | ||
| for (int i = 0; i < argc; ++i) { | ||
| runArgv.push_back(argv[i]); |
| break; \ | ||
| echo "apt-get install attempt $i failed; sleeping $((i * 15))s"; \ | ||
| sleep $((i * 15)); \ | ||
| apt-get -y update || true; \ |
Comment on lines
+25
to
+27
| remove-slo-label: | ||
| if: github.event.workflow_run.event == 'pull_request' | ||
| runs-on: ubuntu-latest |
Comment on lines
+198
to
+205
| // When no successful ops landed in the last second, publish 0.0 | ||
| // for all percentiles so the gauges reset with the HDR window | ||
| // rather than appearing "stuck" at the last non-empty value (the | ||
| // OTel periodic exporter would otherwise re-emit the previous | ||
| // Record() value on every collection cycle). | ||
| LatencyP50_->Record(snapshot.HasData ? snapshot.P50 : 0.0, attrs); | ||
| LatencyP95_->Record(snapshot.HasData ? snapshot.P95 : 0.0, attrs); | ||
| LatencyP99_->Record(snapshot.HasData ? snapshot.P99 : 0.0, attrs); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The v2 ydb-platform/ydb-slo-action starts a workload container once per run, passes config exclusively via env vars (
WORKLOAD_REF,WORKLOAD_DURATION,YDB_CONNECTION_STRING,OTEL_EXPORTER_OTLP_METRICS_ENDPOINT, …), and queriessdk_operation_latency_p{50,95,99}_secondsas pre-computed gauges labeled by(operation_type, operation_status, ref). The existing C++ workload does not match: it requires a subcommand + a long list of CLI flags, bakesrefat compile time viaSLO_BRANCH_REF, and publishes an OTel histogram that the action no longer reads. The old CI workflow also hand-rolled thedocker runorchestration thatydb-slo-action/init@v2now owns viadeploy/compose.yml.This PR realigns
tests/slo_workloads/and the SLO CI workflows to the v2 contract with the minimum possible code churn, following the pattern used by the Java SDK (ydb-platform/ydb-java-sdk#644).Workload changes (
tests/slo_workloads/)SLO_BRANCH_REFbuild arg and thecmake -DSLO_BRANCH_REF=…step. SingleENTRYPOINT, noCMD.utils/utils.{h,cpp}— subcommand becomes optional. With no free-arg, the binary executescreate → run → cleanupsequentially in one process (cleanup always runs, even if run errored) — same single-container lifecycle used byydb-js-sdk/tests/slo. Option resolution follows CLI > env > default via.DefaultValue():-cfromYDB_CONNECTION_STRING(or composed fromYDB_ENDPOINT+YDB_DATABASE);--metrics-push-urlfromOTEL_EXPORTER_OTLP_METRICS_ENDPOINT;--timefromWORKLOAD_DURATION.utils/metrics.cpp— replace the OTelHistogram<double>with HDR-backed gauges. Only successful operations are recorded (errors are excluded from percentile stats, matchingoperation_status="success"in the action's metrics query). A background thread snapshots p50/p95/p99 every second, publishes the threesdk_operation_latency_p*_secondsgauges, then resets the HDR window — each gauge value reflects only the last interval.sdk_retry_attempts_totalbecomes a counter ofretry_attempts + 1per op.refis read fromWORKLOAD_REFat startup, not compile time.utils/CMakeLists.txt— add HdrHistogram_c 0.11.8 viaFetchContent(opt-in — only reached whenYDB_SDK_TESTS=ON), linkslo-utilsprivately againsthdr_histogram_static.Metrics contract after change
sdk_operations_totaloperation_type,operation_status,refsdk_retry_attempts_totaloperation_type,refsdk_operation_latency_p50_secondsoperation_type,operation_status,refsdk_operation_latency_p95_secondsoperation_type,operation_status,refsdk_operation_latency_p99_secondsoperation_type,operation_status,refLegacy metrics (
sdk_errors_total,sdk_operations_success_total,sdk_operations_failure_total,sdk_operation_latency_secondshistogram) are removed.CI changes (
.github/workflows/)slo.yml— delegate toydb-slo-action/init@v2. No more hand-rolleddocker runorchestration — the v2 action brings up YDB + Prometheus + Grafana + chaos-monkey via its owndeploy/compose.yml. We just hand it two prebuilt images. Gated on theSLOPR label (matches js-sdk/java-sdk/go-sdk). Baseline build has a fallback to the current image if the historical commit can't be built.slo_report.yml— pin to@v2and add a second job that removes theSLOlabel after the report is published.Triggering convention
Add the
SLOlabel to a PR to opt into a full SLO run. The label is removed automatically once the report is posted.Test plan
docker run --rm -e WORKLOAD_REF=current -e WORKLOAD_DURATION=15 -e YDB_CONNECTION_STRING=grpc://…:2136/?database=/local ydb-app-current— logs showcreate→run(15 s) →cleanup, exit 0.workload_current_command: "--read-rps 500"override works (passes through to therunphase).SLOlabel on this PR triggers the workflow,ydb-slo-action/init@v2starts compose, and Prometheus servessdk_operation_latency_p99_seconds{ref="current",operation_type="read"}with sensible values.p99_seconds ≈ 0.05; after load stops the gauge stops updating (window resets each second with no new samples).