tests/slo_workloads: align with ydb-slo-action v2 by polRk · Pull Request #609 · ydb-platform/ydb-cpp-sdk

polRk · 2026-05-05T13:42:48Z

Summary

The v2 ydb-platform/ydb-slo-action starts a workload container once per run, passes config exclusively via env vars (WORKLOAD_REF, WORKLOAD_DURATION, YDB_CONNECTION_STRING, OTEL_EXPORTER_OTLP_METRICS_ENDPOINT, …), and queries sdk_operation_latency_p{50,95,99}_seconds as pre-computed gauges labeled by (operation_type, operation_status, ref). The existing C++ workload does not match: it requires a subcommand + a long list of CLI flags, bakes ref at compile time via SLO_BRANCH_REF, and publishes an OTel histogram that the action no longer reads. The old CI workflow also hand-rolled the docker run orchestration that ydb-slo-action/init@v2 now owns via deploy/compose.yml.

This PR realigns tests/slo_workloads/ and the SLO CI workflows to the v2 contract with the minimum possible code churn, following the pattern used by the Java SDK (ydb-platform/ydb-java-sdk#644).

Workload changes (`tests/slo_workloads/`)

Dockerfile — drop SLO_BRANCH_REF build arg and the cmake -DSLO_BRANCH_REF=… step. Single ENTRYPOINT, no CMD.
utils/utils.{h,cpp} — subcommand becomes optional. With no free-arg, the binary executes create → run → cleanup sequentially in one process (cleanup always runs, even if run errored) — same single-container lifecycle used by ydb-js-sdk/tests/slo. Option resolution follows CLI > env > default via .DefaultValue():
- -c from YDB_CONNECTION_STRING (or composed from YDB_ENDPOINT + YDB_DATABASE);
- --metrics-push-url from OTEL_EXPORTER_OTLP_METRICS_ENDPOINT;
- --time from WORKLOAD_DURATION.
utils/metrics.cpp — replace the OTel Histogram<double> with HDR-backed gauges. Only successful operations are recorded (errors are excluded from percentile stats, matching operation_status="success" in the action's metrics query). A background thread snapshots p50/p95/p99 every second, publishes the three sdk_operation_latency_p*_seconds gauges, then resets the HDR window — each gauge value reflects only the last interval. sdk_retry_attempts_total becomes a counter of retry_attempts + 1 per op. ref is read from WORKLOAD_REF at startup, not compile time.
utils/CMakeLists.txt — add HdrHistogram_c 0.11.8 via FetchContent (opt-in — only reached when YDB_SDK_TESTS=ON), link slo-utils privately against hdr_histogram_static.

Metrics contract after change

Metric	Type	Labels
`sdk_operations_total`	counter	`operation_type`, `operation_status`, `ref`
`sdk_retry_attempts_total`	counter	`operation_type`, `ref`
`sdk_operation_latency_p50_seconds`	gauge	`operation_type`, `operation_status`, `ref`
`sdk_operation_latency_p95_seconds`	gauge	`operation_type`, `operation_status`, `ref`
`sdk_operation_latency_p99_seconds`	gauge	`operation_type`, `operation_status`, `ref`

Legacy metrics (sdk_errors_total, sdk_operations_success_total, sdk_operations_failure_total, sdk_operation_latency_seconds histogram) are removed.

CI changes (`.github/workflows/`)

slo.yml — delegate to ydb-slo-action/init@v2. No more hand-rolled docker run orchestration — the v2 action brings up YDB + Prometheus + Grafana + chaos-monkey via its own deploy/compose.yml. We just hand it two prebuilt images. Gated on the SLO PR label (matches js-sdk/java-sdk/go-sdk). Baseline build has a fallback to the current image if the historical commit can't be built.
slo_report.yml — pin to @v2 and add a second job that removes the SLO label after the report is published.

Triggering convention

Add the SLO label to a PR to opt into a full SLO run. The label is removed automatically once the report is posted.

Test plan

Workload image builds under CI (local amd64 build blocked by Rosetta/QEMU issues on arm64 Mac).
Standalone smoke test with env vars only: docker run --rm -e WORKLOAD_REF=current -e WORKLOAD_DURATION=15 -e YDB_CONNECTION_STRING=grpc://…:2136/?database=/local ydb-app-current — logs show create → run (15 s) → cleanup, exit 0.
workload_current_command: "--read-rps 500" override works (passes through to the run phase).
SLO label on this PR triggers the workflow, ydb-slo-action/init@v2 starts compose, and Prometheus serves sdk_operation_latency_p99_seconds{ref="current",operation_type="read"} with sensible values.
Verify HDR reset: drive known 50 ms latency, confirm p99_seconds ≈ 0.05; after load stops the gauge stops updating (window resets each second with no new samples).

The v2 SLO action (ydb-platform/ydb-slo-action) starts a workload container once per run, passes config via env vars (WORKLOAD_REF, WORKLOAD_DURATION, YDB_CONNECTION_STRING, ...), and queries sdk_operation_latency_p{50,95,99}_seconds as pre-computed gauges labeled by (operation_type, operation_status, ref). Changes: - Dockerfile: drop SLO_BRANCH_REF compile-time ref baking; a single ENTRYPOINT binary without CMD is now sufficient. - utils/utils.cpp: subcommand is optional — with no free-arg the binary runs create -> run -> cleanup sequentially (cleanup always runs, even if run errored). Option resolution follows the standard CLI > env > default priority via .DefaultValue(): - -c from YDB_CONNECTION_STRING (or built from YDB_ENDPOINT + YDB_DATABASE); - --metrics-push-url from OTEL_EXPORTER_OTLP_METRICS_ENDPOINT; - --time from WORKLOAD_DURATION. - utils/metrics.cpp: replace the OTel Histogram with HDR-backed gauges. Only successful operations are recorded; a background thread snapshots p50/p95/p99 every second, publishes the gauges with operation_status="success", then resets the HDR window. sdk_retry_attempts_total becomes a counter of (retry_attempts + 1) per op. ref is read from WORKLOAD_REF at startup, not compile time. - utils/CMakeLists.txt: pull HdrHistogram_c 0.11.8 via FetchContent (opt-in via the outer YDB_SDK_TESTS flag) and link slo-utils against hdr_histogram_static.

Copilot

Pull request overview

This PR updates the C++ SLO workload (tests/slo_workloads/) to match the v2 ydb-slo-action contract: configuration via environment variables, a single-container lifecycle (optionally running create→run→cleanup in one process), and publishing precomputed latency percentile gauges with the expected labels.

Changes:

Make the subcommand optional; when omitted, run create → run → cleanup sequentially in one process, ensuring cleanup always runs.
Switch option resolution to CLI > env > default, including env-derived defaults for connection string, metrics endpoint, and workload duration.
Replace the exported latency histogram with HDR-backed p50/p95/p99 gauge metrics and adjust retry-attempts reporting; add HdrHistogram_c via FetchContent; simplify the Docker build.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/slo_workloads/utils/utils.h	Adds `All` command mode to support full lifecycle execution with no subcommand.
tests/slo_workloads/utils/utils.cpp	Implements env-derived defaults, allows zero free args, and runs create→run→cleanup in `All` mode.
tests/slo_workloads/utils/metrics.cpp	Reworks OTLP metrics to export percentile gauges from an HDR window and updates counters/labels.
tests/slo_workloads/utils/CMakeLists.txt	Fetches and links HdrHistogram_c for the SLO utils library.
tests/slo_workloads/Dockerfile	Removes build-time REF arg and builds the workload with a single ENTRYPOINT.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    void PublishPercentiles() {
+        auto snapshot = Latency_.SnapshotAndReset();
+        if (!snapshot.HasData) {
+            return;
+        }
+        auto attrs = MergeAttributes({
+            {"operation_type", OperationType_},
+            {"operation_status", "success"},
+        });
+        LatencyP50_->Record(snapshot.P50, attrs);
+        LatencyP95_->Record(snapshot.P95, attrs);
+        LatencyP99_->Record(snapshot.P99, attrs);


    void InitMetrics() {
-        ErrorsTotal_ = Meter_->CreateUInt64Counter("sdk_errors_total",
-            "Total number of errors encountered, categorized by error type."
+        OperationsTotal_ = Meter_->CreateDoubleCounter("sdk_operations_total",
+            "Total number of operations, categorized by operation type and status."
        );

-        OperationsTotal_ = Meter_->CreateUInt64Counter("sdk_operations_total",
-            "Total number of operations, categorized by type attempted by the SDK."
-        );
-
-        OperationsSuccessTotal_ = Meter_->CreateUInt64Counter("sdk_operations_success_total",
-            "Total number of successful operations, categorized by type."
+        RetryAttemptsTotal_ = Meter_->CreateDoubleCounter("sdk_retry_attempts_total",
+            "Total number of retry attempts (including the first attempt), categorized by operation type."
        );


@v2

Rewrite the SLO workflows following the pattern used by ydb-java-sdk (ydb-platform/ydb-java-sdk#644). slo.yml: - Drop the hand-rolled docker run orchestration that spun up YDB and invoked the workload with --dont-push / explicit create/run phases. The v2 action owns that lifecycle via deploy/compose.yml — we just hand it two prebuilt images. - Gate on the `SLO` PR label. - Build both images with a single `docker build` per ref; if the baseline commit can't be built (missing Dockerfile or compile error on a historical SHA), fall back to the current image so the run is comparable against itself rather than silently failing. - Drop `--build-arg REF=…` — ref is now read from WORKLOAD_REF env at runtime. - Rename matrix entry to `cpp-key-value` to match the built binary and collapse the per-compiler matrix to a single clang entry (gcc variant can be added back when needed). slo_report.yml: - Pin to @v2. - Add a second job that removes the `SLO` label from the PR after the report is published, matching js-sdk/java-sdk/go-sdk.

- Counters: switch OperationsTotal/RetryAttemptsTotal from DoubleCounter to UInt64Counter. Operation counts and retry attempts are inherently integer; float representation could drift for very high cumulative values and doesn't match the Prometheus counter convention. - Percentile gauges: publish 0.0 on empty-window intervals instead of returning early. OTel's sync Gauge holds the last Record() value and the periodic exporter re-emits it every collection cycle — without an explicit reset, gauges would look "stuck" at the last non-empty value after load stops, contradicting the per-second HDR reset semantics.

Shared CI runners (and local docker builds) periodically hit connection timeouts on ppa.launchpadcontent.net. Telling apt itself to retry via Acquire::Retries=5 plus a 60 s connect timeout handles the blip in-place without a shell retry loop.

Two separate issues were making every SLO PR a coin flip: 1. Any transient network error (GitHub release download, PPA timeout, abseil/protobuf/grpc tarball) kills the entire 30-min build. Each of the 8 wget calls and both apt-get steps is now retried: - wget gets --tries=5 --waitretry=15 --timeout=60 --retry-connrefused --retry-on-http-error=500,502,503,504 via a shared WGET_OPTS env var. - apt already has Acquire::Retries=5. 2. Every CI run does a full cold build. The Dockerfile's toolchain and dep layers (~25 min) never change between SDK commits, so caching them is trivially safe. - Switch slo.yml to docker/build-push-action@v6 with GHA type=gha cache export/import, scoped per preset. - First build on a runner still takes ~30 min; subsequent builds where only the SDK source changed should take ~3 min. - Baseline build uses continue-on-error + an explicit fallback step that retags ydb-app-current as ydb-app-baseline when the historical commit won't compile — replaces the inline shell if-else that docker/build-push-action can't express.

The earlier Acquire::Retries=5 tweak covered HTTP-level errors but didn't handle TCP connect timeouts to ppa.launchpadcontent.net, which is the actual failure mode observed on shared CI runners. Wrap both the add-apt-repository step and the main apt-get install in shell retry loops (5 attempts, 15/30/45/60/75 s backoff) so a CDN blip no longer kills the 30-min build.

The v2 SLO action invokes the workload as `slo-key-value <run-args>` without a subcommand keyword. The global parser used to error on unknown long options like --read-rps; now it tolerates them and the implicit All branch passes the leftover argv to the run phase.

The baseline image used to be built from the merge-base checkout's tests/slo_workloads/, so any harness change on the PR (Dockerfile, CLI parser, …) was absent from the baseline. SDK comparison should only vary the library, not the harness — so reuse current's harness on both sides.

cmake configure/build now run under ccache as the C/C++ compiler launcher, with /root/.ccache exposed as a BuildKit cache mount so state persists across runs through cache-to=type=gha,mode=max (BuildKit ≥0.13 exports cache mounts under mode=max). Cold runs incur the usual full compile; warm runs reuse object hashes and should drop cmake --build from ~14 min to a few minutes.

…ring lacks ?database= The v2 SLO action provides YDB_CONNECTION_STRING in path form (grpc://host:port/Root/testdb), which GetDatabase can't parse, so prefix stayed empty and create issued CreateTable("key_value") without the database root → BAD_REQUEST. Falling back to the YDB_DATABASE env var (which the action also sets) restores the correct table path.

…he-dance cache-to: type=gha,mode=max exports layer cache but not the contents of --mount=type=cache, so the ccache mount started empty on every run (0% hit rate observed). Restore /root/.ccache from a host dir backed by actions/cache, inject it into BuildKit before each build via the cache-dance action, and extract the updated state for the next run. Same scope is shared by current and baseline since the SDK code overlap dominates ccache hit potential.

Gazizonoki · 2026-05-08T14:06:29Z

+        // YDB SLO action sets YDB_CONNECTION_STRING in path form
+        // (grpc://host:port/Root/testdb), which GetDatabase can't parse.
+        // Fall back to YDB_DATABASE which the action sets alongside it.
+        prefix = GetEnv("YDB_DATABASE");


database можно вытащить из драйвера

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

+            char programName[] = "slo";
+            std::vector<char*> runArgv;
+            runArgv.reserve(argc + 1);
+            runArgv.push_back(programName);
+            for (int i = 0; i < argc; ++i) {
+                runArgv.push_back(argv[i]);


+        break; \
+        echo "apt-get install attempt $i failed; sleeping $((i * 15))s"; \
+        sleep $((i * 15)); \
+        apt-get -y update || true; \


+  remove-slo-label:
+    if: github.event.workflow_run.event == 'pull_request'
+    runs-on: ubuntu-latest


+        // When no successful ops landed in the last second, publish 0.0
+        // for all percentiles so the gauges reset with the HDR window
+        // rather than appearing "stuck" at the last non-empty value (the
+        // OTel periodic exporter would otherwise re-emit the previous
+        // Record() value on every collection cycle).
+        LatencyP50_->Record(snapshot.HasData ? snapshot.P50 : 0.0, attrs);
+        LatencyP95_->Record(snapshot.HasData ? snapshot.P95 : 0.0, attrs);
+        LatencyP99_->Record(snapshot.HasData ? snapshot.P99 : 0.0, attrs);


Copilot AI review requested due to automatic review settings May 5, 2026 13:42

Copilot started reviewing on behalf of polRk May 5, 2026 13:58 View session

Copilot AI reviewed May 5, 2026

View reviewed changes

polRk added the SLO label May 5, 2026

polRk added 9 commits May 5, 2026 17:11

Gazizonoki reviewed May 8, 2026

View reviewed changes

Gazizonoki requested a review from Copilot May 12, 2026 15:09

Copilot started reviewing on behalf of Gazizonoki May 12, 2026 15:10 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests/slo_workloads: align with ydb-slo-action v2#609

tests/slo_workloads: align with ydb-slo-action v2#609
polRk wants to merge 11 commits into
mainfrom
slo/align-with-action-v2

polRk commented May 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Gazizonoki May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

polRk commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Workload changes (tests/slo_workloads/)

Metrics contract after change

CI changes (.github/workflows/)

Triggering convention

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Gazizonoki May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

polRk commented May 5, 2026 •

edited

Loading

Workload changes (`tests/slo_workloads/`)

CI changes (`.github/workflows/`)