Skip to content

tests/slo_workloads: align with ydb-slo-action v2#609

Open
polRk wants to merge 11 commits into
mainfrom
slo/align-with-action-v2
Open

tests/slo_workloads: align with ydb-slo-action v2#609
polRk wants to merge 11 commits into
mainfrom
slo/align-with-action-v2

Conversation

@polRk
Copy link
Copy Markdown
Member

@polRk polRk commented May 5, 2026

Summary

The v2 ydb-platform/ydb-slo-action starts a workload container once per run, passes config exclusively via env vars (WORKLOAD_REF, WORKLOAD_DURATION, YDB_CONNECTION_STRING, OTEL_EXPORTER_OTLP_METRICS_ENDPOINT, …), and queries sdk_operation_latency_p{50,95,99}_seconds as pre-computed gauges labeled by (operation_type, operation_status, ref). The existing C++ workload does not match: it requires a subcommand + a long list of CLI flags, bakes ref at compile time via SLO_BRANCH_REF, and publishes an OTel histogram that the action no longer reads. The old CI workflow also hand-rolled the docker run orchestration that ydb-slo-action/init@v2 now owns via deploy/compose.yml.

This PR realigns tests/slo_workloads/ and the SLO CI workflows to the v2 contract with the minimum possible code churn, following the pattern used by the Java SDK (ydb-platform/ydb-java-sdk#644).

Workload changes (tests/slo_workloads/)

  • Dockerfile — drop SLO_BRANCH_REF build arg and the cmake -DSLO_BRANCH_REF=… step. Single ENTRYPOINT, no CMD.
  • utils/utils.{h,cpp} — subcommand becomes optional. With no free-arg, the binary executes create → run → cleanup sequentially in one process (cleanup always runs, even if run errored) — same single-container lifecycle used by ydb-js-sdk/tests/slo. Option resolution follows CLI > env > default via .DefaultValue():
    • -c from YDB_CONNECTION_STRING (or composed from YDB_ENDPOINT + YDB_DATABASE);
    • --metrics-push-url from OTEL_EXPORTER_OTLP_METRICS_ENDPOINT;
    • --time from WORKLOAD_DURATION.
  • utils/metrics.cpp — replace the OTel Histogram<double> with HDR-backed gauges. Only successful operations are recorded (errors are excluded from percentile stats, matching operation_status="success" in the action's metrics query). A background thread snapshots p50/p95/p99 every second, publishes the three sdk_operation_latency_p*_seconds gauges, then resets the HDR window — each gauge value reflects only the last interval. sdk_retry_attempts_total becomes a counter of retry_attempts + 1 per op. ref is read from WORKLOAD_REF at startup, not compile time.
  • utils/CMakeLists.txt — add HdrHistogram_c 0.11.8 via FetchContent (opt-in — only reached when YDB_SDK_TESTS=ON), link slo-utils privately against hdr_histogram_static.

Metrics contract after change

Metric Type Labels
sdk_operations_total counter operation_type, operation_status, ref
sdk_retry_attempts_total counter operation_type, ref
sdk_operation_latency_p50_seconds gauge operation_type, operation_status, ref
sdk_operation_latency_p95_seconds gauge operation_type, operation_status, ref
sdk_operation_latency_p99_seconds gauge operation_type, operation_status, ref

Legacy metrics (sdk_errors_total, sdk_operations_success_total, sdk_operations_failure_total, sdk_operation_latency_seconds histogram) are removed.

CI changes (.github/workflows/)

  • slo.yml — delegate to ydb-slo-action/init@v2. No more hand-rolled docker run orchestration — the v2 action brings up YDB + Prometheus + Grafana + chaos-monkey via its own deploy/compose.yml. We just hand it two prebuilt images. Gated on the SLO PR label (matches js-sdk/java-sdk/go-sdk). Baseline build has a fallback to the current image if the historical commit can't be built.
  • slo_report.yml — pin to @v2 and add a second job that removes the SLO label after the report is published.

Triggering convention

Add the SLO label to a PR to opt into a full SLO run. The label is removed automatically once the report is posted.

Test plan

  • Workload image builds under CI (local amd64 build blocked by Rosetta/QEMU issues on arm64 Mac).
  • Standalone smoke test with env vars only: docker run --rm -e WORKLOAD_REF=current -e WORKLOAD_DURATION=15 -e YDB_CONNECTION_STRING=grpc://…:2136/?database=/local ydb-app-current — logs show createrun (15 s) → cleanup, exit 0.
  • workload_current_command: "--read-rps 500" override works (passes through to the run phase).
  • SLO label on this PR triggers the workflow, ydb-slo-action/init@v2 starts compose, and Prometheus serves sdk_operation_latency_p99_seconds{ref="current",operation_type="read"} with sensible values.
  • Verify HDR reset: drive known 50 ms latency, confirm p99_seconds ≈ 0.05; after load stops the gauge stops updating (window resets each second with no new samples).

The v2 SLO action (ydb-platform/ydb-slo-action) starts a workload
container once per run, passes config via env vars (WORKLOAD_REF,
WORKLOAD_DURATION, YDB_CONNECTION_STRING, ...), and queries
sdk_operation_latency_p{50,95,99}_seconds as pre-computed gauges
labeled by (operation_type, operation_status, ref).

Changes:

- Dockerfile: drop SLO_BRANCH_REF compile-time ref baking; a single
  ENTRYPOINT binary without CMD is now sufficient.
- utils/utils.cpp: subcommand is optional — with no free-arg the
  binary runs create -> run -> cleanup sequentially (cleanup always
  runs, even if run errored). Option resolution follows the standard
  CLI > env > default priority via .DefaultValue():
  - -c from YDB_CONNECTION_STRING (or built from YDB_ENDPOINT +
    YDB_DATABASE);
  - --metrics-push-url from OTEL_EXPORTER_OTLP_METRICS_ENDPOINT;
  - --time from WORKLOAD_DURATION.
- utils/metrics.cpp: replace the OTel Histogram with HDR-backed
  gauges. Only successful operations are recorded; a background
  thread snapshots p50/p95/p99 every second, publishes the gauges
  with operation_status="success", then resets the HDR window.
  sdk_retry_attempts_total becomes a counter of (retry_attempts + 1)
  per op. ref is read from WORKLOAD_REF at startup, not compile time.
- utils/CMakeLists.txt: pull HdrHistogram_c 0.11.8 via FetchContent
  (opt-in via the outer YDB_SDK_TESTS flag) and link slo-utils
  against hdr_histogram_static.
Copilot AI review requested due to automatic review settings May 5, 2026 13:42
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the C++ SLO workload (tests/slo_workloads/) to match the v2 ydb-slo-action contract: configuration via environment variables, a single-container lifecycle (optionally running create→run→cleanup in one process), and publishing precomputed latency percentile gauges with the expected labels.

Changes:

  • Make the subcommand optional; when omitted, run create → run → cleanup sequentially in one process, ensuring cleanup always runs.
  • Switch option resolution to CLI > env > default, including env-derived defaults for connection string, metrics endpoint, and workload duration.
  • Replace the exported latency histogram with HDR-backed p50/p95/p99 gauge metrics and adjust retry-attempts reporting; add HdrHistogram_c via FetchContent; simplify the Docker build.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/slo_workloads/utils/utils.h Adds All command mode to support full lifecycle execution with no subcommand.
tests/slo_workloads/utils/utils.cpp Implements env-derived defaults, allows zero free args, and runs create→run→cleanup in All mode.
tests/slo_workloads/utils/metrics.cpp Reworks OTLP metrics to export percentile gauges from an HDR window and updates counters/labels.
tests/slo_workloads/utils/CMakeLists.txt Fetches and links HdrHistogram_c for the SLO utils library.
tests/slo_workloads/Dockerfile Removes build-time REF arg and builds the workload with a single ENTRYPOINT.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/slo_workloads/utils/metrics.cpp Outdated
Comment on lines +192 to +203
void PublishPercentiles() {
auto snapshot = Latency_.SnapshotAndReset();
if (!snapshot.HasData) {
return;
}
auto attrs = MergeAttributes({
{"operation_type", OperationType_},
{"operation_status", "success"},
});
LatencyP50_->Record(snapshot.P50, attrs);
LatencyP95_->Record(snapshot.P95, attrs);
LatencyP99_->Record(snapshot.P99, attrs);
Comment on lines 161 to 168
void InitMetrics() {
ErrorsTotal_ = Meter_->CreateUInt64Counter("sdk_errors_total",
"Total number of errors encountered, categorized by error type."
OperationsTotal_ = Meter_->CreateDoubleCounter("sdk_operations_total",
"Total number of operations, categorized by operation type and status."
);

OperationsTotal_ = Meter_->CreateUInt64Counter("sdk_operations_total",
"Total number of operations, categorized by type attempted by the SDK."
);

OperationsSuccessTotal_ = Meter_->CreateUInt64Counter("sdk_operations_success_total",
"Total number of successful operations, categorized by type."
RetryAttemptsTotal_ = Meter_->CreateDoubleCounter("sdk_retry_attempts_total",
"Total number of retry attempts (including the first attempt), categorized by operation type."
);
Rewrite the SLO workflows following the pattern used by ydb-java-sdk
(ydb-platform/ydb-java-sdk#644).

slo.yml:
- Drop the hand-rolled docker run orchestration that spun up YDB and
  invoked the workload with --dont-push / explicit create/run phases.
  The v2 action owns that lifecycle via deploy/compose.yml — we just
  hand it two prebuilt images.
- Gate on the `SLO` PR label.
- Build both images with a single `docker build` per ref; if the
  baseline commit can't be built (missing Dockerfile or compile error
  on a historical SHA), fall back to the current image so the run is
  comparable against itself rather than silently failing.
- Drop `--build-arg REF=…` — ref is now read from WORKLOAD_REF env at
  runtime.
- Rename matrix entry to `cpp-key-value` to match the built binary and
  collapse the per-compiler matrix to a single clang entry (gcc variant
  can be added back when needed).

slo_report.yml:
- Pin to @v2.
- Add a second job that removes the `SLO` label from the PR after the
  report is published, matching js-sdk/java-sdk/go-sdk.
@polRk polRk added the SLO label May 5, 2026
polRk added 9 commits May 5, 2026 17:11
- Counters: switch OperationsTotal/RetryAttemptsTotal from DoubleCounter
  to UInt64Counter. Operation counts and retry attempts are inherently
  integer; float representation could drift for very high cumulative
  values and doesn't match the Prometheus counter convention.
- Percentile gauges: publish 0.0 on empty-window intervals instead of
  returning early. OTel's sync Gauge holds the last Record() value and
  the periodic exporter re-emits it every collection cycle — without an
  explicit reset, gauges would look "stuck" at the last non-empty value
  after load stops, contradicting the per-second HDR reset semantics.
Shared CI runners (and local docker builds) periodically hit
connection timeouts on ppa.launchpadcontent.net. Telling apt itself
to retry via Acquire::Retries=5 plus a 60 s connect timeout handles
the blip in-place without a shell retry loop.
Two separate issues were making every SLO PR a coin flip:

1. Any transient network error (GitHub release download, PPA timeout,
   abseil/protobuf/grpc tarball) kills the entire 30-min build. Each
   of the 8 wget calls and both apt-get steps is now retried:
   - wget gets --tries=5 --waitretry=15 --timeout=60
     --retry-connrefused --retry-on-http-error=500,502,503,504
     via a shared WGET_OPTS env var.
   - apt already has Acquire::Retries=5.

2. Every CI run does a full cold build. The Dockerfile's toolchain and
   dep layers (~25 min) never change between SDK commits, so caching
   them is trivially safe.
   - Switch slo.yml to docker/build-push-action@v6 with GHA type=gha
     cache export/import, scoped per preset.
   - First build on a runner still takes ~30 min; subsequent builds
     where only the SDK source changed should take ~3 min.
   - Baseline build uses continue-on-error + an explicit fallback
     step that retags ydb-app-current as ydb-app-baseline when the
     historical commit won't compile — replaces the inline shell
     if-else that docker/build-push-action can't express.
The earlier Acquire::Retries=5 tweak covered HTTP-level errors but
didn't handle TCP connect timeouts to ppa.launchpadcontent.net, which
is the actual failure mode observed on shared CI runners. Wrap both
the add-apt-repository step and the main apt-get install in shell
retry loops (5 attempts, 15/30/45/60/75 s backoff) so a CDN blip no
longer kills the 30-min build.
The v2 SLO action invokes the workload as `slo-key-value <run-args>`
without a subcommand keyword. The global parser used to error on
unknown long options like --read-rps; now it tolerates them and the
implicit All branch passes the leftover argv to the run phase.
The baseline image used to be built from the merge-base checkout's
tests/slo_workloads/, so any harness change on the PR (Dockerfile,
CLI parser, …) was absent from the baseline. SDK comparison should
only vary the library, not the harness — so reuse current's harness
on both sides.
cmake configure/build now run under ccache as the C/C++ compiler
launcher, with /root/.ccache exposed as a BuildKit cache mount so
state persists across runs through cache-to=type=gha,mode=max
(BuildKit ≥0.13 exports cache mounts under mode=max). Cold runs
incur the usual full compile; warm runs reuse object hashes and
should drop cmake --build from ~14 min to a few minutes.
…ring lacks ?database=

The v2 SLO action provides YDB_CONNECTION_STRING in path form
(grpc://host:port/Root/testdb), which GetDatabase can't parse, so
prefix stayed empty and create issued CreateTable("key_value")
without the database root → BAD_REQUEST. Falling back to the
YDB_DATABASE env var (which the action also sets) restores the
correct table path.
…he-dance

cache-to: type=gha,mode=max exports layer cache but not the contents
of --mount=type=cache, so the ccache mount started empty on every
run (0% hit rate observed). Restore /root/.ccache from a host dir
backed by actions/cache, inject it into BuildKit before each build
via the cache-dance action, and extract the updated state for the
next run. Same scope is shared by current and baseline since the
SDK code overlap dominates ccache hit potential.
Comment thread .github/workflows/slo.yml
// YDB SLO action sets YDB_CONNECTION_STRING in path form
// (grpc://host:port/Root/testdb), which GetDatabase can't parse.
// Fall back to YDB_DATABASE which the action sets alongside it.
prefix = GetEnv("YDB_DATABASE");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

database можно вытащить из драйвера

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Comment on lines +252 to +257
char programName[] = "slo";
std::vector<char*> runArgv;
runArgv.reserve(argc + 1);
runArgv.push_back(programName);
for (int i = 0; i < argc; ++i) {
runArgv.push_back(argv[i]);
break; \
echo "apt-get install attempt $i failed; sleeping $((i * 15))s"; \
sleep $((i * 15)); \
apt-get -y update || true; \
Comment on lines +25 to +27
remove-slo-label:
if: github.event.workflow_run.event == 'pull_request'
runs-on: ubuntu-latest
Comment on lines +198 to +205
// When no successful ops landed in the last second, publish 0.0
// for all percentiles so the gauges reset with the HDR window
// rather than appearing "stuck" at the last non-empty value (the
// OTel periodic exporter would otherwise re-emit the previous
// Record() value on every collection cycle).
LatencyP50_->Record(snapshot.HasData ? snapshot.P50 : 0.0, attrs);
LatencyP95_->Record(snapshot.HasData ? snapshot.P95 : 0.0, attrs);
LatencyP99_->Record(snapshot.HasData ? snapshot.P99 : 0.0, attrs);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants