Skip to content

Flaky test: TestVerticalShardingFuzz — sharded vs unsharded "or vector(…)" semantic divergence #7547

@sandy2008

Description

@sandy2008

Summary

TestVerticalShardingFuzz (integration/query_fuzz_test.go:665) compares results from an unsharded Cortex (c1) against one running with -frontend.query-vertical-shard-size=2 (c2). The fuzzer occasionally generates a query of the shape

(<expr-with-partial-time-coverage>) or vector(<scalar>)

For at least one shard, the LHS sub-expression is empty at certain timestamps, while in the unsharded engine the LHS has data (value 0). The or then falls through to vector(...) in the sharded path but not in the unsharded path, producing a per-timestamp value mismatch that the test's comparer flags as a results mismatch.

This is consistent with prior known semantic-divergence bugs in vertical sharding around functions whose behavior depends on what data the shard sees — see #5203, #5204, #5205 (closed) for the same pattern with absent, absent_over_time, and scalar. The Thanos query analyzer at vendor/github.com/thanos-io/thanos/pkg/querysharding/analyzer.go:111 already marks those three as non-shardable, but it does not include vector.

Hypothesis (not independently proven by replay): when the LHS is itself a complex expression involving stddev_over_time and a binary equality (==), the unsharded engine produces a {} series with value 0 at the failing timestamp, while in the sharded engine — for reasons that may involve range-vector window edges, aggregation push-down, or the shard merger collapsing empty samples — the LHS is absent at that timestamp, causing or vector(...) to fire. The end-user-visible symptom is clear; the precise upstream mechanism is not.

Most recent occurrence

Failure excerpt

query_fuzz_test.go:1967: case 1194 results mismatch.
    range query: (
        sum without (job, series) (
          (stddev_over_time({__name__="test_series_a"}[4m]) == -{__name__="test_series_a",series=~".*"})
        )
      or
        vector(((0.6746652128584755 * 0.5484783823211438) atan2 -0.17384075643661537))
    )
    res1 len: 1 data: {} =>
    2.009985198006908 @[1779304385.517]
    0                  @[1779304415.517]
    2.009985198006908 @[1779304445.517]
    2.009985198006908 @[1779304475.517]
    … (all other timestamps 2.009985198006908)
    res2 len: 1 data: {} =>
    2.009985198006908 @[1779304385.517]
    2.009985198006908 @[1779304415.517]
    2.009985198006908 @[1779304445.517]
    … (all timestamps 2.009985198006908)
--- FAIL: TestVerticalShardingFuzz (15.23s)

c1 (res1, unsharded) has a 0 at timestamp 1779304415.517; c2 (res2, vertical-shard-size=2) has vector(...)'s value 2.009985… there. atan2(0.6747*0.5485, -0.1738) ≈ 2.0099, confirming the right-hand side of the or is what fires in the sharded path at that timestamp.

Empirical flake rate

Per the prior agent's 18-day scan (~245 CI runs across master + PRs):

  • TestVerticalShardingFuzz: 1 observed failure in the scan window (≈0.4% of PR CI runs hitting integration_query_fuzz). Rare relative to TestParquetFuzz/TestExpandedPostingsCacheFuzz, but with a clear known-shape root cause that warrants a fix at the test generator rather than waiting for it to recur.
  • Arch: arm64 in the one observed sample. The broader integration_query_fuzz job has a 67/33 arm64/amd64 skew; the prevailing hypothesis (slower arm64 + amd64-only SIMD in parquet-go widening race windows) does not obviously apply to a semantic-divergence flake. It may be coincidence at n=1.

Sample prior failures

No further occurrences of this exact shape were captured in the 18-day window; if more samples are needed for confidence, leaving the test as-is and grepping for the vector( substring in future failure logs will identify them.

Root cause

  1. The test (integration/query_fuzz_test.go:778) runs the unsharded vs sharded comparison through runQueryFuzzTestCases and decides shape via shouldUseSampleNumComparer (line 1976) — which checks only for topk/bottomk. Queries containing or vector(...) fall through to the strict comparer (line 1966).
  2. isValidQuery (integration/query_fuzz_test.go:1983-2020) filters generated queries by string match for limitk, limit_ratio, --, and (when skipBackwardIncompat=true) stddev/stdvar/quantile/predict_linear/atan2. TestVerticalShardingFuzz calls runQueryFuzzTestCases(..., false) at line 778, so skipBackwardIncompat=false — meaning stddev_over_time and atan2 are not filtered. There is no filter for vector(.
  3. The upstream Thanos query-sharding analyzer (vendor/github.com/thanos-io/thanos/pkg/querysharding/analyzer.go:103-117) marks queries containing absent, absent_over_time, or scalar as non-shardable because their semantics depend on what data the shard sees. vector(...) is not in that list. As a constant generator, vector() is itself shard-safe; but when used as the RHS of or, it forms an LHS-data-dependent shape analogous to absent — and the analyzer doesn't detect this.
  4. Cortex's wrapper pkg/querysharding/util.go:90-123 (NewDisableBinaryExpressionAnalyzer) only disables binary expressions with VectorMatching.On == false; it does not special-case or vector(...).
  5. The exact upstream mechanism by which the LHS goes missing in one sharded path but not unsharded — at one timestamp — is unverified. Candidates: stddev_over_time window-edge floating-point differences when the input series set per shard differs; merger behavior for empty matrices in or; or a per-step evaluation gap. This issue does not pin down which.

Proposed fix

Primary (test-side, low-risk, ~6 lines):

Extend isValidQuery in integration/query_fuzz_test.go:1983-2020 to skip queries containing vector( regardless of skipBackwardIncompat. Justification: vector(...) as a generator is rare in production, and the divergence with vertical sharding is genuine (engine-level) — the unit-level fuzz test isn't the right vehicle for catching it.

if strings.Contains(queryStr, "vector(") {
    // or-vector fallback fires at different timestamps in sharded vs unsharded
    // engines when the LHS has partial time coverage; see #<this issue>.
    return false
}

Secondary (engine-side, larger, possibly out of scope here):

Add vector to the analyzer's non-shardable function list at vendor/github.com/thanos-io/thanos/pkg/querysharding/analyzer.go:111. This would need an upstream Thanos change and may be wider in impact than warranted — vector() is shardable on its own; only its use as a fallback RHS to or is problematic. A more precise analyzer change (mark or vector(...) non-shardable) is the right long-term fix but is upstream work.

Why not other approaches

  • Switch to sampleNumComparer for queries containing or vector(: same shape would still mismatch — the unsharded result has one 0 sample, the sharded result has one 2.0099… sample, totals are equal but the test currently uses strict comparer for non-topk cases and changing that for or vector( would mask real bugs of the same shape.
  • Relax comparer to ignore positions where values differ but counts match: too permissive; this is the test's signal channel for real correctness regressions.
  • Fix the upstream sharding semantics fully: correct in principle but out of scope of "make CI green" — the divergence is a known class of bugs (bug: vertical sharding has issue with absent_over_time #5203/bug: vertical sharding aggregation with absent displays shard by label #5204/bug: vertical sharding doesn't work well with scalar function #5205) and Thanos hasn't enumerated all of them. Filter-at-test plus a tracking issue for the analyzer fix is the practical path.
  • Disable TestVerticalShardingFuzz entirely: loses coverage of the many shapes vertical sharding does handle correctly.

Acceptance criteria

  • After fix, TestVerticalShardingFuzz runs 1000 instant + 1000 range cases without or vector( queries reaching the comparer (verifiable by grepping logs or instrumenting isValidQuery to count skipped queries).
  • No regression in TestVerticalShardingFuzz coverage of or-without-vector (e.g. … or {series="x"} still exercised — promqlsmith generates or between vector selectors too).
  • Linked tracking issue (or note) for the upstream analyzer to detect or vector(...) and mark as non-shardable. This issue can be closed once the test-side filter lands; the upstream fix is a separate follow-up.

Honest caveat: only one failure sample is available, so the root-cause attribution to or vector(...) is well-supported by the failure shape but the deeper mechanism (why the LHS goes missing per-shard at exactly that one timestamp) is unverified. The proposed test-side fix is robust regardless of which mechanism is at play, because it removes the trigger shape.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions