Skip to content

Flaky test: TestExpandedPostingsCacheFuzz — inverted isValidQuery filter lets backward-incompat queries (quantile, atan2, …) through #7545

@sandy2008

Description

@sandy2008

Summary

TestExpandedPostingsCacheFuzz is a cross-version compatibility test: cortex-1 runs quay.io/cortexproject/cortex:v1.18.0, cortex-2 runs the current build (with expanded-postings cache enabled). Both instances receive the same remote_write pushes; the test compares query results between them.

Two latent bugs in the test body cause it to flake:

Bug 1 (primary, causes the visible failures): the query-generation loop at integration/query_fuzz_test.go:552-563 filters generated PromQL with isValidQuery(expr, true), which is intended to drop queries whose semantics changed across the Prometheus versions embedded in different Cortex releases (stddev, stdvar, quantile, predict_linear, atan2, plus limitk, limit_ratio, --). However the condition is inverted: when isValidQuery returns true (i.e. the query is safe to run) the loop breaks without appending; when it returns false (i.e. the query should be skipped) the loop appends. The result is that queries is populated entirely with queries containing the very constructs isValidQuery was written to exclude — guaranteed semantic divergence between v1.18.0 and HEAD.

Every observed failure of this test in the last ~18 days fits that pattern (failing queries all contain quantile, with one additionally containing atan2).

Bug 2 (latent, masks get series regressions): at integration/query_fuzz_test.go:654, the get series branch reads cmp.Equal(tc.sres1, tc.sres1, labelSetsComparer) — comparing the same field to itself. The result is always trivially equal, so this branch can never report a Series API mismatch. Not currently producing visible failures (the bug means real divergences go undetected, not falsely reported), but should be fixed in the same PR while we're in the function.

Most recent occurrence

Failure excerpt

    query_fuzz_test.go:652: case 1 results mismatch.
        range query: quantile without (test_label) (
          scalar(
            max by (test_label) (day_of_month({__name__="test_series_9",test_label="test_label_value_0"}))
          ),
          sort({__name__="test_series_5"})
        )
        res1: {k="0"} =>
        NaN @[1779223886.476]
        NaN @[1779223916.476]
        … (58 × NaN total) …
        res2: {k="0"} =>
        50 @[1779223886.476]
        51 @[1779223916.476]
        52 @[1779223946.476]
        … 53, 54, …, 59, 59, 59, 59, 59, 59, 59, 59, 59 …
        NaN @[1779224456.476]
        … (rest NaN past data range) …
        Error: finished query fuzzing tests
        Test:  TestExpandedPostingsCacheFuzz
--- FAIL: TestExpandedPostingsCacheFuzz (8.84s)

res1 (cortex-1 = v1.18.0) returns all-NaN; res2 (HEAD) returns the actual test_series_5 values 50–59 (then NaN past the data's time range). The φ argument to quantile is a day_of_month value (1–31), outside [0,1]; out-of-range-φ behavior in quantile differs between the Prometheus versions embedded in v1.18.0 vs HEAD. This is exactly the case isValidQuery(expr, true) was meant to filter (integration/query_fuzz_test.go:2009).

Empirical flake rate (last 18 days, ~245 PR-CI runs sampled)

Metric Count
TestExpandedPostingsCacheFuzz failures (PR + master) 3
integration_query_fuzz job failures (any test) ~16
TestExpandedPostingsCacheFuzz failures on master CI 0

~1.2 % of sampled PR-CI runs that exercise this job. Arch split for this test: 2/3 arm64, 1/3 amd64.

Sample prior failures

All three failing queries contain quantile (and one additionally contains atan2), both of which isValidQuery(expr, true) is supposed to filter (integration/query_fuzz_test.go:2005-2017).

Root cause

  1. integration/query_fuzz_test.go:552-563 runs a bounded for i := 0; i < testRun; i++ loop that walks random range queries. Its acceptance check is if isValidQuery(expr, true) { break } — it breaks the loop the first time a valid query is generated (and does not append it), and otherwise appends the invalid query to queries.
  2. Compare with the canonical pattern in this same file: TestDisableChunkTrimmingFuzz at lines 383-390 and runQueryFuzzTestCases at lines 1909-1915 both use an unbounded inner for { … if isValidQuery(...) { break } } wrapped in a counted outer loop. Those correctly retain only queries that pass the filter.
  3. Because the condition is inverted, queries is populated entirely with strings containing quantile, stddev, stdvar, predict_linear, atan2, limitk, limit_ratio, or --. These produce legitimately different results between the older Prometheus in cortex-1 (v1.18.0) and the HEAD Prometheus in cortex-2.
  4. The non-determinism that makes this a flake rather than a constant failure is promqlsmith's seeding with rand.New(rand.NewSource(now.Unix())) at line 541 — different test invocations land on different random first-queries; sometimes the seed happens to yield queries that evaluate identically (e.g. all-NaN on both, or all-empty matrices) and the test passes despite the bug.
  5. Secondary bug at line 654: cmp.Equal(tc.sres1, tc.sres1, labelSetsComparer) should be cmp.Equal(tc.sres1, tc.sres2, labelSetsComparer). The get series branch currently never reports mismatches — a silent gap, not the cause of the observed flakes but worth fixing in the same PR.
  6. The data-freshness hypothesis floated in an earlier triage is not supported by the evidence: Push is synchronous (distributor.replication-factor=1, no async path), both instances receive the same push before any query, and the failing series (test_series_5) is fully present on res2. The divergence is in quantile semantics, not ingestion lag.

Proposed fix

Two adjacent edits, single PR:

Edit 1 — replace the bounded loop at integration/query_fuzz_test.go:552-563:

testRun := 300
queries := make([]string, 0, testRun)
matchers := make([]string, 0, testRun)
for i := 0; i < testRun; i++ {
    var expr parser.Expr
    for {
        expr = ps.WalkRangeQuery()
        if isValidQuery(expr, true) {
            break
        }
    }
    queries = append(queries, expr.Pretty(0))
    matchers = append(matchers, storepb.PromMatchersToString(
        append(
            ps.WalkSelectors(),
            labels.MustNewMatcher(labels.MatchEqual, "__name__", fmt.Sprintf("test_series_%d", i%numSeries)),
        )...))
}

This matches the pattern at integration/query_fuzz_test.go:383-390 and 1909-1915.

Edit 2 — fix the self-comparison at line 654:

} else if !cmp.Equal(tc.sres1, tc.sres2, labelSetsComparer) {

Why not other approaches

  • Add quantile / atan2 etc. to promqlsmith.WithEnabledAggrs / WithEnabledFunctions exclusions — partial fix only. atan2 is a binary operator, predict_linear is a function, the -- case is a parser quirk; isValidQuery's string filter is the existing single source of truth and is already correct — only the surrounding call site is wrong. Touching promqlsmith options would also lose coverage of these in non-cross-version tests where they're safe.
  • Bump cortex-1 to a newer stable image — defeats the point (the test exists to compare against a stable older Cortex) and will break again the next time upstream changes any of these semantics.
  • Add a NaN-tolerant output comparator — would silence real cross-version regressions in addition to the known-incompat queries.
  • Add a synchronization barrier before queries (the earlier freshness hypothesis) — not the actual cause; Push is synchronous and both backends have the data at query time.

Acceptance criteria

  • After fix, every entry in queries passes isValidQuery(expr, true); len(queries) == 300.
  • No TestExpandedPostingsCacheFuzz failure across a representative local-run sample on each of arm64 and amd64 (given the ~1 % rate, a substantial number of iterations is required for statistical confidence).
  • Coverage breadth preserved: the random generator still emits queries using enabledAggrs (SUM, MIN, MAX, AVG, GROUP, COUNT, QUANTILE — the aggregator form, kept by enabledAggrs; the isValidQuery filter only drops the function form of quantile via the string check at line 2009, which is unchanged).
  • tc.sres1 vs tc.sres2 comparison wired correctly — verifiable with a unit-style assertion or by injecting a known divergence in a follow-up sanity-check commit.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions