You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TestExpandedPostingsCacheFuzz is a cross-version compatibility test: cortex-1 runs quay.io/cortexproject/cortex:v1.18.0, cortex-2 runs the current build (with expanded-postings cache enabled). Both instances receive the same remote_write pushes; the test compares query results between them.
Two latent bugs in the test body cause it to flake:
Bug 1 (primary, causes the visible failures): the query-generation loop at integration/query_fuzz_test.go:552-563 filters generated PromQL with isValidQuery(expr, true), which is intended to drop queries whose semantics changed across the Prometheus versions embedded in different Cortex releases (stddev, stdvar, quantile, predict_linear, atan2, plus limitk, limit_ratio, --). However the condition is inverted: when isValidQuery returns true (i.e. the query is safe to run) the loop breaks without appending; when it returns false (i.e. the query should be skipped) the loop appends. The result is that queries is populated entirely with queries containing the very constructs isValidQuery was written to exclude — guaranteed semantic divergence between v1.18.0 and HEAD.
Every observed failure of this test in the last ~18 days fits that pattern (failing queries all contain quantile, with one additionally containing atan2).
Bug 2 (latent, masks get series regressions): at integration/query_fuzz_test.go:654, the get series branch reads cmp.Equal(tc.sres1, tc.sres1, labelSetsComparer) — comparing the same field to itself. The result is always trivially equal, so this branch can never report a Series API mismatch. Not currently producing visible failures (the bug means real divergences go undetected, not falsely reported), but should be fixed in the same PR while we're in the function.
Arch: ubuntu-24.04-arm, arm64, build tag integration_query_fuzz
PR: branch querier-grpc (changes unrelated to this path)
Failure excerpt
query_fuzz_test.go:652: case 1 results mismatch.
range query: quantile without (test_label) (
scalar(
max by (test_label) (day_of_month({__name__="test_series_9",test_label="test_label_value_0"}))
),
sort({__name__="test_series_5"})
)
res1: {k="0"} =>
NaN @[1779223886.476]
NaN @[1779223916.476]
… (58 × NaN total) …
res2: {k="0"} =>
50 @[1779223886.476]
51 @[1779223916.476]
52 @[1779223946.476]
… 53, 54, …, 59, 59, 59, 59, 59, 59, 59, 59, 59 …
NaN @[1779224456.476]
… (rest NaN past data range) …
Error: finished query fuzzing tests
Test: TestExpandedPostingsCacheFuzz
--- FAIL: TestExpandedPostingsCacheFuzz (8.84s)
res1 (cortex-1 = v1.18.0) returns all-NaN; res2 (HEAD) returns the actual test_series_5 values 50–59 (then NaN past the data's time range). The φ argument to quantile is a day_of_month value (1–31), outside [0,1]; out-of-range-φ behavior in quantile differs between the Prometheus versions embedded in v1.18.0 vs HEAD. This is exactly the case isValidQuery(expr, true) was meant to filter (integration/query_fuzz_test.go:2009).
All three failing queries contain quantile (and one additionally contains atan2), both of which isValidQuery(expr, true) is supposed to filter (integration/query_fuzz_test.go:2005-2017).
Root cause
integration/query_fuzz_test.go:552-563 runs a boundedfor i := 0; i < testRun; i++ loop that walks random range queries. Its acceptance check is if isValidQuery(expr, true) { break } — it breaks the loop the first time a valid query is generated (and does not append it), and otherwise appends the invalid query to queries.
Compare with the canonical pattern in this same file: TestDisableChunkTrimmingFuzz at lines 383-390 and runQueryFuzzTestCases at lines 1909-1915 both use an unbounded inner for { … if isValidQuery(...) { break } } wrapped in a counted outer loop. Those correctly retain only queries that pass the filter.
Because the condition is inverted, queries is populated entirely with strings containing quantile, stddev, stdvar, predict_linear, atan2, limitk, limit_ratio, or --. These produce legitimately different results between the older Prometheus in cortex-1 (v1.18.0) and the HEAD Prometheus in cortex-2.
The non-determinism that makes this a flake rather than a constant failure is promqlsmith's seeding with rand.New(rand.NewSource(now.Unix())) at line 541 — different test invocations land on different random first-queries; sometimes the seed happens to yield queries that evaluate identically (e.g. all-NaN on both, or all-empty matrices) and the test passes despite the bug.
Secondary bug at line 654: cmp.Equal(tc.sres1, tc.sres1, labelSetsComparer) should be cmp.Equal(tc.sres1, tc.sres2, labelSetsComparer). The get series branch currently never reports mismatches — a silent gap, not the cause of the observed flakes but worth fixing in the same PR.
The data-freshness hypothesis floated in an earlier triage is not supported by the evidence: Push is synchronous (distributor.replication-factor=1, no async path), both instances receive the same push before any query, and the failing series (test_series_5) is fully present on res2. The divergence is in quantile semantics, not ingestion lag.
Proposed fix
Two adjacent edits, single PR:
Edit 1 — replace the bounded loop at integration/query_fuzz_test.go:552-563:
Add quantile / atan2 etc. to promqlsmith.WithEnabledAggrs / WithEnabledFunctions exclusions — partial fix only. atan2 is a binary operator, predict_linear is a function, the -- case is a parser quirk; isValidQuery's string filter is the existing single source of truth and is already correct — only the surrounding call site is wrong. Touching promqlsmith options would also lose coverage of these in non-cross-version tests where they're safe.
Bump cortex-1 to a newer stable image — defeats the point (the test exists to compare against a stable older Cortex) and will break again the next time upstream changes any of these semantics.
Add a NaN-tolerant output comparator — would silence real cross-version regressions in addition to the known-incompat queries.
Add a synchronization barrier before queries (the earlier freshness hypothesis) — not the actual cause; Push is synchronous and both backends have the data at query time.
Acceptance criteria
After fix, every entry in queries passes isValidQuery(expr, true); len(queries) == 300.
No TestExpandedPostingsCacheFuzz failure across a representative local-run sample on each of arm64 and amd64 (given the ~1 % rate, a substantial number of iterations is required for statistical confidence).
Coverage breadth preserved: the random generator still emits queries using enabledAggrs (SUM, MIN, MAX, AVG, GROUP, COUNT, QUANTILE — the aggregator form, kept by enabledAggrs; the isValidQuery filter only drops the function form of quantile via the string check at line 2009, which is unchanged).
tc.sres1 vs tc.sres2 comparison wired correctly — verifiable with a unit-style assertion or by injecting a known divergence in a follow-up sanity-check commit.
Summary
TestExpandedPostingsCacheFuzzis a cross-version compatibility test:cortex-1runsquay.io/cortexproject/cortex:v1.18.0,cortex-2runs the current build (with expanded-postings cache enabled). Both instances receive the sameremote_writepushes; the test compares query results between them.Two latent bugs in the test body cause it to flake:
Bug 1 (primary, causes the visible failures): the query-generation loop at
integration/query_fuzz_test.go:552-563filters generated PromQL withisValidQuery(expr, true), which is intended to drop queries whose semantics changed across the Prometheus versions embedded in different Cortex releases (stddev,stdvar,quantile,predict_linear,atan2, pluslimitk,limit_ratio,--). However the condition is inverted: whenisValidQueryreturnstrue(i.e. the query is safe to run) the loopbreaks without appending; when it returnsfalse(i.e. the query should be skipped) the loop appends. The result is thatqueriesis populated entirely with queries containing the very constructsisValidQuerywas written to exclude — guaranteed semantic divergence between v1.18.0 and HEAD.Every observed failure of this test in the last ~18 days fits that pattern (failing queries all contain
quantile, with one additionally containingatan2).Bug 2 (latent, masks
get seriesregressions): atintegration/query_fuzz_test.go:654, theget seriesbranch readscmp.Equal(tc.sres1, tc.sres1, labelSetsComparer)— comparing the same field to itself. The result is always trivially equal, so this branch can never report a Series API mismatch. Not currently producing visible failures (the bug means real divergences go undetected, not falsely reported), but should be fixed in the same PR while we're in the function.Most recent occurrence
ubuntu-24.04-arm,arm64, build tagintegration_query_fuzzquerier-grpc(changes unrelated to this path)Failure excerpt
res1(cortex-1 = v1.18.0) returns all-NaN;res2(HEAD) returns the actualtest_series_5values 50–59 (then NaN past the data's time range). The φ argument toquantileis aday_of_monthvalue (1–31), outside[0,1]; out-of-range-φ behavior inquantilediffers between the Prometheus versions embedded in v1.18.0 vs HEAD. This is exactly the caseisValidQuery(expr, true)was meant to filter (integration/query_fuzz_test.go:2009).Empirical flake rate (last 18 days, ~245 PR-CI runs sampled)
TestExpandedPostingsCacheFuzzfailures (PR + master)integration_query_fuzzjob failures (any test)TestExpandedPostingsCacheFuzzfailures on master CI~1.2 % of sampled PR-CI runs that exercise this job. Arch split for this test: 2/3 arm64, 1/3 amd64.
Sample prior failures
quantile without (test_label)(scalar(max by (test_label)(day_of_month(...))), sort({__name__="test_series_5"}))tanh(quantile without ()(scalar(... atan2 ...), {__name__="test_series_3"}))day_of_year(quantile by (__name__)(...))All three failing queries contain
quantile(and one additionally containsatan2), both of whichisValidQuery(expr, true)is supposed to filter (integration/query_fuzz_test.go:2005-2017).Root cause
integration/query_fuzz_test.go:552-563runs a boundedfor i := 0; i < testRun; i++loop that walks random range queries. Its acceptance check isif isValidQuery(expr, true) { break }— itbreaks the loop the first time a valid query is generated (and does not append it), and otherwiseappends the invalid query toqueries.TestDisableChunkTrimmingFuzzat lines 383-390 andrunQueryFuzzTestCasesat lines 1909-1915 both use an unbounded innerfor { … if isValidQuery(...) { break } }wrapped in a counted outer loop. Those correctly retain only queries that pass the filter.queriesis populated entirely with strings containingquantile,stddev,stdvar,predict_linear,atan2,limitk,limit_ratio, or--. These produce legitimately different results between the older Prometheus incortex-1(v1.18.0) and the HEAD Prometheus incortex-2.promqlsmith's seeding withrand.New(rand.NewSource(now.Unix()))at line 541 — different test invocations land on different random first-queries; sometimes the seed happens to yield queries that evaluate identically (e.g. all-NaN on both, or all-empty matrices) and the test passes despite the bug.cmp.Equal(tc.sres1, tc.sres1, labelSetsComparer)should becmp.Equal(tc.sres1, tc.sres2, labelSetsComparer). Theget seriesbranch currently never reports mismatches — a silent gap, not the cause of the observed flakes but worth fixing in the same PR.Pushis synchronous (distributor.replication-factor=1, no async path), both instances receive the same push before any query, and the failing series (test_series_5) is fully present onres2. The divergence is inquantilesemantics, not ingestion lag.Proposed fix
Two adjacent edits, single PR:
Edit 1 — replace the bounded loop at
integration/query_fuzz_test.go:552-563:This matches the pattern at
integration/query_fuzz_test.go:383-390and1909-1915.Edit 2 — fix the self-comparison at line 654:
Why not other approaches
quantile/atan2etc. topromqlsmith.WithEnabledAggrs/WithEnabledFunctionsexclusions — partial fix only.atan2is a binary operator,predict_linearis a function, the--case is a parser quirk;isValidQuery's string filter is the existing single source of truth and is already correct — only the surrounding call site is wrong. Touching promqlsmith options would also lose coverage of these in non-cross-version tests where they're safe.cortex-1to a newer stable image — defeats the point (the test exists to compare against a stable older Cortex) and will break again the next time upstream changes any of these semantics.Pushis synchronous and both backends have the data at query time.Acceptance criteria
queriespassesisValidQuery(expr, true);len(queries) == 300.TestExpandedPostingsCacheFuzzfailure across a representative local-run sample on each of arm64 and amd64 (given the ~1 % rate, a substantial number of iterations is required for statistical confidence).enabledAggrs(SUM, MIN, MAX, AVG, GROUP, COUNT, QUANTILE— the aggregator form, kept byenabledAggrs; theisValidQueryfilter only drops the function form ofquantilevia the string check at line 2009, which is unchanged).tc.sres1vstc.sres2comparison wired correctly — verifiable with a unit-style assertion or by injecting a known divergence in a follow-up sanity-check commit.