Flaky test: TestExpandedPostingsCacheFuzz — inverted isValidQuery filter lets backward-incompat queries (quantile, atan2, …) through

## Summary

`TestExpandedPostingsCacheFuzz` is a cross-version compatibility test: `cortex-1` runs `quay.io/cortexproject/cortex:v1.18.0`, `cortex-2` runs the current build (with expanded-postings cache enabled). Both instances receive the same `remote_write` pushes; the test compares query results between them.

Two latent bugs in the test body cause it to flake:

**Bug 1 (primary, causes the visible failures):** the query-generation loop at `integration/query_fuzz_test.go:552-563` filters generated PromQL with `isValidQuery(expr, true)`, which is intended to drop queries whose semantics changed across the Prometheus versions embedded in different Cortex releases (`stddev`, `stdvar`, `quantile`, `predict_linear`, `atan2`, plus `limitk`, `limit_ratio`, `--`). However the condition is **inverted**: when `isValidQuery` returns `true` (i.e. the query is safe to run) the loop `break`s without appending; when it returns `false` (i.e. the query *should be skipped*) the loop appends. The result is that `queries` is populated entirely with queries containing the very constructs `isValidQuery` was written to exclude — guaranteed semantic divergence between v1.18.0 and HEAD.

Every observed failure of this test in the last ~18 days fits that pattern (failing queries all contain `quantile`, with one additionally containing `atan2`).

**Bug 2 (latent, masks `get series` regressions):** at `integration/query_fuzz_test.go:654`, the `get series` branch reads `cmp.Equal(tc.sres1, tc.sres1, labelSetsComparer)` — comparing the same field to itself. The result is always trivially equal, so this branch can never report a Series API mismatch. Not currently producing visible failures (the bug means real divergences go undetected, not falsely reported), but should be fixed in the same PR while we're in the function.

## Most recent occurrence

- Run: https://github.com/cortexproject/cortex/actions/runs/26188714336/job/77052093282
- Date: 2026-05-20T20:53:00Z
- Arch: `ubuntu-24.04-arm`, `arm64`, build tag `integration_query_fuzz`
- PR: branch `querier-grpc` (changes unrelated to this path)

## Failure excerpt

```
    query_fuzz_test.go:652: case 1 results mismatch.
        range query: quantile without (test_label) (
          scalar(
            max by (test_label) (day_of_month({__name__="test_series_9",test_label="test_label_value_0"}))
          ),
          sort({__name__="test_series_5"})
        )
        res1: {k="0"} =>
        NaN @[1779223886.476]
        NaN @[1779223916.476]
        … (58 × NaN total) …
        res2: {k="0"} =>
        50 @[1779223886.476]
        51 @[1779223916.476]
        52 @[1779223946.476]
        … 53, 54, …, 59, 59, 59, 59, 59, 59, 59, 59, 59 …
        NaN @[1779224456.476]
        … (rest NaN past data range) …
        Error: finished query fuzzing tests
        Test:  TestExpandedPostingsCacheFuzz
--- FAIL: TestExpandedPostingsCacheFuzz (8.84s)
```

`res1` (cortex-1 = v1.18.0) returns all-NaN; `res2` (HEAD) returns the actual `test_series_5` values 50–59 (then NaN past the data's time range). The φ argument to `quantile` is a `day_of_month` value (1–31), outside `[0,1]`; out-of-range-φ behavior in `quantile` differs between the Prometheus versions embedded in v1.18.0 vs HEAD. This is exactly the case `isValidQuery(expr, true)` was meant to filter (`integration/query_fuzz_test.go:2009`).

## Empirical flake rate (last 18 days, ~245 PR-CI runs sampled)

| Metric | Count |
|---|---|
| `TestExpandedPostingsCacheFuzz` failures (PR + master) | 3 |
| `integration_query_fuzz` job failures (any test) | ~16 |
| `TestExpandedPostingsCacheFuzz` failures on master CI | 0 |

~1.2 % of sampled PR-CI runs that exercise this job. Arch split for this test: 2/3 arm64, 1/3 amd64.

## Sample prior failures

- arm64, 2026-05-20: https://github.com/cortexproject/cortex/actions/runs/26188714336 — query: `quantile without (test_label)(scalar(max by (test_label)(day_of_month(...))), sort({__name__="test_series_5"}))`
- amd64, 2026-05-07: https://github.com/cortexproject/cortex/actions/runs/25476314272 — query: `tanh(quantile without ()(scalar(... atan2 ...), {__name__="test_series_3"}))`
- arm64, 2026-05-05: https://github.com/cortexproject/cortex/actions/runs/25379792817 — query: `day_of_year(quantile by (__name__)(...))`

All three failing queries contain `quantile` (and one additionally contains `atan2`), both of which `isValidQuery(expr, true)` is supposed to filter (`integration/query_fuzz_test.go:2005-2017`).

## Root cause

1. `integration/query_fuzz_test.go:552-563` runs a *bounded* `for i := 0; i < testRun; i++` loop that walks random range queries. Its acceptance check is `if isValidQuery(expr, true) { break }` — it `break`s the loop the first time a *valid* query is generated (and does **not** append it), and otherwise `append`s the *invalid* query to `queries`.
2. Compare with the canonical pattern in this same file: `TestDisableChunkTrimmingFuzz` at lines 383-390 and `runQueryFuzzTestCases` at lines 1909-1915 both use an *unbounded* inner `for { … if isValidQuery(...) { break } }` wrapped in a counted outer loop. Those correctly retain only queries that pass the filter.
3. Because the condition is inverted, `queries` is populated entirely with strings containing `quantile`, `stddev`, `stdvar`, `predict_linear`, `atan2`, `limitk`, `limit_ratio`, or `--`. These produce legitimately different results between the older Prometheus in `cortex-1` (v1.18.0) and the HEAD Prometheus in `cortex-2`.
4. The non-determinism that makes this a *flake* rather than a constant failure is `promqlsmith`'s seeding with `rand.New(rand.NewSource(now.Unix()))` at line 541 — different test invocations land on different random first-queries; sometimes the seed happens to yield queries that evaluate identically (e.g. all-NaN on both, or all-empty matrices) and the test passes despite the bug.
5. **Secondary bug** at line 654: `cmp.Equal(tc.sres1, tc.sres1, labelSetsComparer)` should be `cmp.Equal(tc.sres1, tc.sres2, labelSetsComparer)`. The `get series` branch currently never reports mismatches — a silent gap, not the cause of the observed flakes but worth fixing in the same PR.
6. The data-freshness hypothesis floated in an earlier triage is **not supported by the evidence**: `Push` is synchronous (`distributor.replication-factor=1`, no async path), both instances receive the same push before any query, and the failing series (`test_series_5`) is fully present on `res2`. The divergence is in `quantile` semantics, not ingestion lag.

## Proposed fix

Two adjacent edits, single PR:

**Edit 1** — replace the bounded loop at `integration/query_fuzz_test.go:552-563`:

```go
testRun := 300
queries := make([]string, 0, testRun)
matchers := make([]string, 0, testRun)
for i := 0; i < testRun; i++ {
    var expr parser.Expr
    for {
        expr = ps.WalkRangeQuery()
        if isValidQuery(expr, true) {
            break
        }
    }
    queries = append(queries, expr.Pretty(0))
    matchers = append(matchers, storepb.PromMatchersToString(
        append(
            ps.WalkSelectors(),
            labels.MustNewMatcher(labels.MatchEqual, "__name__", fmt.Sprintf("test_series_%d", i%numSeries)),
        )...))
}
```

This matches the pattern at `integration/query_fuzz_test.go:383-390` and `1909-1915`.

**Edit 2** — fix the self-comparison at line 654:

```go
} else if !cmp.Equal(tc.sres1, tc.sres2, labelSetsComparer) {
```

## Why not other approaches

- **Add `quantile` / `atan2` etc. to `promqlsmith.WithEnabledAggrs` / `WithEnabledFunctions` exclusions** — partial fix only. `atan2` is a binary operator, `predict_linear` is a function, the `--` case is a parser quirk; `isValidQuery`'s string filter is the existing single source of truth and is already correct — only the surrounding call site is wrong. Touching promqlsmith options would also lose coverage of these in non-cross-version tests where they're safe.
- **Bump `cortex-1` to a newer stable image** — defeats the point (the test exists to compare against a stable older Cortex) and will break again the next time upstream changes any of these semantics.
- **Add a NaN-tolerant output comparator** — would silence real cross-version regressions in addition to the known-incompat queries.
- **Add a synchronization barrier before queries** (the earlier freshness hypothesis) — not the actual cause; `Push` is synchronous and both backends have the data at query time.

## Acceptance criteria

- [ ] After fix, every entry in `queries` passes `isValidQuery(expr, true)`; `len(queries) == 300`.
- [ ] No `TestExpandedPostingsCacheFuzz` failure across a representative local-run sample on each of arm64 and amd64 (given the ~1 % rate, a substantial number of iterations is required for statistical confidence).
- [ ] Coverage breadth preserved: the random generator still emits queries using `enabledAggrs` (`SUM, MIN, MAX, AVG, GROUP, COUNT, QUANTILE` — the *aggregator* form, kept by `enabledAggrs`; the `isValidQuery` filter only drops the *function* form of `quantile` via the string check at line 2009, which is unchanged).
- [ ] `tc.sres1` vs `tc.sres2` comparison wired correctly — verifiable with a unit-style assertion or by injecting a known divergence in a follow-up sanity-check commit.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test: TestExpandedPostingsCacheFuzz — inverted isValidQuery filter lets backward-incompat queries (quantile, atan2, …) through #7545

Summary

Most recent occurrence

Failure excerpt

Empirical flake rate (last 18 days, ~245 PR-CI runs sampled)

Sample prior failures

Root cause

Proposed fix

Why not other approaches

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Count
`TestExpandedPostingsCacheFuzz` failures (PR + master)	3
`integration_query_fuzz` job failures (any test)	~16
`TestExpandedPostingsCacheFuzz` failures on master CI	0

Flaky test: TestExpandedPostingsCacheFuzz — inverted isValidQuery filter lets backward-incompat queries (quantile, atan2, …) through #7545

Description

Summary

Most recent occurrence

Failure excerpt

Empirical flake rate (last 18 days, ~245 PR-CI runs sampled)

Sample prior failures

Root cause

Proposed fix

Why not other approaches

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions