Flaky test: TestPrometheusCompatibilityQueryFuzz / TestExperimentalPromQLFuncsWithPrometheus — error-string comparator sensitive to non-deterministic series-list ordering

## Summary

Both `TestPrometheusCompatibilityQueryFuzz` (`integration/query_fuzz_test.go:1604-1703`) and `TestExperimentalPromQLFuncsWithPrometheus` (`integration/query_fuzz_test.go:163-266`) execute random PromQL via `runQueryFuzzTestCases` against two engines (Cortex vs. upstream Prometheus) and compare results. When a generated query triggers a vector-matching duplicate (e.g. many-to-many on a `bool` binop with a `group by (...)` or `or` subtree), both engines correctly return the **same** error type and the **same** logical message — but the trailing `[X, Y]` series-list inside the message renders in an order that depends on Go map iteration. The comparator at `integration/query_fuzz_test.go:1957` is `cmp.Equal(tc.err1, tc.err2)` (an exact-string equality on `*errors.errorString`), so any byte-level divergence — including a reordered two-element list inside the message — flips the case to `error mismatch` and fails the test.

This is a test-side issue: both engines are doing the right thing; the comparator is too strict.

## Most recent occurrence

- Run: https://github.com/cortexproject/cortex/actions/runs/26005855514/job/76437467814
- Date: 2026-05-17T23:40:54Z
- Arch: `ubuntu-24.04` amd64, build tag `integration_query_fuzz`
- Test: `TestPrometheusCompatibilityQueryFuzz`
- PR #7515 changes (memberlist TCP limits) are unrelated to the failing path.

## Failure excerpt

```
    query_fuzz_test.go:1958: case 1457 error mismatch.
        range query: (
            {__name__="test_series_b"}
          <= bool
            group by (series, __name__) (
              (
                  label_replace({__name__="test_series_a"}, "__promqlsmith_dst_label__", "$1", "status_code", "(.*)")
                or
                  -{__name__="test_series_b",status_code="404"}
              )
            )
        )
        err1: execution: found duplicate series for the match group {series="2"} on the right hand-side of the operation: [{series="2"}, {__name__="test_series_a", series="2"}];many-to-many matching not allowed: matching labels must be unique on one side
        err2: execution: found duplicate series for the match group {series="2"} on the right hand-side of the operation: [{__name__="test_series_a", series="2"}, {series="2"}];many-to-many matching not allowed: matching labels must be unique on one side
    query_fuzz_test.go:1972:
        Error:      finished query fuzzing tests
        Test:       TestPrometheusCompatibilityQueryFuzz
        Messages:   1 test cases failed
--- FAIL: TestPrometheusCompatibilityQueryFuzz (18.97s)
```

Same shape on other occurrences — only the two metric strings inside `[ , ]` swap places.

## Empirical flake rate (last 18 days, ~250 CI runs)

| Test | Confirmed failures (this 18-day window) |
|------|-----------------------------------------|
| `TestPrometheusCompatibilityQueryFuzz` | 3 |
| `TestExperimentalPromQLFuncsWithPrometheus` | 1 confirmed (a second hit was reported by a prior investigation but I could not re-verify it in the logs available to this scan) |
| Combined | ~4 PR-CI hits |

All confirmed failures are the same "[X, Y]" vs "[Y, X]" inside the `duplicate series for the match group …` error. None observed on master in this window.

## Sample prior failures

- `TestPrometheusCompatibilityQueryFuzz`, amd64, 2026-05-17: https://github.com/cortexproject/cortex/actions/runs/26005855514/job/76437467814
- `TestPrometheusCompatibilityQueryFuzz`, amd64, 2026-05-08: https://github.com/cortexproject/cortex/actions/runs/25541008715/job/74967405874 (`err1: [...test_series_a, ...test_series_b]` vs `err2: [...test_series_b, ...test_series_a]`)
- `TestPrometheusCompatibilityQueryFuzz`, amd64, 2026-05-06: https://github.com/cortexproject/cortex/actions/runs/25416189989/job/74548551715
- `TestExperimentalPromQLFuncsWithPrometheus`, arm64, 2026-05-13: https://github.com/cortexproject/cortex/actions/runs/25817703267/job/75851455288

## Root cause

1. The comparator at `integration/query_fuzz_test.go:1957` is `if !cmp.Equal(tc.err1, tc.err2)`. For plain `*errors.errorString` produced by the HTTP client, `cmp.Equal` reduces to exact string equality on the error message.
2. When a generated query exercises many-to-one (e.g. `<= bool group by (...) (...)`), the Prometheus engine reaches `vendor/github.com/prometheus/prometheus/promql/engine.go:2830-2831`:
   ```go
   ev.errorf("found duplicate series for the match group %s on the %s hand-side of the operation: [%s, %s]"+
     ";many-to-many matching not allowed: matching labels must be unique on one side",
     matchedLabels.String(), oneSide, rs.Metric.String(), duplSample.Metric.String())
   ```
   Here `duplSample` is whichever entry already sits in the `rightSigs map[string]Sample` (engine.go:2809), and `rs` is the second-arrived sample being iterated from the `rhs` Vector slice.
3. `rhs` is gathered from the RHS sub-expression's output `Matrix`. For aggregations the output matrix is built by iterating the internal `seriess map` (`engine.go:1396-1399` for `rangeEval`, `engine.go:1525-1527` for `rangeEvalAgg`'s topk/bottomk/limitk branch). Map iteration in Go is randomized per-process. The two engines are two distinct processes (Cortex container vs. Prometheus container), so their map seeds differ → whichever series lands "first" in `rightSigs` differs → the order of `[%s, %s]` in the message differs.
4. The same root cause applies to both tests: both call `runQueryFuzzTestCases` and Cortex vs. Prometheus paths both delegate to the upstream Prometheus engine for this branch. The only difference is `TestExperimentalPromQLFuncsWithPrometheus` enables `promql-experimental-functions`; that flag has no bearing on the binop error path.

### Uncertainty

The order was traced back to one map iteration (`seriess` → output `Matrix`) inside the **RHS** sub-expression's evaluator. There may be additional contributing map iterations upstream of that (e.g. inside select / chunkSeries merge), but the `seriess` map alone is sufficient to produce the observed flake. The downstream message order in `engine.go:2830` itself is deterministic given a fixed input slice order — the randomness is entirely upstream, in how the RHS Matrix is assembled.

## Proposed fix

**Compare error types/codes rather than message bytes.** In `runQueryFuzzTestCases` (`integration/query_fuzz_test.go:1956-1960`), replace the exact `cmp.Equal(tc.err1, tc.err2)` with a tolerant comparator:

```go
if tc.err1 != nil || tc.err2 != nil {
    if !sameErrorClass(tc.err1, tc.err2) {
        t.Logf(...) ; failures++
    }
}
```

where `sameErrorClass` either:
- (a) checks both are non-nil and their HTTP API error type/status codes match (the Cortex/Prom HTTP clients already return typed errors with a Type field like `execution`), or
- (b) parses the message and canonicalizes the `[…, …]` segment by sorting the comma-separated label strings before comparing.

Option (a) is preferred — it's a few lines, doesn't depend on Prometheus's exact wording, and survives any future error-message rewording. Option (b) is acceptable as a regex-based normalization scoped to this one error string if (a) proves to expose other categories of mismatch.

## Why not other approaches

- **Skip queries that produce vector-matching errors**: Would require pre-classifying the generator's output, which `isValidQuery` (`integration/query_fuzz_test.go:1983`) currently does only via string-contains. We'd need to either re-run the query and inspect the error before deciding to count it (chicken-and-egg) or block large query families (e.g. all `bool` with set ops), which would drop useful coverage. Not worth it when the comparator fix is local.
- **Sort the label sets in the upstream Prometheus error message**: Would require a Prometheus PR plus a vendor bump. Higher coordination cost, longer feedback loop, and would not help us until adopted. We could carry the change in `vendor/` but modifying vendored code is against the project's convention (`CLAUDE.md`).
- **Set `GODEBUG=randmapiter=0`**: That toggle doesn't exist; map iteration randomization is unconditional since Go 1.0. Even if it did, the two engines run in separate containers so they'd still diverge across runs.
- **Re-seed both engines deterministically and run them in-process**: Out of scope; the test deliberately runs Cortex in Docker against an external Prometheus binary.

## Acceptance criteria

- [ ] No `error mismatch` failures of the `duplicate series for the match group … [X, Y]` shape in `TestPrometheusCompatibilityQueryFuzz` or `TestExperimentalPromQLFuncsWithPrometheus` across a representative local-run sample.
- [ ] When both engines genuinely return **different** error types (e.g. one nil + one non-nil; or different Prometheus error categories like `execution` vs `bad_data`), the test still fails — i.e. the comparator relaxation does not mask real divergences. Add a unit test for the comparator covering both same-class-different-message and different-class-different-message inputs.
- [ ] No change required in `vendor/github.com/prometheus/prometheus/promql/engine.go`.
- [ ] CHANGELOG entry omitted (test-only change, not user-facing).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test: TestPrometheusCompatibilityQueryFuzz / TestExperimentalPromQLFuncsWithPrometheus — error-string comparator sensitive to non-deterministic series-list ordering #7546

Summary

Most recent occurrence

Failure excerpt

Empirical flake rate (last 18 days, ~250 CI runs)

Sample prior failures

Root cause

Uncertainty

Proposed fix

Why not other approaches

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Test	Confirmed failures (this 18-day window)
`TestPrometheusCompatibilityQueryFuzz`	3
`TestExperimentalPromQLFuncsWithPrometheus`	1 confirmed (a second hit was reported by a prior investigation but I could not re-verify it in the logs available to this scan)
Combined	~4 PR-CI hits

Flaky test: TestPrometheusCompatibilityQueryFuzz / TestExperimentalPromQLFuncsWithPrometheus — error-string comparator sensitive to non-deterministic series-list ordering #7546

Description

Summary

Most recent occurrence

Failure excerpt

Empirical flake rate (last 18 days, ~250 CI runs)

Sample prior failures

Root cause

Uncertainty

Proposed fix

Why not other approaches

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions