Summary
Both TestPrometheusCompatibilityQueryFuzz (integration/query_fuzz_test.go:1604-1703) and TestExperimentalPromQLFuncsWithPrometheus (integration/query_fuzz_test.go:163-266) execute random PromQL via runQueryFuzzTestCases against two engines (Cortex vs. upstream Prometheus) and compare results. When a generated query triggers a vector-matching duplicate (e.g. many-to-many on a bool binop with a group by (...) or or subtree), both engines correctly return the same error type and the same logical message — but the trailing [X, Y] series-list inside the message renders in an order that depends on Go map iteration. The comparator at integration/query_fuzz_test.go:1957 is cmp.Equal(tc.err1, tc.err2) (an exact-string equality on *errors.errorString), so any byte-level divergence — including a reordered two-element list inside the message — flips the case to error mismatch and fails the test.
This is a test-side issue: both engines are doing the right thing; the comparator is too strict.
Most recent occurrence
Failure excerpt
query_fuzz_test.go:1958: case 1457 error mismatch.
range query: (
{__name__="test_series_b"}
<= bool
group by (series, __name__) (
(
label_replace({__name__="test_series_a"}, "__promqlsmith_dst_label__", "$1", "status_code", "(.*)")
or
-{__name__="test_series_b",status_code="404"}
)
)
)
err1: execution: found duplicate series for the match group {series="2"} on the right hand-side of the operation: [{series="2"}, {__name__="test_series_a", series="2"}];many-to-many matching not allowed: matching labels must be unique on one side
err2: execution: found duplicate series for the match group {series="2"} on the right hand-side of the operation: [{__name__="test_series_a", series="2"}, {series="2"}];many-to-many matching not allowed: matching labels must be unique on one side
query_fuzz_test.go:1972:
Error: finished query fuzzing tests
Test: TestPrometheusCompatibilityQueryFuzz
Messages: 1 test cases failed
--- FAIL: TestPrometheusCompatibilityQueryFuzz (18.97s)
Same shape on other occurrences — only the two metric strings inside [ , ] swap places.
Empirical flake rate (last 18 days, ~250 CI runs)
| Test |
Confirmed failures (this 18-day window) |
TestPrometheusCompatibilityQueryFuzz |
3 |
TestExperimentalPromQLFuncsWithPrometheus |
1 confirmed (a second hit was reported by a prior investigation but I could not re-verify it in the logs available to this scan) |
| Combined |
~4 PR-CI hits |
All confirmed failures are the same "[X, Y]" vs "[Y, X]" inside the duplicate series for the match group … error. None observed on master in this window.
Sample prior failures
TestPrometheusCompatibilityQueryFuzz, amd64, 2026-05-17: https://github.com/cortexproject/cortex/actions/runs/26005855514/job/76437467814
TestPrometheusCompatibilityQueryFuzz, amd64, 2026-05-08: https://github.com/cortexproject/cortex/actions/runs/25541008715/job/74967405874 (err1: [...test_series_a, ...test_series_b] vs err2: [...test_series_b, ...test_series_a])
TestPrometheusCompatibilityQueryFuzz, amd64, 2026-05-06: https://github.com/cortexproject/cortex/actions/runs/25416189989/job/74548551715
TestExperimentalPromQLFuncsWithPrometheus, arm64, 2026-05-13: https://github.com/cortexproject/cortex/actions/runs/25817703267/job/75851455288
Root cause
- The comparator at
integration/query_fuzz_test.go:1957 is if !cmp.Equal(tc.err1, tc.err2). For plain *errors.errorString produced by the HTTP client, cmp.Equal reduces to exact string equality on the error message.
- When a generated query exercises many-to-one (e.g.
<= bool group by (...) (...)), the Prometheus engine reaches vendor/github.com/prometheus/prometheus/promql/engine.go:2830-2831:
ev.errorf("found duplicate series for the match group %s on the %s hand-side of the operation: [%s, %s]"+
";many-to-many matching not allowed: matching labels must be unique on one side",
matchedLabels.String(), oneSide, rs.Metric.String(), duplSample.Metric.String())
Here duplSample is whichever entry already sits in the rightSigs map[string]Sample (engine.go:2809), and rs is the second-arrived sample being iterated from the rhs Vector slice.
rhs is gathered from the RHS sub-expression's output Matrix. For aggregations the output matrix is built by iterating the internal seriess map (engine.go:1396-1399 for rangeEval, engine.go:1525-1527 for rangeEvalAgg's topk/bottomk/limitk branch). Map iteration in Go is randomized per-process. The two engines are two distinct processes (Cortex container vs. Prometheus container), so their map seeds differ → whichever series lands "first" in rightSigs differs → the order of [%s, %s] in the message differs.
- The same root cause applies to both tests: both call
runQueryFuzzTestCases and Cortex vs. Prometheus paths both delegate to the upstream Prometheus engine for this branch. The only difference is TestExperimentalPromQLFuncsWithPrometheus enables promql-experimental-functions; that flag has no bearing on the binop error path.
Uncertainty
The order was traced back to one map iteration (seriess → output Matrix) inside the RHS sub-expression's evaluator. There may be additional contributing map iterations upstream of that (e.g. inside select / chunkSeries merge), but the seriess map alone is sufficient to produce the observed flake. The downstream message order in engine.go:2830 itself is deterministic given a fixed input slice order — the randomness is entirely upstream, in how the RHS Matrix is assembled.
Proposed fix
Compare error types/codes rather than message bytes. In runQueryFuzzTestCases (integration/query_fuzz_test.go:1956-1960), replace the exact cmp.Equal(tc.err1, tc.err2) with a tolerant comparator:
if tc.err1 != nil || tc.err2 != nil {
if !sameErrorClass(tc.err1, tc.err2) {
t.Logf(...) ; failures++
}
}
where sameErrorClass either:
- (a) checks both are non-nil and their HTTP API error type/status codes match (the Cortex/Prom HTTP clients already return typed errors with a Type field like
execution), or
- (b) parses the message and canonicalizes the
[…, …] segment by sorting the comma-separated label strings before comparing.
Option (a) is preferred — it's a few lines, doesn't depend on Prometheus's exact wording, and survives any future error-message rewording. Option (b) is acceptable as a regex-based normalization scoped to this one error string if (a) proves to expose other categories of mismatch.
Why not other approaches
- Skip queries that produce vector-matching errors: Would require pre-classifying the generator's output, which
isValidQuery (integration/query_fuzz_test.go:1983) currently does only via string-contains. We'd need to either re-run the query and inspect the error before deciding to count it (chicken-and-egg) or block large query families (e.g. all bool with set ops), which would drop useful coverage. Not worth it when the comparator fix is local.
- Sort the label sets in the upstream Prometheus error message: Would require a Prometheus PR plus a vendor bump. Higher coordination cost, longer feedback loop, and would not help us until adopted. We could carry the change in
vendor/ but modifying vendored code is against the project's convention (CLAUDE.md).
- Set
GODEBUG=randmapiter=0: That toggle doesn't exist; map iteration randomization is unconditional since Go 1.0. Even if it did, the two engines run in separate containers so they'd still diverge across runs.
- Re-seed both engines deterministically and run them in-process: Out of scope; the test deliberately runs Cortex in Docker against an external Prometheus binary.
Acceptance criteria
Summary
Both
TestPrometheusCompatibilityQueryFuzz(integration/query_fuzz_test.go:1604-1703) andTestExperimentalPromQLFuncsWithPrometheus(integration/query_fuzz_test.go:163-266) execute random PromQL viarunQueryFuzzTestCasesagainst two engines (Cortex vs. upstream Prometheus) and compare results. When a generated query triggers a vector-matching duplicate (e.g. many-to-many on aboolbinop with agroup by (...)ororsubtree), both engines correctly return the same error type and the same logical message — but the trailing[X, Y]series-list inside the message renders in an order that depends on Go map iteration. The comparator atintegration/query_fuzz_test.go:1957iscmp.Equal(tc.err1, tc.err2)(an exact-string equality on*errors.errorString), so any byte-level divergence — including a reordered two-element list inside the message — flips the case toerror mismatchand fails the test.This is a test-side issue: both engines are doing the right thing; the comparator is too strict.
Most recent occurrence
ubuntu-24.04amd64, build tagintegration_query_fuzzTestPrometheusCompatibilityQueryFuzzFailure excerpt
Same shape on other occurrences — only the two metric strings inside
[ , ]swap places.Empirical flake rate (last 18 days, ~250 CI runs)
TestPrometheusCompatibilityQueryFuzzTestExperimentalPromQLFuncsWithPrometheusAll confirmed failures are the same "[X, Y]" vs "[Y, X]" inside the
duplicate series for the match group …error. None observed on master in this window.Sample prior failures
TestPrometheusCompatibilityQueryFuzz, amd64, 2026-05-17: https://github.com/cortexproject/cortex/actions/runs/26005855514/job/76437467814TestPrometheusCompatibilityQueryFuzz, amd64, 2026-05-08: https://github.com/cortexproject/cortex/actions/runs/25541008715/job/74967405874 (err1: [...test_series_a, ...test_series_b]vserr2: [...test_series_b, ...test_series_a])TestPrometheusCompatibilityQueryFuzz, amd64, 2026-05-06: https://github.com/cortexproject/cortex/actions/runs/25416189989/job/74548551715TestExperimentalPromQLFuncsWithPrometheus, arm64, 2026-05-13: https://github.com/cortexproject/cortex/actions/runs/25817703267/job/75851455288Root cause
integration/query_fuzz_test.go:1957isif !cmp.Equal(tc.err1, tc.err2). For plain*errors.errorStringproduced by the HTTP client,cmp.Equalreduces to exact string equality on the error message.<= bool group by (...) (...)), the Prometheus engine reachesvendor/github.com/prometheus/prometheus/promql/engine.go:2830-2831:duplSampleis whichever entry already sits in therightSigs map[string]Sample(engine.go:2809), andrsis the second-arrived sample being iterated from therhsVector slice.rhsis gathered from the RHS sub-expression's outputMatrix. For aggregations the output matrix is built by iterating the internalseriess map(engine.go:1396-1399forrangeEval,engine.go:1525-1527forrangeEvalAgg's topk/bottomk/limitk branch). Map iteration in Go is randomized per-process. The two engines are two distinct processes (Cortex container vs. Prometheus container), so their map seeds differ → whichever series lands "first" inrightSigsdiffers → the order of[%s, %s]in the message differs.runQueryFuzzTestCasesand Cortex vs. Prometheus paths both delegate to the upstream Prometheus engine for this branch. The only difference isTestExperimentalPromQLFuncsWithPrometheusenablespromql-experimental-functions; that flag has no bearing on the binop error path.Uncertainty
The order was traced back to one map iteration (
seriess→ outputMatrix) inside the RHS sub-expression's evaluator. There may be additional contributing map iterations upstream of that (e.g. inside select / chunkSeries merge), but theseriessmap alone is sufficient to produce the observed flake. The downstream message order inengine.go:2830itself is deterministic given a fixed input slice order — the randomness is entirely upstream, in how the RHS Matrix is assembled.Proposed fix
Compare error types/codes rather than message bytes. In
runQueryFuzzTestCases(integration/query_fuzz_test.go:1956-1960), replace the exactcmp.Equal(tc.err1, tc.err2)with a tolerant comparator:where
sameErrorClasseither:execution), or[…, …]segment by sorting the comma-separated label strings before comparing.Option (a) is preferred — it's a few lines, doesn't depend on Prometheus's exact wording, and survives any future error-message rewording. Option (b) is acceptable as a regex-based normalization scoped to this one error string if (a) proves to expose other categories of mismatch.
Why not other approaches
isValidQuery(integration/query_fuzz_test.go:1983) currently does only via string-contains. We'd need to either re-run the query and inspect the error before deciding to count it (chicken-and-egg) or block large query families (e.g. allboolwith set ops), which would drop useful coverage. Not worth it when the comparator fix is local.vendor/but modifying vendored code is against the project's convention (CLAUDE.md).GODEBUG=randmapiter=0: That toggle doesn't exist; map iteration randomization is unconditional since Go 1.0. Even if it did, the two engines run in separate containers so they'd still diverge across runs.Acceptance criteria
error mismatchfailures of theduplicate series for the match group … [X, Y]shape inTestPrometheusCompatibilityQueryFuzzorTestExperimentalPromQLFuncsWithPrometheusacross a representative local-run sample.executionvsbad_data), the test still fails — i.e. the comparator relaxation does not mask real divergences. Add a unit test for the comparator covering both same-class-different-message and different-class-different-message inputs.vendor/github.com/prometheus/prometheus/promql/engine.go.