Skip to content

fix(queryeviction): account evictions before cancelling the victim query#7616

Open
sandy2008 wants to merge 2 commits into
cortexproject:masterfrom
sandy2008:fix/queryeviction-eviction-metric-ordering
Open

fix(queryeviction): account evictions before cancelling the victim query#7616
sandy2008 wants to merge 2 commits into
cortexproject:masterfrom
sandy2008:fix/queryeviction-eviction-metric-ordering

Conversation

@sandy2008

Copy link
Copy Markdown
Contributor

What this PR does:

Fixes the flaky TestPrometheusMetrics_IncrementedCorrectly in pkg/util/queryeviction (issue #7604, CI failure expected: 3, actual: 2) at its root: the query evictor performed its bookkeeping after the externally observable eviction signal. Two severable commits reorder the eviction loop so that all bookkeeping — the cortex_query_evictions_total increment and the registry removal — happens before victim.Cancel().

Why it flakes (mechanism)

The eviction loop in QueryEvictor.running() (pkg/util/queryeviction/evictor.go) did:

victim.Cancel()  →  warn log  →  evictionsTotal.Inc()

Cancel() is the externally observable commit point of an eviction: the test helper's cancel callback closes the per-query evicted channel, waitEvicted returns on that close, and the assertions at evictor_test.go:210 (and :141 in TestCPUCheckedBeforeHeap) read the counter immediately. Because the increment was sequenced after the signal, there is no happens-before edge between the third Inc() and the test's read — an observer woken by the cancellation can read a stale counter. The counter is atomic, so -race never flags it; this is purely an ordering gap.

Natural reproduction on master (exact CI signature, 0 DATA RACE): ~10/3200 failures with 8 parallel processes × GOMAXPROCS=2 × -count=400; 3/600 with -count=600 -cpu 2; ≥8/600 under a 6-busy-loop contention harness. 0 failures under the same harnesses with this fix.

There is a second, latent defect in the same loop: the evictor never deregistered its victims. The only cleanup is the deferred Deregister in trackedQuery.Exec, which runs when the cancelled query eventually unwinds. Until then, FindHeaviest could re-pick the already-cancelled victim on a later cycle: cortex_query_evictions_total was double-counted for a single query, a MaxEvictionsPerCycle slot was burned on a no-op re-cancel instead of relieving pressure, and in tests the re-picked victim's second cancel callback panicked on a double channel close.

How the fix resolves it

Two severable commits (commit 1 compiles, passes, and closes #7604 on its own; commit 2 reverts cleanly):

  1. fix(queryeviction): increment eviction metric before cancelling the victim — moves Inc() before Cancel(), plus a deterministic regression test.
  2. fix(queryeviction): deregister evicted queries to prevent double accounting — adds Deregister(victim.QueryID) before Cancel(), plus a deterministic regression test.

Final order in the victim loop:

metricValue → evictionsTotal.Inc() → registry.Deregister() → victim.Cancel() → warn log (fields and position unchanged)

The invariant: bookkeeping (metric, registry) happens-before the cancellation signal; the log is unmoved. The warn log deliberately stays after Cancel() — logging is not bookkeeping, nothing synchronizes on it, and a slow logger must not delay the pressure-relieving cancellation.

Why this is safe

  • Zero readers of the metric outside this package. grep -rn "cortex_query_evictions_total\|evictionsTotal" pkg/ cmd/ docs/ integration/ hits only pkg/util/queryeviction/evictor.go and evictor_test.go — no dashboard, alert rule, doc, or integration test references it. Its only other consumer is the asynchronous Prometheus scrape (≥1s granularity vs. a µs-scale reorder window).
  • #Inc == #Cancel is preserved. victim.Cancel is an idempotent, infallible context.CancelFunc (engine_wrapper.go); the loop is straight-line with no failure path between the two calls. The reorder only chooses which side of an unobservable window the count lands on — and removes the previously possible "eviction observed but not counted" skew.
  • Deregister is an idempotent locked delete (registry.go), so trackedQuery.Exec's own deferred Deregister remains a safe no-op when the evicted query finishes unwinding.
  • The only synchronous consumer of the eviction signal is trackedQuery.Exec's child-context check that maps the cancellation to ErrQueryEvicted (engine_wrapper.go); it reads neither the metric nor the registry.
  • The wiring site (pkg/querier/querier.go) only constructs the registry/evictor; it never reads them.

Behavior note: registry semantic shift (commit 2)

Evicted queries now leave the registry immediately at eviction time instead of when Exec unwinds:

  • registry.Len() (and the registered_queries debug log field) now means "evictable candidates" — it excludes cancelled-but-still-unwinding victims.
  • A victim that ignores context cancellation is no longer re-selected by later cycles. No remedy is lost: its cancel has already fired, and pre-fix re-picks were no-op cancels that burned MaxEvictionsPerCycle slots.
  • Where re-eviction previously double-counted, the observed eviction rate may drop — that inflation was the bug.

Tests

Both new regression tests were written first and verified to fail deterministically without their fix (output quoted in the test plan below); existing assertions at evictor_test.go:210 and :141 are byte-unchanged.

  • TestEvictionsTotal_IncrementedBeforeCancelObserved (commit 1): reads the counter inside the cancel callback — on the evictor goroutine, at the instant the signal fires — pinning exactly the "account before signal" invariant with no logger fixture or polling. sync.Once keeps the callback idempotent independently of commit 2.
  • TestEvictedVictim_RemovedFromRegistry (commit 2): sync.Once-guarded callback, cooldown 0; after the eviction is observed, a plain assert.Equal(t, 0, reg.Len()) — no Eventually needed because Deregister is sequenced before the close of the evicted channel.

Why not other approaches

  • Test-side counter polling (cortex_testutil.Poll / assert.Eventually around :210/:141): would stabilize the two existing call sites but leaves the producer-side ordering gap as a trap for every future consumer of this new package, and leaves both latent defects (production double-count, double-close panic class) in place. Rejected unanimously in review.
  • Moving the warn log before Cancel: nothing synchronizes on the log, and under saturation the pressure-relieving cancel would wait on log I/O.
  • Sleeps / timeout bumps / Eventually-wrapped asserts: mask or rescale the window without establishing a happens-before.

Scope

Production change is a reorder plus one added line in pkg/util/queryeviction/evictor.go; two regression tests in pkg/util/queryeviction/evictor_test.go; one CHANGELOG.md line. No flags, config, or API changes (make doc not required). The package was introduced by #7488.

Which issue(s) this PR fixes:
Fixes #7604

Checklist

  • Tests updated — two deterministic regression tests, each fails without its commit's fix.
  • Documentation added — N/A (no flags or config changed).
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
  • docs/configuration/v1-guarantees.md updated if this PR introduces experimental flags — N/A

Test plan

  • gofmt -l / goimports -local github.com/cortexproject/cortex -l on pkg/util/queryeviction/ — clean.
  • go vet -tags "netgo slicelabels" ./pkg/util/queryeviction/ — clean.
  • Fail-without-fix, commit 1: TestEvictionsTotal_IncrementedBeforeCancelObserved on the unfixed tree — 5/5 deterministic FAIL, expected: 1, actual: 0; with the reorder: PASS (-count=5 and -race -count=50).
  • Fail-without-fix, commit 2: TestEvictedVictim_RemovedFromRegistry on the commit-1-only tree — 5/5 deterministic FAIL, expected: 0, actual: 1; with Deregister: PASS (-count=5 and -race -count=50).
  • go test -tags "netgo slicelabels" -race ./pkg/util/queryeviction/ — ok (full package; also green at the commit-1-only state, proving severability).
  • Stress, 4 affected tests: go test -tags "netgo slicelabels" -race -count=500 -cpu 1,2 -run 'Test(PrometheusMetrics_IncrementedCorrectly|CPUCheckedBeforeHeap|EvictionsTotal_IncrementedBeforeCancelObserved|EvictedVictim_RemovedFromRegistry)$' ./pkg/util/queryeviction/ok, 65.956s, 0 failures (2,000 executions).
  • Stress, full package: go test -tags "netgo slicelabels" -count=150 -cpu 1,2 ./pkg/util/queryeviction/ok, 112.3s, 0 failures (300 full-package executions).
  • Master baseline for comparison (natural repro of Flaky test: TestPrometheusMetrics_IncrementedCorrectly (pkg/util/queryeviction) — race between async eviction and evictionsTotal assertion #7604, documented during root-cause analysis): ~10/3200 and 3/600 failure rates under CPU contention, exact CI signature, no DATA RACE.

AI assistance disclosure

Per GENAI_POLICY.md: this fix was developed with substantial AI assistance (Claude Code) — including a multi-round adversarial review of the root cause and candidate designs. All code was reviewed by the contributor, and every verification command above was run locally; the regression tests were proven to fail without each commit's fix before it was applied.

🤖 Generated with Claude Code

@dosubot dosubot Bot added go Pull requests that update Go code type/bug type/flaky-test labels Jun 11, 2026
…ictim

The eviction loop in QueryEvictor.running() called victim.Cancel() before
incrementing cortex_query_evictions_total. Cancel() is the externally
observable commit point of an eviction: the cancellation callback is where
tests (and any future hook) synchronize on the event. Because the counter
was incremented only after that signal fired, an observer woken by the
cancellation could read a stale counter value - there was no happens-before
edge between the increment and the observation.

This is exactly the flake in TestPrometheusMetrics_IncrementedCorrectly
(expected: 3, actual: 2) and the same window exists for the assertions in
TestCPUCheckedBeforeHeap. The counter is atomic, so -race never flags it;
it is purely an ordering gap (reproduced naturally on master at roughly
10/3200 runs under CPU contention with -cpu 1,2).

Move the Inc() before victim.Cancel() so the accounting is complete before
the eviction becomes observable. There is no failure path between the two
calls (Cancel is an infallible, idempotent context.CancelFunc), so the
number of increments is unchanged - only which side of the cancellation
window they land on. The warn log intentionally stays after Cancel():
logging is not bookkeeping, and a slow logger must not delay resource
relief.

Add TestEvictionsTotal_IncrementedBeforeCancelObserved, which reads the
counter inside the cancel callback (on the evictor goroutine, at the
instant the signal fires). It fails deterministically without this fix
(expected: 1, actual: 0) and pins the invariant "account before signal".

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@sandy2008 sandy2008 force-pushed the fix/queryeviction-eviction-metric-ordering branch 2 times, most recently from b61cef3 to 9dd449f Compare June 11, 2026 06:56
…unting

Evicted victims were never removed from the QueryRegistry by the evictor;
the only cleanup is the deferred Deregister in trackedQuery.Exec, which
runs when the query goroutine eventually unwinds. Until then the
already-cancelled victim remained in the evictable candidate set, so a
later eviction cycle could re-pick it: cortex_query_evictions_total was
double-counted for a single query, a MaxEvictionsPerCycle slot was burned
on a no-op Cancel instead of relieving pressure, and in tests the second
Cancel callback panicked on a double channel close.

Deregister the victim before cancelling it, keeping all bookkeeping
(metric, registry) ahead of the externally observable cancellation
signal. Deregister is an idempotent locked delete, so the deferred
Deregister in trackedQuery.Exec remains a safe no-op when the query
finishes unwinding.

Note the registry semantic shift: registry.Len() (and the
registered_queries debug log field) now means "evictable candidates" -
it no longer includes cancelled-but-still-unwinding victims.

Add TestEvictedVictim_RemovedFromRegistry, which fails deterministically
without this fix (expected: 0, actual: 1): the victim stayed registered
after its eviction was observed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@sandy2008 sandy2008 force-pushed the fix/queryeviction-eviction-metric-ordering branch from 9dd449f to fce96a0 Compare June 11, 2026 07:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Pull requests that update Go code size/M type/bug type/flaky-test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestPrometheusMetrics_IncrementedCorrectly (pkg/util/queryeviction) — race between async eviction and evictionsTotal assertion

1 participant