fix(queryeviction): account evictions before cancelling the victim query#7616
Open
sandy2008 wants to merge 2 commits into
Open
fix(queryeviction): account evictions before cancelling the victim query#7616sandy2008 wants to merge 2 commits into
sandy2008 wants to merge 2 commits into
Conversation
…ictim The eviction loop in QueryEvictor.running() called victim.Cancel() before incrementing cortex_query_evictions_total. Cancel() is the externally observable commit point of an eviction: the cancellation callback is where tests (and any future hook) synchronize on the event. Because the counter was incremented only after that signal fired, an observer woken by the cancellation could read a stale counter value - there was no happens-before edge between the increment and the observation. This is exactly the flake in TestPrometheusMetrics_IncrementedCorrectly (expected: 3, actual: 2) and the same window exists for the assertions in TestCPUCheckedBeforeHeap. The counter is atomic, so -race never flags it; it is purely an ordering gap (reproduced naturally on master at roughly 10/3200 runs under CPU contention with -cpu 1,2). Move the Inc() before victim.Cancel() so the accounting is complete before the eviction becomes observable. There is no failure path between the two calls (Cancel is an infallible, idempotent context.CancelFunc), so the number of increments is unchanged - only which side of the cancellation window they land on. The warn log intentionally stays after Cancel(): logging is not bookkeeping, and a slow logger must not delay resource relief. Add TestEvictionsTotal_IncrementedBeforeCancelObserved, which reads the counter inside the cancel callback (on the evictor goroutine, at the instant the signal fires). It fails deterministically without this fix (expected: 1, actual: 0) and pins the invariant "account before signal". Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
b61cef3 to
9dd449f
Compare
…unting Evicted victims were never removed from the QueryRegistry by the evictor; the only cleanup is the deferred Deregister in trackedQuery.Exec, which runs when the query goroutine eventually unwinds. Until then the already-cancelled victim remained in the evictable candidate set, so a later eviction cycle could re-pick it: cortex_query_evictions_total was double-counted for a single query, a MaxEvictionsPerCycle slot was burned on a no-op Cancel instead of relieving pressure, and in tests the second Cancel callback panicked on a double channel close. Deregister the victim before cancelling it, keeping all bookkeeping (metric, registry) ahead of the externally observable cancellation signal. Deregister is an idempotent locked delete, so the deferred Deregister in trackedQuery.Exec remains a safe no-op when the query finishes unwinding. Note the registry semantic shift: registry.Len() (and the registered_queries debug log field) now means "evictable candidates" - it no longer includes cancelled-but-still-unwinding victims. Add TestEvictedVictim_RemovedFromRegistry, which fails deterministically without this fix (expected: 0, actual: 1): the victim stayed registered after its eviction was observed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
9dd449f to
fce96a0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does:
Fixes the flaky
TestPrometheusMetrics_IncrementedCorrectlyinpkg/util/queryeviction(issue #7604, CI failureexpected: 3, actual: 2) at its root: the query evictor performed its bookkeeping after the externally observable eviction signal. Two severable commits reorder the eviction loop so that all bookkeeping — thecortex_query_evictions_totalincrement and the registry removal — happens beforevictim.Cancel().Why it flakes (mechanism)
The eviction loop in
QueryEvictor.running()(pkg/util/queryeviction/evictor.go) did:Cancel()is the externally observable commit point of an eviction: the test helper's cancel callback closes the per-queryevictedchannel,waitEvictedreturns on that close, and the assertions atevictor_test.go:210(and:141inTestCPUCheckedBeforeHeap) read the counter immediately. Because the increment was sequenced after the signal, there is no happens-before edge between the thirdInc()and the test's read — an observer woken by the cancellation can read a stale counter. The counter is atomic, so-racenever flags it; this is purely an ordering gap.Natural reproduction on master (exact CI signature, 0 DATA RACE): ~10/3200 failures with 8 parallel processes ×
GOMAXPROCS=2×-count=400; 3/600 with-count=600 -cpu 2; ≥8/600 under a 6-busy-loop contention harness. 0 failures under the same harnesses with this fix.There is a second, latent defect in the same loop: the evictor never deregistered its victims. The only cleanup is the deferred
DeregisterintrackedQuery.Exec, which runs when the cancelled query eventually unwinds. Until then,FindHeaviestcould re-pick the already-cancelled victim on a later cycle:cortex_query_evictions_totalwas double-counted for a single query, aMaxEvictionsPerCycleslot was burned on a no-op re-cancel instead of relieving pressure, and in tests the re-picked victim's second cancel callback panicked on a double channel close.How the fix resolves it
Two severable commits (commit 1 compiles, passes, and closes #7604 on its own; commit 2 reverts cleanly):
fix(queryeviction): increment eviction metric before cancelling the victim— movesInc()beforeCancel(), plus a deterministic regression test.fix(queryeviction): deregister evicted queries to prevent double accounting— addsDeregister(victim.QueryID)beforeCancel(), plus a deterministic regression test.Final order in the victim loop:
The invariant: bookkeeping (metric, registry) happens-before the cancellation signal; the log is unmoved. The warn log deliberately stays after
Cancel()— logging is not bookkeeping, nothing synchronizes on it, and a slow logger must not delay the pressure-relieving cancellation.Why this is safe
grep -rn "cortex_query_evictions_total\|evictionsTotal" pkg/ cmd/ docs/ integration/hits onlypkg/util/queryeviction/evictor.goandevictor_test.go— no dashboard, alert rule, doc, or integration test references it. Its only other consumer is the asynchronous Prometheus scrape (≥1s granularity vs. a µs-scale reorder window).#Inc == #Cancelis preserved.victim.Cancelis an idempotent, infalliblecontext.CancelFunc(engine_wrapper.go); the loop is straight-line with no failure path between the two calls. The reorder only chooses which side of an unobservable window the count lands on — and removes the previously possible "eviction observed but not counted" skew.Deregisteris an idempotent locked delete (registry.go), sotrackedQuery.Exec's own deferredDeregisterremains a safe no-op when the evicted query finishes unwinding.trackedQuery.Exec's child-context check that maps the cancellation toErrQueryEvicted(engine_wrapper.go); it reads neither the metric nor the registry.pkg/querier/querier.go) only constructs the registry/evictor; it never reads them.Behavior note: registry semantic shift (commit 2)
Evicted queries now leave the registry immediately at eviction time instead of when
Execunwinds:registry.Len()(and theregistered_queriesdebug log field) now means "evictable candidates" — it excludes cancelled-but-still-unwinding victims.MaxEvictionsPerCycleslots.Tests
Both new regression tests were written first and verified to fail deterministically without their fix (output quoted in the test plan below); existing assertions at
evictor_test.go:210and:141are byte-unchanged.TestEvictionsTotal_IncrementedBeforeCancelObserved(commit 1): reads the counter inside the cancel callback — on the evictor goroutine, at the instant the signal fires — pinning exactly the "account before signal" invariant with no logger fixture or polling.sync.Oncekeeps the callback idempotent independently of commit 2.TestEvictedVictim_RemovedFromRegistry(commit 2):sync.Once-guarded callback, cooldown 0; after the eviction is observed, a plainassert.Equal(t, 0, reg.Len())— noEventuallyneeded becauseDeregisteris sequenced before the close of theevictedchannel.Why not other approaches
cortex_testutil.Poll/assert.Eventuallyaround:210/:141): would stabilize the two existing call sites but leaves the producer-side ordering gap as a trap for every future consumer of this new package, and leaves both latent defects (production double-count, double-close panic class) in place. Rejected unanimously in review.Cancel: nothing synchronizes on the log, and under saturation the pressure-relieving cancel would wait on log I/O.Eventually-wrapped asserts: mask or rescale the window without establishing a happens-before.Scope
Production change is a reorder plus one added line in
pkg/util/queryeviction/evictor.go; two regression tests inpkg/util/queryeviction/evictor_test.go; oneCHANGELOG.mdline. No flags, config, or API changes (make docnot required). The package was introduced by #7488.Which issue(s) this PR fixes:
Fixes #7604
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]docs/configuration/v1-guarantees.mdupdated if this PR introduces experimental flags — N/ATest plan
gofmt -l/goimports -local github.com/cortexproject/cortex -lonpkg/util/queryeviction/— clean.go vet -tags "netgo slicelabels" ./pkg/util/queryeviction/— clean.TestEvictionsTotal_IncrementedBeforeCancelObservedon the unfixed tree — 5/5 deterministic FAIL,expected: 1, actual: 0; with the reorder: PASS (-count=5and-race -count=50).TestEvictedVictim_RemovedFromRegistryon the commit-1-only tree — 5/5 deterministic FAIL,expected: 0, actual: 1; withDeregister: PASS (-count=5and-race -count=50).go test -tags "netgo slicelabels" -race ./pkg/util/queryeviction/— ok (full package; also green at the commit-1-only state, proving severability).go test -tags "netgo slicelabels" -race -count=500 -cpu 1,2 -run 'Test(PrometheusMetrics_IncrementedCorrectly|CPUCheckedBeforeHeap|EvictionsTotal_IncrementedBeforeCancelObserved|EvictedVictim_RemovedFromRegistry)$' ./pkg/util/queryeviction/— ok, 65.956s, 0 failures (2,000 executions).go test -tags "netgo slicelabels" -count=150 -cpu 1,2 ./pkg/util/queryeviction/— ok, 112.3s, 0 failures (300 full-package executions).AI assistance disclosure
Per
GENAI_POLICY.md: this fix was developed with substantial AI assistance (Claude Code) — including a multi-round adversarial review of the root cause and candidate designs. All code was reviewed by the contributor, and every verification command above was run locally; the regression tests were proven to fail without each commit's fix before it was applied.🤖 Generated with Claude Code