Skip to content

test(qwp): fix flaky port-bind race in WebSocket client tests#47

Open
puzpuzpuz wants to merge 4 commits into
mainfrom
fix-flaky-test-port-bind-race
Open

test(qwp): fix flaky port-bind race in WebSocket client tests#47
puzpuzpuz wants to merge 4 commits into
mainfrom
fix-flaky-test-port-bind-race

Conversation

@puzpuzpuz

@puzpuzpuz puzpuzpuz commented Jun 12, 2026

Copy link
Copy Markdown

Problem

The WebSocket client test suite intermittently failed with java.net.BindException: Address already in use, e.g.:

QwpQueryClientWalkTrackerTest.testWalk_426UpgradeRequiredIsTransportNotTerminal:114
  » Bind Address already in use   (at TestWebSocketServer.start)

The cause is a time-of-check-to-time-of-use port race. TestPorts.findUnusedPort() opened an ephemeral ServerSocket, read its port, then closed it — releasing the port. TestWebSocketServer.start() re-bound that port only later, so in the gap another process (or the very next findUnusedPort() call, since nothing held the port) could take it. The failure is timing- and load-dependent, so it shows up as a rare flake on busy CI runners rather than deterministically.

Fix

TestWebSocketServer now binds its loopback listener eagerly in the constructor, holds it for the server's whole lifetime, and exposes the OS-assigned port via getPort(). start() just launches the accept loop on the already-bound socket. Owning the port from allocation to teardown closes the race window entirely.

Call sites no longer pre-allocate a port: each test reads server.getPort() after constructing the server. findUnusedPort() remains only where a test points a client at a deliberately-dead endpoint (no server bound). The two raw ServerSocket fixtures that carried the same race (Always401Fixture, Auth401AfterFirstConnectionFixture) adopt the same eager-bind pattern; the other raw fixtures already used it. The change also removes two latent same-class hazards the migration surfaced: the sequential fixed-port allocPort() in DurableAckIntegrationTest and the port1 + 50 guess in CleanShutdownNoReplayTest.

Tradeoffs

Eager binding means the listener is accept-able at the TCP level from construction rather than from start(). For "server arrives late" tests this changes the cause of a pre-start() connection failure (upgrade timeout instead of
connection-refused) but not the outcome: there is no accept loop until start(), so the client still fails and retries, and the assertions are about end state. Those tests pass unchanged. A minor side effect is some benign Handshake failed server logs when a stale pre-start() connection is drained after the accept loop starts.

The diff is broad (19 files) because the helper is widely used, but each call-site change is mechanical: drop the pre-allocated port argument, read getPort() instead.

Test plan

  • Ran the full core test suite after the main merge and compile fix (ff631e4): 2326 tests, 0 failures, 0 errors, 1 skipped
  • Confirmed the originally-failing QwpQueryClientWalkTrackerTest passes
  • Verified the timing-sensitive classes (InitialConnectAsyncTest, InitialConnectRetryTest, CloseDrainTest, RecoveryReplayTest, ReconnectTest) pass, including the "server arrives late" scenarios
  • Grepped to confirm no remaining new TestWebSocketServer(port, ...) calls and no raw ServerSocket bound to a pre-selected port

TestPorts.findUnusedPort() opened an ephemeral ServerSocket, read its
port, then closed it -- releasing the port. TestWebSocketServer.start()
re-bound that port only later, so another process (or the very next
findUnusedPort() call) could grab it in the gap, surfacing as a flaky
"BindException: Address already in use" on loaded CI runners.

TestWebSocketServer now binds its loopback listener eagerly in the
constructor, holds it for the server's whole lifetime, and exposes the
OS-assigned port via getPort(); start() just launches the accept loop on
the already-bound socket. Owning the port from allocation to teardown
closes the race window entirely.

Every test that starts a server now reads server.getPort() instead of
pre-allocating a port via findUnusedPort(); findUnusedPort() stays only
where a test points a client at a deliberately-dead endpoint. The two
raw ServerSocket fixtures that shared the same race (Always401Fixture,
Auth401AfterFirstConnectionFixture) adopt the same eager-bind pattern;
the other raw fixtures already used it. The change also removes two
latent same-class hazards the migration surfaced: the sequential
fixed-port allocPort() in DurableAckIntegrationTest and the port1 + 50
guess in CleanShutdownNoReplayTest.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@puzpuzpuz puzpuzpuz self-assigned this Jun 12, 2026
@puzpuzpuz puzpuzpuz added the bug Something isn't working label Jun 12, 2026
@bluestreak01

Copy link
Copy Markdown
Member

@puzpuzpuz — code review (level 3). Verdict: request changes. The fix itself is correct and well-reasoned, but the branch HEAD does not compile after today's main merge.

Critical

1. The test module does not compile — BLOCKER

mvn -pl core test-compile fails:

CloseDrainTest.java:[231,66]              int cannot be converted to WebSocketServerHandler
CloseTerminalConflationTest.java:[76,62]  int cannot be converted to WebSocketServerHandler
CloseDrainDoubleSignalTest.java:[108,62]  int cannot be converted to WebSocketServerHandler

All three still call the removed new TestWebSocketServer(int port, WebSocketServerHandler handler):

  • CloseDrainTest.java:231in-diff file. Method testDrainAfterFlushWaitsForPriorUnackedFrames was missed by the migration; the PR converted 8 sibling methods in this same file but not this one.
  • CloseDrainDoubleSignalTest.java:108out-of-diff. Whole file comes from fix(qwp): fix missed terminal send errors during Sender close #42 (153cdd5).
  • CloseTerminalConflationTest.java:76out-of-diff. Same origin.

Root cause: these call sites live on origin/main with the old signature, and the main merge into this branch (17cc8f7, made today) pulled them in without reconciling them against the new getPort() API. The "2314 tests, 0 failures" plan was valid for pre-merge 8f69a13 but is now stale — nothing compiles, so nothing runs.

Fix — apply the standard pattern (construct → start() → read getPort()):

// after
GatedHaltHandler server = new GatedHaltHandler();
try (TestWebSocketServer ws = new TestWebSocketServer(server)) {
    ws.start();
    Assert.assertTrue(ws.awaitStart(5, TimeUnit.SECONDS));
    int port = ws.getPort();
    String cfg = "ws::addr=localhost:" + port + ";...";

All three enclosing methods already declare throws Exception, so the constructor's new throws IOException is covered. Drop the now-unused TestPorts.findUnusedPort() line (and its import where it becomes unused). After fixing, re-run CloseDrainTest, CloseDrainDoubleSignalTest, CloseTerminalConflationTest, and QwpQueryClientWalkTrackerTest, and refresh the test-plan numbers.

Moderate

2. Stale test-plan claim

"Ran the full core test suite … 2314 tests, 0 failures" predates the main merge that broke compilation. Re-run and update after fixing Finding 1.

Minor

3. Dead null-check on a now-final field

TestWebSocketServer.close() still guards if (serverSocket != null). The field is now final and always assigned in the constructor (or it throws), so it can never be null here. Drop the guard.

4. New tiny leak window in multi-server tests

In QwpQueryClientWalkTrackerTest / WriteFailoverTest, two servers are constructed before the try { … } finally { close(); close(); }. Since construction now binds a real ServerSocket, if the second constructor throws IOException the first server's socket leaks. Realistically unreachable (requires loopback ephemeral-socket exhaustion), so a nit — not worth restructuring.

Verified safe (checked against source — not findings)

  • Eager-bind tradeoff in the deferred-start() tests (testAsyncDeliversBufferedRowsWhenServerArrivesLate, testWithRetrySucceedsWhenServerComesUpInTime, testAsyncCloseDrainSucceedsWhenServerStartsDuringDrain). Concern: with the listener bound from construction, the first attempt now completes the TCP handshake and blocks on the upgrade read (default auth_timeout = 15_000ms) instead of failing fast, while the test waits for awaitAtLeastOneConnectAttempt before server.start() under a 5–10s budget. Resolved: CursorWebSocketSendLoop.connectLoop() does totalReconnectAttempts.incrementAndGet() at the start of each attempt (line 834), before the blocking reconnect(). So the counter reaches 1 immediately, the main thread calls server.start() within ms, the accept loop drains the already-queued connection, and the parked upgrade read gets its 101 far inside the 15s timeout — the first attempt succeeds rather than timing out. End-state assertions hold.
  • testAuthTimeoutBoundsHungUpgradegoodPort definite-assignment is fine; good is started before the sender connects, so eager-bind is irrelevant.
  • findUnusedPort() retained only for deliberately-dead-endpoint tests — correct; the inherent dead-port race is unavoidable and acceptable there.
  • Multi-server port ordering preserved; no unused TestPorts imports; no other deleted-signature call sites beyond the three above (grep + compiler agree).

Summary

  • Request changes — correct fix, but a merge-integration regression leaves the branch uncompilable.
  • 1 critical (compile break), 1 moderate (stale test plan), 2 minor. 2 findings verified, 0 false positives.
  • In-diff vs out-of-diff: the critical finding spans both — 1 in-diff site (CloseDrainTest.java:231) and 2 out-of-diff sites (CloseDrainDoubleSignalTest, CloseTerminalConflationTest) that only a repo-wide new TestWebSocketServer( grep + a real test-compile exposed.

🤖 Generated with pi review-pr (level 3)

The main merge pulled in three call sites that still used the removed
TestWebSocketServer(int port, handler) constructor -- two from #42
(CloseDrainDoubleSignalTest, CloseTerminalConflationTest) and a new
CloseDrainTest method -- breaking test compilation. Switch them to the
construct-then-getPort() pattern used everywhere else in the suite.
@bluestreak01

Copy link
Copy Markdown
Member

@puzpuzpuz pushed ff631e4 to fix the compile break (Critical finding #1).

Migrated the three stragglers to the construct → start() → getPort() pattern, dropping the pre-allocated port:

  • CloseDrainTest#testDrainAfterFlushWaitsForPriorUnackedFrames
  • CloseDrainDoubleSignalTest#testCloseDoesNotDoubleSignalWhenAsyncHandlerOwnsErrorAndDrainRuns
  • CloseTerminalConflationTest#testCloseSurfacesHaltAfterEarlierDropFlippedTheStickyFlag

No import changes (all three share TestPorts' package). Verified locally: mvn -pl core test-compile is green and all three tests pass (3 run, 0 failures, 0 errors). The two Close* files came in via the main merge (from #42), so they're now part of this PR's footprint.

Still open from the review: re-run the full core suite and refresh the test-plan numbers (Moderate #2), plus the two optional nits (dead serverSocket != null null-check #3, multi-server construct-before-try leak window #4).

The serverSocket field is final and unconditionally assigned in the
constructor (or the constructor throws IOException and no instance
exists), so the `if (serverSocket != null)` guard in close() can never
be false. Addresses review Minor finding #3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@puzpuzpuz

Copy link
Copy Markdown
Author

Addressed the remaining review items.

Moderate #2 — test plan refreshed. Re-ran the full core suite after the main merge and the ff631e4 compile fix: 2326 tests, 0 failures, 0 errors, 1 skipped (BUILD SUCCESS). The count moved 2314 → 2326 because the merge pulled in new tests. Updated the PR description's test-plan line accordingly.

Minor #3 — dead null-check dropped (9c4af00). serverSocket is final and unconditionally assigned in the constructor (or it throws IOException and no instance exists), so the if (serverSocket != null) guard in close() could never be false. Removed it, keeping the try/catch and the "close the listener first" comment. test-compile green.

Minor #4 — multi-server construct-before-try leak: leaving as-is. Verified the window is real (in WriteFailoverTest and QwpQueryClientWalkTrackerTest, the second server is constructed before the enclosing try/finally, so a throw from its constructor would leak the first server's socket). As you noted, it's only reachable under loopback ephemeral-socket exhaustion, so not worth restructuring those call sites.

Critical #1 was already resolved in ff631e4; confirmed mvn -pl core test-compile is green and a repo-wide grep finds zero remaining old-signature new TestWebSocketServer(port, ...) calls.

@bluestreak01 bluestreak01 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve (level 3 review).

Test-only fix; closes a genuine TOCTOU port-bind race by owning the ephemeral port from allocation to teardown. Verified at HEAD 9c4af00:

  • mvn -pl core test-compile green (prior compile break resolved)
  • Ran the full footprint of all 21 changed test classes: 121 tests, 0 failures, 0 errors
  • Repo-wide grep confirms zero remaining old-signature new TestWebSocketServer(port, ...) callsites; all raw test ServerSockets use port 0; findUnusedPort() retained only for deliberately-dead endpoints
  • Validated the eager-bind tradeoff against source: deferred-start async tests gate on totalReconnectAttempts incremented at attempt-start (CursorWebSocketSendLoop:834), so they're robust to the connect-blocks-instead-of-refused shift, not timing-lucky

0 critical, 0 moderate, 4 minor (doc/cleanup nits + 2 unreachable leak-window observations already acknowledged). None blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants