test(qwp): fix flaky port-bind race in WebSocket client tests#47
test(qwp): fix flaky port-bind race in WebSocket client tests#47puzpuzpuz wants to merge 4 commits into
Conversation
TestPorts.findUnusedPort() opened an ephemeral ServerSocket, read its port, then closed it -- releasing the port. TestWebSocketServer.start() re-bound that port only later, so another process (or the very next findUnusedPort() call) could grab it in the gap, surfacing as a flaky "BindException: Address already in use" on loaded CI runners. TestWebSocketServer now binds its loopback listener eagerly in the constructor, holds it for the server's whole lifetime, and exposes the OS-assigned port via getPort(); start() just launches the accept loop on the already-bound socket. Owning the port from allocation to teardown closes the race window entirely. Every test that starts a server now reads server.getPort() instead of pre-allocating a port via findUnusedPort(); findUnusedPort() stays only where a test points a client at a deliberately-dead endpoint. The two raw ServerSocket fixtures that shared the same race (Always401Fixture, Auth401AfterFirstConnectionFixture) adopt the same eager-bind pattern; the other raw fixtures already used it. The change also removes two latent same-class hazards the migration surfaced: the sequential fixed-port allocPort() in DurableAckIntegrationTest and the port1 + 50 guess in CleanShutdownNoReplayTest. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@puzpuzpuz — code review (level 3). Verdict: request changes. The fix itself is correct and well-reasoned, but the branch HEAD does not compile after today's Critical1. The test module does not compile — BLOCKER
All three still call the removed
Root cause: these call sites live on Fix — apply the standard pattern (construct → // after
GatedHaltHandler server = new GatedHaltHandler();
try (TestWebSocketServer ws = new TestWebSocketServer(server)) {
ws.start();
Assert.assertTrue(ws.awaitStart(5, TimeUnit.SECONDS));
int port = ws.getPort();
String cfg = "ws::addr=localhost:" + port + ";...";All three enclosing methods already declare Moderate2. Stale test-plan claim"Ran the full Minor3. Dead null-check on a now-
|
The main merge pulled in three call sites that still used the removed TestWebSocketServer(int port, handler) constructor -- two from #42 (CloseDrainDoubleSignalTest, CloseTerminalConflationTest) and a new CloseDrainTest method -- breaking test compilation. Switch them to the construct-then-getPort() pattern used everywhere else in the suite.
|
@puzpuzpuz pushed Migrated the three stragglers to the
No import changes (all three share Still open from the review: re-run the full |
The serverSocket field is final and unconditionally assigned in the constructor (or the constructor throws IOException and no instance exists), so the `if (serverSocket != null)` guard in close() can never be false. Addresses review Minor finding #3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Addressed the remaining review items. Moderate #2 — test plan refreshed. Re-ran the full Minor #3 — dead null-check dropped ( Minor #4 — multi-server construct-before- Critical #1 was already resolved in |
bluestreak01
left a comment
There was a problem hiding this comment.
Approve (level 3 review).
Test-only fix; closes a genuine TOCTOU port-bind race by owning the ephemeral port from allocation to teardown. Verified at HEAD 9c4af00:
mvn -pl core test-compilegreen (prior compile break resolved)- Ran the full footprint of all 21 changed test classes: 121 tests, 0 failures, 0 errors
- Repo-wide grep confirms zero remaining old-signature
new TestWebSocketServer(port, ...)callsites; all raw test ServerSockets use port 0; findUnusedPort() retained only for deliberately-dead endpoints - Validated the eager-bind tradeoff against source: deferred-start async tests gate on totalReconnectAttempts incremented at attempt-start (CursorWebSocketSendLoop:834), so they're robust to the connect-blocks-instead-of-refused shift, not timing-lucky
0 critical, 0 moderate, 4 minor (doc/cleanup nits + 2 unreachable leak-window observations already acknowledged). None blocking.
Problem
The WebSocket client test suite intermittently failed with
java.net.BindException: Address already in use, e.g.:The cause is a time-of-check-to-time-of-use port race.
TestPorts.findUnusedPort()opened an ephemeralServerSocket, read its port, then closed it — releasing the port.TestWebSocketServer.start()re-bound that port only later, so in the gap another process (or the very nextfindUnusedPort()call, since nothing held the port) could take it. The failure is timing- and load-dependent, so it shows up as a rare flake on busy CI runners rather than deterministically.Fix
TestWebSocketServernow binds its loopback listener eagerly in the constructor, holds it for the server's whole lifetime, and exposes the OS-assigned port viagetPort().start()just launches the accept loop on the already-bound socket. Owning the port from allocation to teardown closes the race window entirely.Call sites no longer pre-allocate a port: each test reads
server.getPort()after constructing the server.findUnusedPort()remains only where a test points a client at a deliberately-dead endpoint (no server bound). The two rawServerSocketfixtures that carried the same race (Always401Fixture,Auth401AfterFirstConnectionFixture) adopt the same eager-bind pattern; the other raw fixtures already used it. The change also removes two latent same-class hazards the migration surfaced: the sequential fixed-portallocPort()inDurableAckIntegrationTestand theport1 + 50guess inCleanShutdownNoReplayTest.Tradeoffs
Eager binding means the listener is accept-able at the TCP level from construction rather than from
start(). For "server arrives late" tests this changes the cause of a pre-start()connection failure (upgrade timeout instead ofconnection-refused) but not the outcome: there is no accept loop until
start(), so the client still fails and retries, and the assertions are about end state. Those tests pass unchanged. A minor side effect is some benignHandshake failedserver logs when a stale pre-start()connection is drained after the accept loop starts.The diff is broad (19 files) because the helper is widely used, but each call-site change is mechanical: drop the pre-allocated port argument, read
getPort()instead.Test plan
coretest suite after themainmerge and compile fix (ff631e4): 2326 tests, 0 failures, 0 errors, 1 skippedQwpQueryClientWalkTrackerTestpassesInitialConnectAsyncTest,InitialConnectRetryTest,CloseDrainTest,RecoveryReplayTest,ReconnectTest) pass, including the "server arrives late" scenariosnew TestWebSocketServer(port, ...)calls and no rawServerSocketbound to a pre-selected port