fix(qwp): prevent JVM crash when closing a QWP sender by jerrinot · Pull Request #43 · questdb/java-questdb-client

jerrinot · 2026-06-09T15:13:09Z

Closing a QWP sender (on shutdown, reconnect, or sender churn) could
crash the entire JVM with a SIGSEGV when it raced the background segment
manager. Under load this showed up as rare, hard-to-reproduce process
deaths.

implementation details for reviewers
Two native-memory races are fixed:

Watermark SIGSEGV. The worker services rings off a snapshot taken
under lock, then writes the acked-FSN watermark outside the lock. If a
sender unmapped that file in the same window, the worker wrote through a
dangling address → SIGSEGV. Fix: the watermark write + totalBytes
accounting now run under lock, gated on a lock-guarded
RingEntry.registered flag that deregister() clears before close()
unmaps.
pathScratch use-after-free. close() uses a bounded join; a
timed-out join could leave the worker alive while its scratch buffer was
freed. Fix: only free worker-owned native state once the worker is
observed dead, else retry on a later close().

Closing a QWP sender while its background segment manager was mid-tick could crash the whole process. The manager's worker thread persists the acknowledged-FSN watermark into a memory-mapped file on each tick; if a sender closed and unmapped that file in the same instant, a stale worker could write to the now-unmapped address and abort the JVM with a SIGSEGV. The worker now re-checks, under the manager lock, whether the ring is still registered before it touches the watermark or the byte accounting. deregister() flips a lock-guarded `registered` flag, so once close() returns the worker can no longer write through the unmapped watermark. The watermark write and the totalBytes subtraction are both gated on the flag; drainTrimmable() and the segment close/unlink stay unconditional, so a stale snapshot still unlinks fully-acked segments as before. The O(1) flag replaces the previous O(n) scan of the rings list.

Keep the bounded close wait, but only free worker-owned native state after the segment-manager worker is observed dead. A timed-out or interrupted join can leave the worker alive inside a service tick. In that state pathScratch may still be used for spare path creation or native-path cleanup, so closing it immediately risks a native use-after-free. Leave workerThread set and pathScratch allocated when the worker is still alive, allowing a later close() to retry cleanup.

…gfault

The durable-ack tests assert on the in-memory engine.ackedFsn(), and the recovery tests forge the .ack-watermark by hand, so nothing observed the SegmentManager worker actually writing the watermark on its trim tick. A regression that silently stopped that write (e.g. an inverted `registered` gate) would pass the whole suite while reintroducing re-replay of durable-acked frames on restart. Add a positive twin of testRecoveryAdvancesAckedFsnPastWatermark: drive a real, started manager to persist the watermark from real acks, block until the worker has written it to disk, then assert a second session recovers that manager-written value rather than the bare lowestBase - 1 seed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mtopolnik · 2026-06-17T09:29:31Z

[PR Coverage check]

😍 pass : 42 / 43 (97.67%)

file detail

	path	covered line	new line	coverage
🔵	io/questdb/client/cutlass/qwp/client/sf/cursor/SegmentManager.java	38	39	97.44%
🔵	io/questdb/client/cutlass/qwp/client/sf/cursor/CursorSendEngine.java	4	4	100.00%

bluestreak01

Approving after a full level-3 adversarial review.

This is a correct, well-tested fix for two real pre-existing native-memory use-after-free hazards, both reachable in the production owned-manager path when the bounded join(5s) times out under load:

Watermark write-through-unmapped: the worker's watermark.write() now runs inside synchronized(lock) gated on the new RingEntry.registered flag, which deregister() flips false under the same lock before the engine unmaps the watermark. This correctly routes around AckWatermark's non-volatile 'closed' boolean (the original bug) by supplying a lock happens-before edge.
pathScratch use-after-free: close() now early-returns on t.isAlive() without freeing worker-owned native scratch, trading a bounded ~256B leak for not freeing memory a live worker still writes.

Verified against source: drainTrimmable() and SegmentRing.close() are both synchronized(this) (clean ownership transfer, no double-free); needsHotSpare()/nextSeqHint() read only heap state (stale snapshots never touch freed native memory); register() reordered so nothing throwable runs after rings.add (no half-registered entry); totalBytes accounting cannot drift under any deregister/trim interleaving. All production callers (CursorSendEngine.close, QwpWebSocketSender, BackgroundDrainer, Sender, reconnect loop) walked and SAFE. Touched + surrounding test suite is green (26/26 plus the full sf.cursor.* package), and the crash-capable regressions confirm both fixed branches actually fire.

Non-blocking follow-ups (optional):

close() called from an already-interrupted thread always early-returns and leaks pathScratch regardless of worker/disk health (SegmentManager.java:160-176). Strictly safer than the prior crash; consider clearing interrupt status or a non-interruptible bounded wait so a clean stop is still attempted.
'a later close() retries' overstates production reality (engine closes the owned manager exactly once); tighten the comment.
Shared-manager ctor catch never deregisters; safe only because register() can't throw after rings.add — a maintenance hazard worth the existing invariant comment.
Test gaps: no e2e production timed-out-join through engine close; ctor-catch reorder intent effectively untested; watermark test is crash-only signal; two ctor-failure tests lack @test(timeout).

jerrinot added the bug Something isn't working label Jun 9, 2026

jerrinot changed the title ~~fix(qwp): prevent JVM crash when closing a QWP sender~~ fix(qwp): prevent JVM crash when closing a QWP sender [DO NOT MERGE] Jun 9, 2026

jerrinot added 7 commits June 9, 2026 18:09

refactor(qwp): remove dead register rollback hook

72513cd

test(qwp): finish SegmentManager hook migration

4d0bd6b

docs(qwp): avoid stale hook caller list

f368f99

refactor(qwp): align constructor cleanup order

d62e488

docs(qwp): clarify register publish invariant

fcf28e9

Merge remote-tracking branch 'origin/main' into jh_segment_manager_se…

d7430b6

…gfault

jerrinot changed the title ~~fix(qwp): prevent JVM crash when closing a QWP sender [DO NOT MERGE]~~ fix(qwp): prevent JVM crash when closing a QWP sender Jun 15, 2026

jerrinot and others added 6 commits June 15, 2026 16:56

refactor(qwp): remove inert constructor deregister

1aef779

test(qwp): cover constructor cleanup on register failure

889da1f

comment cleanup

b1edcbe

ordering is important

210cafd

Merge branch 'main' into jh_segment_manager_segfault

d74c1e9

bluestreak01 approved these changes Jun 17, 2026

View reviewed changes

bluestreak01 enabled auto-merge (squash) June 17, 2026 09:51

bluestreak01 merged commit 2f4d7c7 into main Jun 17, 2026
12 checks passed

bluestreak01 deleted the jh_segment_manager_segfault branch June 17, 2026 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(qwp): prevent JVM crash when closing a QWP sender#43

fix(qwp): prevent JVM crash when closing a QWP sender#43
bluestreak01 merged 14 commits into
mainfrom
jh_segment_manager_segfault

jerrinot commented Jun 9, 2026 •

edited

Loading

Uh oh!

mtopolnik commented Jun 17, 2026

Uh oh!

bluestreak01 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jerrinot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtopolnik commented Jun 17, 2026

[PR Coverage check]

file detail

Uh oh!

bluestreak01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jerrinot commented Jun 9, 2026 •

edited

Loading