fix(qwp): prevent JVM crash when closing a QWP sender#43
Conversation
Closing a QWP sender while its background segment manager was mid-tick could crash the whole process. The manager's worker thread persists the acknowledged-FSN watermark into a memory-mapped file on each tick; if a sender closed and unmapped that file in the same instant, a stale worker could write to the now-unmapped address and abort the JVM with a SIGSEGV. The worker now re-checks, under the manager lock, whether the ring is still registered before it touches the watermark or the byte accounting. deregister() flips a lock-guarded `registered` flag, so once close() returns the worker can no longer write through the unmapped watermark. The watermark write and the totalBytes subtraction are both gated on the flag; drainTrimmable() and the segment close/unlink stay unconditional, so a stale snapshot still unlinks fully-acked segments as before. The O(1) flag replaces the previous O(n) scan of the rings list.
Keep the bounded close wait, but only free worker-owned native state after the segment-manager worker is observed dead. A timed-out or interrupted join can leave the worker alive inside a service tick. In that state pathScratch may still be used for spare path creation or native-path cleanup, so closing it immediately risks a native use-after-free. Leave workerThread set and pathScratch allocated when the worker is still alive, allowing a later close() to retry cleanup.
The durable-ack tests assert on the in-memory engine.ackedFsn(), and the recovery tests forge the .ack-watermark by hand, so nothing observed the SegmentManager worker actually writing the watermark on its trim tick. A regression that silently stopped that write (e.g. an inverted `registered` gate) would pass the whole suite while reintroducing re-replay of durable-acked frames on restart. Add a positive twin of testRecoveryAdvancesAckedFsnPastWatermark: drive a real, started manager to persist the watermark from real acks, block until the worker has written it to disk, then assert a second session recovers that manager-written value rather than the bare lowestBase - 1 seed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
[PR Coverage check]😍 pass : 42 / 43 (97.67%) file detail
|
bluestreak01
left a comment
There was a problem hiding this comment.
Approving after a full level-3 adversarial review.
This is a correct, well-tested fix for two real pre-existing native-memory use-after-free hazards, both reachable in the production owned-manager path when the bounded join(5s) times out under load:
- Watermark write-through-unmapped: the worker's watermark.write() now runs inside synchronized(lock) gated on the new RingEntry.registered flag, which deregister() flips false under the same lock before the engine unmaps the watermark. This correctly routes around AckWatermark's non-volatile 'closed' boolean (the original bug) by supplying a lock happens-before edge.
- pathScratch use-after-free: close() now early-returns on t.isAlive() without freeing worker-owned native scratch, trading a bounded ~256B leak for not freeing memory a live worker still writes.
Verified against source: drainTrimmable() and SegmentRing.close() are both synchronized(this) (clean ownership transfer, no double-free); needsHotSpare()/nextSeqHint() read only heap state (stale snapshots never touch freed native memory); register() reordered so nothing throwable runs after rings.add (no half-registered entry); totalBytes accounting cannot drift under any deregister/trim interleaving. All production callers (CursorSendEngine.close, QwpWebSocketSender, BackgroundDrainer, Sender, reconnect loop) walked and SAFE. Touched + surrounding test suite is green (26/26 plus the full sf.cursor.* package), and the crash-capable regressions confirm both fixed branches actually fire.
Non-blocking follow-ups (optional):
- close() called from an already-interrupted thread always early-returns and leaks pathScratch regardless of worker/disk health (SegmentManager.java:160-176). Strictly safer than the prior crash; consider clearing interrupt status or a non-interruptible bounded wait so a clean stop is still attempted.
- 'a later close() retries' overstates production reality (engine closes the owned manager exactly once); tighten the comment.
- Shared-manager ctor catch never deregisters; safe only because register() can't throw after rings.add — a maintenance hazard worth the existing invariant comment.
- Test gaps: no e2e production timed-out-join through engine close; ctor-catch reorder intent effectively untested; watermark test is crash-only signal; two ctor-failure tests lack @test(timeout).
Closing a QWP sender (on shutdown, reconnect, or sender churn) could
crash the entire JVM with a SIGSEGV when it raced the background segment
manager. Under load this showed up as rare, hard-to-reproduce process
deaths.
implementation details for reviewers
Two native-memory races are fixed:
Watermark SIGSEGV. The worker services rings off a snapshot taken
under lock, then writes the acked-FSN watermark outside the lock. If a
sender unmapped that file in the same window, the worker wrote through a
dangling address → SIGSEGV. Fix: the watermark write +
totalBytesaccounting now run under lock, gated on a lock-guarded
RingEntry.registeredflag thatderegister()clears before close()unmaps.
pathScratchuse-after-free.close()uses a bounded join; atimed-out join could leave the worker alive while its scratch buffer was
freed. Fix: only free worker-owned native state once the worker is
observed dead, else retry on a later close().