Skip to content

Fix use-after-free when a terminated worker has children still starting#31951

Open
robobun wants to merge 8 commits into
mainfrom
farm/07b3df71/fix-nested-worker-teardown-uaf
Open

Fix use-after-free when a terminated worker has children still starting#31951
robobun wants to merge 8 commits into
mainfrom
farm/07b3df71/fix-nested-worker-teardown-uaf

Conversation

@robobun
Copy link
Copy Markdown
Collaborator

@robobun robobun commented Jun 7, 2026

What does this PR do?

Fixes a use-after-free found by Fuzzilli (fingerprint Address:heap-use-after-free, flaky, on a Worker thread).

Root cause: a worker's children hold a BackRef to the parent's VirtualMachine and read it throughout start_vm() (transform options clone, env map clone, standalone_module_graph in VirtualMachine::init_worker). When the parent is itself a worker and gets terminated, its shutdown() freed that VM immediately. Children still starting up then read freed memory:

READ of size 8 ... thread T42 (Worker)
    #0 <bun_jsc::virtual_machine::VirtualMachine>::init_worker src/jsc/VirtualMachine.rs:3675
    #1 <bun_jsc::web_worker::WebWorker>::start_vm src/jsc/web_worker.rs:914
    #2 <bun_jsc::web_worker::WebWorker>::thread_main src/jsc/web_worker.rs:772
freed by thread T9 (Worker):
    #5 <bun_jsc::web_worker::WebWorker>::shutdown src/jsc/web_worker.rs:1317
0x... is located 28112 bytes inside of 56064-byte region (the parent worker's VirtualMachine)

This was the "Known gap" documented in the web_worker.rs header: the main thread waits for all workers at exit (terminate_all_and_wait), but a worker parent did not wait for its own children. The first regression test below crashes release builds too, so this is reachable in production, not only under ASAN.

The fix: the fixing line is the terminate_children_and_wait(vm_ptr, 10_000) call in shutdown() (step 3.5): before freeing its VM, an exiting worker terminates the workers whose parent VM is its own and futex-waits until each has unlinked from the live-workers registry, which is past every parent-VM read. It reuses the existing terminate_all_and_wait sweep, parameterized with a parent filter. This mirrors Node's Environment::stop_sub_worker_contexts(). The placement after JSC teardown means no JS can run to register new children, so the sweep is complete.

Two more crashes on the same termination race, surfaced by the same repro once the UAF was fixed:

  • flush_logs() hit panic!("unhandled exception") when a child's entry-point resolution failed (revoked blob URL) while its termination was in flight: the pending TerminationException makes the log-to-JS conversion return JsError::Terminated. Termination is a normal event, not an invariant violation, so skip dispatching the error event. A genuine JsError::Thrown from the conversion is reported through report_uncaught_exception instead of crashing.
  • JSC clears the termination request at VM-entry-scope exit while leaving the TerminationException pending (VM.cpp:1799, handleTraps re-arms from the trap bit). Re-entering JS in that state trips ASSERT(vm.hasTerminationRequest()) in VMTraps::deferTerminationSlow (debug builds). The worker death path now clears the stale exception at its JS re-entry points, using the existing clear_termination_exception() (same pattern as the test runner and repl).

How did you verify your code works?

Two regression tests in test/js/web/workers/worker-terminate-lifetime.test.ts, each a spawned fixture where a worker creates children and is terminated while they start:

  • "terminating a worker while its nested children are starting does not UAF": on the unfixed ASAN build, crashes with the heap-use-after-free above in 5/5 runs; also fails on the release build (crash report in stderr). Passes on the fixed build (25/25 stress runs clean).
  • "terminating a worker whose children fail entry resolution does not crash": covers the flush_logs panic and the VMTraps assertion path; crashes 5/5 on the unfixed ASAN build.

Also ran the existing worker suites (worker.test.ts, worker_blob.test.ts, worker-terminate-lifetime.test.ts, worker_threads.test.ts, worker-async-dispose.test.ts, message-port-context-destroy-leak.test.ts) with the debug+ASAN build. The pre-existing failures in worker.test.ts ("worker with event listeners doesn't close event loop", 1000ms budget) and worker_threads.test.ts ("eval does not leak source code") reproduce identically on an unfixed build on this runner; they are unrelated timing flakes on slow debug machines.

The file is also added to test/no-validate-leaksan.txt: the debian-13 x64-asan lane runs tests with detect_leaks=1, and worker teardown intentionally leaks when a parent context is gone before the close task runs (the thread-held Worker ref documented in web_worker.rs), which these tests exercise by design. Reproduced on a local release-asan build (flaky 32-byte leak report in the fixture's stderr, about one run in five); worker.test.ts and worker_blob.test.ts are already excluded for the same reason. AddressSanitizer's use-after-free detection is unaffected by the exclusion.

A worker's children hold a BackRef to its VirtualMachine and read it
throughout start_vm() (transform options, env clone, standalone module
graph). When the parent worker was terminated, its shutdown() freed that
VM without waiting for children still starting up, a use-after-free
caught by ASAN in VirtualMachine::init_worker.

shutdown() now terminates its own children and waits for each to unlink
from the live-workers registry before freeing the VM, the per-worker
analogue of the main thread's terminate_all_and_wait() and of Node's
Environment::stop_sub_worker_contexts().

Also fix two crashes on the same termination race:
- flush_logs() panicked with "unhandled exception" when the log-to-JS
  conversion hit a pending TerminationException; skip dispatching the
  error event instead.
- A TerminationException left pending after JSC clears the termination
  request at VM-entry-scope exit tripped
  ASSERT(vm.hasTerminationRequest()) in VMTraps::deferTerminationSlow
  when the death path re-entered JS; clear the stale exception at the
  re-entry points (flush_logs and the exit-handler step of shutdown).
@robobun
Copy link
Copy Markdown
Collaborator Author

robobun commented Jun 7, 2026

@github-actions github-actions Bot added the claude label Jun 7, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 7, 2026

Found 3 issues this PR may fix:

  1. panic: Segmentation fault at address 0xD — "multiple threads are crashing" under Worker spawn/terminate churn (1.3.14, long-running server) #31880 - Segfault under Worker spawn/terminate churn matches the UAF where parent VM is freed while children are still starting
  2. Worker create+terminate cycle aborts process after ~100k–900k iterations on macOS arm64 #30421 - Worker create+terminate cycle abort matches both the UAF pattern and the flush_logs panic / VMTraps assertion paths
  3. Bun terminates script with "abort" when an error is thrown in an async worker onmessage handler #20911 - Blob URL worker aborting on async error matches the flush_logs panic path fixed by the JsError::Terminated handling

If this is helpful, copy the block below into the PR description to auto-close these issues on merge.

Fixes #31880
Fixes #30421
Fixes #20911

🤖 Generated with Claude Code

@robobun
Copy link
Copy Markdown
Collaborator Author

robobun commented Jun 7, 2026

Checked all three against this PR's builds before adding any Fixes lines. None of them verify, so I am not claiming them:

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 7, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Stops and waits for nested child workers during parent shutdown (step 3.5), refactors termination into a filtered terminate-and-wait helper, changes futex wake semantics to wake-all, clears stale JSC termination exceptions during shutdown/log flush, and adds two regression tests for termination lifetimes.

Changes

Nested worker termination and shutdown guarantees

Layer / File(s) Summary
Termination lifetime contract and documentation
src/jsc/VirtualMachine.rs, src/jsc/web_worker.rs
Updated documentation to define that parent VM remains valid through termination-and-wait of nested children, and that nested workers are stopped when the parent shuts down. Covers parent_vm() lifetime, file header docs, parent VM validity, shutdown sequence, and related field documentation.
Futex wake and live-worker coordination
src/jsc/web_worker.rs
Changed futex wake calls in live_workers::register and live_workers::mark_exited to wake-all so multiple concurrent waiters re-sweep when workers register or exit.
Filtered termination helper and matching logic
src/jsc/web_worker.rs
Introduced terminate_and_wait(parent_filter, timeout_ms) helper and terminate_children_and_wait(parent_vm, timeout_ms) variant. Added matching-counter tracking and optional parent-VM filtering to the termination sweep loop, allowing selective termination of nested children by parent reference.
Shutdown path integration and exception cleanup
src/jsc/web_worker.rs
Integrated terminate_children_and_wait() into WebWorker::shutdown before VM teardown (shutdown step 3.5). Changed JSC termination cleanup to clear stale termination exceptions on the global object via clear_termination_exception().
Log flushing robustness and error handling
src/jsc/web_worker.rs
Hardened flush_logs by clearing termination exceptions before log conversion and reworking error mapping: explicitly skip error dispatch for JsError::Terminated, safely report exceptions from JsError::Thrown instead of panicking, preventing crashes during termination-path error reporting.
Regression tests for termination lifetimes
test/js/web/workers/worker-terminate-lifetime.test.ts
Added two regression tests: one verifying nested worker termination does not cause use-after-free when children are still spawning, another verifying termination during child entry-resolution failures (revoked blob URLs) does not crash the error path. Both assert clean process exit.

Possibly related issues

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly describes the main use-after-free bug being fixed: a terminated worker with children still starting up.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description provides comprehensive coverage of what the fix does and includes detailed verification details with regression tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread src/jsc/web_worker.rs Outdated
The filtered wait derives its done-condition from the list sweep
(matching) but passed a futex expected value loaded after the mutex was
released. If the last matching child ran unlink() and mark_exited() in
that gap, matching stayed stale at nonzero while OUTSTANDING already
equaled the loaded value, so Futex::wait slept through the already-fired
wake until the timeout. Loading OUTSTANDING before releasing the mutex
guarantees every counted child's decrement is still ahead of the
expected value, so the wait either observes the change or is woken.
Comment thread src/jsc/web_worker.rs
terminate_children_and_wait means the main thread and any number of
worker parents can wait on OUTSTANDING at the same time. A futex waiter
queued with a stale expected value is only released by a wake or its
timeout, so waking a single waiter can hand the event to a thread whose
own condition is unmet while the one that could make progress stays
asleep until its deadline (e.g. a three-level tree where the
grandchild's exit wakes the grandparent instead of the middle worker).
Wake all waiters in register() and mark_exited(); each re-sweeps under
the mutex and re-checks its own condition, and the waiter count is
bounded by the worker-tree depth.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/jsc/web_worker.rs (1)

419-436: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Wait on a monotonic wake sequence, not OUTSTANDING itself.

Between Line 420 and Line 436, OUTSTANDING can go n → n-1 → n before this thread actually enters Futex::wait(): one worker exits, a mid-WebWorker__create worker registers, both wake() calls fire, and the counter is back at the original n. At that point Futex::wait(&OUTSTANDING, n, ...) will sleep on a stale value, so the newly registered worker is never re-swept and never gets requested_terminate = true. If the wait then times out, the caller can still tear down the parent/main VM while that worker is in start_vm().

Possible direction
 pub(super) static OUTSTANDING: AtomicU32 = AtomicU32::new(0);
+pub(super) static WAKE_SEQ: AtomicU32 = AtomicU32::new(0);

 pub(super) fn register(worker: *mut WebWorker) {
     MUTEX.lock();
     ...
     OUTSTANDING.fetch_add(1, Ordering::Release);
-    Futex::wake(&OUTSTANDING, u32::MAX);
+    WAKE_SEQ.fetch_add(1, Ordering::AcqRel);
+    Futex::wake(&WAKE_SEQ, u32::MAX);
     MUTEX.unlock();
 }

 pub(super) fn mark_exited() {
     OUTSTANDING.fetch_sub(1, Ordering::Release);
-    Futex::wake(&OUTSTANDING, u32::MAX);
+    WAKE_SEQ.fetch_add(1, Ordering::AcqRel);
+    Futex::wake(&WAKE_SEQ, u32::MAX);
 }

-        let n = live_workers::OUTSTANDING.load(Ordering::Acquire);
+        let n = live_workers::OUTSTANDING.load(Ordering::Acquire);
+        let seq = live_workers::WAKE_SEQ.load(Ordering::Acquire);
         live_workers::MUTEX.unlock();
         ...
-        let _ = Futex::wait(&live_workers::OUTSTANDING, n, Some(deadline_ns - elapsed));
+        let _ = Futex::wait(&live_workers::WAKE_SEQ, seq, Some(deadline_ns - elapsed));

As per coding guidelines, "Rust code: fix the whole bug class in the same PR."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/jsc/web_worker.rs` around lines 419 - 436, The wait is currently sleeping
on the mutable counter live_workers::OUTSTANDING (via
Futex::wait(&live_workers::OUTSTANDING, n, ...)), which can change n→n-1→n and
cause a stale-wait race; instead introduce and wait on a monotonic
sequence/version (e.g., live_workers::WAKE_SEQ or similar) that you increment
whenever workers are added/removed or when requested_terminate is toggled, read
that sequence under live_workers::MUTEX, compute the done condition the same way
(using parent_filter or outstanding==0), then call
Futex::wait(&live_workers::WAKE_SEQ, seq, Some(...)) so wakes are not lost (also
update all corresponding wake() calls to futex::wake(&live_workers::WAKE_SEQ)
when registering/unregistering workers or setting requested_terminate, and
ensure WebWorker__create and any code that toggles requested_terminate
increments the sequence while holding live_workers::MUTEX).

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/jsc/web_worker.rs`:
- Around line 419-436: The wait is currently sleeping on the mutable counter
live_workers::OUTSTANDING (via Futex::wait(&live_workers::OUTSTANDING, n, ...)),
which can change n→n-1→n and cause a stale-wait race; instead introduce and wait
on a monotonic sequence/version (e.g., live_workers::WAKE_SEQ or similar) that
you increment whenever workers are added/removed or when requested_terminate is
toggled, read that sequence under live_workers::MUTEX, compute the done
condition the same way (using parent_filter or outstanding==0), then call
Futex::wait(&live_workers::WAKE_SEQ, seq, Some(...)) so wakes are not lost (also
update all corresponding wake() calls to futex::wake(&live_workers::WAKE_SEQ)
when registering/unregistering workers or setting requested_terminate, and
ensure WebWorker__create and any code that toggles requested_terminate
increments the sequence while holding live_workers::MUTEX).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7892402e-59aa-4913-a414-fb0fff0d8faf

📥 Commits

Reviewing files that changed from the base of the PR and between b6b37a7 and 9e0980f.

📒 Files selected for processing (1)
  • src/jsc/web_worker.rs

OUTSTANDING as the futex word is ABA-prone: one worker exit plus one
registration in the gap between a waiter's snapshot and its Futex::wait
restores the exact expected value while both wakes fire with no waiter
queued, so the wait sleeps through events it needed to re-sweep for
(a newly registered worker then misses requested_terminate until the
deadline forces a re-sweep). Add WAKE_SEQ, bumped on register() and
mark_exited(), and wait on that; it only increments, so it cannot alias.
The sequence is snapshotted under the registry mutex before the sweep
and the OUTSTANDING load, so any event invalidating those inputs changes
the futex word afterwards.
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🟡 src/jsc/web_worker.rs:420-428 — nit: this timeout log still says terminateAllAndWait and prints the global n = OUTSTANDING, but the function is now also reached via terminate_children_and_wait from shutdown() step 3.5 — in that path the relevant figure is matching (the global count always includes the caller itself plus unrelated workers). Consider branching on parent_filter to emit a distinct message with matching, or at least genericizing the prefix.

    Extended reasoning...

    What this is

    This PR refactored terminate_all_and_wait into a shared terminate_and_wait(parent_filter, timeout_ms) with a new filtered caller, terminate_children_and_wait, invoked from every exiting worker's shutdown() step 3.5. The timeout-path debug log was not updated to reflect the generalization:

    if elapsed >= deadline_ns {
        log!("terminateAllAndWait: timed out with {} outstanding", n);
        return;
    }

    When reached via the filtered path, this is misleading on two axes:

    1. Prefix names the wrong entry point — it says terminateAllAndWait even when the caller is terminate_children_and_wait from a worker's shutdown.
    2. Reports the wrong countn is the global OUTSTANDING count, which in the filtered path always includes the caller worker itself (it unlinks after step 3.5 returns) plus any unrelated workers in the process. The figure the filtered wait actually conditions on is matching.

    Concrete example

    Take Main → M → G, with M terminated and G stuck somewhere past the cooperative checkpoints for 10 s:

    • M's step 3.5 sweep finds G: matching = 1.
    • n = OUTSTANDING.load() under the mutex → n = 2 (M itself + G; M is still registered until after step 3.5).
    • 10 s elapses → log emits terminateAllAndWait: timed out with 2 outstanding.

    A developer with BUN_DEBUG_Worker=1 enabled who is chasing a per-worker child-wait stall sees a message that (a) names the main-thread global-exit sweep and (b) reports 2 outstanding when only 1 child of M is actually blocking the wait. If there were sibling worker trees in the process, n would be larger still while matching stayed at 1.

    Why this isn't entirely academic

    This is a hidden scoped debug log (define_scoped_log!(log, Worker, hidden) — only emits when the Worker debug scope is explicitly enabled via env var) on a 10 s timeout path, so there is zero functional impact and it is invisible to users. The counter-argument that this is below the nit threshold is reasonable.

    That said, this PR's own review history went through two rounds of debugging timeout stalls in exactly this code path (the lost-wakeup snapshot race fixed in b6b37a7, and the wake-one-vs-multiple-waiters issue fixed in 9e0980f — the latter explicitly cited a 10 s deadline stall on the debian-13 ASAN job). Anyone debugging the next stall here with the Worker scope enabled will land on this line, and "terminateAllAndWait timed out with N" pointing at the wrong sweep with an inflated count is a small but real speed bump. Since the PR is what generalized the function, updating the one diagnostic it contains to match seems worth a one-line follow-up.

    Suggested fix

    if elapsed >= deadline_ns {
        match parent_filter {
            Some(_) => log!("terminateChildrenAndWait: timed out with {} matching ({} outstanding)", matching, n),
            None => log!("terminateAllAndWait: timed out with {} outstanding", n),
        }
        return;
    }

    (or just genericize the prefix to terminate_and_wait and print both counts).

The timeout log always named terminateAllAndWait and printed the global
OUTSTANDING count, which on the filtered path includes the calling
worker itself plus unrelated workers; matching is the number the wait
actually conditions on.
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🟡 src/jsc/web_worker.rs:710-714 — The parallel nested-worker comment in create() (line 631) still says "parent is itself a worker, not joined on exit" — the PR updated this rationale at the file header, the parent field doc, and set_ref(), but missed this 4th site. With step 3.5 in place, worker parents do terminate-and-wait for children, so this wording now contradicts the rest of the file; it should get the same update set_ref() did.

    Extended reasoning...

    What the issue is

    This PR updates the nested-worker keepalive rationale to reflect that worker parents now terminate-and-wait for their children via the new shutdown() step 3.5. It updates that rationale at three sites:

    • the file header (lines 54-60): "Nested workers ARE stopped when their WORKER parent tears down: shutdown() step 3.5 terminates and waits for them…"
    • the parent field doc (lines 88-94): "When the parent is itself a worker, its shutdown() step 3.5 terminates us and waits for our unlink() before freeing its VM…"
    • the set_ref() comment (lines 710-714): changed from "worker parents aren't joined on exit" to "keeping the parent's loop alive avoids the child being terminated by the parent's natural exit (shutdown() step 3.5)"

    But the parallel comment in create() was missed:

    // src/jsc/web_worker.rs:628-632
    // Keep the parent's event loop alive until the close task releases this.
    // If the user passed `{ ref: false }` we skip — they've opted out of the
    // worker keeping the process alive. Exception: a nested worker (parent is
    // itself a worker, not joined on exit) must hold the parent-loop keepalive
    // regardless, because the child holds a non-owning `BackRef` to the parent VM.

    Why it's now wrong

    The phrase "not joined on exit" was the old rationale: before this PR, a worker parent did not wait for its children, so the keepalive was the only thing standing between the child's BackRef and a freed parent VM. After this PR, step 3.5 does wait for children before freeing the parent VM — so "not joined on exit" is no longer true, and the implied UAF-prevention rationale is no longer the operative one. The keepalive's purpose for nested workers is now to keep the parent's loop alive so the child isn't terminated by the parent's natural exit (exactly what the updated set_ref() comment says).

    Step-by-step

    1. Pre-PR: file header documented "nested workers are not stopped when their WORKER parent's context tears down" → both create() and set_ref() said "not joined on exit" → consistent.
    2. PR adds step 3.5 (terminate_children_and_wait) → worker parents now DO wait for children.
    3. PR updates file header, parent field doc, and set_ref() to match → 3 of 4 sites consistent.
    4. create() at line 631 still says "not joined on exit" → contradicts the other 3 sites and the actual behavior.

    Impact

    Pure documentation inconsistency — no runtime effect. But these comments document a memory-safety invariant (the parent-VM BackRef lifetime), and a future reader who lands on create() first would form the wrong model of why the nested-worker keepalive exists and what guarantees it provides.

    Fix

    Give create() the same wording set_ref() got, e.g.:

    // … Exception: a nested worker (parent is itself a worker) must hold the
    // parent-loop keepalive regardless: the child holds a non-owning `BackRef`
    // to the parent VM, and keeping the parent's loop alive avoids the child
    // being terminated by the parent's natural exit (`shutdown()` step 3.5).

Comment thread src/jsc/web_worker.rs
Each child is a full JSC VM; 12 children x 4 rounds per test overloaded
saturated ASAN runners into the test timeout. 6 children x 3 rounds
still crashed the unfixed build on every verification run (the race
window spans the whole child VM startup), and the fixed suite drops to
about 2.3s per test locally.

Also update the one remaining create() comment that still described
worker parents as not waiting for their children on exit.
Comment thread src/jsc/web_worker.rs
The debian-13 x64-asan lane runs tests with detect_leaks=1 and
BUN_DESTRUCT_VM_ON_EXIT=1 unless the file is listed in
test/no-validate-leaksan.txt. Worker teardown intentionally leaks when a
parent context is gone before the close task runs (the thread-held
Worker ref and detached thread bookkeeping documented in
src/jsc/web_worker.rs), and the new regression tests terminate workers
while nested children are starting, which is exactly that window.
Reproduced on a local release-asan build with the lane environment: a
flaky 32-byte direct leak report in the spawned fixture's stderr failed
the test about one run in five. worker.test.ts and worker_blob.test.ts
are already listed for the same reason. Leak exclusion does not affect
the tests' crash coverage; AddressSanitizer UAF detection stays on.
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All four of my earlier findings (snapshot race, wake-one vs wake-all, timeout log, stale create() comment) are addressed and the current WAKE_SEQ approach looks sound, but this reworks the cross-thread worker shutdown/futex protocol enough that a human should sign off.

Extended reasoning...

Overview

This PR fixes a use-after-free where a terminated worker frees its VirtualMachine while nested children are still inside start_vm() reading it via BackRef. The fix adds step 3.5 to WebWorker::shutdown(): terminate_children_and_wait(vm_ptr, 10_000) terminates direct children and futex-waits for each to unlink from the live-workers registry before the parent VM is freed. Supporting changes: terminate_all_and_wait is refactored into a generic terminate_and_wait(parent_filter, timeout); a new monotonic WAKE_SEQ futex word replaces waiting on OUTSTANDING (avoids ABA where exit+register restores the expected value); both register() and mark_exited() now wake-all since multiple threads can wait concurrently; flush_logs and shutdown() step 2 now clear stale TerminationExceptions and handle JsError::Terminated/Thrown instead of panic!. Two regression tests are added, and the test file is added to the LeakSanitizer exclusion list.

Security risks

None identified. This is internal lifetime/synchronization machinery for worker teardown; no auth, crypto, untrusted input parsing, or external surface.

Level of scrutiny

High. The change sits at the intersection of cross-thread raw-pointer lifetime management (extensive unsafe, BackRef validity invariants documented in load-bearing comments), a futex wait/wake protocol that this PR substantially redesigns (new WAKE_SEQ word, wake-one → wake-all, snapshot-under-mutex ordering), and the worker shutdown() sequence whose step ordering is a documented hard constraint. Earlier review rounds on this PR surfaced two real concurrency bugs (lost-wakeup race; wrong-waiter with wake-one) that were fixed, which underscores the subtlety. The 10s timeout fallback bounds any remaining protocol error to a stall rather than a UAF, but that's still user-visible.

Other factors

All four of my prior inline findings are resolved in the current diff. Since my last review the author landed 50fff74 (the WAKE_SEQ monotonic-sequence redesign — a stronger fix than the one I suggested) and c02cb19 (LeakSanitizer exclusion for the new test, with a documented rationale matching the file-header's intentional-leak note). The robobun CI comment for c99f30c shows the new test failing on x64-asan and an unrelated bunx.test.ts failure across all platforms; c02cb19 should address the former, but I haven't seen green CI for HEAD. Given the scope — new synchronization primitive, modified shutdown ordering, and a leaksan exclusion — this warrants human review rather than auto-approval.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All four of my earlier findings have been addressed and I didn't spot anything new in the WAKE_SEQ rework, but this is cross-thread futex synchronization on the worker VM teardown path — worth a human sign-off.

Extended reasoning...

Overview

This PR fixes a use-after-free in nested Web Worker teardown by adding a terminate_children_and_wait step (3.5) to WebWorker::shutdown(), so a worker parent terminates and waits for its own children before freeing the VirtualMachine they back-reference. To support this it refactors terminate_all_and_wait into a shared terminate_and_wait(parent_filter, timeout) helper, replaces the futex word with a monotonic WAKE_SEQ counter (avoiding ABA on OUTSTANDING), switches register()/mark_exited() to wake-all, and hardens flush_logs() / the shutdown path against stale JSC TerminationExceptions. Two regression tests and a LeakSan exclusion are added; the VirtualMachine.rs change is doc-only.

Security risks

None in the conventional sense (no auth/crypto/input-handling). The change is memory-safety-relevant though: it edits the cross-thread lifetime contract that justifies every BackRef<VirtualMachine> deref in start_vm(), and the ordering of steps in shutdown() is load-bearing for soundness.

Level of scrutiny

High. This is core runtime concurrency code: an intrusive linked list walked under a process-global mutex, futex wait/wake protocol with a newly-introduced sequence word, and a multi-step teardown where each step is a barrier for the next. The PR already went through four review iterations here (lost-wakeup snapshot race, wake-one vs wake-all with multiple waiters, timeout-log accuracy, stale doc comment) — all fixed — which itself signals the subtlety. The final WAKE_SEQ design looks correct to me (snapshot taken under MUTEX before the sweep; every state change bumps it and wakes all), but a human reviewer familiar with this file's history should confirm the step-3.5 placement (after JSC teardown, before unlink()/dispatchExit) and the 10-second bounded-wait fallback are acceptable.

Other factors

  • All four of my prior inline comments are resolved by commits on this branch, and the current bug-hunting pass found nothing new.
  • Regression tests reproduce the original UAF and the flush_logs panic on unfixed builds per the PR description.
  • The no-validate-leaksan.txt addition is justified (intentional leak documented in the file header) and matches existing exclusions for worker.test.ts/worker_blob.test.ts.
  • I'm not approving because the change is neither simple nor mechanical and sits squarely on a critical, hard-to-test code path.

@robobun
Copy link
Copy Markdown
Collaborator Author

robobun commented Jun 7, 2026

CI status for reviewers: everything this PR touches is green. The worker test suites pass on all lanes, including debian-13 x64-asan after the LeakSanitizer exclusion (that lane's earlier failures were a flaky 32-byte leak report from the intentional worker-teardown leak, reproduced and verified locally on a release-asan build). The remaining red is unrelated: bunx.test.ts fails on every OS in every recent build of this branch, including builds whose diff was a text file or an empty commit, and the two single-platform failures in build 61283 (sql-mysql-bind-blob-borrow on alpine, install migration complex-workspace on ubuntu 25.04) are in suites this PR does not touch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant