usockets: decouple dns_ready_head and closed_connecting_head lists#29099
usockets: decouple dns_ready_head and closed_connecting_head lists#29099
Conversation
|
Updated 10:57 PM PT - Apr 9th, 2026
❌ @autofix-ci[bot], your commit 7e39306 has 3 failures in
🧪 To try this PR locally: bunx bun-pr 29099That installs a local version of the PR into your bun-29099 --bun |
|
Status: defensive hardening, not a proven root-cause fix. Rebased off CI (build #44832): all build jobs pass on every platform; all test lanes pass except Local (debug+ASAN,
Gate: fails on container-local issues (release WebKit fetch → Draft. The analysis in the PR body is the deliverable; the diff turns the |
|
This PR may be a duplicate of:
🤖 Generated with Claude Code |
|
Not a duplicate of #29064 — that one snapshots CI build failures are from #29092 ( |
Give us_connecting_socket_t a dedicated next_closed link for the deferred-free list instead of reusing the dns_ready_head next pointer, and make us_connecting_socket_free idempotent via a scheduled_for_free bit. Drain skips entries that are already scheduled for free. The two lists previously shared c->next, so enqueuing c for deferred free while it was still reachable via a dns_ready_head drain snapshot silently spliced the remaining drain entries onto the close list. The pending_resolve_callback invariant is supposed to prevent that overlap, but the v1.3.11 loop.c:238 SIGSEGV @ 0x0 (c->context == NULL in after_resolve) shows a recycled us_connecting_socket_t reaching the drain loop, and neither #29067 (context ref) nor #29068 (cancel) addresses that fault mode. This change makes the separation structural so a future invariant violation degrades to a skipped entry rather than a use-after-free.
1e28538 to
7e39306
Compare
What
Give
us_connecting_socket_ta dedicatednext_closedlink for the deferred-free list (closed_connecting_head) instead of reusing thedns_ready_headnextpointer, makeus_connecting_socket_freeidempotent via ascheduled_for_freebit, and have the drain loop skip entries that are already scheduled for free.Why
The v1.3.11 linux-x64-baseline crash signature is:
Disassembly of the callee prologue:
c->context(offset 8 inus_connecting_socket_t) reads as 0. Nothing ever writesc->context = NULLexplicitly, so the block atcwas freed and recycled by a subsequentus_calloc. #29067 added an extra context ref, which keeps the context alive but does nothing whencitself is the recycled block — the field is zero regardless. #29068 addedBun__addrinfo_cancel, which correctly refuses oncerequest.resultis set, so it doesn't change the guarded paths either.Analysis of the
pending_resolve_callbackinvariantus_connecting_socket_free(c)is reachable from four sites and in every onec->pending_resolve_callback == 0:after_resolve(c)c->closedbranch —pendingcleared two lines earlier;chas just been dequeued from the drain snapshot.us_connecting_socket_close(c)non-pending branch — explicitpending == 0.us_connecting_socket_close(c)cancel-success branch —Bun__addrinfo_cancelonly returns 1 whilerequest.result == NULL, i.e. before the worker has snapshottednotify(both sides holdglobal_cache.lock), sous_internal_dns_callback(c)will never fire andcnever reachesdns_ready_head.after_open(s)success —swas created bystart_connections(c), called only fromafter_resolve(c)after it clearedpending.And
us_internal_dns_callback(c)fires exactly once perc(Bun__addrinfo_seteither enqueues directly or appends tonotify, never both;afterResultmovesnotifyout under the same lock). Socis never simultaneously onclosed_connecting_headand reachable from a futuredns_ready_headsnapshot — the invariant holds through every 2/3/4-socket interleaving traced.The production workload (fetch via proxy + WebSocket + spawn + AbortSignal, ~3 req/s, one crash per several process-hours) nevertheless hits it, so either there is an interleaving outside the traced set or another writer is corrupting
c->next. Separately,us_loop_run_bun_tickhas no re-entrancy guard andwaitForPromise/HTMLRewriter.transform/expect().resolvescan nest it on the same loop from insideon_connect_error, which runsfree_closed_socketsinside a drain — but even that path leaves the outernextuntouched becausenextis captured beforeafter_resolve(s).The structural hazard this removes
The two lists previously shared
c->next.us_connecting_socket_free(c)did:so the correctness of the drain loop depended entirely on the invariant above. With a dedicated
next_closedlink, enqueuingcfor deferred free can no longer truncate or splice the drain chain,us_connecting_socket_freebecomes safely idempotent (double-call early-returns instead of double-unlinking and cycling the close list), and the drain loop skips anything already scheduled for free (valid memory untilfree_closed_socketsruns) instead of derefing a releasedcontext/addrinfo_req.This turns a subtle cross-function invariant into a structural guarantee: even if a future change violates
pending_resolve_callback, the failure mode is a skipped drain entry, not a recycled pointer inafter_resolve.Testing
resolve-dns.test.ts71/71,node-net/socket.test.ts/tcp-server.test.tsunchanged vsUSE_SYSTEM_BUN=1baseline (failures there are environmental — proxy cert / ECONNREFUSED to external hosts). The newdns-connecting-socket-stress.test.ts(ASAN-only, 25k concurrentnet.connect("localhost")+ random-phase abort,quarantine_size_mb=0) runs clean on the patched build; it also ran clean on a pre-#29067 ASAN build, consistent with the field hit rate. No deterministic fail-before — the exact interleaving is not reproduced here. The stress test is included as a regression guard for the lifecycle rather than a gate.