You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
coderd head pods exhibit steady, linear memory growth on long-uptime processes. Heap profiles attribute the growth to tailnet.(*Conn).pingWithType and related wgengine ping allocations, but live goroutine counts in the wgengine and magicsock packages remain in the low tens. The retainer is an upstream tailscale.com/wgengine bug that was identified and fixed upstream on 2025-12-02 but has not yet been picked into Coder's coder/tailscale fork.
While auditing some tangential code, I happened to notice that TSMP+ICMP ping callbacks in wgengine.userspaceEngine leak. I verified with some logging that the callback map size just grows forever upon success and only cleans up after itself on failed pings.
Root cause
In wgengine/userspace.go, the pongCallback and icmpEchoResponseCallback maps in userspaceEngine are populated when a TSMP or ICMP ping is initiated. They are only deleted on the failure path (cb == nil). On the success path — i.e. when a pong/ICMP response arrives — the callback is invoked but never removed from the map.
Each retained callback closure pins:
The *ipnstate.PingResult buffer
The associated time.Timer from the ping timeout
The destination netip.Addr (and its string form)
The setTSMPPongCallback callback wrapper
Why this hits Coder specifically
Coder uses a fork at github.com/coder/tailscale, currently pinned via replace tailscale.com => github.com/coder/tailscale in coder/codergo.mod. The fork's wgengine/userspace.go does not contain the upstream fix — the OnTSMPPongReceived and OnICMPEchoResponseReceived handlers invoke their callbacks but never call delete(...) on the corresponding map entry.
In coderd, a long-lived singleton ServerTailnet (coderd/tailnet.go) handles every workspace-agent dial from the head process. Every dial calls conn.AwaitReachable(ctx), which spawns (per the existing in-code comment) 10–20 parallel TSMP pings on an exponential-backoff ticker. Every successful ping leaves callback state pinned in the upstream map. Over the process lifetime, retained allocations dominate the heap.
Reproduction signal
On a coderd head pod after several days of uptime under workspace-dial traffic (AI tasks, web terminals, port-forwarding, VS Code Coder Extension, etc.):
curl -s http://localhost:6060/debug/pprof/heap > heap.out
go tool pprof -top -cum heap.out
go tool pprof heap.out
(pprof) peek pingWithType
pprof peek pingWithType shows ~100% of allocations attributed to tailnet.(*Conn).AwaitReachable.func1. Live go_goroutines remains flat or near-flat while process_resident_memory_bytes climbs linearly.
A copy of Canva's support bundle which includes pprof data is also included below.
@@ OnTSMPPongReceived @@
if cb != nil {
+ delete(e.pongCallback, pong.Data)
go cb(pong)
}
@@ OnICMPEchoResponseReceived @@
if cb == nil {
return false
}
+delete(e.icmpEchoResponseCallback, idSeq)
e.logf("wgengine: got diagnostic ICMP response %02x", idSeq)
go cb()
// A PR has been opened with a test which aims to fix this: coder/tailscale#122.
2. Tag and bump
Tag a new coder/tailscale version after merge.
Bump the replace tailscale.com => github.com/coder/tailscale ... pin in coder/codergo.mod.
3. Backport
Backport the go.mod bump to active Coder release branches:
release/2.29
release/2.30
release/2.31
release/2.32
release/2.33
4. Optional follow-ups
Cross-reference Coder pods running out of memory #14881 — the long-standing "coder pods running out of memory" issue may have this leak as a contributing cause, worth re-checking against affected customers after the fix lands.
Audit other coder/tailscale packages (wgengine/, wgengine/magicsock/, derp/, net/netcheck/) for upstream bug fixes that landed after the most recent rebase of the fork. PR docs: add AWS Q Developer to list of agents #18113 was discovered by upstream auditing tangential code; there may be similar quietly-fixed leaks not yet cherry-picked.
Consider establishing a periodic upstream-sync process for coder/tailscale to reduce the risk of similar gaps.
Impact assessment
All Coder deployments share this code path. Severity scales with:
Pod uptime between restarts (the longer, the worse).
Volume of workspace-agent dials per pod (the higher, the worse). Workloads that increase dial volume: AI tasks, AI agents (chatd, 2.32+), web terminals, port-forwarding, VS Code Coder Extension, JetBrains plugins, coder ssh, MCP agent connections.
Network stability between coderd and workspace agents (less stable → more pings per AwaitReachable invocation → higher allocation rate).
Container memory limits (tighter → OOM sooner).
Deployments with frequent restarts (Helm upgrades, node recycling, autoscaling churn) may have masked this entirely. Deployments with long-uptime heads will see it.
Severity
Recommend treating as s2 / customer-visible bug with a clear, low-risk upstream fix already available. The 2-line cherry-pick is minimal risk; the backport across active release lines covers exposure.
Summary
coderdhead pods exhibit steady, linear memory growth on long-uptime processes. Heap profiles attribute the growth totailnet.(*Conn).pingWithTypeand relatedwgengineping allocations, but live goroutine counts in thewgengineandmagicsockpackages remain in the low tens. The retainer is an upstreamtailscale.com/wgenginebug that was identified and fixed upstream on 2025-12-02 but has not yet been picked into Coder'scoder/tailscalefork.Upstream context
From upstream issue text:
Root cause
In
wgengine/userspace.go, thepongCallbackandicmpEchoResponseCallbackmaps inuserspaceEngineare populated when a TSMP or ICMP ping is initiated. They are only deleted on the failure path (cb == nil). On the success path — i.e. when a pong/ICMP response arrives — the callback is invoked but never removed from the map.Each retained callback closure pins:
*ipnstate.PingResultbuffertime.Timerfrom the ping timeoutnetip.Addr(and its string form)setTSMPPongCallbackcallback wrapperWhy this hits Coder specifically
Coder uses a fork at
github.com/coder/tailscale, currently pinned viareplace tailscale.com => github.com/coder/tailscaleincoder/codergo.mod. The fork'swgengine/userspace.godoes not contain the upstream fix — theOnTSMPPongReceivedandOnICMPEchoResponseReceivedhandlers invoke their callbacks but never calldelete(...)on the corresponding map entry.In
coderd, a long-lived singletonServerTailnet(coderd/tailnet.go) handles every workspace-agent dial from the head process. Every dial callsconn.AwaitReachable(ctx), which spawns (per the existing in-code comment) 10–20 parallel TSMP pings on an exponential-backoff ticker. Every successful ping leaves callback state pinned in the upstream map. Over the process lifetime, retained allocations dominate the heap.Reproduction signal
On a coderd head pod after several days of uptime under workspace-dial traffic (AI tasks, web terminals, port-forwarding, VS Code Coder Extension, etc.):
Expected signature:
pprof peek pingWithTypeshows ~100% of allocations attributed totailnet.(*Conn).AwaitReachable.func1. Livego_goroutinesremains flat or near-flat whileprocess_resident_memory_bytesclimbs linearly.A copy of Canva's support bundle which includes pprof data is also included below.
coder-support-1779082181.zip
Proposed remediation
1. Cherry-pick the upstream fix into
coder/tailscaleApply the 2-line diff from tailscale/tailscale#18113 to
wgengine/userspace.go:// A PR has been opened with a test which aims to fix this: coder/tailscale#122.
2. Tag and bump
coder/tailscaleversion after merge.replace tailscale.com => github.com/coder/tailscale ...pin incoder/codergo.mod.3. Backport
Backport the
go.modbump to active Coder release branches:release/2.29release/2.30release/2.31release/2.32release/2.334. Optional follow-ups
coder/tailscalepackages (wgengine/,wgengine/magicsock/,derp/,net/netcheck/) for upstream bug fixes that landed after the most recent rebase of the fork. PR docs: add AWS Q Developer to list of agents #18113 was discovered by upstream auditing tangential code; there may be similar quietly-fixed leaks not yet cherry-picked.coder/tailscaleto reduce the risk of similar gaps.Impact assessment
All Coder deployments share this code path. Severity scales with:
coder ssh, MCP agent connections.AwaitReachableinvocation → higher allocation rate).Deployments with frequent restarts (Helm upgrades, node recycling, autoscaling churn) may have masked this entirely. Deployments with long-uptime heads will see it.
Severity
Recommend treating as
s2/ customer-visible bug with a clear, low-risk upstream fix already available. The 2-line cherry-pick is minimal risk; the backport across active release lines covers exposure.