Skip to content

Bug: TSMP/ICMP callback leak in tailnet causes steady memory growth in long-lived heads #25380

@bjornrobertsson

Description

@bjornrobertsson

Summary

coderd head pods exhibit steady, linear memory growth on long-uptime processes. Heap profiles attribute the growth to tailnet.(*Conn).pingWithType and related wgengine ping allocations, but live goroutine counts in the wgengine and magicsock packages remain in the low tens. The retainer is an upstream tailscale.com/wgengine bug that was identified and fixed upstream on 2025-12-02 but has not yet been picked into Coder's coder/tailscale fork.

Upstream context

From upstream issue text:

While auditing some tangential code, I happened to notice that TSMP+ICMP ping callbacks in wgengine.userspaceEngine leak. I verified with some logging that the callback map size just grows forever upon success and only cleans up after itself on failed pings.

Root cause

In wgengine/userspace.go, the pongCallback and icmpEchoResponseCallback maps in userspaceEngine are populated when a TSMP or ICMP ping is initiated. They are only deleted on the failure path (cb == nil). On the success path — i.e. when a pong/ICMP response arrives — the callback is invoked but never removed from the map.

Each retained callback closure pins:

  • The *ipnstate.PingResult buffer
  • The associated time.Timer from the ping timeout
  • The destination netip.Addr (and its string form)
  • The setTSMPPongCallback callback wrapper

Why this hits Coder specifically

Coder uses a fork at github.com/coder/tailscale, currently pinned via replace tailscale.com => github.com/coder/tailscale in coder/coder go.mod. The fork's wgengine/userspace.go does not contain the upstream fix — the OnTSMPPongReceived and OnICMPEchoResponseReceived handlers invoke their callbacks but never call delete(...) on the corresponding map entry.

In coderd, a long-lived singleton ServerTailnet (coderd/tailnet.go) handles every workspace-agent dial from the head process. Every dial calls conn.AwaitReachable(ctx), which spawns (per the existing in-code comment) 10–20 parallel TSMP pings on an exponential-backoff ticker. Every successful ping leaves callback state pinned in the upstream map. Over the process lifetime, retained allocations dominate the heap.

Reproduction signal

On a coderd head pod after several days of uptime under workspace-dial traffic (AI tasks, web terminals, port-forwarding, VS Code Coder Extension, etc.):

curl -s http://localhost:6060/debug/pprof/heap > heap.out
go tool pprof -top -cum heap.out
go tool pprof heap.out
(pprof) peek pingWithType

Expected signature:

      flat  flat%   sum%        cum   cum%
  XXX.XXMB  XX.XX% XX.XX%   XXX.XXMB XX.XX%  tailnet.(*Conn).pingWithType
  XXX.XXMB  XX.XX% XX.XX%   XXX.XXMB XX.XX%  wgengine.(*userspaceEngine).Ping
  XXX.XXMB  XX.XX% XX.XX%   XXX.XXMB XX.XX%  wgengine.(*userspaceEngine).sendTSMPPing
  XXX.XXMB  XX.XX% XX.XX%   XXX.XXMB XX.XX%  time.newTimer
  XXX.XXMB  XX.XX% XX.XX%   XXX.XXMB XX.XX%  net/netip.Addr.string6
   XX.XXMB   X.XX% XX.XX%    XX.XXMB  X.XX%  wgengine.(*userspaceEngine).setTSMPPongCallback

pprof peek pingWithType shows ~100% of allocations attributed to tailnet.(*Conn).AwaitReachable.func1. Live go_goroutines remains flat or near-flat while process_resident_memory_bytes climbs linearly.

A copy of Canva's support bundle which includes pprof data is also included below.

coder-support-1779082181.zip

Proposed remediation

1. Cherry-pick the upstream fix into coder/tailscale

Apply the 2-line diff from tailscale/tailscale#18113 to wgengine/userspace.go:

@@ OnTSMPPongReceived @@
 if cb != nil {
+    delete(e.pongCallback, pong.Data)
     go cb(pong)
 }

@@ OnICMPEchoResponseReceived @@
 if cb == nil {
     return false
 }
+delete(e.icmpEchoResponseCallback, idSeq)
 e.logf("wgengine: got diagnostic ICMP response %02x", idSeq)
 go cb()

// A PR has been opened with a test which aims to fix this: coder/tailscale#122.

2. Tag and bump

  • Tag a new coder/tailscale version after merge.
  • Bump the replace tailscale.com => github.com/coder/tailscale ... pin in coder/coder go.mod.

3. Backport

Backport the go.mod bump to active Coder release branches:

  • release/2.29
  • release/2.30
  • release/2.31
  • release/2.32
  • release/2.33

4. Optional follow-ups

  • Cross-reference Coder pods running out of memory #14881 — the long-standing "coder pods running out of memory" issue may have this leak as a contributing cause, worth re-checking against affected customers after the fix lands.
  • Audit other coder/tailscale packages (wgengine/, wgengine/magicsock/, derp/, net/netcheck/) for upstream bug fixes that landed after the most recent rebase of the fork. PR docs: add AWS Q Developer to list of agents #18113 was discovered by upstream auditing tangential code; there may be similar quietly-fixed leaks not yet cherry-picked.
  • Consider establishing a periodic upstream-sync process for coder/tailscale to reduce the risk of similar gaps.

Impact assessment

All Coder deployments share this code path. Severity scales with:

  1. Pod uptime between restarts (the longer, the worse).
  2. Volume of workspace-agent dials per pod (the higher, the worse). Workloads that increase dial volume: AI tasks, AI agents (chatd, 2.32+), web terminals, port-forwarding, VS Code Coder Extension, JetBrains plugins, coder ssh, MCP agent connections.
  3. Network stability between coderd and workspace agents (less stable → more pings per AwaitReachable invocation → higher allocation rate).
  4. Container memory limits (tighter → OOM sooner).

Deployments with frequent restarts (Helm upgrades, node recycling, autoscaling churn) may have masked this entirely. Deployments with long-uptime heads will see it.

Severity

Recommend treating as s2 / customer-visible bug with a clear, low-risk upstream fix already available. The 2-line cherry-pick is minimal risk; the backport across active release lines covers exposure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions