Bug: TSMP/ICMP callback leak in tailnet causes steady memory growth in long-lived heads

## Summary

`coderd` head pods exhibit steady, linear memory growth on long-uptime processes. Heap profiles attribute the growth to `tailnet.(*Conn).pingWithType` and related `wgengine` ping allocations, but live goroutine counts in the `wgengine` and `magicsock` packages remain in the low tens. The retainer is an upstream `tailscale.com/wgengine` bug that was identified and fixed upstream on 2025-12-02 but has not yet been picked into Coder's `coder/tailscale` fork.

## Upstream context

* Upstream issue (root cause, identified by Brad Fitzpatrick): [https://github.com/tailscale/tailscale/issues/18112](<https://github.com/tailscale/tailscale/issues/18112>)
* Upstream fix (2-line patch, merged 2025-12-02): [https://github.com/tailscale/tailscale/pull/18113](<https://github.com/tailscale/tailscale/pull/18113>)
* Upstream merge commit: [https://github.com/tailscale/tailscale/commit/b8c58ca7c1a49fb772d095c65693cdab06488047](<https://github.com/tailscale/tailscale/commit/b8c58ca7c1a49fb772d095c65693cdab06488047>)

From upstream issue text:

> While auditing some tangential code, I happened to notice that TSMP+ICMP ping callbacks in `wgengine.userspaceEngine` leak. I verified with some logging that the callback map size just grows forever upon success and only cleans up after itself on failed pings.

## Root cause

In `wgengine/userspace.go`, the `pongCallback` and `icmpEchoResponseCallback` maps in `userspaceEngine` are populated when a TSMP or ICMP ping is initiated. They are only deleted on the **failure** path (`cb == nil`). On the **success** path — i.e. when a pong/ICMP response arrives — the callback is invoked but **never removed from the map**.

Each retained callback closure pins:

* The `*ipnstate.PingResult` buffer
* The associated `time.Timer` from the ping timeout
* The destination `netip.Addr` (and its string form)
* The `setTSMPPongCallback` callback wrapper

## Why this hits Coder specifically

Coder uses a fork at `github.com/coder/tailscale`, currently pinned via `replace tailscale.com => github.com/coder/tailscale` in `coder/coder` `go.mod`. The fork's `wgengine/userspace.go` does not contain the upstream fix — the `OnTSMPPongReceived` and `OnICMPEchoResponseReceived` handlers invoke their callbacks but never call `delete(...)` on the corresponding map entry.

In `coderd`, a long-lived singleton `ServerTailnet` (`coderd/tailnet.go`) handles every workspace-agent dial from the head process. Every dial calls `conn.AwaitReachable(ctx)`, which spawns (per the existing in-code comment) 10–20 parallel TSMP pings on an exponential-backoff ticker. Every successful ping leaves callback state pinned in the upstream map. Over the process lifetime, retained allocations dominate the heap.

## Reproduction signal

On a coderd head pod after several days of uptime under workspace-dial traffic (AI tasks, web terminals, port-forwarding, VS Code Coder Extension, etc.):

```
curl -s http://localhost:6060/debug/pprof/heap > heap.out
go tool pprof -top -cum heap.out
go tool pprof heap.out
(pprof) peek pingWithType
```

Expected signature:

```
      flat  flat%   sum%        cum   cum%
  XXX.XXMB  XX.XX% XX.XX%   XXX.XXMB XX.XX%  tailnet.(*Conn).pingWithType
  XXX.XXMB  XX.XX% XX.XX%   XXX.XXMB XX.XX%  wgengine.(*userspaceEngine).Ping
  XXX.XXMB  XX.XX% XX.XX%   XXX.XXMB XX.XX%  wgengine.(*userspaceEngine).sendTSMPPing
  XXX.XXMB  XX.XX% XX.XX%   XXX.XXMB XX.XX%  time.newTimer
  XXX.XXMB  XX.XX% XX.XX%   XXX.XXMB XX.XX%  net/netip.Addr.string6
   XX.XXMB   X.XX% XX.XX%    XX.XXMB  X.XX%  wgengine.(*userspaceEngine).setTSMPPongCallback
```

`pprof peek pingWithType` shows \~100% of allocations attributed to `tailnet.(*Conn).AwaitReachable.func1`. Live `go_goroutines` remains flat or near-flat while `process_resident_memory_bytes` climbs linearly.

A copy of Canva's support bundle which includes pprof data is also included below.

[coder-support-1779082181.zip](https://uploads.linear.app/e62091d9-44f5-421c-8e5c-df481fc99003/e484fd2d-a6ff-4096-ba2a-d3e2bf0a28c2/85d81587-1fc3-41f5-b0a5-a15bc3ea7524?signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJwYXRoIjoiL2U2MjA5MWQ5LTQ0ZjUtNDIxYy04ZTVjLWRmNDgxZmM5OTAwMy9lNDg0ZmQyZC1hNmZmLTQwOTYtYmEyYS1kM2UyYmYwYTI4YzIvODVkODE1ODctMWZjMy00MWY1LWIwYTUtYTE1YmMzZWE3NTI0IiwiaWF0IjoxNzc5MjAyMDEwLCJleHAiOjE4MTA3NzI1NzB9.bQ8XffEsv5d_HBn9fXPKRYDp-A16zIExJiiSPUjZ4zM)

## Proposed remediation

### 1\. Cherry-pick the upstream fix into `coder/tailscale`

Apply the 2-line diff from [https://github.com/tailscale/tailscale/pull/18113](<https://github.com/tailscale/tailscale/pull/18113>) to `wgengine/userspace.go`:

```diff
@@ OnTSMPPongReceived @@
 if cb != nil {
+    delete(e.pongCallback, pong.Data)
     go cb(pong)
 }

@@ OnICMPEchoResponseReceived @@
 if cb == nil {
     return false
 }
+delete(e.icmpEchoResponseCallback, idSeq)
 e.logf("wgengine: got diagnostic ICMP response %02x", idSeq)
 go cb()
```

// A PR has been opened with a test which aims to fix this: coder/tailscale#122.

### 2\. Tag and bump

* Tag a new `coder/tailscale` version after merge.
* Bump the `replace tailscale.com => github.com/coder/tailscale ...` pin in `coder/coder` `go.mod`.

### 3\. Backport

Backport the `go.mod` bump to active Coder release branches:

* `release/2.29`
* `release/2.30`
* `release/2.31`
* `release/2.32`
* `release/2.33`

### 4\. Optional follow-ups

* Cross-reference [https://github.com/coder/coder/issues/14881](<https://github.com/coder/coder/issues/14881>) — the long-standing "coder pods running out of memory" issue may have this leak as a contributing cause, worth re-checking against affected customers after the fix lands.
* Audit other `coder/tailscale` packages (`wgengine/`, `wgengine/magicsock/`, `derp/`, `net/netcheck/`) for upstream bug fixes that landed after the most recent rebase of the fork. PR coder/coder#18113 was discovered by upstream auditing tangential code; there may be similar quietly-fixed leaks not yet cherry-picked.
* Consider establishing a periodic upstream-sync process for `coder/tailscale` to reduce the risk of similar gaps.

## Impact assessment

All Coder deployments share this code path. Severity scales with:

1. Pod uptime between restarts (the longer, the worse).
2. Volume of workspace-agent dials per pod (the higher, the worse). Workloads that increase dial volume: AI tasks, AI agents (chatd, 2.32+), web terminals, port-forwarding, VS Code Coder Extension, JetBrains plugins, `coder ssh`, MCP agent connections.
3. Network stability between coderd and workspace agents (less stable → more pings per `AwaitReachable` invocation → higher allocation rate).
4. Container memory limits (tighter → OOM sooner).

Deployments with frequent restarts (Helm upgrades, node recycling, autoscaling churn) may have masked this entirely. Deployments with long-uptime heads will see it.

## Severity

Recommend treating as `s2` / customer-visible bug with a clear, low-risk upstream fix already available. The 2-line cherry-pick is minimal risk; the backport across active release lines covers exposure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: TSMP/ICMP callback leak in tailnet causes steady memory growth in long-lived heads #25380

Summary

Upstream context

Root cause

Why this hits Coder specifically

Reproduction signal

Proposed remediation

1. Cherry-pick the upstream fix into `coder/tailscale`

2. Tag and bump

3. Backport

4. Optional follow-ups

Impact assessment

Severity

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug: TSMP/ICMP callback leak in tailnet causes steady memory growth in long-lived heads #25380

Description

Summary

Upstream context

Root cause

Why this hits Coder specifically

Reproduction signal

Proposed remediation

1. Cherry-pick the upstream fix into coder/tailscale

2. Tag and bump

3. Backport

4. Optional follow-ups

Impact assessment

Severity

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Cherry-pick the upstream fix into `coder/tailscale`