Skip to content

fix(coderd/externalauth): detect concurrent refresh race to prevent cache poisoning#24228

Merged
f0ssel merged 2 commits into
coder:mainfrom
altana-ai:fix/external-auth-refresh-race
May 4, 2026
Merged

fix(coderd/externalauth): detect concurrent refresh race to prevent cache poisoning#24228
f0ssel merged 2 commits into
coder:mainfrom
altana-ai:fix/external-auth-refresh-race

Conversation

@jasonwbarnett
Copy link
Copy Markdown
Contributor

@jasonwbarnett jasonwbarnett commented Apr 9, 2026

Fixes #17069

Builds on #24332 and #24334 which addressed token persistence and rate limit handling.

Problem

When multiple concurrent requests race to refresh an expiring external auth token, providers with single-use refresh tokens (e.g., GitHub Apps) reject all but the first refresh attempt with bad_refresh_token. The losing request caches this transient error in the oauth_refresh_failure_reason database column and clears the refresh token, blocking all subsequent refresh attempts until the user manually re-authenticates.

This is common for users with multiple terminals, IDE connections, or workspaces open, all of which poll the external auth endpoint and trigger concurrent refreshes when the token nears expiry. Database analysis showed 5 of 7 affected users failed within 5-10 seconds of token expiry, matching the Go oauth2 library's expiryDelta window.

Fix

Before caching a bad_refresh_token failure, re-read the external auth link from the database. If the refresh token has changed (indicating a concurrent caller already refreshed successfully), return the winner's updated link instead of writing a failure. An empty-string guard ensures a token cleared by another loser isn't mistaken for a winner's successful refresh.

@github-actions github-actions Bot added the community Pull Requests and issues created by the community. label Apr 9, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 9, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@jasonwbarnett jasonwbarnett force-pushed the fix/external-auth-refresh-race branch from 8845515 to f89a424 Compare April 9, 2026 21:33
@jasonwbarnett jasonwbarnett marked this pull request as ready for review April 9, 2026 21:34
@jasonwbarnett
Copy link
Copy Markdown
Contributor Author

jasonwbarnett commented Apr 10, 2026

We built a custom release binary from a fork of v2.31.9 with these two fix commits cherry-picked on top and deployed it to our production environment. Release: https://github.com/altana-ai/coder/releases/tag/v2.31.9-fix

We reproduced the race condition using the same method as before (setting oauth_expiry = NOW() + interval '5 seconds' and firing 10 concurrent requests). With the fix applied, the database remains clean after the race — no cached bad_refresh_token failure, and the refresh token is preserved. Before the fix, the same test consistently poisoned the database.

Before fix After fix
oauth_refresh_failure_reason bad_refresh_token (poisoned) empty (clean)
refresh_token_cleared true (lost forever) false (preserved)
Token usable after race No — stuck until manual re-auth Yes — works immediately

Copy link
Copy Markdown
Contributor

@geokat geokat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR - thank you for submitting!

Comment thread coderd/externalauth/externalauth.go
Comment thread coderd/externalauth/externalauth.go Outdated
@jasonwbarnett jasonwbarnett force-pushed the fix/external-auth-refresh-race branch from 94cb3cf to f984ef3 Compare April 15, 2026 22:23
jasonwbarnett added a commit to altana-ai/coder that referenced this pull request Apr 15, 2026
…rom request context

Address review feedback from geokat on coder#24228:

1. Add empty-string check to concurrent refresh detection: a refresh
   token cleared by another loser (empty string) should not be treated
   as a successful concurrent refresh by a winner.

2. Use context.WithoutCancel for post-refresh work: once a single-use
   refresh token has been consumed, the validation and DB persistence
   must complete even if the caller's HTTP request context is cancelled.
   Use a 10-second timeout detached from the parent context.

3. Persist the new token to the database BEFORE validation: the refresh
   token has already been consumed by the provider, so if validation
   fails (network error, rate limit, context cancellation), the new
   token would be lost forever. Persist first, then validate.

Refs coder#17069

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@jasonwbarnett jasonwbarnett requested a review from geokat April 15, 2026 22:25
jasonwbarnett added a commit to altana-ai/coder that referenced this pull request Apr 15, 2026
…rom request context

Address review feedback from geokat on coder#24228:

1. Add empty-string check to concurrent refresh detection: a refresh
   token cleared by another loser (empty string) should not be treated
   as a successful concurrent refresh by a winner.

2. Use context.WithoutCancel for post-refresh work: once a single-use
   refresh token has been consumed, the validation and DB persistence
   must complete even if the caller's HTTP request context is cancelled.
   Use a 10-second timeout detached from the parent context.

3. Persist the new token to the database BEFORE validation: the refresh
   token has already been consumed by the provider, so if validation
   fails (network error, rate limit, context cancellation), the new
   token would be lost forever. Persist first, then validate.

Refs coder#17069

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@jasonwbarnett
Copy link
Copy Markdown
Contributor Author

Cut a new release (v2.31.9-fix2) incorporating the review feedback and shipped it to production. This includes the empty-string guard, context.WithoutCancel, and persist-before-validate changes. Will report back if there are any issues.

@gorangasic
Copy link
Copy Markdown

Hi @jasonwbarnett, we're also seeing intermittent GitHub auth failures at my workplace and your PR looks quite promising. I see that you've already rebased on top of #24332. Do you think it's ready for another review?

@geokat would you be the best person to review it? Or should we pull in @johnstcn and @hugodutka?

@jasonwbarnett
Copy link
Copy Markdown
Contributor Author

Hi @jasonwbarnett, we're also seeing intermittent GitHub auth failures at my workplace and your PR looks quite promising. I see that you've already rebased on top of #24332. Do you think it's ready for another review?

@gorangasic Yes, I think it's ready for another review.

I'm not sure if they want to merge #24334 first or not; it's still in draft.

@mafredri
Copy link
Copy Markdown
Member

Hey @jasonwbarnett, really appreciate the contribution and sorry about your PR getting stuck in limbo (I independently discovered this issue because it was starting to block my work). Both #24332 and #24334 are now merged, but the concurrent race detection in your PR was a nice find and not covered by mine. Are you still up for working on this and rebasing on main? Happy to take a look if you have get the chance. 👍🏻

@jasonwbarnett jasonwbarnett force-pushed the fix/external-auth-refresh-race branch from f7a1de2 to 7789460 Compare April 29, 2026 12:13
@jasonwbarnett
Copy link
Copy Markdown
Contributor Author

@mafredri rebased and pushed. Let me know if there are any issues. Great work in #24332 and #24334 🎉

@jasonwbarnett jasonwbarnett changed the title fix(coderd/externalauth): prevent concurrent token refresh from poisoning cache fix(coderd/externalauth): detect concurrent refresh race to prevent cache poisoning Apr 29, 2026
Copy link
Copy Markdown
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should revert the ctx change, but otherwise LGTM! Thanks!

Comment thread coderd/externalauth/externalauth.go Outdated
…ache poisoning

When multiple concurrent requests race to refresh an expiring external
auth token, providers with single-use refresh tokens (e.g., GitHub Apps)
reject all but the first refresh attempt with "bad_refresh_token". The
losing request was caching this transient error in the database, which
cleared the refresh token and blocked all subsequent refresh attempts
until the user manually re-authenticated.

Before caching a refresh failure, re-read the external auth link from
the database. If the refresh token has changed (indicating a concurrent
caller already refreshed successfully), return the winner's updated
token instead of writing a failure. An empty-string guard ensures a
token cleared by another loser is not mistaken for a winner's
successful refresh.

Fixes coder#17069

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@jasonwbarnett jasonwbarnett force-pushed the fix/external-auth-refresh-race branch from 2723885 to e3e87b5 Compare April 29, 2026 13:01
@jasonwbarnett
Copy link
Copy Markdown
Contributor Author

@mafredri let's go! 🚀

Copy link
Copy Markdown
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@f0ssel f0ssel merged commit da6e708 into coder:main May 4, 2026
26 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators May 4, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

backport cherry-pick community Pull Requests and issues created by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: github external auth intermittently fails to refresh token

5 participants