fix(coderd/externalauth): detect concurrent refresh race to prevent cache poisoning#24228
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
8845515 to
f89a424
Compare
|
We built a custom release binary from a fork of We reproduced the race condition using the same method as before (setting
|
geokat
left a comment
There was a problem hiding this comment.
Great PR - thank you for submitting!
94cb3cf to
f984ef3
Compare
…rom request context Address review feedback from geokat on coder#24228: 1. Add empty-string check to concurrent refresh detection: a refresh token cleared by another loser (empty string) should not be treated as a successful concurrent refresh by a winner. 2. Use context.WithoutCancel for post-refresh work: once a single-use refresh token has been consumed, the validation and DB persistence must complete even if the caller's HTTP request context is cancelled. Use a 10-second timeout detached from the parent context. 3. Persist the new token to the database BEFORE validation: the refresh token has already been consumed by the provider, so if validation fails (network error, rate limit, context cancellation), the new token would be lost forever. Persist first, then validate. Refs coder#17069 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…rom request context Address review feedback from geokat on coder#24228: 1. Add empty-string check to concurrent refresh detection: a refresh token cleared by another loser (empty string) should not be treated as a successful concurrent refresh by a winner. 2. Use context.WithoutCancel for post-refresh work: once a single-use refresh token has been consumed, the validation and DB persistence must complete even if the caller's HTTP request context is cancelled. Use a 10-second timeout detached from the parent context. 3. Persist the new token to the database BEFORE validation: the refresh token has already been consumed by the provider, so if validation fails (network error, rate limit, context cancellation), the new token would be lost forever. Persist first, then validate. Refs coder#17069 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Cut a new release (v2.31.9-fix2) incorporating the review feedback and shipped it to production. This includes the empty-string guard, |
f984ef3 to
f7a1de2
Compare
|
Hi @jasonwbarnett, we're also seeing intermittent GitHub auth failures at my workplace and your PR looks quite promising. I see that you've already rebased on top of #24332. Do you think it's ready for another review? @geokat would you be the best person to review it? Or should we pull in @johnstcn and @hugodutka? |
@gorangasic Yes, I think it's ready for another review. I'm not sure if they want to merge #24334 first or not; it's still in draft. |
|
Hey @jasonwbarnett, really appreciate the contribution and sorry about your PR getting stuck in limbo (I independently discovered this issue because it was starting to block my work). Both #24332 and #24334 are now merged, but the concurrent race detection in your PR was a nice find and not covered by mine. Are you still up for working on this and rebasing on |
f7a1de2 to
7789460
Compare
mafredri
left a comment
There was a problem hiding this comment.
We should revert the ctx change, but otherwise LGTM! Thanks!
…ache poisoning When multiple concurrent requests race to refresh an expiring external auth token, providers with single-use refresh tokens (e.g., GitHub Apps) reject all but the first refresh attempt with "bad_refresh_token". The losing request was caching this transient error in the database, which cleared the refresh token and blocked all subsequent refresh attempts until the user manually re-authenticated. Before caching a refresh failure, re-read the external auth link from the database. If the refresh token has changed (indicating a concurrent caller already refreshed successfully), return the winner's updated token instead of writing a failure. An empty-string guard ensures a token cleared by another loser is not mistaken for a winner's successful refresh. Fixes coder#17069 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
2723885 to
e3e87b5
Compare
|
@mafredri let's go! 🚀 |
Fixes #17069
Builds on #24332 and #24334 which addressed token persistence and rate limit handling.
Problem
When multiple concurrent requests race to refresh an expiring external auth token, providers with single-use refresh tokens (e.g., GitHub Apps) reject all but the first refresh attempt with
bad_refresh_token. The losing request caches this transient error in theoauth_refresh_failure_reasondatabase column and clears the refresh token, blocking all subsequent refresh attempts until the user manually re-authenticates.This is common for users with multiple terminals, IDE connections, or workspaces open, all of which poll the external auth endpoint and trigger concurrent refreshes when the token nears expiry. Database analysis showed 5 of 7 affected users failed within 5-10 seconds of token expiry, matching the Go oauth2 library's
expiryDeltawindow.Fix
Before caching a
bad_refresh_tokenfailure, re-read the external auth link from the database. If the refresh token has changed (indicating a concurrent caller already refreshed successfully), return the winner's updated link instead of writing a failure. An empty-string guard ensures a token cleared by another loser isn't mistaken for a winner's successful refresh.