Skip to content

fix(codex): force real token refresh on 401 and retry transient refresh failures#587

Open
Bowl42 wants to merge 2 commits into
mainfrom
fix/codex-refresh-token
Open

fix(codex): force real token refresh on 401 and retry transient refresh failures#587
Bowl42 wants to merge 2 commits into
mainfrom
fix/codex-refresh-token

Conversation

@Bowl42

@Bowl42 Bowl42 commented May 31, 2026

Copy link
Copy Markdown
Collaborator

问题

Codex adapter 的 getAccessToken 没有 forceRefresh 路径。收到 401 时它只清空内存 cache,然后重新读取持久化的 access_token——也就是上游刚刚拒绝的那个凭据——于是重试又撞上同一个 401。刷新失败时还会回退到这个失效 token,并且没有任何重试,也不区分永久性错误(refresh_token_reused / invalid_grant)。

参考 CLIProxyAPI(internal/auth/codex/openai_auth.goRefreshTokensWithRetry / isNonRetryableRefreshErr)以及同仓库已正确实现的 Claude adapter(getAccessToken(ctx, forceRefresh bool))对齐修复。

改动

internal/adapter/provider/codex/oauth.go

  • 新增 RefreshAccessTokenWithRetry(线性退避,支持 ctx 取消)
  • 新增 isNonRetryableRefreshErr(对 refresh_token_reused / invalid_grant / invalid_request / invalid_client 快速失败)

internal/adapter/provider/codex/adapter.go

  • getAccessToken(ctx)getAccessToken(ctx, forceRefresh bool):forceRefresh 时跳过 cache 和持久化 token,且刷新失败不回退被拒 token
  • 新增 refreshMu 串行化刷新(避免并发 refresh_token 复用),取锁后二次复查 cache
  • 401 重试路径改用 getAccessToken(ctx, true)
  • access token TTL 加 >= 1s 下限保护

internal/service/codex_task.go / internal/handler/codex.go

  • 配额路径统一改用 RefreshAccessTokenWithRetry,与请求路径保持一致

测试

  • go build ./internal/...
  • go vet ./internal/adapter/provider/codex/... ./internal/service/... ./internal/handler/...
  • go test ./internal/adapter/provider/codex/... ./internal/service/... ./internal/handler/... ✅(全部 ok)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • 新增刷新互斥锁以序列化同一账户/refresh_token 的刷新;新增 provider 重载注入点与 SetProviderReloadFunc。
    • 新增带重试与线性退避的 RefreshAccessTokenWithRetry,并在刷新成功后更改过期计算与最佳努力持久化行为。
    • 将 provider 更新回调类型调整以支持路由侧注入。
  • Bug Fixes

    • 引入“强制刷新”与锁内双重检查,避免复用被拒绝或短期过期令牌;强制刷新失败不再回退持久化令牌;统一内存缓存与 60s 缓冲策略,改进重试/并发稳定性。
  • Tests

    • 增加刷新锁、适配器注入与重载相关单元测试。

@coderabbitai

coderabbitai Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

为 Codex 引入带重试的刷新函数、进程级刷新互斥锁;重构适配器令牌复用与强制刷新流程(锁内重读 provider、二次检查、重试刷新并持久化);调用方在 quota/定时任务路径按锁串行化刷新并处理失败。

Changes

Codex 令牌重试与并发控制

Layer / File(s) Summary
令牌重试函数与不可重试错误检测
internal/adapter/provider/codex/oauth.go
新增 RefreshAccessTokenWithRetry(ctx, refreshToken, maxRetries)isNonRetryableRefreshErr,实现线性退避重试、context 取消处理及对不可重试错误的快速失败。
全进程刷新互斥锁与测试
internal/adapter/provider/codex/refresh_lock.go, internal/adapter/provider/codex/refresh_lock_test.go
新增进程内锁注册表、AcquireRefreshLock(key) func()RefreshLockKey(accountID,refreshToken);添加锁行为单元测试(键格式、互斥、清理、幂等、不同键不阻塞)。
适配器强制刷新与双重检查逻辑
internal/adapter/provider/codex/adapter.go
重构 getAccessToken(ctx, forceRefresh, rejectedToken)Execute/WarmToken 改为显式非强制调用;收到 401 时使用强制刷新且绕过被拒绝凭据复用;刷新在锁内通过 providerReload 重读 prov 并调用 RefreshAccessTokenWithRetry(...,3),成功后更新内存并对锁内快照做最佳努力持久化。
处理器与定时任务的刷新调用替换
internal/handler/codex.go, internal/service/codex_task.go
在 quota 获取与定时刷新路径中按 AccountID/RefreshToken 获取刷新锁并在锁内二次检查 token;需刷新时调用 RefreshAccessTokenWithRetry(...,3),成功则更新并持久化 provider,失败时解锁并跳过该 provider。
路由注入与适配器接线测试
internal/router/router.go, internal/adapter/provider/codex/reload_wiring_test.go
路由注入新增对 SetProviderReloadFunc 能力的检测并注入回调;新增适配器 wiring 测试验证注入回调被设置并按预期返回 provider 快照。

Sequence Diagram

sequenceDiagram
  participant Client
  participant Execute
  participant getAccessToken
  participant AcquireLock
  participant RefreshWithRetry
  participant CodexAPI
  participant Store

  Client->>Execute: 发起请求
  Execute->>getAccessToken: getAccessToken(ctx, false, "")
  getAccessToken-->>Execute: 若缓存/持久化令牌在缓冲期内则返回令牌
  alt 需要刷新
    getAccessToken->>AcquireLock: AcquireRefreshLock(key)
    AcquireLock-->>getAccessToken: 锁已获得
    getAccessToken->>RefreshWithRetry: 调用 RefreshAccessTokenWithRetry(refreshToken,3)
    RefreshWithRetry->>CodexAPI: 发起刷新请求(可能多次重试)
    CodexAPI-->>RefreshWithRetry: 返回新令牌或错误
    RefreshWithRetry-->>Store: 更新缓存与持久化(若成功)
    getAccessToken->>AcquireLock: Unlock()
  end
  Execute->>CodexAPI: 使用令牌调用 API
  alt API 返回 401
    CodexAPI-->>Execute: 401 Unauthorized
    Execute->>getAccessToken: getAccessToken(ctx, true, 被拒绝token)
    getAccessToken->>AcquireLock: AcquireRefreshLock(key)
    getAccessToken->>RefreshWithRetry: 强制刷新(绕过被拒凭据复用)
    RefreshWithRetry->>CodexAPI: 刷新请求
    CodexAPI-->>RefreshWithRetry: 返回新令牌
    RefreshWithRetry-->>Store: 更新
    getAccessToken->>AcquireLock: Unlock()
    Execute->>CodexAPI: 使用新令牌重试请求
  end
Loading

估计代码审查工作量

🎯 4 (Complex) | ⏱️ ~45 分钟

可能相关的 PRs

建议审查者

  • awsl233777
  • ymkiux
  • whhjdi

🐰 小兔报喜:
锁护令牌免重争,三次退避稳刷新,
401 呼唤强制换,快照锁内再复审。

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed 标题准确总结了本次变更的主要目标:在收到 401 响应时强制执行真实的 token 刷新,并针对可重试的刷新失败进行重试机制
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/codex-refresh-token

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/adapter/provider/codex/adapter.go (2)

322-342: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

锁内仍然沿用了锁前读取的 refresh_token 快照。

这里拿到全局刷新锁后只重查了 a.tokenCache,但 Line 342 真正用于刷新的是锁前的 config.RefreshToken。如果等待锁期间是 quota handler / task 完成了刷新并轮换了 refresh_token,这个旧 adapter 实例并不会像 internal/handler/codex.go 那样在锁内重读 provider,于是仍可能拿已经失效的 refresh_token 去调刷新,refresh_token_reused 会在跨路径并发下继续出现。建议在锁内重新获取最新 provider 配置,或引入能把最新 refresh token 同步回当前 adapter 的共享状态。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/adapter/provider/codex/adapter.go` around lines 322 - 342, After
acquiring the refresh lock via AcquireRefreshLock(RefreshLockKey(...)) you
currently still call RefreshAccessTokenWithRetry using the pre-lock snapshot
config.RefreshToken; instead, inside the locked section re-read the
authoritative refresh token from the provider/shared state (e.g., reload
a.provider or a.provider.Config or call the same lookup used by
internal/handler/codex.go) or sync the latest token into a.tokenCache before
calling RefreshAccessTokenWithRetry, and then pass that fresh token to
RefreshAccessTokenWithRetry (and update a.tokenCache/a.provider with the new
token on success) so you never attempt refresh with a stale config.RefreshToken.

365-404: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

不要把适配器内存态的 token 更新绑在 providerUpdate 上。

providerUpdate 在这个类型里是可选的,但现在 config.AccessTokenconfig.ExpiresAtconfig.RefreshToken 只会在它非空时更新。这样一旦本次刷新返回了新的 refresh_token,没有回调的 adapter 下一次过期后仍会继续使用旧 token 刷新,等于把轮换语义丢掉了。内存态应始终先更新,再按需持久化。

建议修改
-	// Persist token to database if update function is set
-	if a.providerUpdate != nil {
-		config.AccessToken = tokenResp.AccessToken
-		config.ExpiresAt = expiresAt.Format(time.RFC3339)
-		if tokenResp.RefreshToken != "" {
-			config.RefreshToken = tokenResp.RefreshToken
-		}
+	config.AccessToken = tokenResp.AccessToken
+	config.ExpiresAt = expiresAt.Format(time.RFC3339)
+	if tokenResp.RefreshToken != "" {
+		config.RefreshToken = tokenResp.RefreshToken
+	}
+
+	// Persist token to database if update function is set
+	if a.providerUpdate != nil {
 		if tokenResp.IDToken != "" {
 			if claims, parseErr := ParseIDToken(tokenResp.IDToken); parseErr == nil && claims != nil {
 				if v := strings.TrimSpace(claims.GetAccountID()); v != "" {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/adapter/provider/codex/adapter.go` around lines 365 - 404, The code
currently only updates in-memory token fields when a.providerUpdate is non-nil,
which loses refresh_token rotation for adapters without a callback; change the
logic in the refresh handling (the block that references config, tokenResp,
expiresAt, ParseIDToken and a.providerUpdate) so that you always update the
in-memory config.AccessToken, config.ExpiresAt, and—if
present—config.RefreshToken and all ID token-derived fields (claims.Email, Name,
Picture, AccountID, UserID, PlanType, SubscriptionStart, SubscriptionEnd) before
checking providerUpdate; then, if a.providerUpdate != nil, call
a.providerUpdate(a.provider) to persist and log any error (same best-effort
behavior), leaving ParseIDToken usage and error handling intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/handler/codex.go`:
- Around line 628-632: The call to h.svc.GetByID does not exist on the service
interface and causes typecheck failure; replace it with the existing service
method used elsewhere in this file (e.g. h.svc.GetProvider or
h.svc.GetProviders) so the code compiles. Locate the block using
codex.AcquireRefreshLock and change the call to use the same service method
signature/return values used by other paths in this file (match variable names
and nil checks to the existing GetProvider/GetProviders return types),
preserving the logic that refreshes provider and config when a fresh record is
returned.
- Around line 646-653: The UpdateProvider call's error is ignored which can drop
a rotated refresh_token; change the block after setting config
(AccessToken/ExpiresAt/RefreshToken) to check the error returned by
h.svc.UpdateProvider(provider) and treat it as a refresh failure: if
UpdateProvider returns an error, do not proceed as if refresh succeeded —
log/return the error (propagate it from the current refresh function), avoid
releasing the lock as if success until cleanup, and ensure any adapter cache
refresh that depends on persistence is not considered done; reference the
h.svc.UpdateProvider call, the local config/refresh_token assignment, and the
unlock() call to implement this control flow change.

---

Outside diff comments:
In `@internal/adapter/provider/codex/adapter.go`:
- Around line 322-342: After acquiring the refresh lock via
AcquireRefreshLock(RefreshLockKey(...)) you currently still call
RefreshAccessTokenWithRetry using the pre-lock snapshot config.RefreshToken;
instead, inside the locked section re-read the authoritative refresh token from
the provider/shared state (e.g., reload a.provider or a.provider.Config or call
the same lookup used by internal/handler/codex.go) or sync the latest token into
a.tokenCache before calling RefreshAccessTokenWithRetry, and then pass that
fresh token to RefreshAccessTokenWithRetry (and update a.tokenCache/a.provider
with the new token on success) so you never attempt refresh with a stale
config.RefreshToken.
- Around line 365-404: The code currently only updates in-memory token fields
when a.providerUpdate is non-nil, which loses refresh_token rotation for
adapters without a callback; change the logic in the refresh handling (the block
that references config, tokenResp, expiresAt, ParseIDToken and a.providerUpdate)
so that you always update the in-memory config.AccessToken, config.ExpiresAt,
and—if present—config.RefreshToken and all ID token-derived fields
(claims.Email, Name, Picture, AccountID, UserID, PlanType, SubscriptionStart,
SubscriptionEnd) before checking providerUpdate; then, if a.providerUpdate !=
nil, call a.providerUpdate(a.provider) to persist and log any error (same
best-effort behavior), leaving ParseIDToken usage and error handling intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 84b83891-9c9c-4f86-a9fb-7d7646b1815b

📥 Commits

Reviewing files that changed from the base of the PR and between e934c89 and 641fbc9.

📒 Files selected for processing (3)
  • internal/adapter/provider/codex/adapter.go
  • internal/adapter/provider/codex/refresh_lock.go
  • internal/handler/codex.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Backend Checks
  • GitHub Check: Frontend Checks
  • GitHub Check: e2e
  • GitHub Check: playwright
🧰 Additional context used
🪛 golangci-lint (2.12.2)
internal/handler/codex.go

[error] 629-629: : # github.com/awsl-project/maxx/internal/handler [github.com/awsl-project/maxx/internal/handler.test]
internal/handler/codex.go:629:28: h.svc.GetByID undefined (type *service.AdminService has no field or method GetByID)

(typecheck)

🔇 Additional comments (1)
internal/adapter/provider/codex/refresh_lock.go (1)

9-42: LGTM!

Comment thread internal/handler/codex.go Outdated
Comment thread internal/handler/codex.go Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/router/router.go`:
- Around line 582-588: The function injectProviderUpdate currently references an
undefined variable p; change its signature to injectProviderUpdate(a
provider.ProviderAdapter, p *domain.Provider) and update its internal use of p
(used to capture tenantID and id and to create the reload closure via
repo.GetByID) so the provider pointer is supplied rather than relying on a
nonexistent outer variable; then update the two call sites that invoke
injectProviderUpdate—inside InitAdapters (the loop at the previous line ~113)
and inside RefreshAdapter (the call at the previous line ~129)—to pass the
corresponding *domain.Provider instance into injectProviderUpdate, ensuring
types match provider.ProviderAdapter and *domain.Provider and keeping the
providerReloader/SetProviderReloadFunc logic unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3db96733-dbfc-4c85-8645-af22cfad0b25

📥 Commits

Reviewing files that changed from the base of the PR and between 641fbc9 and fecba1c.

📒 Files selected for processing (7)
  • internal/adapter/provider/codex/adapter.go
  • internal/adapter/provider/codex/refresh_lock.go
  • internal/adapter/provider/codex/refresh_lock_test.go
  • internal/adapter/provider/codex/reload_wiring_test.go
  • internal/handler/codex.go
  • internal/router/router.go
  • internal/service/codex_task.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/handler/codex.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: e2e
  • GitHub Check: playwright
🧰 Additional context used
🪛 GitHub Actions: PR Checks / 0_Backend Checks.txt
internal/router/router.go

[error] 582-582: go vet failed: internal/router/router.go:582:42: undefined: p

🪛 GitHub Actions: PR Checks / Backend Checks
internal/router/router.go

[error] 582-582: go vet reported undefined identifier 'p' at internal/router/router.go:582:42. Step: go vet ./cmd/... ./internal/...

🪛 GitHub Check: Backend Checks
internal/router/router.go

[failure] 584-584:
undefined: p


[failure] 582-582:
undefined: p

🪛 golangci-lint (2.12.2)
internal/router/router.go

[error] 582-582: : # github.com/awsl-project/maxx/internal/router
internal/router/router.go:582:42: undefined: p
internal/router/router.go:584:19: undefined: p

(typecheck)

🔇 Additional comments (11)
internal/adapter/provider/codex/refresh_lock_test.go (1)

9-73: LGTM!

internal/adapter/provider/codex/reload_wiring_test.go (1)

9-48: LGTM!

internal/adapter/provider/codex/adapter.go (5)

57-64: LGTM!

Also applies to: 81-85


133-133: LGTM!

Also applies to: 215-215, 267-267


279-315: LGTM!


341-373: LGTM!


375-454: LGTM!

internal/adapter/provider/codex/refresh_lock.go (3)

8-19: LGTM!


35-59: LGTM!


68-76: LGTM!

internal/service/codex_task.go (1)

150-185: LGTM!

Comment thread internal/router/router.go
@Bowl42

Bowl42 commented May 31, 2026

Copy link
Copy Markdown
Collaborator Author

多方多轮 Review 总结

对本 PR 进行了多方、多轮 review(本地 Claude + 本地 Codex + CodeRabbit),逐轮修复。最终 go build ./internal/... / go vet / go test / go test -race(codex 包)全部通过。

各方结论

Reviewer 状态
CodeRabbit 首轮(commit a2087138)"No actionable comments 🎉"。其后额度耗尽(usage credits run out),本会话无法再触发复审。
本地 Claude (Opus) 最终轮 NO ACTIONABLE FINDINGS(独立验证:无死锁、无锁泄漏、无指针竞争、wiring 正确)。
本地 Codex (gpt-5.5) 多轮中发现并修复多个真实 bug;最终仅余 1 项 pre-existing 竞争(见下「已知后续项」)。

修复要点(在原始 401 刷新修复之上,经 review 迭代)

  1. per-account 刷新锁refresh_lock.go):进程级、按账户(ChatGPT account ID,回退 refresh_token)加锁,带 refcount 空闲驱逐、sync.Once 幂等 unlock。OpenAI 每次刷新轮换 refresh_token 并拒绝复用,故所有可能并发刷新同一账户的路径必须串行。
  2. 覆盖全部 5 条 refresh_token 轮换路径:请求适配器 401 重试、后台配额任务、批量配额 handler、单 provider 用量、provider-info 刷新——全部「加锁 → 锁内重读最新 provider → 复用或刷新一次 → 持久化」。
  3. 修复死的 token 持久化 wiringProviderUpdateFunc/ProviderReloadFunc 改为 type alias,使 router 的 duck-typed 断言真正匹配(此前 defined type 静默失配,刷新后的 token 从未落库——main 上即存在的潜在 bug)。新增 reload_wiring_test.go 回归覆盖。
  4. 适配器锁内重读 provider:等锁期间别的路径可能已轮换并持久化新 token,重读后采用,避免用过期 refresh_token 再次刷新。
  5. 消除 a.provider 指针竞争:getAccessToken 改用局部快照,不再重写共享字段。
  6. isNonRetryableRefreshErr 收窄到 refresh_token_reused/invalid_grant;TTL 钳制;移除重复 60s buffer。

已验证安全(曾被怀疑、经核实不成立)

  • 无死锁:锁内调用的 repo.GetByID/ValidateRefreshToken/RefreshAccessTokenWithRetry 均不重入该锁;refreshLocksMu/entry.mutokenMu/刷新锁均无持有顺序倒置。
  • 无锁泄漏:每处加锁点每条分支恰好 unlock 一次(sync.Once 兜底)。

已知后续项(建议单独 PR)

本地 Codex 指出 adapter.goconfig := ensureCodexConfig(prov) 仍指向共享provider.Config.Codex:锁外对 token 字段的读取与另一 goroutine 锁内的写入存在 field 级 data race。这是先于本 PR 即存在的竞争类型(Execute/applyCodexHeaders 等多处都无锁读取该共享结构),彻底修复需「锁内 deep-copy provider + 同步 Execute 的字段读取」的更大重构,超出本 PR「修复 codex refresh token」范围,记为后续项。本 PR 已将指针竞争与刷新串行化处理到位,相对 main 是明确改进。

🤖 Generated with Claude Code

The Codex adapter never truly refreshed on a 401: it cleared the in-memory
cache then re-read the persisted access_token — the same credential the upstream
had just rejected — so the retry hit the same 401, and refresh failures fell back
to that dead token. There was no retry, no handling of permanent errors, and
concurrent refreshes could reuse a rotated refresh_token (refresh_token_reused).

Aligns with CLIProxyAPI and the sibling Claude adapter:

- oauth.go: add RefreshAccessTokenWithRetry (linear backoff, ctx-aware) and
  isNonRetryableRefreshErr (fast-fail on refresh_token_reused / invalid_grant).
- adapter.go: getAccessToken(ctx, forceRefresh, rejectedToken) forces a real
  refresh on 401, never falling back to the rejected token; works against a local
  provider snapshot (no a.provider mutation race); re-reads the freshest provider
  under the lock to adopt a token another path just rotated.
- refresh_lock.go: process-wide per-account refresh lock (keyed by ChatGPT
  account ID, falling back to refresh_token) with refcount eviction and an
  idempotent unlock. OpenAI rotates the refresh_token on every refresh and
  rejects reuse, so all paths that may refresh the same account serialize.
- All five refresh_token rotation paths now take the lock and re-read under it:
  request adapter, quota task, batch-quota handler, per-provider usage, and
  provider-info refresh.
- router.go: inject a provider-reload callback (repo.GetByID) alongside the
  existing update hook. ProviderUpdateFunc/ProviderReloadFunc are type aliases so
  the router's duck-typed assertions match — previously the defined ProviderUpdateFunc
  silently failed the assertion, leaving refreshed tokens unpersisted.
- handler/codex.go, service/codex_task.go: quota paths use the retry helper and
  persist rotated refresh tokens.
- tests: refresh_lock_test.go (key derivation, mutual exclusion, eviction,
  idempotent unlock) and reload_wiring_test.go (router wiring type assertions).

Reviewed across multiple rounds (local Claude + local Codex + CodeRabbit).
go build ./internal/... / vet / test / -race all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Bowl42 Bowl42 force-pushed the fix/codex-refresh-token branch from 3061bf3 to 210e2be Compare May 31, 2026 16:36
The cached ProviderRepository hands the same *domain.Provider pointer to every
caller, and the request hot path reads token fields off provider.Config.Codex
without a lock (Execute reads AccountID; getAccessToken's fast path reads
AccessToken/ExpiresAt/RefreshToken). The refresh paths mutated that shared
struct in place under the refresh lock — which serializes writers but not the
lock-free readers — a data race flagged in review.

Switch every token-persisting path to copy-on-write: clone the provider
(CloneForTokenPersist), mutate the clone, and persist it via repo.Update, which
atomically swaps the cache pointer. Readers holding the old pointer see a
consistent (if briefly stale) struct; no reader and writer ever touch the same
struct. Applied in the adapter (refresh + fallback), the quota task, and all
three handler paths (RefreshProviderInfo, GetProviderUsage, GetBatchQuotas);
each post-persist read is repointed at the clone.

Tests: persist_test.go covers clone isolation, nil-safety, and a concurrent
persist-vs-read scenario that is clean under -race. go build / vet / test /
-race all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant