fix(codex): force real token refresh on 401 and retry transient refresh failures#587
fix(codex): force real token refresh on 401 and retry transient refresh failures#587Bowl42 wants to merge 2 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthrough为 Codex 引入带重试的刷新函数、进程级刷新互斥锁;重构适配器令牌复用与强制刷新流程(锁内重读 provider、二次检查、重试刷新并持久化);调用方在 quota/定时任务路径按锁串行化刷新并处理失败。 ChangesCodex 令牌重试与并发控制
Sequence DiagramsequenceDiagram
participant Client
participant Execute
participant getAccessToken
participant AcquireLock
participant RefreshWithRetry
participant CodexAPI
participant Store
Client->>Execute: 发起请求
Execute->>getAccessToken: getAccessToken(ctx, false, "")
getAccessToken-->>Execute: 若缓存/持久化令牌在缓冲期内则返回令牌
alt 需要刷新
getAccessToken->>AcquireLock: AcquireRefreshLock(key)
AcquireLock-->>getAccessToken: 锁已获得
getAccessToken->>RefreshWithRetry: 调用 RefreshAccessTokenWithRetry(refreshToken,3)
RefreshWithRetry->>CodexAPI: 发起刷新请求(可能多次重试)
CodexAPI-->>RefreshWithRetry: 返回新令牌或错误
RefreshWithRetry-->>Store: 更新缓存与持久化(若成功)
getAccessToken->>AcquireLock: Unlock()
end
Execute->>CodexAPI: 使用令牌调用 API
alt API 返回 401
CodexAPI-->>Execute: 401 Unauthorized
Execute->>getAccessToken: getAccessToken(ctx, true, 被拒绝token)
getAccessToken->>AcquireLock: AcquireRefreshLock(key)
getAccessToken->>RefreshWithRetry: 强制刷新(绕过被拒凭据复用)
RefreshWithRetry->>CodexAPI: 刷新请求
CodexAPI-->>RefreshWithRetry: 返回新令牌
RefreshWithRetry-->>Store: 更新
getAccessToken->>AcquireLock: Unlock()
Execute->>CodexAPI: 使用新令牌重试请求
end
估计代码审查工作量🎯 4 (Complex) | ⏱️ ~45 分钟 可能相关的 PRs
建议审查者
诗
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
internal/adapter/provider/codex/adapter.go (2)
322-342:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift锁内仍然沿用了锁前读取的
refresh_token快照。这里拿到全局刷新锁后只重查了
a.tokenCache,但 Line 342 真正用于刷新的是锁前的config.RefreshToken。如果等待锁期间是 quota handler / task 完成了刷新并轮换了refresh_token,这个旧 adapter 实例并不会像internal/handler/codex.go那样在锁内重读 provider,于是仍可能拿已经失效的refresh_token去调刷新,refresh_token_reused会在跨路径并发下继续出现。建议在锁内重新获取最新 provider 配置,或引入能把最新 refresh token 同步回当前 adapter 的共享状态。🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/adapter/provider/codex/adapter.go` around lines 322 - 342, After acquiring the refresh lock via AcquireRefreshLock(RefreshLockKey(...)) you currently still call RefreshAccessTokenWithRetry using the pre-lock snapshot config.RefreshToken; instead, inside the locked section re-read the authoritative refresh token from the provider/shared state (e.g., reload a.provider or a.provider.Config or call the same lookup used by internal/handler/codex.go) or sync the latest token into a.tokenCache before calling RefreshAccessTokenWithRetry, and then pass that fresh token to RefreshAccessTokenWithRetry (and update a.tokenCache/a.provider with the new token on success) so you never attempt refresh with a stale config.RefreshToken.
365-404:⚠️ Potential issue | 🟠 Major | ⚡ Quick win不要把适配器内存态的 token 更新绑在
providerUpdate上。
providerUpdate在这个类型里是可选的,但现在config.AccessToken、config.ExpiresAt和config.RefreshToken只会在它非空时更新。这样一旦本次刷新返回了新的refresh_token,没有回调的 adapter 下一次过期后仍会继续使用旧 token 刷新,等于把轮换语义丢掉了。内存态应始终先更新,再按需持久化。建议修改
- // Persist token to database if update function is set - if a.providerUpdate != nil { - config.AccessToken = tokenResp.AccessToken - config.ExpiresAt = expiresAt.Format(time.RFC3339) - if tokenResp.RefreshToken != "" { - config.RefreshToken = tokenResp.RefreshToken - } + config.AccessToken = tokenResp.AccessToken + config.ExpiresAt = expiresAt.Format(time.RFC3339) + if tokenResp.RefreshToken != "" { + config.RefreshToken = tokenResp.RefreshToken + } + + // Persist token to database if update function is set + if a.providerUpdate != nil { if tokenResp.IDToken != "" { if claims, parseErr := ParseIDToken(tokenResp.IDToken); parseErr == nil && claims != nil { if v := strings.TrimSpace(claims.GetAccountID()); v != "" {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/adapter/provider/codex/adapter.go` around lines 365 - 404, The code currently only updates in-memory token fields when a.providerUpdate is non-nil, which loses refresh_token rotation for adapters without a callback; change the logic in the refresh handling (the block that references config, tokenResp, expiresAt, ParseIDToken and a.providerUpdate) so that you always update the in-memory config.AccessToken, config.ExpiresAt, and—if present—config.RefreshToken and all ID token-derived fields (claims.Email, Name, Picture, AccountID, UserID, PlanType, SubscriptionStart, SubscriptionEnd) before checking providerUpdate; then, if a.providerUpdate != nil, call a.providerUpdate(a.provider) to persist and log any error (same best-effort behavior), leaving ParseIDToken usage and error handling intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/handler/codex.go`:
- Around line 628-632: The call to h.svc.GetByID does not exist on the service
interface and causes typecheck failure; replace it with the existing service
method used elsewhere in this file (e.g. h.svc.GetProvider or
h.svc.GetProviders) so the code compiles. Locate the block using
codex.AcquireRefreshLock and change the call to use the same service method
signature/return values used by other paths in this file (match variable names
and nil checks to the existing GetProvider/GetProviders return types),
preserving the logic that refreshes provider and config when a fresh record is
returned.
- Around line 646-653: The UpdateProvider call's error is ignored which can drop
a rotated refresh_token; change the block after setting config
(AccessToken/ExpiresAt/RefreshToken) to check the error returned by
h.svc.UpdateProvider(provider) and treat it as a refresh failure: if
UpdateProvider returns an error, do not proceed as if refresh succeeded —
log/return the error (propagate it from the current refresh function), avoid
releasing the lock as if success until cleanup, and ensure any adapter cache
refresh that depends on persistence is not considered done; reference the
h.svc.UpdateProvider call, the local config/refresh_token assignment, and the
unlock() call to implement this control flow change.
---
Outside diff comments:
In `@internal/adapter/provider/codex/adapter.go`:
- Around line 322-342: After acquiring the refresh lock via
AcquireRefreshLock(RefreshLockKey(...)) you currently still call
RefreshAccessTokenWithRetry using the pre-lock snapshot config.RefreshToken;
instead, inside the locked section re-read the authoritative refresh token from
the provider/shared state (e.g., reload a.provider or a.provider.Config or call
the same lookup used by internal/handler/codex.go) or sync the latest token into
a.tokenCache before calling RefreshAccessTokenWithRetry, and then pass that
fresh token to RefreshAccessTokenWithRetry (and update a.tokenCache/a.provider
with the new token on success) so you never attempt refresh with a stale
config.RefreshToken.
- Around line 365-404: The code currently only updates in-memory token fields
when a.providerUpdate is non-nil, which loses refresh_token rotation for
adapters without a callback; change the logic in the refresh handling (the block
that references config, tokenResp, expiresAt, ParseIDToken and a.providerUpdate)
so that you always update the in-memory config.AccessToken, config.ExpiresAt,
and—if present—config.RefreshToken and all ID token-derived fields
(claims.Email, Name, Picture, AccountID, UserID, PlanType, SubscriptionStart,
SubscriptionEnd) before checking providerUpdate; then, if a.providerUpdate !=
nil, call a.providerUpdate(a.provider) to persist and log any error (same
best-effort behavior), leaving ParseIDToken usage and error handling intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 84b83891-9c9c-4f86-a9fb-7d7646b1815b
📒 Files selected for processing (3)
internal/adapter/provider/codex/adapter.gointernal/adapter/provider/codex/refresh_lock.gointernal/handler/codex.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Backend Checks
- GitHub Check: Frontend Checks
- GitHub Check: e2e
- GitHub Check: playwright
🧰 Additional context used
🪛 golangci-lint (2.12.2)
internal/handler/codex.go
[error] 629-629: : # github.com/awsl-project/maxx/internal/handler [github.com/awsl-project/maxx/internal/handler.test]
internal/handler/codex.go:629:28: h.svc.GetByID undefined (type *service.AdminService has no field or method GetByID)
(typecheck)
🔇 Additional comments (1)
internal/adapter/provider/codex/refresh_lock.go (1)
9-42: LGTM!
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/router/router.go`:
- Around line 582-588: The function injectProviderUpdate currently references an
undefined variable p; change its signature to injectProviderUpdate(a
provider.ProviderAdapter, p *domain.Provider) and update its internal use of p
(used to capture tenantID and id and to create the reload closure via
repo.GetByID) so the provider pointer is supplied rather than relying on a
nonexistent outer variable; then update the two call sites that invoke
injectProviderUpdate—inside InitAdapters (the loop at the previous line ~113)
and inside RefreshAdapter (the call at the previous line ~129)—to pass the
corresponding *domain.Provider instance into injectProviderUpdate, ensuring
types match provider.ProviderAdapter and *domain.Provider and keeping the
providerReloader/SetProviderReloadFunc logic unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3db96733-dbfc-4c85-8645-af22cfad0b25
📒 Files selected for processing (7)
internal/adapter/provider/codex/adapter.gointernal/adapter/provider/codex/refresh_lock.gointernal/adapter/provider/codex/refresh_lock_test.gointernal/adapter/provider/codex/reload_wiring_test.gointernal/handler/codex.gointernal/router/router.gointernal/service/codex_task.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/handler/codex.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: e2e
- GitHub Check: playwright
🧰 Additional context used
🪛 GitHub Actions: PR Checks / 0_Backend Checks.txt
internal/router/router.go
[error] 582-582: go vet failed: internal/router/router.go:582:42: undefined: p
🪛 GitHub Actions: PR Checks / Backend Checks
internal/router/router.go
[error] 582-582: go vet reported undefined identifier 'p' at internal/router/router.go:582:42. Step: go vet ./cmd/... ./internal/...
🪛 GitHub Check: Backend Checks
internal/router/router.go
[failure] 584-584:
undefined: p
[failure] 582-582:
undefined: p
🪛 golangci-lint (2.12.2)
internal/router/router.go
[error] 582-582: : # github.com/awsl-project/maxx/internal/router
internal/router/router.go:582:42: undefined: p
internal/router/router.go:584:19: undefined: p
(typecheck)
🔇 Additional comments (11)
internal/adapter/provider/codex/refresh_lock_test.go (1)
9-73: LGTM!internal/adapter/provider/codex/reload_wiring_test.go (1)
9-48: LGTM!internal/adapter/provider/codex/adapter.go (5)
57-64: LGTM!Also applies to: 81-85
133-133: LGTM!Also applies to: 215-215, 267-267
279-315: LGTM!
341-373: LGTM!
375-454: LGTM!internal/adapter/provider/codex/refresh_lock.go (3)
8-19: LGTM!
35-59: LGTM!
68-76: LGTM!internal/service/codex_task.go (1)
150-185: LGTM!
多方多轮 Review 总结对本 PR 进行了多方、多轮 review(本地 Claude + 本地 Codex + CodeRabbit),逐轮修复。最终 各方结论
修复要点(在原始 401 刷新修复之上,经 review 迭代)
已验证安全(曾被怀疑、经核实不成立)
已知后续项(建议单独 PR)本地 Codex 指出 🤖 Generated with Claude Code |
The Codex adapter never truly refreshed on a 401: it cleared the in-memory cache then re-read the persisted access_token — the same credential the upstream had just rejected — so the retry hit the same 401, and refresh failures fell back to that dead token. There was no retry, no handling of permanent errors, and concurrent refreshes could reuse a rotated refresh_token (refresh_token_reused). Aligns with CLIProxyAPI and the sibling Claude adapter: - oauth.go: add RefreshAccessTokenWithRetry (linear backoff, ctx-aware) and isNonRetryableRefreshErr (fast-fail on refresh_token_reused / invalid_grant). - adapter.go: getAccessToken(ctx, forceRefresh, rejectedToken) forces a real refresh on 401, never falling back to the rejected token; works against a local provider snapshot (no a.provider mutation race); re-reads the freshest provider under the lock to adopt a token another path just rotated. - refresh_lock.go: process-wide per-account refresh lock (keyed by ChatGPT account ID, falling back to refresh_token) with refcount eviction and an idempotent unlock. OpenAI rotates the refresh_token on every refresh and rejects reuse, so all paths that may refresh the same account serialize. - All five refresh_token rotation paths now take the lock and re-read under it: request adapter, quota task, batch-quota handler, per-provider usage, and provider-info refresh. - router.go: inject a provider-reload callback (repo.GetByID) alongside the existing update hook. ProviderUpdateFunc/ProviderReloadFunc are type aliases so the router's duck-typed assertions match — previously the defined ProviderUpdateFunc silently failed the assertion, leaving refreshed tokens unpersisted. - handler/codex.go, service/codex_task.go: quota paths use the retry helper and persist rotated refresh tokens. - tests: refresh_lock_test.go (key derivation, mutual exclusion, eviction, idempotent unlock) and reload_wiring_test.go (router wiring type assertions). Reviewed across multiple rounds (local Claude + local Codex + CodeRabbit). go build ./internal/... / vet / test / -race all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
3061bf3 to
210e2be
Compare
The cached ProviderRepository hands the same *domain.Provider pointer to every caller, and the request hot path reads token fields off provider.Config.Codex without a lock (Execute reads AccountID; getAccessToken's fast path reads AccessToken/ExpiresAt/RefreshToken). The refresh paths mutated that shared struct in place under the refresh lock — which serializes writers but not the lock-free readers — a data race flagged in review. Switch every token-persisting path to copy-on-write: clone the provider (CloneForTokenPersist), mutate the clone, and persist it via repo.Update, which atomically swaps the cache pointer. Readers holding the old pointer see a consistent (if briefly stale) struct; no reader and writer ever touch the same struct. Applied in the adapter (refresh + fallback), the quota task, and all three handler paths (RefreshProviderInfo, GetProviderUsage, GetBatchQuotas); each post-persist read is repointed at the clone. Tests: persist_test.go covers clone isolation, nil-safety, and a concurrent persist-vs-read scenario that is clean under -race. go build / vet / test / -race all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
问题
Codex adapter 的
getAccessToken没有forceRefresh路径。收到 401 时它只清空内存 cache,然后重新读取持久化的access_token——也就是上游刚刚拒绝的那个凭据——于是重试又撞上同一个 401。刷新失败时还会回退到这个失效 token,并且没有任何重试,也不区分永久性错误(refresh_token_reused/invalid_grant)。参考
CLIProxyAPI(internal/auth/codex/openai_auth.go的RefreshTokensWithRetry/isNonRetryableRefreshErr)以及同仓库已正确实现的 Claude adapter(getAccessToken(ctx, forceRefresh bool))对齐修复。改动
internal/adapter/provider/codex/oauth.goRefreshAccessTokenWithRetry(线性退避,支持 ctx 取消)isNonRetryableRefreshErr(对refresh_token_reused/invalid_grant/invalid_request/invalid_client快速失败)internal/adapter/provider/codex/adapter.gogetAccessToken(ctx)→getAccessToken(ctx, forceRefresh bool):forceRefresh时跳过 cache 和持久化 token,且刷新失败不回退被拒 tokenrefreshMu串行化刷新(避免并发refresh_token复用),取锁后二次复查 cachegetAccessToken(ctx, true)>= 1s下限保护internal/service/codex_task.go/internal/handler/codex.goRefreshAccessTokenWithRetry,与请求路径保持一致测试
go build ./internal/...✅go vet ./internal/adapter/provider/codex/... ./internal/service/... ./internal/handler/...✅go test ./internal/adapter/provider/codex/... ./internal/service/... ./internal/handler/...✅(全部 ok)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Tests