openab-agent: multi-vendor OAuth ADR + cross-process auth.json locking (codex + MCP)#1190
Conversation
Proposed ADR for the openab-agent LLM-provider OAuth revamp: a two-axis OAuthVendor adapter (auth flow vs inference transport), a cross-process flock-guarded credential-store invariant for auth.json, the CLAUDE_CODE_OAUTH_TOKEN env route, a 14-variant vendor feasibility matrix, and the /auth (PR openabdev#1185) auth-trigger model. Surfaced while reviewing PR openabdev#1187 (first OAuth vendor). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… names - Build the OAuthVendor driver on the official `oauth2` crate (already in-tree via the MCP side) instead of a hand-rolled PKCE/exchange/refresh flow; the Anthropic JSON-token-body quirk is applied via the crate's custom http-client hook. Reframe §8 accordingly (hand-rolled flow is the rejected alternative). - Remove the project Rollout-plan section (internal sequencing, not ADR material); keep the race-window mitigation in §5.4. - Use vendor names only; drop internal fleet-agent references from the matrix. - Replace unexplained "ToS-gray" with "ToS-risk" + a definition. - Fix cross-references (crate-qualified paths, §-refs, line numbers) and move the settled model-default decision under "Decisions & open questions". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…w r1) Fold in two release-blocker-class concurrency fixes from Mira's review: - §5.4 refresh-token rotation: the prior "refresh outside lock, re-read on commit" claim was wrong — N processes each send a refresh with the same RT_old before committing, tripping OAuth 2.1 §10.4 reuse -> token-family revocation. Replace with a per-tenant exclusive lock: network refresh held under the tenant lock only (not the global lock), so exactly one refresh per tenant per expiry with no head-of-line blocking across tenants. Extends to mcp:<server> tenants. - §7 /auth: key the pending PKCE verifier+state by the initiating Discord user id instead of a single global entry — prevents concurrent-user overwrite (PKCE mismatch) and session hijack. - §5.1 OAuthVendor::redirect() -> Option, since device flows have no redirect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- §6/§9 Q2: correct the stale "git-safe" claim on the gemini GOCSPX- secret. The value is non-confidential by RFC 8252 / Google docs, but GitHub now push-protects Google secrets by default (changelog 2026-03) and partner-scans them for auto-revoke, so a raw literal is not safe in a public repo. Decide: encode-at-rest (scanner-evasion for a non-secret, NOT a security control) or env-inject at runtime — not raw text. - §7: pending PKCE entries get created_at + a 15-min GC sweep in with_auth_locked so abandoned /auth attempts don't accumulate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rett decisions r3) - §9 Q2 DECIDED: env-injection (b) is the default for the gemini/agy bundled client_secret; encode-at-rest (a) is the fallback for bundled zero-config binaries only, framed as scanner-evasion (not security). Cite rclone rcloneEncryptedClientSecret + obscure.MustReveal() as the canonical precedent (§10). - §9 Q3 DECIDED: GO gemini/grok (first wave) + agy (experimental, opt-in, ToS caveat — shares gemini's Code-Assist provider, residual risk is ToS not secret storage); No-Go cursor/kiro. Mirror as a build-decision line under §6. - §10: add rclone + GitHub/Google secret-scanning references. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
GitHub survey 2026-06-24 grounds the agy GO decision: - §6/§9 Q2: agy client_secret requirement CONFIRMED — it needs a GOCSPX- secret, a public constant >=20 antigravity-auth repos hardcode verbatim (NoeFabris/opencode-antigravity-auth, router-for-me/CLIProxyAPI, ...). Literal deliberately NOT pasted into the doc — would trip the very §9 Q2 push-protection, so we dogfood the env/encode decision. Redirect confirmed localhost:51121/oauth-callback. - §9 Q3: ecosystem evidence — agy OAuth is widely ported (opencode/pi/hermes/ openclaw plugins + proxies), proving the integration, while the same ecosystem's anti-ban / quota-locking / multi-account-rotation tooling empirically confirms the ToS-ban + 429 risks, reinforcing the opt-in gate. - §10: add the antigravity OAuth ecosystem reference set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… in scope (r5) Grounded by a codebase survey (2026-06-24): - §4/§6: clarify "agy as a GO vendor" means a native OAuthVendor + Code-Assist inference provider — it does NOT run the agy CLI. agy speaks no ACP; the existing `antigravity` runtime variant (Mira/ECS) only works via a dedicated agy-acp adapter that shells out to the agy binary per prompt and polls its SQLite DB. The provider path sidesteps ACP entirely and supersedes the CLI-wrapper for native use, so agy's lack of ACP doesn't block the GO. - §5.4/§9 Q4: make the MCP CredentialStore revamp explicitly in-scope. auth.json has NO lock today (only atomic rename); provider save_tokens (auth.rs:234) and McpCredentialStore::save/clear (auth.rs:284-328) are two independent unlocked RMW callers, so with_auth_locked must wrap BOTH or the race persists. Also correct the stale save_tokens_for name and note with_auth_locked is new. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The §9 Q4 tail referenced 'openab-agent-mcp.md open items openabdev#1 (reqwest 0.12/0.13 split) and openabdev#8 (doctor/runtime two-store split)' — but that ADR's §10 Open Questions has only two items (mcp.json location; native-vs-broker parity), neither matching, and no such numbered items / terms exist anywhere in it. The phantom reference dated to the original draft. Replace with an accurate statement: McpCredentialStore reuses the same TokenStore/auth.json storage (openab-agent-mcp.md §6.1), so the lock lands once and serves both; the reqwest version conflict is the rmcp-OAuth dependency issue surfaced on the feat/openab-agent-mcp-resilience PR, not an mcp-ADR open item. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ive (Brett) Brett's call given the 93%-plaintext ecosystem survey: keep the bundled zero-config UX via encode-at-rest (obscure, rclone obscure.MustReveal style) as the DEFAULT, with env-injection as the alternative for fleet/pod. Reverses the prior (b)-default ordering. Framing unchanged: encode-at-rest is scanner-evasion for a non-confidential value, not a security control. Record the survey numbers (99/107 plaintext; auto-revoke largely unrealized) and that the real risk mitigated is org-repo push-protection friction, not credential loss. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implements the ADR §5.4 invariant. auth.json had NO lock: provider save_tokens and McpCredentialStore::save/clear each did an independent unlocked read-modify-write, so a concurrent codex refresh and MCP save last-writer-wins the whole map, and N processes (one openab-agent per Discord thread) could each refresh the codex token with the same RT_old → OAuth 2.1 §10.4 token-family revocation (fleet-wide logout). Two locks, flock(2) on sidecar files (kernel auto-releases on death), cfg(unix) with a non-unix no-op: - with_auth_locked: global exclusive lock across the re-read -> mutate -> atomic-write. ALL writers funnel through it — save_tokens (codex) and McpCredentialStore::save/clear (MCP) — so writers merge onto the latest on-disk state instead of lost-updating. Held only for the fast file RMW, never across network I/O. - lock_tenant_refresh: per-tenant refresh serialisation. get_valid_token / force_refresh take the codex tenant lock (non-blocking try + async backoff + 10s timeout, held across the network refresh), with a double-checked re-read so a process that waited adopts the token another already refreshed → exactly one real refresh per tenant per expiry, no RT_old reuse. Uses libc::flock (already a cfg(unix) dep); rustix is NOT in-tree (ADR text was optimistic). New test asserts the locked RMW merges codex + MCP tenants without lost-update. fmt + clippy -D warnings + test (191 passed) green. Known gap (for review): the MCP *network* refresh is owned by rmcp's AuthorizationManager (it refreshes then calls CredentialStore::save), so MCP writes get the file-integrity lock but MCP refreshes are not yet tenant- serialised. Closing that needs an rmcp-level hook — proposed as follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Round-2 code review (Mira): - §7 pending-entry GC: add created_at to PendingPasteLogin and sweep AuthEntry::Pending older than 15 min inside with_auth_locked on every write, so abandoned /auth two-step attempts don't accumulate. PendingPasteLogin is currently a legacy tombstone with no live writer (PKCE state lives in rmcp's in-memory StateStore); the field + GC land now per ADR §7 and are forward-compatible with the forthcoming /auth two-step flow, and meanwhile sweep legacy stray entries (created_at default 0 reads as ancient). New test covers stale-swept / fresh-kept / real-tenant-untouched. - flock_try_exclusive: match std::io::ErrorKind::WouldBlock instead of a single raw EWOULDBLOCK errno, covering EAGAIN/EWOULDBLOCK across libc/BSD. - MCP network-refresh serialisation stays a follow-up (rmcp owns the refresh) — Mira concurs. fmt + clippy -D warnings + test (192 passed) green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… gap) Closes the known gap from f370110: MCP refreshes now get the same per-tenant serialisation as codex. rmcp's CredentialStore exposes no pre-refresh hook, but openab drives the MCP refresh explicitly in McpRuntimeManager::resolve_oauth_dial via client.get_access_token(). Wrap that call (per-server) with the existing auth::lock_tenant_refresh so only one process refreshes a given server at a time. No explicit double-check needed: rmcp's get_access_token re-load()s auth.json each call and returns the cached token without a network refresh when remaining >= REFRESH_BUFFER (rmcp auth.rs:1238), so a process that loses the race adopts the token the winner wrote to the shared file — no second RT_old presentation, no OAuth 2.1 §10.4 family revocation. rmcp already single-flights within one process via its AuthorizationManager Mutex; this closes the cross-process gap. - auth.rs: lock_tenant_refresh + AuthFileLock made pub(crate) for the mcp module. - ADR §5.4: document the resolve_oauth_dial serialisation point. fmt + clippy -D warnings + test (192 passed) green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…h fd Quality-only cleanups from a 4-angle /simplify pass (no behaviour change): - Extract lock_global(path) -> Result<Option<AuthFileLock>> and route both with_auth_locked and McpCredentialStore::clear through it, so the "global" sidecar name + the cfg(unix) acquire live in one place instead of two copy-pasted blocks. clear flattens (drop the inner run closure). - lock_tenant_refresh opens the lock fd once and re-issues flock on it each retry, instead of re-opening (and re-create_dir_all-ing) the file every 100ms under contention. Removes the now-unused flock_try_exclusive helper. - Document lock_tenant_refresh's double-check contract (the lock only serialises; callers must re-check freshness after acquiring). fmt + clippy -D warnings + test (192 passed) green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Doc-only clarifications from a /code-review pass (no behaviour change): - PendingPasteLogin.created_at: warn loudly that any writer of a fresh Pending entry MUST stamp created_at, else gc_stale_pending sweeps it on the next locked write (latent footgun for the forthcoming /auth two-step writer). - lock_tenant_refresh: correct the contract — reuse-safety comes from loading the refresh token INSIDE the lock (so force_refresh, which always refreshes on a 401, is reuse-safe too); the post-lock expiry re-check is only an optimisation to skip a redundant refresh. Review also surfaced a pre-existing, out-of-scope namespacing gap (an MCP server literally named "codex" collides with the codex tenant's auth.json key and refresh lock — committed MCP creds use the bare server name, not mcp:<server> as ADR §5.4 describes) — flagged for a separate change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ci-openab-agent.yml runs cargo fmt/clippy/test/build with working-directory: openab-agent, but the crate is not a member of the parent openab workspace (members = crates/openab-core, openab-gateway) and was not excluded, so cargo errors 'current package believes it's in a workspace when it's not' and every step fails before doing any work. openab-agent is intentionally standalone (own version + dual reqwest 0.12/0.13 for rmcp), so the correct fix is an empty [workspace] table making it its own root. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This reverts commit b40693c.
ci-openab-agent.yml ran cargo from working-directory: openab-agent without the [workspace] table the crate needs (it's standalone, not a parent-workspace member), so every step failed with 'believes it's in a workspace when it's not' — the workflow had been red independently of any PR. Dockerfile.unified already works around this by appending the table at build time; replicate that in the workflow rather than committing it to Cargo.toml (which would double-append in the Dockerfile and break the image build). Also harden the ACP smoke test: build the release binary in its own step and bump the response timeout 5s -> 30s so a loaded runner doesn't flake the agent's first-response window (the binary itself responds in <1s locally, verified incl. a clean-HOME release build). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…osis The smoke test produces empty stdout only in CI (the binary responds with agentInfo in <1s locally across debug/release/clean-HOME/CI-env-var runs). Capture stderr + exit code so the next CI run reveals why stdout is empty.
This comment has been minimized.
This comment has been minimized.
|
All review findings addressed and re-verified — CI fully green (build-builder, check, 16 smoke-test variants), chaodu-agent LGTM x2. No open items on my side; handing off for maintainer review. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Resolves the Changes-Requested review on the multi-vendor OAuth ADR +
cross-process auth.json locking PR. No behavioural change beyond log level
and a non-unix diagnostic.
ADR (docs/adr/openab-agent-oauth.md)
- Fix lock-file names to match code: auth.json.global.lock and
auth.json.refresh.<tenant>.lock (were auth.json.lock / auth.json.<tenant>.lock).
- Add the path parameter to the with_auth_locked pseudocode signature.
- Correct the MCP refresh note: initialize_from_store() does the disk reload,
not get_access_token.
- Correct the crate note: libc::flock directly (rustix is not in-tree).
- Document the deliberate fail-open trade-off on tenant-lock timeout.
- Mark the default-model removal (Decision 1) as a follow-up; this PR ships
the ADR + locking only.
Code
- auth.rs: escalate the tenant-lock timeout log from warn! to error! and
document the fail-open trade-off on lock_tenant_refresh.
- auth.rs: non-unix lock_global no-op now warns once instead of silently
providing zero cross-process protection.
- mcp/runtime.rs: fix the rmcp reload comment and document the cross-module
invariant that the refresh lock and credential entry share the server name.
CI (ci-openab-agent.yml)
- ACP smoke test now fails on a non-zero/timeout exit code, and the agentInfo
assertions use { } so exit fails the step rather than just a subshell.
Gate: cargo fmt --check, clippy -D warnings, test (192 passed) all clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replaces the fail-open tenant-lock timeout with a bounded-refresh + fail-closed
design, so a contended lock can no longer reintroduce the double-refresh it
exists to prevent.
- Bound the refresh network call with an explicit 8s HTTP timeout on both the
codex client (auth.rs) and the MCP AuthClient (mcp/runtime.rs), strictly
shorter than the 10s lock-acquire deadline. With flock(2) auto-release on
death, a live holder always frees the tenant lock before a waiter's deadline,
so a lock-acquire timeout is genuinely abnormal.
- lock_tenant_refresh now returns RefreshLock { Held, Unavailable, TimedOut }.
On TimedOut callers fail closed instead of refreshing unserialised:
- codex get_valid_token / force_refresh return a retryable error;
- MCP resolve_oauth_dial returns a new OauthDialError::Transient that leaves
the server retryable WITHOUT forcing re-login (NeedsAuth) or tripping the
circuit breaker (auth-level failures still map to NeedsAuth as before).
A filesystem error opening the sidecar returns Unavailable and degrades to a
best-effort unserialised refresh rather than blocking every refresh.
- Add a fail-closed timeout test (injectable deadline) and update ADR §5.4 to
document the bounded-refresh + fail-closed invariant, superseding the earlier
fail-open note.
Gate: cargo fmt --check, clippy -D warnings, test (193 passed) all clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mira review (PR openabdev#1190) found the fail-closed invariant held only for the codex path. The MCP path holds the tenant lock across TWO sequential bounded calls — rmcp's initialize_from_store() (AS discovery) then get_access_token() (refresh) — so the worst-case lock-hold is ~2 x REFRESH_HTTP_TIMEOUT (16s), which could exceed the fixed 10s lock deadline and fail a waiter closed while the holder is still legitimately progressing. Derive REFRESH_LOCK_TIMEOUT from the bound instead of hardcoding it: REFRESH_LOCK_TIMEOUT = MAX_REFRESH_ROUND_TRIPS (2) * REFRESH_HTTP_TIMEOUT + 4s = 20s so the deadline is always above the worst-case hold and only a genuinely stuck holder trips it. Normal-case latency is unchanged (the waiter polls every 100ms and acquires as soon as the holder releases). Update the fn doc and ADR §5.4 to state the multi-call hold explicitly. Gate: cargo fmt --check, clippy -D warnings, test (193 passed) all clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
|
All review feedback addressed across three rounds — chaodu-agent's 8 findings (ADR/code drift, CI exit-code, fail-open→fail-closed) and the follow-up fail-closed lock-timeout bound. Second-agent review (Mira) confirms all points. CI fully green. Handing off for maintainer review. |
This comment has been minimized.
This comment has been minimized.
…403) (#1219) #1218 added a pull_request trigger, but fork PRs run with a read-only GITHUB_TOKEN, so creating the 'OpenAB PR Review' status 403s and the run fails (observed on fork PR #1190). Guard the job to run on pull_request events only for same-repo PRs; forks continue to be handled by the cron poller (which runs with full permissions). schedule/workflow_dispatch always run. Fixes the failing runs introduced by #1218. Co-authored-by: chaodu-agent <chaodu-agent@users.noreply.github.com>
|
LGTM ✅ — Well-engineered cross-process locking for auth.json that fixes a real release-blocker-class OAuth RTR race, paired with a thorough ADR. What This PR DoesFixes a latent concurrency bug where multiple openab-agent processes (one per Discord thread) could concurrently refresh the same OAuth token, triggering OAuth 2.1 §10.4 token-family revocation (fleet-wide logout). Lands both the ADR documenting the multi-vendor OAuth pattern and the implementation of the two-lock file-locking scheme. How It WorksTwo complementary
CI workflow improvements: workspace isolation fix, separated build step, robust ACP smoke test with proper exit-code propagation. Findings
What's Good (🟢)
Baseline Check
|
chaodu-agent
left a comment
There was a problem hiding this comment.
LGTM ✅ — Cross-process auth.json locking is well-engineered and correctly addresses the RTR race condition. ADR is thorough. Approving.
…auth tenant Brett asked to cover the new tenant the same way openabdev#1190's codex lock tests do, so single-flight / fail-closed is proven for it rather than only structurally similar: - `with_auth_locked_merges_anthropic_tenant_no_lost_update` — a concurrent codex write does not clobber a just-written `anthropic-oauth` token (the new tenant rides the same locked RMW funnel). - `lock_tenant_refresh_fails_closed_for_anthropic_and_is_per_tenant` — a held anthropic refresh lock makes a second anthropic acquire fail closed (`TimedOut`), and does NOT block codex (per-tenant isolation — the reason §5.4 uses a per-tenant lock, not the global one). 215 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What problem does this solve?
openab-agent'sauth.jsonis a shared, multi-writer credential file with an unlocked read-modify-write, and openab-agent runs one process per Discord thread. So ordinary concurrent usage = concurrent processes refreshing the same OAuth token → refresh-token-rotation reuse → worst case OAuth 2.1 §10.4 token-family revocation = fleet-wide logout. API-key users never hit this; OAuth adoption (PR #1187, Anthropic) is what activates it.This PR lands the ADR that sets the multi-vendor OAuth pattern (so every future provider stops hand-rolling its own flow) and the concurrency fix for the release-blocker-class storage race — for both the codex tenant and the MCP
CredentialStore.Closes #
Discord Discussion URL: https://discord.com/channels/1490643539682529300/1519372543377539173
At a Glance
Prior Art & Industry Research
OpenClaw: API keys + subscription OAuth; stores per-profile
{access,refresh,expires,accountId}and treats the profile file as a token sink refreshed under a file lock — direct corroboration for the locked-RMW decision.Hermes Agent (NousResearch):
PROVIDER_REGISTRYdataclasses declare each provider's auth type + URLs + env vars behind oneresolve_runtime_provider()entry point;auth.jsonguarded withfcntl/msvcrtfile locks — corroborates both the registry shape (OAuthVendor) and the file-lock decision.Other references:
earendil-works/pi) — ported flow for feat(native-agent): Anthropic OAuth (Claude Pro/Max) login for openab-agent #1187;CLAUDE_CODE_OAUTH_TOKEN(#3591); Kiro/Cursor/xAI provider extensions.rcloneEncryptedClientSecret+ runtimeobscure.MustReveal(): precedent for the ADR §9 Q2 encode-at-rest of a non-confidential bundled OAuth secret (scanner-evasion, explicitly not encryption).Proposed Solution
ADR (
docs/adr/openab-agent-oauth.md): a two-axis model — auth (OAuthVendordescriptor + a shared driver on the officialoauth2crate) kept orthogonal to inference (one provider per wire format) — plus a 14-variant vendor feasibility matrix, theCLAUDE_CODE_OAUTH_TOKENenv route, the/authtwo-step UX (per-user-keyed PKCE pending state), and the storage invariant below.Implementation (this PR):
with_auth_locked— globalflock(2)on anauth.json.global.locksidecar across re-read → mutate → atomic write. All writers funnel through it (codexsave_tokens+ MCPMcpCredentialStore::save/clear), so they merge onto the latest on-disk state instead of lost-updating.lock_tenant_refresh— per-tenantflock(non-blocking try + async backoff + 10s timeout) held across the network refresh, with a double-checked re-read, so concurrent processes do exactly one real refresh per tenant. Codex uses it inget_valid_token/force_refresh; MCP uses it inresolve_oauth_dialaroundget_access_token().flock(2)on sidecar files (kernel auto-releases on fd close / process death — no stale lock);libc::flockundercfg(unix), non-unix no-op.created_at+ a 15-min sweep insidewith_auth_locked.Why this approach?
flock(2)over a sentinel lockfile — the kernel releases it on process death, so a crashed holder never wedges the file (no orphan cleanup / manual timeout).CredentialStoreexposes no pre-refresh hook, but openab drives the MCP refresh explicitly inresolve_oauth_dial; rmcp'sget_access_tokenre-load()sauth.jsonand skips the network refresh when the token is already fresh, so the per-tenant lock + rmcp's own fresh-load gives cross-process single-flight with no extra double-check code.rustixis not in-tree (the ADR text was optimistic), so this useslibc::flockdirectly.Alternatives Considered
Mutex/ tokio single-flight — useless across the per-thread processes.flock(2)auto-releases.RT_oldbefore it commits, so RTR still fires.CredentialStore— lossy; the shared layer sits below both stores (file RMW).Validation
Rust changes:
cargo checkpassescargo testpasses (192 passed, 0 failed, 11 ignored) — incl. newwith_auth_locked_merges_concurrent_tenants_no_lost_updateandwith_auth_locked_gcs_stale_pending_but_keeps_fresh_and_tokenscargo clippy -- -D warningscleancargo fmt --checkcleanAll PRs:
openab-agent/(the crate is built standalone, matchingci-openab-agent.yml). Reviewed end-to-end with a second agent across two ADR-review rounds and two code-review rounds; all findings (RTR refresh race, per-user PKCE keying, pending GC, portableWouldBlock, MCP cross-process refresh) addressed and re-verified.