diff --git a/.claude/skills/kaizen-research/SKILL.md b/.claude/skills/kaizen-research/SKILL.md new file mode 100644 index 00000000..4bd345bf --- /dev/null +++ b/.claude/skills/kaizen-research/SKILL.md @@ -0,0 +1,386 @@ +--- +name: kaizen-research +description: Weekly Friday early-morning external + internal scan for emerging functionality, agentic trends, tools, and feature/UX improvements in the AgentCore Public Stack repo. Tracks AWS Bedrock + AgentCore announcements, Strands Agents releases, FastMCP (used by externally hosted MCP servers), the aws-samples/sample-strands-agent-with-agentcore reference repo, the MCP ecosystem (including MCP Apps + extensions), frontier model announcements, agent-harness patterns, and agentic UI/UX patterns (MCP Apps, Vercel AI SDK, assistant-ui, NN/g AI research, Linear/Cursor/Anthropic product blogs). Audits internal signals (recent commits, open PRs, CI failures, version-pin lag, dormant skills). Outputs a dated research doc + queues ideas in `docs/kaizen/review-queue.md` for that same morning's `kaizen-review-prep` (runs ~2 hours later) to rank into decisions. Opens a PR into `develop`. **Out of scope**: security advisories / Dependabot / CodeQL — those have dedicated tooling and don't need a weekly kaizen lens. Triggers: "kaizen research", "weekly research scan", "external scan", "what should we look at this week". +--- + +# Kaizen Research + +Friday early morning. The "what's the rest of the world learning that we should consider, and what's our own week telling us?" scan. Pairs with `kaizen-review-prep` which runs ~2 hours later the same morning and ranks this skill's output into a decision agenda — both docs ready before Phil sits down to review Friday morning. + +## Philosophy + +- **Subtraction first.** Every research run should propose at least as many things to *remove or simplify* as to add. A smaller stack you trust beats a bigger one you route around. **Subtraction explicitly includes replacing custom code with library-native equivalents** — when an upstream release (Strands, AgentCore SDK, FastMCP, MCP, etc.) ships a capability we'd already built or filed an issue for, the win is closing our version and adopting upstream. Example: the 2026-05-10 bootstrap run found that Strands v1.37/v1.38 silently closed our open issues #266 and #267 — the codebase surface area shrinks even though we "added" a dep bump. +- **Dual lens — impact + capability-unlock.** Evaluate every upstream feature through *two* lenses, not one: (a) **impact on existing code** (does it change, simplify, or obsolete something we already have?) and (b) **capability unlock** (what *new* product capability, UX pattern, or enhancement does this make possible that we couldn't easily do before?). Subtraction-first still applies to the first lens. But capability-unlock items — features that enable net-new product surface — must be evaluated on their strategic merit, *not* hedged into "replaces future glue we haven't written." Example: the 2026-05-10 AgentCore Runtime BYO filesystem was first framed only as "could replace future filesystem-staging glue" — under-weighting the real story (code-interpreter sandboxes, cross-session uploads, shared skill hot-swap, persistent vector indexes). A dep-bump's win is usually subtraction; a *new* platform primitive's win is usually capability unlock. Don't mis-classify. +- **Subagent fan-out.** External sources are independent — fan them out to parallel subagents and synthesize. Keeps the main context clean and runs faster. +- **Web budget soft cap.** Target ≤50 web requests. If a source is exhausted, unreachable, or rate-limited, list it as "not scanned this week" — don't skip silently. Going modestly over the cap (say, to 60) is fine if the extra requests are surfacing real signal; document the overage in the Web Budget block. Don't pad — if 30 requests covered every source meaningfully, stop at 30. +- **Cite everything.** Every external claim gets a URL + access date in the Sources Scanned appendix. Web findings rot fast and you'll re-read them next week. +- **No edits outside `docs/kaizen/`.** This skill writes a dated research doc and updates `review-queue.md`. It never touches `backend/`, `frontend/`, `infrastructure/`, `CLAUDE.md`, or skill files. + +## When to run + +Friday early morning (~6am MT). `kaizen-review-prep` runs ~2 hours later (~8am MT) so both docs are waiting when Phil sits down Friday morning. Phil reviews, picks 1–3 to ship over the coming week, and POCs additional items over the weekend. Last weekend's POC findings surface in *this* run's review-prep as Carried Over items (lifted from comments on the previous week's research PR). + +## Sources + +### External (web — last 7 days unless noted) + +1. **AWS Bedrock + AgentCore "What's New"** + - https://aws.amazon.com/about-aws/whats-new/recent/feed/ + - https://aws.amazon.com/bedrock/whats-new/ + - https://aws.amazon.com/blogs/machine-learning/ (filter: bedrock, agentcore) + - Filter to: Bedrock, AgentCore, Bedrock Agents, Knowledge Bases, Guardrails, model availability/region/quota changes. + +2. **Strands Agents SDK** + - https://github.com/strands-agents/sdk-python/releases + - https://github.com/strands-agents/sdk-python/blob/main/CHANGELOG.md + - https://github.com/strands-agents/sdk-python/issues?q=is%3Aissue+sort%3Aupdated-desc + - For each new release, identify: breaking changes, new hooks/features, fixes that map to current usage in `backend/src/agents/main_agent/`. + +3. **Reference repo — `aws-samples/sample-strands-agent-with-agentcore`** + - https://github.com/aws-samples/sample-strands-agent-with-agentcore/commits/main + - Diff the last 7 days (or "since last research run" — whichever is longer). Identify new patterns, removed approaches, or fixes that map to constructs in this repo: agent setup, tool registration, AgentCore Identity flows, Memory configuration, Gateway/MCP wiring. + - This repo has historically informed our architecture; week-over-week deltas are first-class signal. + +4. **MCP ecosystem** + - https://modelcontextprotocol.io (blog, spec changes) + - https://github.com/modelcontextprotocol/servers (new servers, retired servers) + - MCP registry / awesome-mcp lists for new servers relevant to the stack (Bedrock, AWS, GitHub, Slack, observability). + +4a. **FastMCP** — used by our externally hosted MCP servers (Lambda-backed, behind AgentCore Gateway). FastMCP is **not** pinned in this repo's `pyproject.toml`; it lives in the MCP server repos this stack consumes via Gateway. Track upstream releases because changes affect server behavior we depend on. + - https://github.com/jlowin/fastmcp/releases + - https://github.com/jlowin/fastmcp/blob/main/CHANGELOG.md + - https://github.com/jlowin/fastmcp/issues?q=is%3Aissue+sort%3Aupdated-desc + - https://pypi.org/project/fastmcp/ (for latest version + release date) + - Identify: breaking changes, new server-side primitives (resources/prompts/tool decorators, lifespan, auth helpers), transport changes (especially relevant if MCP SEP-2567 sessionless transport lands), and Lambda/runtime adapter changes. + +4b. **Agentic UI/UX patterns** — emerging UI and UX conventions for AI/agentic apps. We're Angular + Tailwind, so React-specific libraries are **pattern-only** references (extract the idea, implement in signals). Focus on functionality + interaction + visual conventions, not generic "good chat UX". + - **MCP Apps + extensions** (priority): https://modelcontextprotocol.io/extensions/apps/overview, https://github.com/modelcontextprotocol/ext-apps, https://blog.modelcontextprotocol.io. The "MCP server returns an interactive UI inline with the chat" standard. Track host adoption (Claude Desktop, ChatGPT, VS Code Copilot, Goose, Postman) and new MCP extension SEPs. + - **AI SDK / Generative UI** (Vercel): https://ai-sdk.dev/docs/ai-sdk-ui, https://ai-sdk.dev/cookbook. Canonical reference for tool-call rendering, multi-step UI, generative UI, streaming state patterns. React, but the patterns port. + - **assistant-ui**: https://www.assistant-ui.com/docs, https://github.com/Yonom/assistant-ui/releases. React component library purpose-built for AI chat UI. Tracks attachment UX, threading, tool-call rendering primitives. + - **Vendor product-blog UX writeups**: https://linear.app/blog (Linear Agent), https://www.cursor.com/blog (canvas, agent harness), https://www.anthropic.com/news filtered for `artifact`/`ui`/`design`. Where in-app agentic patterns get documented by the teams shipping them. + - **OpenAI Canvas + ChatGPT UI**: https://openai.com/blog filtered for `canvas`, `chatgpt`, agent UI updates. + - **Nielsen Norman Group AI articles**: https://www.nngroup.com/topic/artificial-intelligence/. UX-research perspective; evidence-based; slow cadence — surfaces in ~1 of 4 weekly runs but high signal when it does. + - Identify: new agentic UI standards (especially MCP Apps + adjacent SEPs), tool-result rendering patterns, attachment/preview UX, multi-agent attribution patterns, consent/elicitation UX, evidence-based usability findings. + +5. **Frontier model announcements** + - https://www.anthropic.com/news + - https://openai.com/blog (filter: API, agents, tools) + - https://blog.google/technology/google-deepmind/ (Gemini) + - https://ai.meta.com/blog/ (Llama) + - Focus on capability deltas affecting agent harness design: longer context, native tool use changes, prompt caching APIs, computer use, structured output, latency/cost shifts. + +6. **Agent harness patterns** + - https://www.anthropic.com/engineering (Claude Code, agent design posts) + - https://docs.claude.com/en/docs/claude-code/release-notes + - LangChain / LlamaIndex / Pydantic-AI release notes — for ideas, not adoption. + +7. **AWS Bedrock pricing + quota** + - https://aws.amazon.com/bedrock/pricing/ + - Note any model price/quota changes that could shift architecture choices in this repo (e.g., model selection in `inference_api`). + +8. **AgentCore SDK / starter-toolkit issues** + - https://github.com/aws/amazon-bedrock-agentcore-sdk-python/issues + - https://github.com/aws/amazon-bedrock-agentcore-starter-toolkit/issues + - Early-signal bugs/limits other users hit before we do. + +9. **Community signal (filtered)** + - HN search: `site:news.ycombinator.com bedrock OR agentcore OR strands OR "claude code"` (last 7 days) + - r/LocalLLaMA, r/MachineLearning — agent-harness critiques and patterns surface here before vendor blogs. + +10. **Anthropic cookbook + courses** + - https://github.com/anthropics/anthropic-cookbook + - https://github.com/anthropics/courses + - Worked examples often outpace docs — especially for caching, tool use, and agent loops. + +11. **Seasonal sources** (only when in window) + - AWS re:Invent (typically late Nov / early Dec) — Bedrock/AgentCore announcements. + - NeurIPS / ICLR / EMNLP agent tracks (when proceedings drop). + - If today's date is not in a known window, skip with "no seasonal sources this week". + +### Internal (this repo) + +13. **Recent commits.** `git log develop --since="7 days ago" --oneline --no-merges`. Cluster by area (`backend/`, `frontend/`, `infrastructure/`). Reverts and high-churn files signal pain points. + +14. **Open PRs + review comments.** `gh pr list --base develop --state open --limit 20`, then `gh pr view --comments` on the top 3 by comment count. Repeated review feedback is a CLAUDE.md or skill-update signal. + +15. **GitHub issues opened in last 7 days.** `gh issue list --state open --search "created:>$(date -v-7d +%Y-%m-%d)"`. Bug clustering = refactor signal. + +16. **CI failures.** `gh run list --status=failure --limit 30`. Group by workflow + job. Flaky tests and recurring infra failures. + +17. **Recent CHANGELOG.md / RELEASE_NOTES.md entries** (last 14 days). Used as the "don't re-propose what we just shipped" filter. + +18. **Skill inventory.** `find .claude/skills -name SKILL.md -exec stat -f "%Sm %N" {} \;`. Skills not modified in 60+ days and not visibly referenced in recent PRs are retirement candidates. + +19. **Version-pin lag.** For each tracked dep, fetch latest release version and compute lag: + - Backend: `strands-agents`, `boto3`, `botocore`, `fastapi`, `pydantic`, `bedrock-agentcore`, `mcp` + - Frontend: `@angular/core`, `@analogjs/platform`, `vitest` + - Infrastructure: `aws-cdk-lib`, `constructs` + - Source files: `backend/pyproject.toml`, `frontend/ai.client/package.json`, `infrastructure/package.json`. + +20. **Decisions log** — `docs/kaizen/decisions.md` (if it exists). Items previously declined; don't re-propose without materially new context. + +21. **Recent reviews** — `docs/kaizen/reviews/*.md` (last 1–2). Used to avoid duplicate proposals. + +## Output + +### 1. Primary doc — `docs/kaizen/research/YYYY-MM-DD.md` + +```markdown +# Kaizen Research — [Day, Month D, YYYY] +> Scan window: [Month D – Month D, YYYY] (7 days) +> Web budget: N/50 used (target). + +## TL;DR + +[2-3 sentences. The single most important external move and the single most pressing internal signal. Name the recommended #1 idea here.] + +## External Scan + +### What's moving this week + +[1-2 paragraphs — gestalt. What's the shape of the week? Are vendors converging on a pattern? Anything surprise you?] + +### Notable items by source + +> **Annotation conventions:** +> - `*relevance*:` — impact-on-existing-code lens. What construct/file does this affect? What does it replace, simplify, or obsolete? +> - `*unlocks*:` — capability-unlock lens (use when applicable, especially for *new* platform primitives, SDK hooks, or UX patterns). What net-new product capability or enhancement does this make possible? What could we now build that we couldn't before? +> +> Bug-fixes and incremental dep-bumps usually only need `*relevance*`. New platform features, new SDK primitives, new spec capabilities, and new UX patterns usually deserve both. + +#### AWS Bedrock / AgentCore +- **[Item]** — [1-2 sentence summary] — [URL] — *relevance*: [specific construct/file] — *unlocks* (if applicable): [net-new capability or enhancement this enables] + +#### Strands Agents +- **[Item]** — … + +#### Reference repo (aws-samples/sample-strands-agent-with-agentcore) +- **[Commit / change]** — [diff summary] — [URL] — *applicability*: [does our equivalent code do this differently? worth porting?] + +#### MCP ecosystem +- … + +#### FastMCP +- **[Release / change]** — [URL] — *implications for our MCP servers*: [breaking change? new primitive worth adopting?] + +#### Agentic UI/UX patterns +- **[Pattern / release]** — [URL] — *what it is*: [1-2 sentences] — *fit for our stack*: [direct port / pattern-only (Angular equivalent: …) / not applicable] — *where it'd land*: [SSE event / component / route] + +#### Frontier model announcements +- … + +#### Agent harness patterns +- … + +#### Pricing / quota +- … + +#### Community + GitHub issues +- … + +#### Cookbook / courses +- … + +#### Seasonal +- [content, or "Out of window — none scanned this week"] + +### Patterns worth considering + +- **[Pattern]** — [3 sentences: what it is, where it's appearing, fit for this repo] + - **Where**: [examples] + - **Fit**: [would this help? what does it replace? cost to adopt?] + - **Verdict**: [Worth trying / Not a fit / Monitor] + +## Internal Audit + +### Activity (last 7 days) +- **Commits on develop**: N (across N PRs) +- **PRs opened**: N — **merged**: N — **reverted**: N +- **Issues opened**: N — **closed**: N +- **CI failures (workflow → count)**: … + +### Repeated friction signals +- **[Pattern]** (N occurrences) — [evidence: commit SHAs, PR numbers, issue links] + - **Hypothesis**: [root cause] + - **Fix candidate**: [specific change — file + behavior] + +### Version-pin lag +| Dep | Pinned | Latest | Lag | Notes | +|---|---|---|---|---| +| strands-agents | x.y.z | a.b.c | N releases / N days | [breaking? new feature relevant to us?] | + +### Retirement candidates +- **[Skill / file / config]** — [evidence: not modified in N days, replaced by X, never referenced] + +### Risks introduced this week + +- **[Risk]** — [source URL or PR] — *what breaks if we ignore this* + +## Ideas — Top 5 (ranked) + +| # | Idea | Surface | Effort | Impact | Subtracts? | Unlocks? | +|---|---|---|---|---|---|---| +| 1 | [Title] | backend / frontend / infra / cross-cutting | L/M/H | L/M/H | [what it retires, or "addition only — justified because…"] | [net-new capability, or "—" if not applicable] | +| 2 | … | | | | | | + +### 1. [Idea title] +- **Source**: [external item / internal signal — URL or commit SHA] +- **Surface area**: [paths affected] +- **Change**: [what specifically would change] +- **Subtracts**: [what this retires/simplifies, or explicitly: "addition only — justified because…"] +- **Unlocks** (if applicable): [net-new product capability, UX pattern, or enhancement this enables — bulleted if multiple. Omit field when not a capability-unlock item.] +- **Effort × Impact**: [Low/Med/High] × [Low/Med/High] +- **Verdict**: [Worth trying / Not a fit / Monitor] + +### 2. … + +## Take + +[2-4 sentences. Net read of the week. Is the system trending toward the ecosystem or away from it? One change that would matter most. What Phil would notice first if shipped.] + +--- + +## Sources Scanned + +| # | Source | URL | Accessed | Items | +|---|---|---|---|---| +| 1 | AWS Bedrock What's New | https://… | 2026-05-10 | 3 | + +## Web Budget + +Used: N / 50 requests (target). +Skipped (unreachable / rate-limited): [list] +Skipped (other): [list with reason] +Notes: [if the cap was exceeded, name the source category that justified it] +``` + +### 2. Handoff — `docs/kaizen/review-queue.md` (rolling, not dated) + +The explicit contract with `kaizen-review-prep`. This skill **appends** new entries under `## Open`. It never edits `## Resolved` (review-prep does the move). + +```markdown +# Kaizen Review Queue + +Items added by `kaizen-research`, consumed by `kaizen-review-prep`. + +## Open + + +### [YYYY-MM-DD] [Idea title] +- **Source**: research/YYYY-MM-DD.md +- **Surface**: backend | frontend | infrastructure | cross-cutting +- **Effort × Impact**: L/M/H × L/M/H +- **Subtracts**: [yes — what / no — justification] +- **Unlocks** (if applicable): [net-new capability, UX pattern, or enhancement this enables; bulleted if multiple. Omit when not a capability-unlock item.] +- **Status**: open + +## Resolved + + +### [YYYY-MM-DD] [Idea title] +- **Source**: research/YYYY-MM-DD.md +- **Decision**: Ship | Decline | Defer until [date] +- **Reasoning**: [Phil's reason, one sentence] +- **Reviewed in**: reviews/YYYY-MM-DD.md +``` + +## How to run + +1. **Bootstrap.** If `docs/kaizen/`, `docs/kaizen/research/`, `docs/kaizen/reviews/`, or `docs/kaizen/review-queue.md` don't exist, create them. The queue starts with the headers above and empty sections. + +2. **Read recent context** (sequential — small reads): + - Last 1-2 files in `docs/kaizen/research/` + - Last 1-2 files in `docs/kaizen/reviews/` + - `docs/kaizen/decisions.md` if present + - `docs/kaizen/review-queue.md` + - Last 14 days of `CHANGELOG.md` and `RELEASE_NOTES.md` + +3. **Inventory internal signals** (parallel Bash calls): + - `git log develop --since="7 days ago" --oneline --no-merges` + - `gh pr list --base develop --state open --limit 20` + - `gh issue list --state open --search "created:>$(date -v-7d +%Y-%m-%d)"` + - `gh run list --status=failure --limit 30` + - `find .claude/skills -name SKILL.md -exec stat -f "%Sm %N" {} \;` + - Read pinned versions from the three manifest files. + +4. **Fan out external scan** — spawn parallel `general-purpose` subagents (or `Explore` for sources requiring multiple targeted lookups). One subagent per source category 1–11 above (13 categories total including 4a FastMCP and 4b Agentic UI/UX). Each subagent receives: + - The exact URLs to scan + - Scope: last 7 days + - Web budget for that subagent (3–5 requests soft target) + - Required output: 3-5 bullet items max — title, 1-2 sentence summary, URL, "relevance to this repo" line. + - **Required**: cite URLs; never fabricate. If empty, return "no notable items this week". + + Total budget across subagents targets ≤50. Track centrally; modest overage (~60) is acceptable when surfacing real signal — beyond that, stop and document the skip. + +5. **Version-pin diff.** For each tracked dep, fetch latest release version (WebFetch on the release page or registry equivalent — counts toward budget). Compute lag in releases and days. If a budget hit prevents a check, list the dep under "Skipped". + +6. **Synthesize.** Write the research doc per the shape above. Pull subagent reports verbatim into source sections; write the gestalt narrative (TL;DR, "What's moving", Take) yourself. **Top 5 weighting**: + - **Library-native subtraction** opportunities (where upstream closed a custom-code need) get a subtraction boost. + - **Capability-unlock** items — new platform primitives, SDK hooks, spec capabilities, or UX patterns that enable net-new product surface we couldn't easily build before — rank on their strategic merit, *not* deprioritized just because they don't intersect existing code. Apply the dual lens from Philosophy: if a feature genuinely unlocks new capability (code-interpreter, persistent agent state, multi-agent UI attribution, etc.), rank it like a fit item, not like a "monitor" item. Resist the temptation to hedge unlock items into "replaces future glue we haven't written" — that under-weights the real story. + - **Concrete fit** UI/UX patterns that match an existing surface (tool-call rendering, attachments, A2A attribution, consent flows) get a fit boost over generic "interesting trend" items. + +7. **Update review queue.** For each Top 5 idea, prepend a new entry under `## Open` in `docs/kaizen/review-queue.md`. Never touch `## Resolved`. + +8. **Open a PR** — see "PR creation". + +## PR creation + +```bash +DATE=$(TZ=America/Denver date +'%Y-%m-%d') +BRANCH="kaizen/research-${DATE}" + +git checkout -b "$BRANCH" develop +git add docs/kaizen/ +git commit -m "$(cat <2 weeks. +- **Concrete, not aspirational.** "Consider Strands hooks" is too vague. "Add a Strands `BeforeToolCall` hook in `backend/src/agents/main_agent/hooks/` to attribute tokens by tool" is actionable. +- **No edits to source code.** This skill only writes under `docs/kaizen/`. +- **Honest about dry weeks.** A quiet week produces a short doc, not a padded one. +- **Don't re-propose declined ideas** without materially new context. Check `docs/kaizen/decisions.md` and recent reviews. +- **Cite everything.** Every external claim has a URL + access date in the Sources Scanned appendix. +- **Don't auto-merge the PR.** Phil reviews and merges Friday morning. Review-prep runs against the unmerged PR's docs — it reads the file from the working tree, not from `develop`. + +## Confirmation + +After the PR is opened, tell Phil: +1. PR URL. +2. Top 1-2 ideas (title + Effort×Impact). +3. One-sentence Take. +4. Web budget used (N/50 target) and any skipped sources. + +Brief. The full doc is on the PR. diff --git a/.claude/skills/kaizen-review-prep/SKILL.md b/.claude/skills/kaizen-review-prep/SKILL.md new file mode 100644 index 00000000..a986b46f --- /dev/null +++ b/.claude/skills/kaizen-review-prep/SKILL.md @@ -0,0 +1,255 @@ +--- +name: kaizen-review-prep +description: Friday late-morning synthesis. Runs ~2 hours after `kaizen-research` the same morning. Consumes this week's research doc, open items in `docs/kaizen/review-queue.md`, last weekend's POC findings (from comments on the previous week's research PR), and recent merges/reverts/CI signal — produces a ranked, decision-oriented agenda. Every item has a Ship / Decline / Defer recommendation. Opens a PR into `develop`. Triggers: "kaizen review prep", "weekly review prep", "friday review", "rank kaizen ideas". +--- + +# Kaizen Review Prep + +Friday late morning, after `kaizen-research` ran earlier the same morning. This skill consolidates this week's research + open queue items + last weekend's POC findings (lifted from PR comments on the previous week's research PR) + recent repo state into a ranked decision agenda. Phil reviews Friday morning, marks ✅/❌/⏸ on each item, ships 1–3 the following week, and POCs the next batch over the weekend. + +## Philosophy + +- **Review is a decision forum, not a status update.** Everything that lands in the output should be either: (a) actionable this week, (b) explicitly deferred with a reason and revisit date, or (c) declined. Nothing is "noted." Noted-and-forgotten is how systems accumulate friction. +- **Subtraction first.** Every proposal ranks against "do nothing" and "retire something instead." If a proposal adds anything, it must explain what existing thing it either replaces or simplifies. +- **Dual lens — impact + capability-unlock.** Rank proposals through *two* lenses, not one: (a) **impact on existing code** (does this change, simplify, or obsolete something we already have?) and (b) **capability unlock** (what *new* product capability or UX enhancement does this enable that we couldn't easily build before?). Subtraction-first applies to lens (a). But proposals that genuinely unlock new product surface — code-interpreter sandboxes, persistent agent state, multi-agent UI attribution, new SSE event types that enable inline UI, etc. — must be evaluated on their strategic merit, *not* auto-deferred because they don't intersect existing code. A proposal with no `Subtracts` value but a substantive `Unlocks` value can rank above a low-impact dep-bump. Don't penalize net-new capability for not being a cleanup. +- **Multiple cycles.** Kaizen is small changes, weekly, compounding. If this week's review touches 3 things, next week's will touch 3 different things. Phil doesn't need a grand plan — he needs a reliable weekly cadence. +- **One-week feedback lag is intentional.** Phil reviews Friday → POCs over the weekend → those POC findings surface in the *next* Friday's review-prep as Carried Over items. Don't try to fold same-day POC findings in — they don't exist yet. +- **No edits outside `docs/kaizen/`.** This skill writes one Markdown file under `docs/kaizen/reviews/` and updates `docs/kaizen/review-queue.md` (moves Open → Resolved post-review). It never touches source code, `CLAUDE.md`, or skill files. Those changes happen in separate PRs after the review. + +## When to run + +Friday late morning (~8am MT), ~2 hours after `kaizen-research` runs. Phil reviews both docs Friday morning, picks 1–3 to ship over the coming week, and POCs additional items over the weekend. POC findings from last weekend's POC session surface here as Carried Over items (lifted from PR comments on the *previous* week's research PR — not this week's, which Phil hasn't seen yet). + +## Inputs + +1. **Most recent `docs/kaizen/research/YYYY-MM-DD.md`** — Friday's scan. Its Top 5 ideas are the primary candidate list. +2. **`docs/kaizen/review-queue.md`** — `## Open` entries. Includes both this week's ideas (just appended by `kaizen-research`) and any prior-week items that weren't resolved. +3. **Last 1–2 `docs/kaizen/reviews/*.md`** — what was proposed before, what was decided, anything deferred to "revisit by [date]". +4. **PR comments on the *previous* week's kaizen-research PR.** `gh pr view --comments` — Phil's reactions and weekend POC findings are first-class signal. The PR opened *this* morning by `kaizen-research` is too fresh; comments accumulate over the week as Phil POCs ideas. Pick the research PR from one week ago (or the most recent merged/closed kaizen-research PR), not today's. +5. **`docs/kaizen/decisions.md`** (if it exists) — declined items with reasons. Don't re-propose without materially new context. +6. **Recent activity since last review:** + - `git log develop --since="" --oneline --no-merges` — what shipped. + - `gh pr list --base develop --state merged --search "merged:>$(date -v-7d +%Y-%m-%d)"` — what landed. + - `gh run list --status=failure --limit 30` — fresh CI failures. +7. **`CLAUDE.md` + skill inventory** — surface concerns only; never propose unilateral edits to these. +8. **`CHANGELOG.md` / `RELEASE_NOTES.md`** — most recent ~14 days, for the "what shipped this week" celebration block + the don't-re-propose filter. + +## Output + +### 1. Review doc — `docs/kaizen/reviews/YYYY-MM-DD.md` + +```markdown +# Kaizen Review — [Day, Month D, YYYY] +> Prepared HH:MMam MT. Review window: [Month D – D] (7 days). +> Source: research/YYYY-MM-DD.md + review-queue.md (N open items). + +## Week in Review + +[2-4 sentences. What did the week reveal about the system? Use concrete language — +"The aws-samples reference repo introduced a new agent-loop pattern and we're 2 +Strands releases behind" beats "some external changes". This is Phil's pulse +check before decisions.] + +## Friction — the week's signal + +### Repeated patterns (≥2 occurrences) +- **[Pattern]** (N times) — [concrete description; quote PR review comments or commit messages where helpful] + - *Hypothesis*: [root cause] + - *Candidate fix*: [specific change — file + behavior] + +### One-offs worth watching +- **[Pattern]** (1 occurrence) — [context] + +### Silence that matters + +- **[Silence]** — [what wasn't used + what that might mean] + +## Proposals — ranked + + + +### 1. [Proposal title] +- **Source**: research/YYYY-MM-DD.md ▸ Top 5 #N | review-queue.md (open since YYYY-MM-DD) | PR comment | direct observation +- **Surface area**: backend / frontend / infrastructure / cross-cutting / docs / skills +- **Change**: [concrete description — what files change, what the new behavior is] +- **Subtracts**: [required field — what this retires, simplifies, or replaces. Or explicitly "addition only — justified because…"] +- **Unlocks** (if applicable): [net-new product capability, UX pattern, or enhancement this enables — bulleted if multiple. Required for proposals where `Subtracts: no — addition only`; the unlock is the justification. Omit when purely a cleanup/dep-bump and not applicable.] +- **Effort**: Low / Med / High +- **Impact**: Low / Med / High +- **POC findings (if Phil tried it)**: [summary or "not POCed"] +- **Ship means**: [specific action — "open PR updating X to do Y" or "retire skill Z"] +- **Decline means**: [what happens instead — usually "keep current behavior, revisit in N weeks"] +- **Recommendation**: Ship / Decline / Defer N weeks — [one-sentence why] + +### 2. [Next proposal] +… + +## Carried Over From Prior Reviews + + +- **[Deferred item]** (deferred YYYY-MM-DD until YYYY-MM-DD) — [original context]. Now due. + +## Retirement Candidates + + + +- **[Candidate]** — [evidence: not modified in N days, not referenced, replaced by X] + +## Risks Acknowledged But Not Acted On + + +- **[Risk]** — [source URL] — *what breaks if ignored* — recommendation: [Address now / Watch until [date] / Accept] + +## What Shipped This Week + + + +- [shipped item] — *why it mattered* + +## Take + +[2-4 sentences. Is the system trending toward trust or toward friction? Is the kaizen +loop catching real signal or generating noise? What's the one change that would +matter most this week if shipped? Don't sugarcoat — if a skill or pattern isn't +pulling its weight, say so.] + +--- + +## Review Protocol (for Phil) + +1. Read Friction (2 min). +2. Scan Proposals — mark ✅ Ship / ❌ Decline / ⏸ Defer on each (3-5 min). +3. Scan Retirement Candidates — same marks (1-2 min). +4. Resolve Carried Over items (1-2 min). +5. Resolve Risks block. +6. Pick 1-3 to ship this week. Decline or defer the rest with a reason. + +Target: 10-15 minutes. + +## Post-review (for Phil — separate PRs) + +- ✅ Ship items → individual feature PRs over the week. The decision is logged in this doc; the implementation lives elsewhere. +- ❌ Decline items → appended to `docs/kaizen/decisions.md` with Phil's reason so future research doesn't re-propose. +- ⏸ Defer items → kept open in `review-queue.md` with a "revisit by [date]"; surface again in the next review when due. + +This skill produces the agenda. Implementation never happens here. +``` + +### 2. Queue update — `docs/kaizen/review-queue.md` + +After Phil reviews and the decisions are logged in the review doc, this skill (or Phil himself, manually) **moves resolved items** from `## Open` to `## Resolved` with a Decision and Reasoning. On a fresh run before Phil has reviewed, the skill leaves Open as-is — only the *prior* review's outcomes get processed for queue movement. + +## How to run + +1. **Bootstrap.** Confirm `docs/kaizen/reviews/` exists; create it if not. + +2. **Read inputs** (sequential — small reads): + - Latest file in `docs/kaizen/research/` + - `docs/kaizen/review-queue.md` (full) + - Last 1–2 files in `docs/kaizen/reviews/` + - `docs/kaizen/decisions.md` if present + - Last ~14 days of `CHANGELOG.md` and `RELEASE_NOTES.md` + - `CLAUDE.md` (read-only — for context, not edits) + +3. **Pull PR comments on the latest research PR** (parallel with step 4): + ``` + gh pr list --base develop --state all --search "kaizen/research" --limit 1 --json number,url + gh pr view --comments + ``` + Capture Phil's reactions. POC findings he mentions get folded into proposal entries. + +4. **Pull recent activity** (parallel Bash): + - `git log develop --since="" --oneline --no-merges` + - `gh pr list --base develop --state merged --search "merged:>$(date -v-7d +%Y-%m-%d)" --limit 30` + - `gh run list --status=failure --limit 30` + - `gh issue list --state open --search "created:>$(date -v-7d +%Y-%m-%d)"` + +5. **Process prior-review queue movement.** For each entry in `## Open` that was resolved in the most recent review doc, move it to `## Resolved` with the Decision + Reasoning + Reviewed-in fields. Items with no decision in the prior review stay open. + +6. **Identify Carried Over items.** Scan prior review docs for `Defer N weeks` recommendations whose revisit date has hit. Add those to the new review's Carried Over section. + +7. **Synthesize the review doc** per the shape above. The Proposals list is built from: + - All `## Open` entries in `review-queue.md` (the primary source) + - Any new friction patterns surfaced from PR comments / merged PRs / CI that weren't already in the queue + - Carried Over items + Rank: + - Low-effort × High-impact first. + - **Retirement candidates** get a +1 boost (subtraction bias). + - **Capability-unlock items** (proposals with a substantive `Unlocks` field — new product capability, UX surface, or platform primitive adoption) rank on their strategic merit. Do not auto-defer just because `Subtracts: no`. A High-impact unlock can rank above a Low-impact subtraction. + - Items with **POC findings** rank above untested items at the same effort/impact. + +8. **Cap the proposal count at 10.** If more than 10 candidates, defer the lowest-ranked to next week with a note. The review is supposed to take 10-15 minutes, not be exhaustive. + +9. **Open a PR** — see "PR creation". + +## PR creation + +```bash +DATE=$(TZ=America/Denver date +'%Y-%m-%d') +BRANCH="kaizen/review-${DATE}" + +git checkout -b "$BRANCH" develop +git add docs/kaizen/ +git commit -m "$(cat <2 weeks, *that's* the finding — flag it in the Take. +- **Don't re-propose declined items** without materially new context. Cross-check `docs/kaizen/decisions.md` and the last 1–2 reviews. +- **Carried Over is not a graveyard.** Deferred items resurface on their revisit date. No silent deferrals. +- **No fabrication.** If a week was quiet, the review is short. Length tracks signal, not target word count. +- **Never edit `CLAUDE.md` or skill files unilaterally.** A proposal can recommend a change to them, but the change itself is always Phil-approved in review and shipped in a separate PR. +- **Cap at 10 proposals.** A 15-item list defeats the 10-15 min target. + +## Confirmation + +After the PR is opened, tell Phil: +1. PR URL. +2. Top 1–2 proposals (title, Effort×Impact, recommendation). +3. Top 1 retirement candidate if any. +4. One-sentence Take. +5. Estimated review time. + +Brief. Phil reads the full doc on the PR and marks decisions there or in a follow-up commit. diff --git a/.github/workflows/skip-auth-guard.yml b/.github/workflows/skip-auth-guard.yml new file mode 100644 index 00000000..d3fd2b5b --- /dev/null +++ b/.github/workflows/skip-auth-guard.yml @@ -0,0 +1,49 @@ +name: "Guard: SKIP_AUTH must not appear in deployed config" + +# Refuses any PR or push that lets the local-dev SKIP_AUTH=true bypass +# leak into deployed configuration. The bypass itself is implemented in +# backend/src/apis/shared/auth/dependencies.py and gated at boot in +# backend/src/apis/app_api/main.py — both of those legitimately reference +# the string and are excluded from the scan. Anywhere else is a leak. + +permissions: + contents: read + +on: + push: + branches: [main, develop] + pull_request: + branches: [main, develop] + +jobs: + scan: + runs-on: ubuntu-24.04 + steps: + - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 + + - name: Scan for SKIP_AUTH=true in deployable config + run: | + set -euo pipefail + # Look in CDK, GitHub Actions, Dockerfiles, and any task definitions. + # Exclude the two files that legitimately reference the variable + # (the bypass implementation + its startup guard) and this workflow. + # Catches `SKIP_AUTH=true`, `SKIP_AUTH: true`, `SKIP_AUTH: "true"`, etc. + # Covers shell, Dockerfile, YAML, and CDK TypeScript object-literal styles. + PATTERN='SKIP_AUTH[[:space:]]*[:=][[:space:]]*["'\'']*true' + MATCHES=$(grep -RInE "$PATTERN" \ + infrastructure/lib \ + infrastructure/bin \ + scripts \ + .github/workflows \ + backend/Dockerfile.app-api \ + backend/Dockerfile.inference-api \ + 2>/dev/null \ + | grep -v '^.github/workflows/skip-auth-guard.yml:' \ + || true) + if [ -n "$MATCHES" ]; then + echo "::error::SKIP_AUTH=true is a local-dev bypass and must never appear in deployed config." + echo "Found in:" + echo "$MATCHES" + exit 1 + fi + echo "OK — no SKIP_AUTH=true in deployable config." diff --git a/.kiro/specs/bff-middleware-event-loop-blocking/.config.kiro b/.kiro/specs/bff-middleware-event-loop-blocking/.config.kiro new file mode 100644 index 00000000..d4b7813d --- /dev/null +++ b/.kiro/specs/bff-middleware-event-loop-blocking/.config.kiro @@ -0,0 +1 @@ +{"specId": "075212d4-ee53-4e7a-bc6d-9d99dacb7cef", "workflowType": "requirements-first", "specType": "bugfix"} diff --git a/.kiro/specs/bff-middleware-event-loop-blocking/bugfix.md b/.kiro/specs/bff-middleware-event-loop-blocking/bugfix.md new file mode 100644 index 00000000..4e345403 --- /dev/null +++ b/.kiro/specs/bff-middleware-event-loop-blocking/bugfix.md @@ -0,0 +1,78 @@ +# Bugfix Requirements Document + +## Introduction + +Since the `v1.0.0-beta.24` BFF Token Handler release (commit `258193d`, deployed 2026-05-06), the app-api service has exhibited severe tail-latency and ingress stalls on page loads. Angular's refresh fan-out (~8 concurrent endpoints — `/auth/session`, `/models`, `/tools/`, `/files/quota`, `/users/me/permissions`, `/sessions`, `/assistants`, `/connectors/`) sees requests time out or exceed the ALB 60s idle cap. Observed signals over the last 24h on the affected fleet: + +- Two `HTTPCode_ELB_504_Count` events (13:37 and 14:40 UTC) — driven by ALB idle timeout, not target 5xx. +- `TargetResponseTime` p-max of 15.6s at 15:25 UTC; `/files/quota` outliers reaching ~80s. +- Endpoint p95s under load: `/models` ~141ms, `/tools/` ~289ms, `/users/me/permissions` ~239ms, `/sessions` ~188ms. +- ECS task at 0.7% CPU / 23% memory. No DDB throttling (0 `ReadThrottleEvents` / `WriteThrottleEvents` across all 23 tables). Zero target 5xx. + +The new `SessionRefreshMiddleware` runs on every request carrying a `__Host-bff_session` cookie. Its hot-path calls block the single-worker uvicorn event loop on synchronous boto3 I/O (DynamoDB + Cognito), its cache TTL and its sliding-renewal throttle are aligned on the same 60s boundary, and the per-session lock that coalesces refreshes does not coalesce the broader session-resolve path. The result is ~16 serialized blocking AWS calls at the front of every page load per active user, once per minute — with no concurrency slack because the service runs one uvicorn worker in one ECS task. + +Impact: degraded UX for every logged-in user (spinners, stale data, failed tab refreshes), 504s visible to users, and a fragile service posture where any single slow AWS call stalls every in-flight request on the same task. + +## Bug Analysis + +### Current Behavior (Defect) + +What currently happens under the new middleware on cookie-bearing requests: + +1.1 WHEN `SessionRepository.get`, `touch_last_seen`, `update_tokens`, `put`, or `delete` is awaited inside a request handler THEN the uvicorn event loop blocks for the full DynamoDB round-trip because the methods are declared `async def` but call boto3 synchronously with no `asyncio.to_thread` offload and no aioboto3. + +1.2 WHEN `SessionRefreshMiddleware._resolve_session` invokes `CognitoRefreshClient.refresh` THEN the uvicorn event loop blocks for the full `cognito-idp:initiate_auth` round-trip because `CognitoRefreshClient.refresh` is a plain `def` called without threadpool offload, and it runs while the per-session `asyncio.Lock` from `get_session_lock()` is held. + +1.3 WHEN N concurrent requests for the same `session_id` arrive with no valid cached `SessionRecord` THEN the middleware issues N independent DynamoDB `get_item` calls because the existing per-session lock only coalesces the refresh exchange, not the upstream unseal → `SessionCache.get` → `SessionRepository.get` sequence. + +1.4 WHEN the `SessionCache` TTL elapses at the same moment the sliding-renewal throttle window elapses THEN a single request triggers both a DynamoDB `get_item` and a DynamoDB `update_item` on its critical path because `_DEFAULT_REFRESH_LEEWAY_SECONDS` and `_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS` are both `60` in `sessions_bff/config.py`, so cache expiry and throttle expiry are aligned. + +1.5 WHEN a request passes `SessionRefreshMiddleware` and a slide is warranted THEN the caller's response waits for `touch_last_seen` to complete against DynamoDB because `_maybe_slide` `await`s the write inline on the request path, even though the code is documented to swallow failures ("Don't fail the request if the slide-write fails"). + +1.6 WHEN the app-api container starts THEN the service has no concurrency slack because the `CMD` in `backend/Dockerfile.app-api` launches a single uvicorn worker with no `--workers` flag and the service runs as a single ECS task — one blocked event loop stalls all ingress, consistent with low CPU/memory while latency climbs. + +1.7 WHEN Angular fires its ~8-endpoint page-load fan-out with a session cookie and the per-session cache window has just elapsed THEN ~16 serialized blocking DynamoDB operations (per-coroutine `get_item` plus per-coroutine `update_item`) accumulate at the front of the page load because each coroutine independently observes cache-miss and past-throttle on its local `SessionRecord` copy and each runs its own blocking AWS I/O on the event loop thread. + +### Expected Behavior (Correct) + +What should happen instead, keeping the same middleware surface and contracts: + +2.1 WHEN `SessionRepository.get`, `touch_last_seen`, `update_tokens`, `put`, or `delete` is awaited inside a request handler THEN the method SHALL execute its underlying boto3 call in a threadpool (via `asyncio.to_thread` or an equivalent offload) so the uvicorn event loop continues scheduling other coroutines for the full DynamoDB round-trip. + +2.2 WHEN `SessionRefreshMiddleware._resolve_session` invokes `CognitoRefreshClient.refresh` THEN the Cognito `initiate_auth` call SHALL execute in a threadpool so the uvicorn event loop continues scheduling other coroutines — including coroutines for different `session_id`s — while the per-session `asyncio.Lock` is held. + +2.3 WHEN N concurrent requests for the same `session_id` arrive with no valid cached `SessionRecord` THEN the middleware SHALL coalesce them so at most one DynamoDB `get_item` is issued per `session_id` per cache window; the remaining N−1 requests SHALL await a shared in-process future keyed by `session_id` and consume its result. + +2.4 WHEN the `SessionCache` TTL elapses THEN a cache miss SHALL NOT imply a sliding-renewal DynamoDB write on the same request because `_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS` SHALL be a strict multiple of `_DEFAULT_REFRESH_LEEWAY_SECONDS` (e.g. 300s versus 60s) so cache-expiry and throttle-expiry are de-aligned. + +2.5 WHEN a request passes `SessionRefreshMiddleware` and a slide is warranted THEN the caller's response SHALL NOT wait for `touch_last_seen` because `_maybe_slide` SHALL dispatch the DynamoDB write as a detached `asyncio.Task` (fire-and-forget) and SHALL return the computed `Max-Age` to the response path immediately. + +2.6 WHEN the app-api container starts THEN the service SHALL have concurrency slack such that a single blocked event loop does not stall all ingress — uvicorn SHALL run with `--workers >= 2` (at current `cpu=512`) and/or the ECS service SHALL run `>= 2` tasks; the chosen configuration SHALL be deployed. + +2.7 WHEN Angular fires its ~8-endpoint page-load fan-out with a session cookie and the per-session cache window has just elapsed THEN the middleware SHALL issue at most 1 DynamoDB `get_item` and at most 1 DynamoDB `update_item` per `session_id` per cache window across the fan-out (not ~16 blocking calls), and all 8 responses SHALL return without any one of them serially waiting on another's AWS I/O to complete. + +### Unchanged Behavior (Regression Prevention) + +Existing guarantees that MUST continue to hold after the fix: + +3.1 WHEN `BFFConfig.is_enabled()` returns `False` (env vars unset) THEN `SessionRefreshMiddleware` SHALL CONTINUE TO short-circuit as a pass-through with no AWS calls (dormant pass-through invariant preserved). + +3.2 WHEN a request arrives without a `__Host-bff_session` cookie THEN `SessionRefreshMiddleware` SHALL CONTINUE TO pass the request through without unsealing, cache lookup, or DynamoDB access. + +3.3 WHEN an unrecoverable cookie is detected (bad seal, missing DynamoDB row, expired TTL, or terminal `CognitoRefreshError`) THEN the middleware SHALL CONTINUE TO clear both `__Host-bff_session` and `__Host-bff_csrf` on the response. + +3.4 WHEN `_maybe_slide` returns a non-`None` `Max-Age` THEN the response's `Set-Cookie` headers for `__Host-bff_session` and `__Host-bff_csrf` SHALL CONTINUE TO use that exact value (the cookie re-emit contract between `_maybe_slide` and `_reemit_cookies` is preserved under fire-and-forget dispatch of the DynamoDB write). + +3.5 WHEN N concurrent requests for the same `session_id` cross the refresh-leeway window at the same moment THEN exactly one `cognito-idp:initiate_auth` exchange SHALL CONTINUE TO be issued per `session_id` per leeway window (the existing refresh-storm coalescing via `get_session_lock(session_id)` is preserved end-to-end). + +3.6 WHEN `CookieCodec._ensure_cipher` is called on a hot request THEN the AES-GCM cipher SHALL CONTINUE TO be served from the process-wide `get_default_codec()` singleton with no per-request `kms:GenerateDataKey` call. + +3.7 WHEN `resolve_bff_client_secret` is called on a hot request THEN the BFF Cognito app-client secret SHALL CONTINUE TO be served from the module-scope cache with no per-request `secretsmanager:GetSecretValue` call. + +3.8 WHEN `CSRFMiddleware` validates an unsafe-method request with `request.state.bff_session` set THEN it SHALL CONTINUE TO accept/reject using the existing in-memory HMAC double-submit check with no new I/O introduced on its path. + +3.9 WHEN the absolute-lifetime cap (`created_at + absolute_lifetime_seconds`) has passed THEN `_maybe_slide` SHALL CONTINUE TO return `None` so no further cookie re-emission or DynamoDB slide is issued. + +3.10 WHEN a refresh rotates the Cognito refresh token and the `update_tokens` persist fails THEN the middleware SHALL CONTINUE TO invalidate the cache entry and clear the cookie so the user is forced to re-authenticate (fail-closed rotation behavior preserved). + +3.11 WHEN the BFF cookie seal fails to decode THEN the middleware SHALL CONTINUE TO treat every decode failure identically (no timing or response-shape oracle introduced by the new offload or single-flight paths). diff --git a/.kiro/specs/bff-middleware-event-loop-blocking/code-review-report.md b/.kiro/specs/bff-middleware-event-loop-blocking/code-review-report.md new file mode 100644 index 00000000..f30360f5 --- /dev/null +++ b/.kiro/specs/bff-middleware-event-loop-blocking/code-review-report.md @@ -0,0 +1,245 @@ +# Code Review Report: BFF Middleware Event-Loop Blocking Bugfix + +**Branch**: `fix/bff-middleware-event-loop-blocking` +**PR**: [#264](https://github.com/Boise-State-Development/agentcore-public-stack/pull/264) +**Commits reviewed**: +- `db3d2e06` — Initial fix (tasks 3.1–3.7) +- `dd91d6fd` — Test polling adjustment +- `78891e2e` — Strong-reference fix for fire-and-forget tasks + +This report reviews each technical decision in the bugfix against authoritative external sources (Python docs, AWS docs, canonical patterns from the Python ecosystem) to demonstrate the approach was sound. Where my initial implementation missed a production nuance, I flag it and cite the source that caught it. + +--- + +## 1. Offloading sync boto3 to threads via `asyncio.to_thread` + +**Change**: `SessionRepository.{get,put,update_tokens,touch_last_seen,delete}` and `CognitoRefreshClient.refresh` now wrap their boto3 calls in `await asyncio.to_thread(...)`. + +**Why this is correct**: + +The official Python documentation for [`asyncio.to_thread()`](https://docs.python.org/3/library/asyncio-task.html#asyncio.to_thread) describes it as: + +> This coroutine function is primarily intended to be used for executing IO-bound functions/methods that would otherwise block the event loop if they were run in the main thread. + +The docs state explicitly that `asyncio.to_thread` is the idiomatic solution for IO-bound blocking work — which is exactly what boto3's synchronous HTTP calls to DynamoDB and Cognito are. They also note: + +> Due to the GIL, asyncio.to_thread() can typically only be used to make IO-bound functions non-blocking. + +boto3 is a well-known offender in this exact scenario. [Stack Overflow](https://stackoverflow.com/questions/72092993/i-want-to-use-boto3-in-async-function-python) recommends two options for using boto3 in async code: (a) use `aioboto3`/`aiobotocore`, or (b) wrap boto3 in `asyncio.to_thread`/`loop.run_in_executor`. Both are valid; `to_thread` is the lower-friction choice because it doesn't introduce a new async SDK with a different API surface. + +The existing codebase had a documented awareness of this gap — the `SessionRepository` docstring before the fix acknowledged that boto3 runs on the event loop thread. The fix simply closes that gap without reshaping the API. + +**Alternative considered (not taken)**: Replacing boto3 with [`aioboto3`](https://pypi.org/project/aioboto3/). Rejected because: (a) adds a new dependency, (b) changes method signatures across the repository (e.g. `async with table.get_item(...)` vs `table.get_item(...)`), (c) the per-method offload is a surgical change with no ripple effect on callers. The spec explicitly called for "targeted, minimal-surface intervention that keeps the middleware's public contracts intact." + +**Verdict**: ✅ Correct approach, supported by official Python docs. + +--- + +## 2. Per-session single-flight via `asyncio.Future` + +**Change**: New `backend/src/apis/shared/sessions_bff/single_flight.py` exports `async def resolve_once(session_id, loader_coro_factory)`. The first caller per `session_id` creates a Future, runs the loader, sets the result; concurrent callers await the same Future. + +**Why this is correct**: + +This is the canonical **request coalescing** / **single-flight** pattern. The Python ecosystem recognizes it as the standard solution for N-concurrent-callers-one-backend-hit. From [OneUptime's "How to Reduce DB Load with Request Coalescing in Python"](https://oneuptime.com/blog/post/2026-01-23-request-coalescing-python/view): + +> Request coalescing, also known as request deduplication or single-flighting, is a technique where concurrent requests for the same resource are merged into a single backend call. +> +> _(paraphrased for licensing compliance)_ + +And from [SystemDesignSandbox](https://www.systemdesignsandbox.com/learn/hot-key-cache-stampede), "request coalescing" is listed as a textbook solution to fan-out amplification on hot keys / concurrent cache misses. + +The name comes from Go's `golang.org/x/sync/singleflight` package, which is the reference implementation of this pattern. Python's `asyncio.Future` is the natural primitive for it: multiple coroutines can `await` the same Future, and setting the result/exception wakes all of them. + +**Why a Future and not an `asyncio.Lock`**: The existing `get_session_lock(session_id)` in `lock.py` already serializes the Cognito refresh exchange. A lock would serialize the fan-out (N callers run sequentially through one DDB call), but we want to **coalesce** it (N callers share one result). A Future is the right primitive for coalescing. The design doc called this out: + +> The fix needs a different primitive — an `asyncio.Future` stored in a per-session slot that N waiters can await — because a lock would serialize N requests through one DDB call instead of consolidating them to one call. + +**Implementation notes**: +- The registry is a plain `dict` guarded by a `threading.Lock` with double-checked locking — mirrors the pattern in `lock.py` which is already approved by the team. +- Leader always removes the entry in a `try/except/finally` pattern so a failed loader doesn't sticky-cache. +- Exceptions propagate to all waiters via `future.set_exception(exc)`; the leader additionally calls `future.exception()` to silence the "Future exception was never retrieved" warning if no follower attached. + +**Verdict**: ✅ Canonical pattern, implemented against Python's standard asyncio primitives. + +--- + +## 3. De-aligning cache TTL and slide-throttle windows + +**Change**: `_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS` raised from 60s to 300s while `_DEFAULT_REFRESH_LEEWAY_SECONDS` stays at 60s. + +**Why this is correct**: + +Aligned TTL boundaries are the textbook cause of **cache stampede / thundering herd**. Multiple sources document this: + +- [Redis (antirez) on cache stampedes](https://redis.antirez.com/fundamental/cache-stampede-prevention.html): a popular cache key expiring causes many concurrent requests to regenerate it, overwhelming the backend. +- [Aman Maharshi, "Cache Stampede: Solving the Thundering Herd Problem"](https://www.amanmaharshi.com/blog/cache-stampede): "Synchronized Expiration" — caching N items at once with one TTL causes them all to expire at the same second, creating a spike. +- [softwarepatternslexicon.com "Thundering Herds and Backend Pressure"](https://softwarepatternslexicon.com/caching-patterns-and-invalidation/consistency-and-stampede-control/thundering-herds-backend-pressure/): "A synchronized TTL boundary... can create a wave of misses that ripples into databases." + +Our case was a miniature version of this: whenever `SessionCache` TTL (60s) elapsed at the same moment as the slide-throttle window (60s), a single request paid **both** a `get_item` AND an `update_item` on its critical path. Making the throttle a strict multiple (300s, 5× the leeway) guarantees that a cache miss at boundary T will never coincide with a slide-throttle expiry at the same T — by construction, the slide throttle expiry is at T + offset where `offset != 0 mod 60`. + +**Why 300s and not some other value**: The design doc explicitly says "strict multiple of refresh leeway (e.g. 300s vs 60s)". 300s is 5× 60s. The key property is that `throttle % leeway == 0` AND `throttle > leeway` — the multiplier could be 2, 5, 10, etc. 5× was chosen because it matches industry practice of caching session metadata for minutes, not seconds. + +**Related patterns we didn't need but recognized**: TTL jitter (randomizing per-key expiry) is another standard mitigation. We don't need it because we only have one key class (sessions) and the single-flight already coalesces; jitter would add complexity without bounded benefit. + +**Verdict**: ✅ Direct application of a well-documented cache-stampede prevention technique. + +--- + +## 4. Fire-and-forget slide-write via `asyncio.create_task` + +**Change**: `_maybe_slide` now dispatches `touch_last_seen` as a detached task rather than awaiting it inline. + +**Why the approach is correct**: + +The inline `await` was causing the response path to wait on a DDB round-trip for a write that was already documented to swallow failures — i.e. the response didn't actually need the write to complete. That's the textbook scenario for fire-and-forget. + +**What I got wrong initially**: I wrote `asyncio.create_task(self._slide_write_task(...))` without holding a reference to the returned Task. This is a **known dangerous anti-pattern**. The [Python docs for `asyncio.create_task`](https://docs.python.org/3/library/asyncio-task.html#asyncio.create_task) contain this explicit warning: + +> **Important** +> +> Save a reference to the result of this function, to avoid a task disappearing mid-execution. The event loop only keeps weak references to tasks. A task that isn't referenced elsewhere may get garbage collected at any time, even before it's done. +> +> For reliable "fire-and-forget" background tasks, gather them in a collection: +> +> ```python +> background_tasks = set() +> for i in range(10): +> task = asyncio.create_task(some_coro(param=i)) +> background_tasks.add(task) +> task.add_done_callback(background_tasks.discard) +> ``` + +The fix in commit `78891e2e` applies this exact pattern: `self._slide_tasks: set[asyncio.Task]` on the middleware instance, with `task.add_done_callback(self._slide_tasks.discard)` to prevent the set from leaking. + +**Multiple external sources reinforce this**: +- [SuperFastPython, "Asyncio Disappearing Task Bug"](http://superfastpython.com/asyncio-disappearing-task-bug/): "Save a reference to the result of this function, to avoid a task disappearing mid-execution. The event loop only keeps weak references to tasks." +- [Michael Kennedy, "Fire and forget (or never) with Python's asyncio"](https://mkennedy.codes/posts/fire-and-forget-or-never-with-python-s-asyncio/): "create_task() can silently garbage collect your fire-and-forget tasks starting in Python 3.12 — they may never run. The fix: store task references in a set and register a done_callback to clean them up." +- [Ruff's `RUF006` lint rule ("asyncio-dangling-task")](https://docs.astral.sh/ruff/rules/asyncio-dangling-task/) flags exactly this anti-pattern automatically. +- [Runebook, "Replacing Low-Level Task Registration"](http://runebook.dev/en/docs/python/library/asyncio-extending/asyncio._register_task): describes the weak-reference behavior and the risk of collection mid-execution. + +**Why the bug surfaced only on CI**: Python 3.12 made garbage collection more aggressive. On my local Python 3.13 (different GC tuning, different scheduler timing), the task usually completed before GC ran. On CI's Python 3.12 runners, the GC occasionally collected the task first, causing a missing `update_item`. Hypothesis caught it as `FlakyFailure` — failed once, passed on retry — which is the signature of exactly this kind of race. + +**Verdict**: ✅ Fire-and-forget is the right approach; ❌ my initial implementation had a canonical asyncio bug; ✅ the fix matches the Python docs' recommended pattern verbatim. + +--- + +## 5. ECS `desiredCount` raised from 1 to 2 + +**Change**: `infrastructure/cdk.context.json` `appApi.desiredCount: 1 → 2` in the production context. + +**Why this is correct**: + +The issue was a single point of failure at the deployment layer: one ECS task running one uvicorn worker means any slow AWS call on that task's event loop halts every in-flight request. AWS's own [ECS availability best practices](https://aws.amazon.com/blogs/containers/amazon-ecs-availability-best-practices/) document explicitly recommends multi-task deployments for availability. + +Independently from the event-loop issue, single-task services fail basic availability requirements: if the one task crashes, restarts, or becomes unreachable, the service has zero capacity until a replacement boots — which for Fargate is tens of seconds to minutes. Two tasks means rolling restarts always keep one healthy instance serving traffic. + +This change is belt-and-suspenders: even if the event-loop-blocking fix is 100% correct, running `desiredCount: 1` would still be a latent availability liability. Raising to 2 gives us: +1. Concurrency slack so a single stuck loop can't halt all ingress (primary rationale). +2. Rolling deploy safety (automatic secondary benefit). +3. Resilience to a single task's AZ failure (automatic tertiary benefit). + +`maxCapacity` stays at 10 so auto-scaling can still burst upward under load. + +**Verdict**: ✅ Standard AWS multi-task posture, with a specific and documented trigger in the bug analysis. + +--- + +## 6. Lock scope preservation (existing `get_session_lock`) + +**Change**: None — the `async with get_session_lock(session_id)` scope around the Cognito refresh exchange is deliberately preserved exactly as it was. + +**Why this is correct**: + +The existing lock exists for a specific purpose: the Cognito refresh-token rotation flow invalidates the previous refresh token as soon as a new one is issued. If N concurrent requests all call `initiate_auth` with the same refresh token, only the first succeeds; the rest receive the token-rotated-out error and have to be failed or retried. Serializing the exchange with a per-session lock prevents this race. + +The new single-flight primitive sits **upstream** of this lock — it coalesces the resolve path (cache, repo.get, needs_refresh decision) so typically only the leader ever reaches the Cognito refresh at all. But in the edge case where the leader decides refresh is NOT needed but a follower does (race with TTL expiry), the existing lock is still needed as a defense-in-depth. The design doc was explicit about not moving or widening the lock. + +The preservation test `test_3_5_refresh_storm_coalesces_to_single_initiate_auth` verifies that exactly one `cognito-idp:initiate_auth` fires per 10 concurrent same-session requests — which is the original contract, preserved end-to-end. + +**Verdict**: ✅ Correctly preserved. The contract the existing lock was enforcing continues to hold. + +--- + +## 7. Testing approach + +**Property-Based Tests over scenario-based tests**: Used `hypothesis` for: +- Sub-conditions that generalize over a domain (fan-out size, request shapes across the non-buggy input domain). +- Preservation properties that must hold "for all" inputs meeting certain criteria. + +This is the approach the project's Kiro spec workflow calls for (Property-Based Testing Integration section). Property-based testing for preservation invariants is particularly strong because it catches edge cases in the fix (single-flight exception paths, background task races, Set-Cookie attribute sets) that scenario tests would miss. + +**Bug Condition exploration test FAILS on unfixed code, PASSES on fixed code**: This is the core methodology of the bugfix workflow — the test serves as the executable specification. 10 of 12 sub-conditions failed on unfixed code (proving the bug); all 12 pass after the fix. + +**What the tests caught that scenario tests would have missed**: +- Hypothesis's `FlakyFailure` detection caught the `asyncio.create_task` GC race on CI — a scenario test at a fixed seed likely wouldn't have reproduced it at all. + +**Verdict**: ✅ Correct methodology; the tests caught a real bug I introduced. + +--- + +## 8. What I did well + +1. **Read before writing**: traced the full middleware path, repository, lock, and config before proposing changes. +2. **Preservation-first**: wrote the preservation test suite on unfixed code before implementing any fix, so regressions surface immediately. +3. **Separate primitive for separate concern**: new `single_flight.py` module instead of overloading `lock.py` — keeps each primitive's contract clear. +4. **Minimal-surface interventions**: no new async SDK, no public API changes, no lock-scope shift. + +## 9. What I got wrong (and corrected) + +1. **Missed the `asyncio.create_task` strong-reference requirement** on the first pass. The Python docs warn about this in bold, Ruff has a lint rule for it, and multiple blog posts cover it. This is directly traceable to me not running the full CI script locally before pushing — my local Python 3.13 GC didn't hit the race. +2. **Initial CI fix was a band-aid** (polling on the test side) rather than a root-cause fix (strong reference in the middleware). The polling remains as defensive depth but the real fix is the set-based reference in commit `78891e2e`. + +## 10. Root cause summary + +The fix addresses four independent but correlated defects in `SessionRefreshMiddleware`, each with a canonical industry solution: + +| Defect | Canonical fix | Authority | +|---|---|---| +| Sync boto3 blocks event loop | `asyncio.to_thread` | [Python docs](https://docs.python.org/3/library/asyncio-task.html#asyncio.to_thread) | +| N concurrent same-session → N DDB calls | Single-flight / request coalescing via `asyncio.Future` | [OneUptime](https://oneuptime.com/blog/post/2026-01-23-request-coalescing-python/view), Go's `singleflight` | +| Aligned TTL = cache stampede | De-align boundaries (strict multiple) | [Redis on cache stampedes](https://redis.antirez.com/fundamental/cache-stampede-prevention.html), [softwarepatternslexicon.com](https://softwarepatternslexicon.com/caching-patterns-and-invalidation/consistency-and-stampede-control/thundering-herds-backend-pressure/) | +| Response waits on non-critical DDB write | Fire-and-forget task with strong reference | [Python docs on `asyncio.create_task`](https://docs.python.org/3/library/asyncio-task.html#asyncio.create_task) | +| Single ECS task = no concurrency slack | `desiredCount >= 2` | [AWS ECS availability best practices](https://aws.amazon.com/blogs/containers/amazon-ecs-availability-best-practices/) | + +Each fix is directly traceable to a published authority. The overall shape — coalesce upstream, offload sync I/O to threads, dispatch non-critical writes asynchronously, stagger TTLs, add replica slack — is the standard stack of techniques for keeping an ASGI service's event loop free under concurrent load. + +## 11. Verification status + +- **Local**: `scripts/stack-app-api/test.sh` and `scripts/stack-inference-api/test.sh` both pass with 2459 tests inside the `agentcore-dev` container. +- **Bug condition exploration suite**: 12/12 pass on fixed code (0/12 passed before fix). +- **Preservation suite**: 19/19 pass on both unfixed and fixed code (baseline intact). +- **Single-flight primitive unit tests**: 6/6 pass. +- **CDK unit tests**: 25/25 pass for `app-api-stack` including new production-context `DesiredCount: 2` assertion. +- **CI PR #264**: pushed commit `78891e2e` with the strong-reference fix; awaiting CI verification. + +--- + +## Sources consulted + +Primary: +- [Python 3 docs: `asyncio.to_thread`](https://docs.python.org/3/library/asyncio-task.html#asyncio.to_thread) +- [Python 3 docs: `asyncio.create_task` (Important: Save a reference...)](https://docs.python.org/3/library/asyncio-task.html#asyncio.create_task) + +Supporting (asyncio task lifecycle): +- [SuperFastPython: Asyncio Disappearing Task Bug](http://superfastpython.com/asyncio-disappearing-task-bug/) +- [Michael Kennedy: Fire and forget (or never) with Python's asyncio](https://mkennedy.codes/posts/fire-and-forget-or-never-with-python-s-asyncio/) +- [Ruff RUF006: asyncio-dangling-task](https://docs.astral.sh/ruff/rules/asyncio-dangling-task/) + +Supporting (boto3 + async): +- [Stack Overflow: I want to use boto3 in async function, python](https://stackoverflow.com/questions/72092993/i-want-to-use-boto3-in-async-function-python) +- [aioboto3 on PyPI](https://pypi.org/project/aioboto3/) — considered and rejected as too invasive + +Supporting (cache stampede / thundering herd): +- [Redis on cache stampede prevention](https://redis.antirez.com/fundamental/cache-stampede-prevention.html) +- [softwarepatternslexicon.com: Thundering Herds and Backend Pressure](https://softwarepatternslexicon.com/caching-patterns-and-invalidation/consistency-and-stampede-control/thundering-herds-backend-pressure/) +- [Aman Maharshi: Cache Stampede: Solving the Thundering Herd Problem](https://www.amanmaharshi.com/blog/cache-stampede) + +Supporting (request coalescing): +- [OneUptime: How to Reduce DB Load with Request Coalescing in Python](https://oneuptime.com/blog/post/2026-01-23-request-coalescing-python/view) +- [SystemDesignSandbox: Hot Keys and Cache Stampedes](https://www.systemdesignsandbox.com/learn/hot-key-cache-stampede) + +Supporting (ECS availability): +- [AWS ECS availability best practices](https://aws.amazon.com/blogs/containers/amazon-ecs-availability-best-practices/) + +Content was paraphrased for compliance with licensing restrictions; verbatim quotes are limited to short excerpts attributed inline. diff --git a/.kiro/specs/bff-middleware-event-loop-blocking/design.md b/.kiro/specs/bff-middleware-event-loop-blocking/design.md new file mode 100644 index 00000000..b5e99a44 --- /dev/null +++ b/.kiro/specs/bff-middleware-event-loop-blocking/design.md @@ -0,0 +1,380 @@ +# BFF Middleware Event Loop Blocking Bugfix Design + +## Overview + +The `SessionRefreshMiddleware` runs on every cookie-bearing request and, as of `v1.0.0-beta.24`, executes four independent classes of blocking/serialized work on the uvicorn event loop: + +1. **Sync boto3 I/O on the event loop thread** — `SessionRepository.*` and `CognitoRefreshClient.refresh` are declared `async def` but call boto3 synchronously. Every DynamoDB `get_item`/`update_item` and every Cognito `initiate_auth` freezes the whole event loop for its round-trip duration. +2. **Missing fan-out coalescing** — the per-session `asyncio.Lock` wraps only the refresh exchange. The upstream `unseal → cache → get_item → maybe_slide` path is not coalesced, so Angular's ~8-endpoint page-load fan-out produces ~16 serialized blocking DDB calls per cache window. +3. **Aligned cache TTL / throttle window** — `_DEFAULT_REFRESH_LEEWAY_SECONDS` and `_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS` both default to 60s. Cache expiry and slide-throttle expiry land on the same boundary, so a single request crossing that boundary incurs both a `get_item` and an `update_item` on its critical path. +4. **Inline awaited slide-write** — `_maybe_slide` awaits `touch_last_seen` on the request path even though the call is already written defensively (failures are swallowed). The caller's response waits on DDB. + +All of this runs inside a **single uvicorn worker on a single ECS task** (no `--workers` flag in `backend/Dockerfile.app-api`, `desiredCount: 1` in CDK), so any one blocked round-trip stalls every other in-flight request. + +The fix is a targeted, minimal-surface intervention that keeps the middleware's public contracts intact: + +- Offload every synchronous boto3 call in `SessionRepository` and `CognitoRefreshClient.refresh` via `asyncio.to_thread`. +- Introduce a per-session `asyncio.Future`-based single-flight in front of the `get_item → needs_refresh → maybe-refresh` path so N concurrent requests for the same `session_id` share one lookup result. +- De-align `_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS` from the cache/leeway window (raise to 300s) so cache-miss does not imply slide-write. +- Dispatch `_maybe_slide`'s `touch_last_seen` as a detached `asyncio.Task` and return the `Max-Age` synchronously. +- Add concurrency slack at the deployment layer (raise `CDK_APP_API_DESIRED_COUNT` to ≥ 2 for production config, keeping 1 valid for dev) so a single stuck event loop can no longer halt all ingress. + +## Glossary + +- **Bug_Condition (C)**: The condition that triggers the bug — a cookie-bearing request reaches `SessionRefreshMiddleware` while the middleware is active (`BFFConfig.is_enabled()` is True), under any of the sub-conditions 1.1–1.7 in `bugfix.md#Current Behavior`. +- **Property (P)**: The desired behavior when the bug condition holds — AWS I/O never freezes the uvicorn event loop, fan-outs share a single coalesced lookup, and slide-writes never block the response path. +- **Preservation**: Existing contracts that must remain unchanged — dormant pass-through (`is_enabled() == False`), no-cookie pass-through, unrecoverable-cookie clearing, refresh-storm coalescing, Max-Age re-emit contract, CSRF unchanged, absolute-lifetime cap, fail-closed rotation, uniform cookie decode failure. +- **SessionRefreshMiddleware**: The middleware in `backend/src/apis/shared/middleware/session_refresh.py` that unseals the BFF cookie, resolves the `SessionRecord`, optionally refreshes Cognito tokens, and slides the session's DDB TTL. +- **SessionRepository**: The repository in `backend/src/apis/shared/sessions_bff/repository.py` that wraps boto3 DynamoDB calls with `async def` signatures. Today the methods call boto3 synchronously on the event loop thread. +- **CognitoRefreshClient**: The class in `backend/src/apis/shared/sessions_bff/refresh.py` whose `refresh()` method is plain `def` and calls `cognito-idp:initiate_auth` synchronously. +- **SessionCache**: The process-wide `TTLCache` in `backend/src/apis/shared/sessions_bff/cache.py` whose TTL defaults to `refresh_leeway_seconds` (60s). +- **`_DEFAULT_REFRESH_LEEWAY_SECONDS`**: 60s constant in `config.py` — both the refresh pre-expiry window and the SessionCache TTL. +- **`_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS`**: 60s constant in `config.py` — the minimum interval between DDB `touch_last_seen` writes for a single session. Currently aligned with leeway, will be de-aligned to 300s. +- **per-session `asyncio.Lock`**: The lock from `get_session_lock(session_id)` in `sessions_bff/lock.py`. Today it wraps only the Cognito refresh exchange; the fix does NOT move its scope — a separate single-flight `Future` is added upstream. +- **Single-flight Future**: New per-session `asyncio.Future` added for this fix that coalesces the upstream `get_item → needs_refresh → refresh?` resolution across concurrent callers within one task. + +## Bug Details + +### Bug Condition + +The bug manifests when a request reaches `SessionRefreshMiddleware.dispatch` with `BFFConfig.is_enabled() == True` AND a `__Host-bff_session` cookie present. Under this condition the middleware's resolve/slide path performs at least one event-loop-blocking AWS call, and — under fan-out — performs 2×N blocking calls for N concurrent same-session requests. The observable symptoms (504s, 80s `/files/quota` tails, 15.6s p-max at 0.7% CPU) follow directly. + +**Formal Specification:** + +``` +FUNCTION isBugCondition(input) + INPUT: input of type HTTPRequest + OUTPUT: boolean + + # Middleware-level precondition — everything else is scoped inside this. + IF NOT BFFConfig.from_env().is_enabled() THEN + RETURN false + END IF + IF input.cookies["__Host-bff_session"] IS NULL THEN + RETURN false + END IF + + # Sub-condition 1.1: sync boto3 in SessionRepository blocks the loop. + blocks_on_repo := ( + awaitedIn(request, SessionRepository.get) + OR awaitedIn(request, SessionRepository.touch_last_seen) + OR awaitedIn(request, SessionRepository.update_tokens) + OR awaitedIn(request, SessionRepository.put) + OR awaitedIn(request, SessionRepository.delete) + ) + AND NOT executesInThreadpool(boto3_call_of_that_method) + + # Sub-condition 1.2: sync boto3 in CognitoRefreshClient blocks the loop, + # AND it runs while get_session_lock(session_id) is held. + blocks_on_cognito := ( + invokedIn(request, CognitoRefreshClient.refresh) + AND NOT executesInThreadpool(initiate_auth_call) + AND sessionLockHeldDuring(initiate_auth_call) + ) + + # Sub-condition 1.3: N concurrent same-session requests are not coalesced + # across the session-resolve path. + missing_resolve_coalescing := ( + concurrentRequestsForSameSession(input.session_id) > 1 + AND countOf(SessionRepository.get calls for input.session_id in this window) + = concurrentRequestsForSameSession(input.session_id) + ) + + # Sub-condition 1.4: cache-miss boundary aligns with throttle boundary. + aligned_windows := ( + BFFConfig._DEFAULT_REFRESH_LEEWAY_SECONDS + == BFFConfig._DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS + ) + + # Sub-condition 1.5: response waits on inline-awaited touch_last_seen. + inline_slide := ( + slideWarrantedFor(request) + AND responseWaitsFor(touch_last_seen_call_of_this_request) + ) + + # Sub-condition 1.6: no concurrency slack at the deployment boundary. + no_slack := ( + uvicornWorkerCount() == 1 + AND ecsDesiredCount() == 1 + ) + + # Sub-condition 1.7: page-load fan-out amplifies 1.1 + 1.3 + 1.4. + amplified_fanout := ( + concurrentRequestsForSameSession(input.session_id) >= 8 + AND cacheWindowJustElapsedFor(input.session_id) + AND countOf(DDB calls on critical path during this window) + >= 2 * concurrentRequestsForSameSession(input.session_id) + ) + + RETURN blocks_on_repo + OR blocks_on_cognito + OR missing_resolve_coalescing + OR aligned_windows + OR inline_slide + OR no_slack + OR amplified_fanout +END FUNCTION +``` + +### Examples + +- **1.1 blocking repo call**: Any request that hits `request.state.bff_session = record` → `_maybe_slide` → `touch_last_seen`. Expected: the DDB round-trip runs off the event loop thread; other coroutines continue to be scheduled. Actual: the event loop is frozen for the full round-trip. +- **1.2 blocking Cognito call**: Two tabs refresh concurrently at minute 59 of the access token's lifetime. Expected: the Cognito `initiate_auth` for session A runs off the loop thread; unrelated requests (different cookies, Bearer-token requests, health checks) proceed. Actual: the loop is frozen for the full Cognito round-trip AND the per-session lock is held during that freeze. +- **1.3 missing resolve coalescing**: Angular fan-out of 8 same-session requests with no cached `SessionRecord`. Expected: 1 DDB `get_item`. Actual: 8 DDB `get_item` calls, each blocking. +- **1.4 aligned windows**: A request at T when `T - last_seen_at == 60s` AND `SessionCache` entry for this session has just TTL-evicted at T. Expected: at most 1 of `{get_item, update_item}`. Actual: both, serialized. +- **1.5 inline slide**: Request with `_maybe_slide` returning non-None. Expected: the response Set-Cookie lands immediately; the DDB write happens in the background. Actual: the response waits for DDB. +- **1.7 page-load fan-out**: Angular page load fires 8 endpoints at once right after a cache window elapses. Expected: ≤1 `get_item` + ≤1 `update_item` across the 8 requests. Actual: up to 16 serialized blocking calls at the front of the page load. +- **Edge case — `is_enabled() == False`**: The middleware must short-circuit before any of the above sub-conditions can manifest. No AWS calls, no locks, no futures. + +## Expected Behavior + +### Preservation Requirements + +**Unchanged Behaviors:** + +- **3.1 Dormant pass-through**: `BFFConfig.is_enabled() == False` → `dispatch` short-circuits to `call_next(request)` with no AWS calls, no cache lookup, no single-flight registration. +- **3.2 No-cookie pass-through**: No `__Host-bff_session` cookie → same short-circuit as 3.1. +- **3.3 Unrecoverable cookie → clear both cookies**: Bad seal, missing DDB row, expired TTL, or terminal `CognitoRefreshError` → `_clear_cookies(response)` clears both `__Host-bff_session` AND `__Host-bff_csrf` with the same attribute set as today. +- **3.4 Max-Age re-emit contract**: When `_maybe_slide` returns a non-None `Max-Age`, the `Set-Cookie` headers for both BFF cookies use that exact value and the exact attribute set in `_reemit_cookies` today. Fire-and-forget dispatch of the DDB write does not change this contract. +- **3.5 Refresh-storm coalescing (existing)**: For N concurrent same-session requests crossing the refresh-leeway boundary, exactly one `cognito-idp:initiate_auth` is issued per `session_id` per leeway window. The existing `get_session_lock(session_id)` scope around the Cognito exchange is preserved end-to-end. +- **3.6 Codec singleton**: `get_default_codec()` is the same process-wide instance used by the auth/callback seal path and the middleware unseal path. No per-request `kms:GenerateDataKey` is introduced. +- **3.7 Client-secret cache**: `resolve_bff_client_secret` continues to serve from the module-scope cache. No per-request `secretsmanager:GetSecretValue`. +- **3.8 CSRF middleware path**: `CSRFMiddleware` continues to validate unsafe-method requests using the existing in-memory HMAC double-submit check against `request.state.bff_csrf_token`. No new I/O is introduced on that path. +- **3.9 Absolute-lifetime cap**: `_maybe_slide` returns `None` once `created_at + absolute_lifetime_seconds` has passed. No further cookie re-emit or DDB slide. +- **3.10 Fail-closed rotation**: When Cognito rotates the refresh token and `_persist_refresh` exhausts its retries, the middleware invalidates the cache and clears the cookie. +- **3.11 Uniform cookie decode failure**: Every `CookieDecodeError` branch produces the same response shape and timing signature. No new oracle is introduced by the offload or single-flight paths. + +**Scope:** + +All inputs that do NOT involve the BFF middleware path should be completely unaffected by this fix. This includes: + +- Bearer-token requests (no `__Host-bff_session` cookie) — untouched. +- Anonymous endpoints (health, static assets) — untouched. +- WebSocket voice routes — they replicate the cookie unseal + DDB lookup outside the middleware (see `voice/routes.py`); this fix does not change their path. +- The auth/callback token-exchange route — it uses the same `CookieCodec` singleton to seal cookies; the singleton is not disturbed. +- The logout route — its cache `invalidate(session_id)` call is preserved. + +## Hypothesized Root Cause + +Based on the bug description and code inspection, the root causes are concurrent and independent — each sub-condition has its own root cause, and the fix addresses all of them: + +1. **Sync boto3 in `async def` methods (1.1, 1.2)**: The `SessionRepository` docstring explicitly acknowledges this ("The methods are declared `async` to match the rest of `apis.shared`, but boto3 is sync — calls run on the event loop thread"). The original reasoning was that refresh-storm coalescing via `get_session_lock()` would hold fan-out low enough to make thread-pool offload unnecessary. That reasoning is wrong for two reasons: (a) the lock only covers the Cognito exchange, not the DDB path — so fan-out is not coalesced at all for cache misses; and (b) even a single blocking call is enough to freeze the event loop for the round-trip duration, which is directly observable in `TargetResponseTime` p-max. + +2. **Wrong lock scope (1.3, 1.7)**: `get_session_lock(session_id)` is acquired inside `_resolve_session` only after the `_cache.get → _repository.get → needs_refresh` decision has been made. An `asyncio.Lock` held this narrowly cannot coalesce anything upstream of itself. The fix needs a different primitive — an `asyncio.Future` stored in a per-session slot that N waiters can await — because a lock would serialize N requests through one DDB call instead of consolidating them to one call. + +3. **Aligned windows by default (1.4)**: Both constants default to 60s in `config.py`. A strict-multiple relationship (e.g. throttle = 5 × leeway) de-aligns the boundaries. This is a config fix with no code change needed in the middleware. + +4. **`await` on `touch_last_seen` by pattern (1.5)**: `_maybe_slide` awaits the write because that matches the rest of the codebase's DB access shape. The surrounding `try/except` already swallows failures (documented as "Don't fail the request if the slide-write fails"), which is exactly the pre-condition that makes fire-and-forget safe. + +5. **Single-worker container (1.6)**: The Dockerfile CMD ships one uvicorn worker and `desiredCount: 1` in CDK ships one task. This was fine for the Bearer-token era; under the BFF middleware, it means any one blocked round-trip halts every other in-flight request. Concurrency slack is a separate lever from event-loop non-blocking — both are required, neither is sufficient alone. + +## Correctness Properties + +Property 1: Bug Condition — Event-Loop Non-Blocking, Coalesced, Window-Staggered, Fire-and-Forget BFF Middleware + +_For any_ request where the bug condition holds (`isBugCondition` returns true), the fixed middleware and its collaborators SHALL (a) execute every boto3 DynamoDB and Cognito call off the event loop thread (via `asyncio.to_thread` or equivalent), (b) coalesce N concurrent same-`session_id` requests crossing a cold cache window to at most one DynamoDB `get_item` via a per-session `asyncio.Future`, (c) hold the `_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS` default to a strict multiple of `_DEFAULT_REFRESH_LEEWAY_SECONDS` (300s vs 60s) so cache-expiry and throttle-expiry do not align, (d) dispatch `_maybe_slide`'s `touch_last_seen` as a detached `asyncio.Task` and return the `Max-Age` to the response path synchronously, and (e) run with concurrency slack such that `desiredCount >= 2` in production configuration. The observable result SHALL be that Angular's ~8-endpoint page-load fan-out issues at most 1 `get_item` and at most 1 `update_item` per `session_id` per cache window (not ~16), and no single AWS call serializes unrelated requests. + +**Validates: Requirements 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7** + +Property 2: Preservation — BFF Middleware Contracts Unchanged for Non-Buggy Inputs + +_For any_ request where the bug condition does NOT hold (`isBugCondition` returns false), the fixed middleware SHALL produce the same externally observable result as the original middleware, preserving: dormant pass-through (`is_enabled() == False`), no-cookie pass-through, unrecoverable-cookie clearing of both `__Host-bff_session` and `__Host-bff_csrf` with the same attribute set, the `Max-Age` re-emit contract between `_maybe_slide` and `_reemit_cookies`, exactly-one Cognito `initiate_auth` per `session_id` per leeway window, the `CookieCodec` and client-secret process-wide singletons, the `CSRFMiddleware` in-memory HMAC double-submit check, the absolute-lifetime cap behavior, fail-closed refresh-token rotation, and uniform `CookieDecodeError` handling. + +**Validates: Requirements 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11** + +## Fix Implementation + +### Changes Required + +Assuming the root cause analysis above is correct, the fix spans four code locations and one infrastructure config. + +**File**: `backend/src/apis/shared/sessions_bff/repository.py` + +**Function**: `SessionRepository.get`, `touch_last_seen`, `update_tokens`, `put`, `delete` + +**Specific Changes**: + +1. **Threadpool offload for every boto3 call**: Extract each method's boto3 invocation into a nested sync helper and invoke it via `await asyncio.to_thread(helper, ...)`. Example for `get`: + ``` + async def get(self, session_id): + if not self._enabled: + return None + def _call(): + return self._table.get_item(Key=self._key(session_id)) + try: + response = await asyncio.to_thread(_call) + except ClientError as exc: + ... + ``` + The method signatures, return types, and exception handling stay identical. The post-decode TTL defense-in-depth check and `_item_to_record` translation stay on the calling coroutine. + +2. **No change to public API**: Every callsite in the middleware (`self._repository.get`, `self._repository.touch_last_seen`, `self._repository.update_tokens`) remains an `await`. The offload is purely internal. + +**File**: `backend/src/apis/shared/sessions_bff/refresh.py` + +**Function**: `CognitoRefreshClient.refresh` + +**Specific Changes**: + +3. **Add async wrapper that offloads to a threadpool**: Either rename `refresh` to `_refresh_sync` and add a new `async def refresh(...)` that calls `await asyncio.to_thread(self._refresh_sync, username=..., refresh_token=...)`, or convert `refresh` to `async def` in-place with the same offload. The middleware callsite (`self._refresh_client.refresh(...)`) becomes `await self._refresh_client.refresh(...)`. The Cognito SDK call and the `CognitoRefreshError` contract are unchanged. + +**File**: `backend/src/apis/shared/middleware/session_refresh.py` + +**Function**: `SessionRefreshMiddleware._resolve_session`, `_maybe_slide`, `dispatch` + +**Specific Changes**: + +4. **Add per-session single-flight for the session-resolve path**: Introduce a new module-level `dict[str, asyncio.Future[tuple[Optional[SessionRecord], bool]]]` guarded by a thread lock in a new small module `backend/src/apis/shared/sessions_bff/single_flight.py` (mirroring `lock.py`'s shape), with an API: + ``` + async def resolve_once(session_id, loader_coro_factory) -> tuple[Optional[SessionRecord], bool] + ``` + The leader creates an `asyncio.Future`, registers it, runs the loader, sets the result/exception, and removes the entry. Followers `await` the existing Future. In `_resolve_session`, wrap the `_cache.get → _repository.get → needs_refresh → (maybe refresh)` block (from cache lookup through return) inside this single-flight, keyed by `session_id`. The existing `get_session_lock(session_id)` scope around the Cognito refresh exchange is **not** moved or widened — it stays exactly where it is today. + +5. **Fire-and-forget slide-write in `_maybe_slide`**: Replace `await self._repository.touch_last_seen(...)` with a detached task. The function still computes `new_max_age` and returns it synchronously. The DDB write happens in the background; the existing `try/except` that was already documented to swallow failures moves into a `_slide_write_task(...)` helper that logs on failure. Update the local cache (`record.last_seen_at = now`, `record.ttl = new_ttl`, `self._cache.set(record)`) before scheduling the task, so subsequent same-request reads see the slid state. + +6. **No change to `dispatch` structure or the cookie-clear / cookie-reemit branches**: Keep `clear_cookie` and `renewal_max_age` handling identical. + +**File**: `backend/src/apis/shared/sessions_bff/config.py` + +**Constant**: `_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS` + +**Specific Changes**: + +7. **Raise default from 60s to 300s**: Change the constant from `60` to `60 * 5` (or explicitly `300`) so the cache TTL (tied to `_DEFAULT_REFRESH_LEEWAY_SECONDS = 60`) and the slide-throttle window are strict multiples. The env var `BFF_SESSION_SLIDING_RENEWAL_THROTTLE_SECONDS` continues to override. + +**File**: `infrastructure/cdk.context.json` (and test fixtures under `infrastructure/test/`) + +**Key**: `appApi.desiredCount` + +**Specific Changes**: + +8. **Raise production `desiredCount` to 2**: Keep `maxCapacity` as-is (4). Update only the production/non-test context — test fixtures can stay at 1 if needed to keep CDK unit tests fast, but the top-level production context value must flip to 2. This is a **deployment-time** behavior change and the last item in the fix plan; it does not become necessary until the other changes ship. + +**No changes required** in: `backend/src/apis/shared/sessions_bff/cache.py`, `backend/src/apis/shared/sessions_bff/cookie.py`, `backend/src/apis/shared/sessions_bff/lock.py`, `backend/src/apis/shared/sessions_bff/csrf.py`, `backend/src/apis/shared/middleware/csrf.py`, `backend/src/apis/app_api/auth/bff/*`, or the uvicorn `CMD` in `backend/Dockerfile.app-api` (the ECS `desiredCount` bump is the chosen vector for concurrency slack in 2.6 — a `--workers N` flag would require reworking the in-process singletons in `cache.py` and `refresh.py`, which is out of scope). + +## Testing Strategy + +### Validation Approach + +The testing strategy follows a two-phase approach: first, surface counterexamples that demonstrate the bug on unfixed code, then verify the fix works correctly and preserves existing behavior. Because four of the sub-conditions are independent, we run the exploratory phase against each one. + +### Exploratory Bug Condition Checking + +**Goal**: Surface counterexamples that demonstrate the bug BEFORE implementing the fix. Confirm or refute the root-cause analysis for each sub-condition. If any is refuted, we re-hypothesize. + +**Test Plan**: Write tests that inject a slow/instrumented boto3 stub (for DDB and Cognito) and drive the middleware directly under `pytest-asyncio`. For each sub-condition, assert the blocking/serialization behavior is present on unfixed code. Run on UNFIXED code first; the assertions SHALL fail against fixed code later. + +**Test Cases**: + +1. **Event loop blocked by `SessionRepository.get`** (validates 1.1): Stub the boto3 `table.get_item` with a 500ms `time.sleep`. Submit a `SessionRepository.get` call and a concurrent `asyncio.sleep(0.05)` marker coroutine on the same loop. Assert the marker resolves strictly after the `get` (will hold on unfixed code, will fail on fixed code where the marker completes long before `get` returns). + +2. **Event loop blocked by `CognitoRefreshClient.refresh`** (validates 1.2): Same shape as (1) but against a stubbed `cognito-idp:initiate_auth`. Additionally assert that `get_session_lock(other_session_id)` can be acquired concurrently (will fail on unfixed code because the sync Cognito call has frozen the whole loop thread). + +3. **N fan-out → N `get_item` calls** (validates 1.3): Spin up 8 concurrent `dispatch` calls with the same cookie and a cold `SessionCache`. Count `table.get_item` invocations on the stub. Assert count == 8 on unfixed code; the fix target is 1. + +4. **Aligned windows → both writes on one request** (validates 1.4): Set clock to a moment where the cache TTL just elapsed AND `now - last_seen_at == 60s`. Drive a single request. Assert both `get_item` AND `update_item` are called on unfixed code; on fixed code with the new 300s throttle default, only `get_item` is called. + +5. **Response waits on `touch_last_seen`** (validates 1.5): Stub `table.update_item` with a 500ms delay. Measure time from `dispatch` entry to `call_next(request)` return. On unfixed code, response time ≥ 500ms; on fixed code, response time is independent of the DDB write latency. + +6. **Single-worker container / `desiredCount: 1`** (validates 1.6): This is a deployment-level property, not a middleware-level one. Verified by reading `infrastructure/cdk.context.json` and the Dockerfile `CMD`. No runtime test; CDK unit test asserts `DesiredCount: 2` on the production context. + +7. **Page-load fan-out amplification** (validates 1.7): Combine (3) + (4) — 8 concurrent requests at a boundary moment. Count blocking DDB calls. Assert ≥ 16 on unfixed code, ≤ 2 on fixed code. + +**Expected Counterexamples**: + +- Blocked-loop markers do not complete until the stubbed AWS call returns. +- `table.get_item` call count on the stub matches the fan-out, not 1. +- `Set-Cookie` response latency tracks `table.update_item` latency. +- Possible causes confirmed: sync boto3 on event loop, narrow lock scope, aligned constants, inline-awaited slide write. + +### Fix Checking + +**Goal**: Verify that for all inputs where the bug condition holds, the fixed middleware produces the expected behavior defined by Property 1. + +**Pseudocode:** + +``` +FOR ALL input WHERE isBugCondition(input) DO + # (a) event loop non-blocking + marker_latency := measureConcurrentMarker(dispatch(input)) + ASSERT marker_latency << AWS_call_latency + + # (b) fan-out coalescing + ddb_get_calls := countGetItemCalls(during_dispatch(input_fanout_n=8)) + ASSERT ddb_get_calls <= 1 + + # (c) window staggering + ASSERT config.slidingRenewalThrottleSeconds + % config.refreshLeewaySeconds == 0 + ASSERT config.slidingRenewalThrottleSeconds + > config.refreshLeewaySeconds + + # (d) fire-and-forget slide + response_latency := measureDispatchTime(input_with_slide) + ASSERT response_latency independent_of touch_last_seen_latency + + # (e) concurrency slack (deployment assertion) + ASSERT cdkContextAppApiDesiredCount >= 2 +END FOR +``` + +### Preservation Checking + +**Goal**: Verify that for all inputs where the bug condition does NOT hold, the fixed middleware produces the same externally observable result as the original middleware. + +**Pseudocode:** + +``` +FOR ALL input WHERE NOT isBugCondition(input) DO + ASSERT dispatch_original(input).response == dispatch_fixed(input).response + ASSERT dispatch_original(input).set_cookie_headers + == dispatch_fixed(input).set_cookie_headers + ASSERT dispatch_original(input).request_state_bff_session + == dispatch_fixed(input).request_state_bff_session + ASSERT dispatch_original(input).cleared_cookies + == dispatch_fixed(input).cleared_cookies + ASSERT countOf(cognito.initiate_auth across N same-session concurrent requests) + == 1 per leeway window +END FOR +``` + +**Testing Approach**: Property-based testing is recommended for preservation checking because: + +- It generates many request shapes across the input domain (cookie present/absent, cookie seal valid/invalid, cache hit/miss, needs_refresh yes/no, rotation yes/no, slide warranted yes/no, absolute cap passed yes/no, `is_enabled()` true/false) and asserts equivalence against a mocked `SessionRepository` + `CognitoRefreshClient`. +- It catches edge cases in the single-flight and fire-and-forget paths that manual unit tests might miss (e.g. an exception inside the single-flight leader; a background slide task racing with the next request). +- It provides strong guarantees that the observable middleware contract is unchanged for the entire `¬C` input domain. + +**Test Plan**: First, exercise the unfixed middleware with an expressive `Hypothesis` strategy over request shapes and record observable outputs (response status, `Set-Cookie` headers, `request.state.bff_session`, DDB/Cognito call counts). Then, swap in the fixed middleware and assert equivalence on the same inputs. The strategy must skip any input that satisfies `isBugCondition` — only `¬C` inputs enter the preservation assertion. + +**Test Cases**: + +1. **Dormant pass-through unchanged** (3.1): With `is_enabled() == False`, every request shape produces identical responses under fixed and unfixed middleware with zero AWS calls. +2. **No-cookie pass-through unchanged** (3.2): Request with no `__Host-bff_session` header, for any method/path, produces identical responses with zero AWS calls. +3. **Unrecoverable cookie clears both cookies** (3.3): Bad-seal, missing-row, expired-row, and terminal-refresh-error inputs produce the same `Set-Cookie` headers with `Max-Age=0` for both `__Host-bff_session` and `__Host-bff_csrf`, same attribute set. +4. **Max-Age re-emit contract** (3.4): For inputs where `_maybe_slide` returns a non-None value, the resulting `Set-Cookie` headers match the original exactly (including attribute set). Fire-and-forget dispatch does not delay or drop the re-emit. +5. **Refresh-storm coalescing preserved** (3.5): For 10 concurrent same-session requests crossing the refresh-leeway window, exactly one `initiate_auth` call is observed on the Cognito stub. +6. **Codec / secret singletons preserved** (3.6, 3.7): Across many requests, `get_default_codec()` returns the same instance, and `resolve_bff_client_secret()` hits Secrets Manager exactly once per process. +7. **CSRF path unchanged** (3.8): Requests that trigger `CSRFMiddleware` produce identical accept/reject decisions with no new I/O. +8. **Absolute lifetime cap preserved** (3.9): Inputs with `created_at + absolute_lifetime_seconds < now` produce `_maybe_slide → None`, no slide write scheduled. +9. **Fail-closed rotation preserved** (3.10): With rotation triggered and `_persist_refresh` forced to exhaust retries, the cache is invalidated and both cookies are cleared. +10. **Cookie decode uniformity** (3.11): All `CookieDecodeError` branches produce identical response shapes and timing profiles on the fixed middleware (no new oracle via single-flight or fire-and-forget). + +### Unit Tests + +- **Repository offload**: Assert each `SessionRepository.*` method calls `asyncio.to_thread` (monkeypatched) exactly once per call and that the wrapped boto3 call receives the expected arguments. Assert `ClientError` propagation still matches today's behavior. +- **Cognito offload**: Assert `CognitoRefreshClient.refresh` is awaitable, offloads to a threadpool, preserves `CognitoRefreshError`, and returns the same `RefreshResult` shape. +- **Single-flight**: Two concurrent `resolve_once(session_id, factory)` calls share one loader invocation; the entry is removed after completion; an exception in the loader propagates to all waiters; distinct `session_id`s do not share. +- **Fire-and-forget slide**: `_maybe_slide` returns Max-Age before `touch_last_seen` completes; the background task writes to DDB; failure inside the task logs and does not bubble to `dispatch`; the local cache is updated synchronously before the task is scheduled. +- **Config constant**: `_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS == 300`; strict multiple of `_DEFAULT_REFRESH_LEEWAY_SECONDS`. + +### Property-Based Tests + +- **Preservation over `¬C` input domain**: As described in Preservation Checking — generate request shapes, assert fixed ≡ original on response, cookies, `request.state`, and AWS call counts. +- **Fan-out coalescing invariant**: For any N ∈ [2, 32] and any cookie-bearing same-session fan-out, the number of DDB `get_item` calls observed on the stub is ≤ 1 per cache window. Randomize cache warm/cold state, `needs_refresh` outcomes, and concurrent-request arrival ordering. +- **Window-staggering invariant**: For any request timing `t` within one leeway window of a cache TTL boundary, the fixed middleware issues at most one of `{get_item, update_item}` on the critical path — never both. + +### Integration Tests + +- **End-to-end page-load fan-out**: Drive the app-api container (under `moto` for DDB, a stubbed Cognito client) with a simulated 8-endpoint Angular page load. Measure total wall-clock time and count of DDB/Cognito calls. Assert ≤ 1 `get_item` and ≤ 1 `update_item` across the fan-out, and total latency bounded by the slowest individual handler (not by serialized AWS I/O). +- **Concurrency slack at the deployment boundary**: CDK unit test asserts `DesiredCount: 2` for the production `app-api` service. Integration smoke test asserts that a deliberately slow endpoint (e.g., a route that sleeps 5s) does not stall a concurrent fast endpoint on a parallel request. +- **Refresh-storm under fan-out**: 8 concurrent requests across the refresh-leeway boundary on the same session. Assert exactly 1 Cognito `initiate_auth`, all 8 responses succeed, and `request.state.bff_session` carries the freshly rotated tokens. diff --git a/.kiro/specs/bff-middleware-event-loop-blocking/tasks.md b/.kiro/specs/bff-middleware-event-loop-blocking/tasks.md new file mode 100644 index 00000000..b6e58c64 --- /dev/null +++ b/.kiro/specs/bff-middleware-event-loop-blocking/tasks.md @@ -0,0 +1,169 @@ +# Implementation Plan + +- [x] 1. Write bug condition exploration test + - **Property 1: Bug Condition** - Event-Loop Blocking, Missing Coalescing, Aligned Windows, Inline Slide-Write + - **CRITICAL**: This test MUST FAIL on unfixed code - failure confirms the bug exists + - **DO NOT attempt to fix the test or the code when it fails** + - **NOTE**: This test encodes the expected behavior - it will validate the fix when it passes after implementation + - **GOAL**: Surface counterexamples that demonstrate each sub-condition of the bug in `SessionRefreshMiddleware` + - **Scoped PBT Approach**: Scope the property to concrete failing cases that deterministically reproduce each sub-condition under `pytest-asyncio` + - Test location: `backend/tests/apis/shared/middleware/test_session_refresh_bug_condition.py` + - Use `hypothesis` + `pytest-asyncio`; inject slow/instrumented boto3 stubs for DynamoDB (`table.get_item`, `table.update_item`) and Cognito (`initiate_auth`) via monkeypatching on `SessionRepository._table` and `CognitoRefreshClient` + - Bug Condition (from design `isBugCondition`): `BFFConfig.is_enabled() == True` AND `__Host-bff_session` cookie present AND any of sub-conditions 1.1 through 1.7 hold + - Expected Behavior assertions (from design Property 1 / Expected Behavior 2.1–2.7) that must hold for all inputs satisfying the bug condition: + - **(1.1) Repository offload**: Stub `table.get_item`/`update_item`/`put_item`/`delete_item` with a 500ms `time.sleep`. Run `SessionRepository.get(session_id)` concurrently with an `asyncio.sleep(0.05)` marker coroutine. ASSERT the marker completes strictly BEFORE the repository call returns (loop is not blocked). Repeat for `touch_last_seen`, `update_tokens`, `put`, `delete`. + - **(1.2) Cognito offload**: Stub Cognito `initiate_auth` with a 500ms `time.sleep`. Run `CognitoRefreshClient.refresh(...)` concurrently with a marker coroutine AND a concurrent `get_session_lock(other_session_id)` acquisition. ASSERT the marker and unrelated lock acquisition complete while `refresh` is in flight. + - **(1.3) Resolve-path coalescing**: Drive 8 concurrent `SessionRefreshMiddleware.dispatch` calls for the same `session_id` with cold `SessionCache` and a valid sealed cookie. Count `table.get_item` invocations on the stub. ASSERT count == 1 (bug: count == 8). + - **(1.4) Window de-alignment**: ASSERT `BFFConfig._DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS % BFFConfig._DEFAULT_REFRESH_LEEWAY_SECONDS == 0` AND `BFFConfig._DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS > BFFConfig._DEFAULT_REFRESH_LEEWAY_SECONDS`. Drive a single request with `SessionCache` TTL just elapsed AND `now - last_seen_at == 60s`. ASSERT at most one of `{get_item, update_item}` is observed on the critical path. + - **(1.5) Fire-and-forget slide**: Stub `table.update_item` with a 500ms delay. Drive a `dispatch` call where a slide is warranted. Measure elapsed time from `dispatch` entry to `call_next(request)` returning. ASSERT elapsed time < 250ms (bug: elapsed time ≥ 500ms because the response waits on the DDB write). + - **(1.6) Concurrency slack at deployment**: Read `infrastructure/cdk.context.json` and assert `appApi.desiredCount >= 2` for the production context. + - **(1.7) Fan-out amplification**: Drive 8 concurrent `dispatch` calls on the same session at a cache-boundary moment. Count blocking DDB calls across the fan-out. ASSERT count ≤ 2 (bug: count ≥ 16). + - Run all property cases on UNFIXED code + - **EXPECTED OUTCOME**: Test FAILS (this is correct - it proves the bug exists). Document the counterexamples in the test output: marker coroutines starved, 8 `get_item` calls per fan-out, both `get_item` and `update_item` on single request, response latency tracking `update_item` latency + - Mark task complete when test is written, run, and failures are documented + - _Requirements: 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7_ + +- [x] 2. Write preservation property tests (BEFORE implementing fix) + - **Property 2: Preservation** - BFF Middleware Contracts Unchanged for Non-Buggy Inputs + - **IMPORTANT**: Follow observation-first methodology + - Test location: `backend/tests/apis/shared/middleware/test_session_refresh_preservation.py` + - Use `hypothesis` to generate request shapes across the `¬C` input domain; skip any input for which `isBugCondition` returns true + - Strategy must cover all axes that exist today: `is_enabled()` true/false, `__Host-bff_session` cookie present/absent, cookie seal valid/invalid/expired, `SessionCache` hit/miss, `needs_refresh` yes/no, refresh-token rotation yes/no, slide warranted yes/no, absolute-lifetime cap passed yes/no, request method safe/unsafe (for CSRF interaction) + - **Observe behavior on UNFIXED code** for each non-buggy input and record: response status, `Set-Cookie` headers for `__Host-bff_session` and `__Host-bff_csrf` (including every attribute), `request.state.bff_session`, `request.state.bff_csrf_token`, DDB call counts, Cognito call counts, KMS/Secrets Manager call counts + - Write property-based tests capturing these observed behaviors as preservation invariants (from Preservation Requirements 3.1–3.11): + - **(3.1) Dormant pass-through**: for all requests, when `is_enabled() == False`, response == `call_next(request)` AND zero AWS calls + - **(3.2) No-cookie pass-through**: for all requests with no `__Host-bff_session` header, response == `call_next(request)` AND zero AWS calls + - **(3.3) Unrecoverable cookie clears both cookies**: for bad-seal / missing-row / expired-row / terminal-`CognitoRefreshError` inputs, `Set-Cookie` for both `__Host-bff_session` AND `__Host-bff_csrf` has `Max-Age=0` AND identical attribute set + - **(3.4) Max-Age re-emit contract**: when `_maybe_slide` returns non-None, the resulting `Set-Cookie` headers for both BFF cookies use that exact `Max-Age` and the exact attribute set from `_reemit_cookies` today + - **(3.5) Refresh-storm coalescing**: for 10 concurrent same-session requests crossing the refresh-leeway window, exactly one `cognito-idp:initiate_auth` call is observed + - **(3.6) Codec singleton**: across many requests, `get_default_codec()` returns the same instance identity; zero per-request `kms:GenerateDataKey` calls + - **(3.7) Client-secret cache**: across many requests, `resolve_bff_client_secret()` hits Secrets Manager exactly once per process + - **(3.8) CSRF path unchanged**: `CSRFMiddleware` accept/reject decision on unsafe-method requests is identical to unfixed; zero new I/O on the CSRF path + - **(3.9) Absolute-lifetime cap**: when `now > created_at + absolute_lifetime_seconds`, `_maybe_slide` returns `None`; no slide scheduled + - **(3.10) Fail-closed rotation**: when rotation triggers AND `_persist_refresh` exhausts retries, cache is invalidated AND both cookies are cleared + - **(3.11) Cookie decode uniformity**: every `CookieDecodeError` branch produces identical response shape and timing profile (no new oracle) + - Run tests on UNFIXED code + - **EXPECTED OUTCOME**: Tests PASS (this confirms baseline behavior to preserve) + - Mark task complete when tests are written, run, and passing on unfixed code + - _Requirements: 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11_ + +- [x] 3. Fix for BFF middleware event-loop blocking and fan-out amplification + + - [x] 3.1 Offload `SessionRepository` boto3 calls via `asyncio.to_thread` + - Edit `backend/src/apis/shared/sessions_bff/repository.py` + - For each of `get`, `touch_last_seen`, `update_tokens`, `put`, `delete`: extract the boto3 invocation into a nested sync helper and invoke it via `await asyncio.to_thread(helper, ...)` + - Keep method signatures, return types, and exception-handling branches identical + - Keep the post-decode TTL defense-in-depth check and `_item_to_record` translation on the calling coroutine + - Do NOT change the public API — every callsite in the middleware remains an `await` + - _Bug_Condition: isBugCondition(input) where sub-condition 1.1 holds (sync boto3 on event loop)_ + - _Expected_Behavior: expectedBehavior(result) per design Property 1 clause (a) — every `SessionRepository` boto3 call executes off the event loop thread_ + - _Preservation: 3.3, 3.10 — exception branches and fail-closed rotation unchanged_ + - _Requirements: 2.1_ + + - [x] 3.2 Offload `CognitoRefreshClient.refresh` via `asyncio.to_thread` + - Edit `backend/src/apis/shared/sessions_bff/refresh.py` + - Rename existing `refresh` to `_refresh_sync` (or equivalent private sync form) and add a new `async def refresh(...)` that calls `await asyncio.to_thread(self._refresh_sync, username=..., refresh_token=...)` + - Update the callsite in `SessionRefreshMiddleware._resolve_session` to `await self._refresh_client.refresh(...)` + - Preserve the `CognitoRefreshError` contract and `RefreshResult` return shape exactly + - Do NOT move or widen the `get_session_lock(session_id)` scope around the refresh exchange + - _Bug_Condition: isBugCondition(input) where sub-condition 1.2 holds (sync Cognito on event loop while session lock held)_ + - _Expected_Behavior: expectedBehavior(result) per design Property 1 clause (a) — Cognito `initiate_auth` executes off the event loop thread, other sessions' locks are acquirable_ + - _Preservation: 3.5 — refresh-storm coalescing preserved_ + - _Requirements: 2.2_ + + - [x] 3.3 Add per-session single-flight primitive module + - Create `backend/src/apis/shared/sessions_bff/single_flight.py` + - Export `async def resolve_once(session_id: str, loader_coro_factory: Callable[[], Awaitable[tuple[Optional[SessionRecord], bool]]]) -> tuple[Optional[SessionRecord], bool]` + - Internal state: module-level `dict[str, asyncio.Future[tuple[Optional[SessionRecord], bool]]]` guarded by a `threading.Lock` (mirroring the shape of `sessions_bff/lock.py`) + - Leader semantics: first caller for a given `session_id` creates an `asyncio.Future`, registers it under the session lock, runs the loader, sets the result or exception, removes the entry, and returns + - Follower semantics: any caller that finds an existing Future `await`s it and returns its value + - Exception propagation: an exception from the loader MUST propagate to all current waiters, and the registry entry MUST be removed so subsequent calls start a new leader + - Distinct `session_id`s MUST NOT share a Future + - Include unit tests alongside: two concurrent `resolve_once` calls share one loader invocation; exception propagation to all waiters; distinct sessions are independent + - _Bug_Condition: isBugCondition(input) where sub-condition 1.3 holds (N concurrent same-session resolves issue N `get_item` calls)_ + - _Expected_Behavior: expectedBehavior(result) per design Property 1 clause (b) — at most one DynamoDB `get_item` per `session_id` per cache window_ + - _Preservation: 3.5 — the existing `get_session_lock` scope around the Cognito exchange is unchanged (this is a separate primitive upstream)_ + - _Requirements: 2.3_ + + - [x] 3.4 Wire single-flight into `SessionRefreshMiddleware._resolve_session` + - Edit `backend/src/apis/shared/middleware/session_refresh.py` + - Wrap the `_cache.get → _repository.get → needs_refresh → (maybe refresh)` block in `_resolve_session` inside `resolve_once(session_id, loader_coro_factory)` where the loader factory builds the coroutine that performs today's cache/repo/refresh sequence and returns `(Optional[SessionRecord], clear_cookie: bool)` + - Ensure the existing `get_session_lock(session_id)` scope around the Cognito refresh exchange remains exactly where it is today — do NOT move or widen it + - Ensure the bad-seal / missing-row / expired-row / terminal-refresh-error paths still produce the same `clear_cookie=True` return and the same exception propagation to `dispatch` as today + - _Bug_Condition: isBugCondition(input) where sub-conditions 1.3 and 1.7 hold (fan-out amplification)_ + - _Expected_Behavior: expectedBehavior(result) per design Property 1 clause (b)_ + - _Preservation: 3.3, 3.5, 3.11 — unrecoverable cookie clearing, refresh-storm coalescing, uniform decode failure preserved_ + - _Requirements: 2.3, 2.7_ + + - [x] 3.5 Convert `_maybe_slide` to fire-and-forget DDB write + - Edit `backend/src/apis/shared/middleware/session_refresh.py` + - In `_maybe_slide`, update the local cache synchronously (`record.last_seen_at = now`, `record.ttl = new_ttl`, `self._cache.set(record)`) BEFORE scheduling the background task + - Replace `await self._repository.touch_last_seen(...)` with `asyncio.create_task(self._slide_write_task(...))` + - Introduce a private `async def _slide_write_task(self, record, ...)` helper that performs `await self._repository.touch_last_seen(...)` inside a `try/except` that logs on failure (preserving today's "swallow failures" semantics) + - Return the computed `new_max_age` synchronously from `_maybe_slide` + - Do NOT change `dispatch` structure or the cookie-clear / cookie-reemit branches + - Do NOT change the absolute-lifetime cap path — it must still return `None` + - _Bug_Condition: isBugCondition(input) where sub-condition 1.5 holds (response waits on inline slide-write)_ + - _Expected_Behavior: expectedBehavior(result) per design Property 1 clause (d) — response latency independent of `touch_last_seen` latency_ + - _Preservation: 3.4, 3.9 — Max-Age re-emit contract and absolute-lifetime cap preserved_ + - _Requirements: 2.5_ + + - [x] 3.6 De-align cache/leeway and throttle windows in config + - Edit `backend/src/apis/shared/sessions_bff/config.py` + - Change `_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS` from `60` to `60 * 5` (or explicit `300`) + - Verify `_DEFAULT_REFRESH_LEEWAY_SECONDS` remains `60` + - Confirm the strict-multiple relationship: `300 % 60 == 0` AND `300 > 60` + - Ensure the `BFF_SESSION_SLIDING_RENEWAL_THROTTLE_SECONDS` env var still overrides the default + - _Bug_Condition: isBugCondition(input) where sub-condition 1.4 holds (aligned windows force both writes on one request)_ + - _Expected_Behavior: expectedBehavior(result) per design Property 1 clause (c) — cache-miss does not imply slide-write_ + - _Preservation: none impacted (pure default-value change; overrides preserved)_ + - _Requirements: 2.4_ + + - [x] 3.7 Raise production `appApi.desiredCount` to 2 + - Edit `infrastructure/cdk.context.json` to set `appApi.desiredCount` to `2` in the production/non-test context + - Keep `appApi.maxCapacity` unchanged (4) + - Test fixtures under `infrastructure/test/` may stay at `1` if needed for CDK unit-test speed; only the top-level production context value must change + - Update or add CDK unit tests to assert `DesiredCount: 2` on the production `app-api` service synthesis + - _Bug_Condition: isBugCondition(input) where sub-condition 1.6 holds (no concurrency slack at deployment)_ + - _Expected_Behavior: expectedBehavior(result) per design Property 1 clause (e) — `desiredCount >= 2` in production configuration_ + - _Preservation: none impacted (deployment-config change; in-process singletons untouched)_ + - _Requirements: 2.6_ + + - [x] 3.8 Verify bug condition exploration test now passes + - **Property 1: Expected Behavior** - Event-Loop Non-Blocking, Coalesced, Window-Staggered, Fire-and-Forget BFF Middleware + - **IMPORTANT**: Re-run the SAME test from task 1 - do NOT write a new test + - The test from task 1 encodes the expected behavior from design Property 1 + - When this test passes, it confirms the expected behavior is satisfied across all seven sub-conditions + - Run: `cd backend && uv run python -m pytest tests/apis/shared/middleware/test_session_refresh_bug_condition.py -v` + - **EXPECTED OUTCOME**: Test PASSES (confirms bug is fixed): + - Marker coroutines complete while AWS stubs are still sleeping (1.1, 1.2) + - 8-fan-out produces exactly 1 `get_item` on the stub (1.3, 1.7) + - Aligned-boundary request produces at most one of `{get_item, update_item}` (1.4) + - Dispatch latency independent of `update_item` stub latency (1.5) + - `appApi.desiredCount >= 2` in production context (1.6) + - _Requirements: Expected Behavior Properties from design (2.1–2.7)_ + + - [x] 3.9 Verify preservation tests still pass + - **Property 2: Preservation** - BFF Middleware Contracts Unchanged for Non-Buggy Inputs + - **IMPORTANT**: Re-run the SAME tests from task 2 - do NOT write new tests + - Run: `cd backend && uv run python -m pytest tests/apis/shared/middleware/test_session_refresh_preservation.py -v` + - **EXPECTED OUTCOME**: Tests PASS (confirms no regressions): + - Dormant pass-through with zero AWS calls (3.1) + - No-cookie pass-through with zero AWS calls (3.2) + - Unrecoverable cookie clears both cookies with identical attributes (3.3) + - Max-Age re-emit contract preserved under fire-and-forget dispatch (3.4) + - Exactly one `initiate_auth` per `session_id` per leeway window (3.5) + - `get_default_codec()` and `resolve_bff_client_secret()` remain singletons (3.6, 3.7) + - `CSRFMiddleware` path unchanged (3.8) + - Absolute-lifetime cap preserved (3.9) + - Fail-closed rotation preserved (3.10) + - Uniform `CookieDecodeError` handling preserved (3.11) + - Confirm all tests still pass after fix (no regressions) + +- [x] 4. Checkpoint - Ensure all tests pass + - Run the full backend test suite: `cd backend && uv run python -m pytest tests/ -v` + - Run CDK unit tests: `cd infrastructure && npm run build && npm test` + - Confirm the bug condition exploration test (task 1) passes on fixed code + - Confirm the preservation property tests (task 2) pass on fixed code + - Confirm no unrelated tests regress + - Ensure all tests pass, ask the user if questions arise diff --git a/CHANGELOG.md b/CHANGELOG.md index 3756d72e..24ec1333 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,76 @@ All notable changes to this project are documented in this file. Format follows For narrative release notes written for operators and product owners, see [RELEASE_NOTES.md](RELEASE_NOTES.md). +## [1.0.0-beta.25] - 2026-05-11 + +Production-readiness fix for the BFF Token Handler shipped in beta.24. Fixes three production-breaking bugs introduced by beta.24: event-loop-blocking sync boto3 on every cookie-bearing request, per-process AES-256 keys that can't round-trip cookies across ECS tasks, and an in-process-only refresh lock that races Cognito rotation across replicas. Also ships PDF thumbnails, rich attachment previews, spreadsheet analysis tools, centralized 401 handling, and a `SKIP_AUTH` local-dev bypass. + +### 🐛 Fixed + +- **Critical (beta.24 regression):** `SessionRefreshMiddleware` ran sync boto3 (DynamoDB + Cognito) on the uvicorn event loop so Angular's ~8-endpoint page-load fan-out produced ~16 serialized blocking AWS calls per user per minute. Observable as ALB 504s, 15.6s p-max `TargetResponseTime` at 0.7% CPU, `/files/quota` outliers reaching ~80s. Every boto3 call in `SessionRepository` and `CognitoRefreshClient.refresh` now offloads via `asyncio.to_thread`; `_resolve_session` is wrapped in a per-session `asyncio.Future` single-flight so N concurrent same-session callers share one loader invocation; `_maybe_slide` dispatches `touch_last_seen` as a detached `asyncio.Task` (with strong reference on the middleware to prevent GC); `_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS` raised 60s → 300s to de-align from the 60s refresh-leeway window (#264) +- **Critical (beta.24 regression):** `CookieCodec` called `kms:GenerateDataKey` on first use per process, so each app-api task minted its own random AES-256 key. Once `desiredCount` went above 1, cookies sealed on Task A failed as `bad seal` on Task B (~50% of requests). Data key is now generated once via Secrets Manager `generateSecretString` (44-char, ~261 bits entropy) encrypted at rest with the existing `BFFCookieSigningKey` CMK; `CookieCodec._ensure_cipher` reads the secret and derives the AES-256 key via SHA-256; `kms:GenerateDataKey` dropped from the runtime task role (#273, #274) +- **Critical (beta.24 regression):** In-process `single_flight` and `get_session_lock` only coalesce same-session callers within one Python process. Under multi-replica, two tasks could each call `cognito-idp:initiate_auth` with the same refresh token; Cognito rotates on the winner and the loser silently logs the user out. New DDB conditional-write lock (`try_acquire_refresh_lock` / `release_refresh_lock` on `BFFSessionsTable`, reusing the existing `dynamodb:UpdateItem` grant) elects exactly one leader fleet-wide; followers poll the row and adopt the leader's tokens. `update_tokens` gains strict-owner condition (`refresh_lock_owner = :owner`) that atomically `REMOVE`s the lock attrs on successful persist and rejects stale-leader stomps via `ConditionalCheckFailedException`. Absolute-lifetime guard added ahead of lock acquisition so we don't burn a Cognito refresh on a row that's about to TTL-evict (#273, #275) +- Per-message cost double-count on tool-use turns — Strands' `AgentResultEvent` cumulative `accumulated_usage` overwrote the last assistant message's per-call usage via `.update()`. Route the result-extracted cumulative on the `metadata_summary` turn-summary track instead of `metadata` (#270) +- Context-% inflation within a tool turn — Bedrock reports each per-LLM-call `inputTokens` as the full context sent on that call, so Strands' summed `accumulated_usage` over-reports. `stream_coordinator` no longer accumulates `metadata_summary` into `accumulated_metadata`; per-call `metadata` last-write-wins so the value equals the most recent call's full input = current context. Summed across `inputTokens` + `cacheReadInputTokens` + `cacheWriteInputTokens` since `AgentResult.context_size` under-reports by 99%+ under prompt caching (#270) +- `LatencyMetrics.time_to_first_token` changed from `int` (placeholder 0) to `Optional[int]` (placeholder `null`) — a real TTFT can't be 0ms and aggregations need to distinguish absence from a real value (#270) +- Session-expired mid-session left users stranded with a generic toast or no feedback on SSE. Every 401 now flows through `SessionService.handleUnauthorized()`, which dedupes concurrent calls and navigates once with preserved `returnUrl` (#277) +- Session loss not surfaced until the next HTTP call failed. Added cookie-presence fast-path (JS-readable `__Host-bff_csrf` cookie absence implies `__Host-bff_session` also gone) and visibility re-probe on tab refocus (#277) +- Login & first-boot lava-lamp backdrop dark-mode CSS never applied on cold load — `html.dark .X` selectors don't match under Angular's emulated view encapsulation, and `ThemeService` was never injected in the pre-auth tree. Switched to `:host-context(html.dark) .X` and forced `ThemeService` construction via `provideAppInitializer` (#271) +- XLSX→CSV filename mismatches in the Code Interpreter sandbox triggered retry loops. Targeted error hints, tolerant filename matching for CSV↔XLSX aliasing, schema footer preservation on errors + +### 🚀 Added + +- Server-rendered PDF page-1 thumbnails on attachment cards. New `ThumbnailRenderer` MIME-dispatcher (PDF today via `pypdfium2`, lazy-cached `_thumb.png` sibling in S3, render runs in `loop.run_in_executor`); new `GET /files/{upload_id}/thumbnail` returning a short-lived presigned URL; single-file + session-cascade deletes clean up thumbnails. Frontend: `FileUploadService.getThumbnail()` returns a typed `ready` / `unsupported` / `unavailable` result; PDF badge renders `object-cover` (#263) +- Rich previews in user messages — iMessage-style image mosaic (1-bubble / 2-col / 1+2 split / 2×2 / 5+ with `+N` overlay) with full-screen lightbox + arrow-key navigation; document-style cards for non-images with tinted header + folded corner + content excerpt. New `GET /files/{upload_id}/preview-url` and `GET /files/{upload_id}/text-snippet` (first 2KB UTF-8) (#254) +- Inline markdown preview for `.md` files in attachment cards; full-screen modal viewer via `ngx-markdown` instead of opening raw source in a new tab (#262) +- Spreadsheet analysis tools — `list_spreadsheets` enumerates CSV/XLSX across KB + attachments (with size + MIME metadata); `analyze_spreadsheet` runs Python analysis in Code Interpreter with schema detection (skiprows probing), cleaned pandas/numpy tracebacks, and 10K/600-char output/error truncation. Injected per-request via `extra_tools` (#f88ce7ec, #0ab90bb1) +- `SKIP_AUTH=true` local-dev bypass in `apis.shared.auth.dependencies` returns a fake admin user from all three auth dependencies. Optional tuning: `SKIP_AUTH_ROLES`, `SKIP_AUTH_USER_ID`, `SKIP_AUTH_EMAIL`. Startup guard in `app_api/main.lifespan` refuses to boot when `SKIP_AUTH=true` is paired with any non-localhost entry in `CORS_ORIGINS`. Inference-api intentionally not bypassed (all SPA traffic flows through app-api) (#272) +- New CI workflow `.github/workflows/skip-auth-guard.yml` greps CDK source, workflow files, and Dockerfiles for `SKIP_AUTH=true` / `SKIP_AUTH: true` patterns and fails the build if any leak into deployed config. SHA-pinned `actions/checkout`, `ubuntu-24.04` (#272) +- `SessionRepository.try_acquire_refresh_lock(session_id, owner, lock_ttl_seconds)` and `release_refresh_lock(session_id, owner)` for cross-task refresh coalescing (#273, #275) +- `apis/shared/sessions_bff/single_flight.py` — new `resolve_once(session_id, loader_coro_factory)` primitive for in-process coalescing of the session-resolve path (#264) +- CAUTION comment in `stream_coordinator` documenting that `AgentResult.context_size` / `EventLoopMetrics.latest_context_size` return only `inputTokens`, under-reporting by 99%+ under prompt caching (#270) + +### ✨ Improved + +- File metadata utilities (`backend/src/apis/shared/files/models.py`) for consistent attachment handling — `FileMetadata`, `FileContent`, size formatting, MIME-type inference — shared between routes and the chat-input component +- Spreadsheet-analysis system prompt clarifies filename vs. sandbox-path handling; tool docstrings expanded with critical guidance on retries +- Stream processor error handling for Code Interpreter responses is more defensive +- Updated `test_session_refresh_preservation.py`'s `InstrumentedTable` to differentiate lock-acquire / token-persist / slide writes so `update_item_side_effect` injection only fires on the persist path (preserving original test intent) (#273) + +### 🔒 Security + +- `kms:GenerateDataKey` and `kms:DescribeKey` dropped from the app-api runtime task role (least privilege). Only `kms:Decrypt` remains, invoked by Secrets Manager on the caller's behalf when reading the CMK-encrypted `BFFCookieDataKeySecret` (#274) +- `SKIP_AUTH=true` gated by boot-time CORS-origin allowlist + CI guard workflow; fails closed for any deploy target we haven't anticipated instead of blocklisting known cloud env vars (#272) + +### ⚡ Performance + +- `SessionRefreshMiddleware` resolve path now coalesces Angular's ~8-endpoint page-load fan-out to 1 `get_item` and 0 `update_item` on the critical path (previously ~16 serialized blocking AWS calls per user per minute). Response latency independent of `touch_last_seen` DDB latency after the `_maybe_slide` fire-and-forget refactor (#264) +- `CookieCodec` initialization dropped from `kms:GenerateDataKey` + per-cold-start round trip to a one-shot Secrets Manager `GetSecretValue` + local SHA-256. No more per-task cold-start KMS call (#274) +- Thumbnail render runs in `loop.run_in_executor` so the request worker isn't blocked; lazy `_thumb.png` sibling in S3 means steady-state thumbnails are a HEAD + presign, not a render (#263) + +### 🏗️ Infrastructure + +- New `BFFCookieDataKeySecret` (Secrets Manager, encrypted with `BFFCookieSigningKey` CMK); SSM parameter `/${projectPrefix}/auth/bff-cookie-data-key-secret-arn` publishes the ARN +- App-api task role: added `secretsmanager:GetSecretValue` on the new secret; removed `kms:GenerateDataKey` and `kms:DescribeKey` on `BFFCookieSigningKey`; kept `kms:Decrypt` +- `appApi.desiredCount` raised 1 → 2 — concurrency slack so a single blocked event loop can no longer halt all ingress + +### 📦 Dependencies + +- Backend: `strands-agents` 1.37.0 → 1.39.0, `strands-agents-tools` 0.5.1 → 0.5.2, new: `pypdfium2` (#265, #263) + +### 🧪 Test Coverage + +- `tests/apis/shared/middleware/test_session_refresh_bug_condition.py` (12 cases) — encodes the seven sub-conditions of the event-loop-blocking bug as Hypothesis properties. Fails on unfixed code (by design); passes on fixed code (#264) +- `tests/apis/shared/middleware/test_session_refresh_preservation.py` (19 cases) — locks in 11 preservation invariants that must remain unchanged for non-buggy inputs (#264) +- `tests/apis/shared/sessions_bff/test_single_flight.py` (6 cases) — primitive-level coverage for the new `resolve_once` module (#264) +- `tests/apis/shared/sessions_bff/test_session_refresh_cross_task.py` (480 lines) — two-task integration coverage over moto DDB for the cross-task refresh lock, follower-polling/adoption, TTL recovery, headline invariant that two tasks racing in parallel call Cognito at most once (#273) +- 8 new repository tests for the lock primitive (acquire on unlocked row, contention blocks peer, TTL recovery, distinct-session isolation, release-by-owner-only, atomic clear on token persist, condition fails when peer owns the lock, phantom-row-prevention on acquire, strict-owner release condition, absolute-lifetime guard ahead of refresh) (#273, #275) +- `tests/agents/main_agent/streaming/test_per_message_cost_attribution.py` — three regression cases for the `metadata` vs `metadata_summary` contract; two parametrized cases for `stream_coordinator` current-context semantics including all-three-buckets-summed under cache-read/write (#270) +- `tests/costs/test_calculator.py` — 26 cases of direct coverage for `CostCalculator` (per-bucket pricing, cache scenarios against Sonnet 4.5 rates, defensive missing-key / None handling, `calculate_cache_savings`, `validate_*` predicates) (#270) +- `tests/auth/test_skip_auth.py` — `SKIP_AUTH` dependency-bypass + env-override coverage, startup guard allowlist behavior, skip-auth-guard.yml regex matches (#272) +- Session-wide autouse fixture in `tests/conftest.py` scrubs `SKIP_AUTH_*` env so developer `.env` bleed doesn't silently turn on the bypass in test runs (#272) +- Infrastructure-stack tests: dropped bootstrap-custom-resource assertions; added negative lock that no `AwsCustomResource` emits `kms:GenerateDataKey` / `secretsmanager:PutSecretValue`; positive assertion on `generateSecretString` shape (44-char, no punctuation, no space); fixed two pre-existing stale resource-count assertions (16→18 DDB tables, 3→6 secrets) (#273, #274) + ## [1.0.0-beta.24] - 2026-05-06 ### 🚀 Added diff --git a/README.md b/README.md index c9db526f..594f0de5 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ **An open-source, production-ready Generative AI platform for institutions** *Built by Boise State University, designed for everyone.* -[![Release](https://img.shields.io/badge/Release-v1.0.0--beta.24-6366f1?style=flat&logo=github&logoColor=white)](RELEASE_NOTES.md) +[![Release](https://img.shields.io/badge/Release-v1.0.0--beta.25-6366f1?style=flat&logo=github&logoColor=white)](RELEASE_NOTES.md) [![Nightly](https://github.com/Boise-State-Development/agentcore-public-stack/actions/workflows/nightly.yml/badge.svg)](https://github.com/Boise-State-Development/agentcore-public-stack/actions/workflows/nightly.yml) ![Python](https://img.shields.io/badge/Python-3.13+-3776AB?style=flat&logo=python&logoColor=white) @@ -260,7 +260,7 @@ agentcore-public-stack/ See [RELEASE_NOTES.md](RELEASE_NOTES.md) for the full changelog, including new features, bug fixes, platform upgrades, and deployment notes for each release. -**Current release:** v1.0.0-beta.24 +**Current release:** v1.0.0-beta.25 --- diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md index 46d94619..32622330 100644 --- a/RELEASE_NOTES.md +++ b/RELEASE_NOTES.md @@ -1,13 +1,214 @@ -# Release Notes — v1.0.0-beta.24 +# Release Notes — v1.0.0-beta.25 -**Release Date:** May 6, 2026 -**Previous Release:** v1.0.0-beta.23 (April 29, 2026) +**Release Date:** May 11, 2026 +**Previous Release:** v1.0.0-beta.24 (May 6, 2026) --- ## Highlights -This release lands the **BFF Token Handler** — a ground-up rewrite of the SPA's auth surface. `localStorage` Bearer tokens are replaced with server-side Cognito session storage keyed by an opaque session id in a KMS-sealed AES-GCM cookie, the public PKCE Cognito client is decommissioned in favor of a confidential client whose secret never leaves the server, and same-origin `/api/*` routing via CloudFront enables `__Host-` cookies, double-submit CSRF, and eliminates the CORS preflight from every chat turn. **Voice mode returns** via a WebSocket-ticket proxy on app-api. The chat view gains a **per-conversation cost + context-window badge** with write-time aggregation, and **context compaction events** now surface inline with refresh-survival. Anthropic **extended thinking** is wired end-to-end via per-model inference parameters. The backend finishes its architecture cleanup: cost, tools, storage, and API-keys modules now live under `apis.shared` with AST-enforced import boundaries. +This release is the **production-readiness fix for the BFF Token Handler** shipped in v1.0.0-beta.24. Beta.24 rewrote the SPA's auth surface onto cookie-based sessions but left three production-breaking bugs that only surfaced under real traffic: the `SessionRefreshMiddleware` ran synchronous boto3 on the uvicorn event loop so Angular's ~8-endpoint page-load fan-out produced ~16 serialized blocking AWS calls per user per minute (504s, 80s `/files/quota` tails, 15.6s p-max on a 0.7% CPU task); the `CookieCodec` minted a fresh random AES-256 key per process, so as soon as we raised `desiredCount` for concurrency slack every cookie started failing as `bad seal` on ~50% of requests; and the per-session refresh lock only coalesced in-process, so two tasks could still race `cognito-idp:initiate_auth` with the same refresh token and Cognito's rotation would silently log out the loser. This release lands the **event-loop offload + single-flight resolve**, a **cross-task shared AES key via Secrets Manager**, and a **DDB conditional-write refresh lock** that elects exactly one leader fleet-wide. + +Also shipping: **server-rendered PDF page-1 thumbnails** on attachment cards, **rich iMessage-style image mosaics** with a full-screen lightbox and inline markdown preview for `.md` files in user messages, **spreadsheet analysis tools** (`list_spreadsheets`, `analyze_spreadsheet`) that run CSV/XLSX analysis inside the Code Interpreter sandbox, **centralized 401 handling** with proactive session-loss detection on tab refocus, and a **`SKIP_AUTH=true` local-dev bypass** gated by a CORS-origin allowlist and a CI guard workflow. Token accounting was corrected across the board — per-message cost no longer double-counts tool-use turns and the context-% badge reflects current context occupancy rather than Strands' summed-across-calls value. + +### Heads-up on beta.24 + +If you deployed beta.24 to a multi-replica environment, you saw some or all of: 401 storms on `/auth/session`, page-load latency tails in the tens of seconds, and users silently logged out after tab refocus. Beta.25 is the fix. The CookieCodec and refresh-lock changes require redeploying the Infrastructure and App API stacks in order — see **🚀 Deployment notes** at the bottom. + +--- + +## BFF Middleware Event-Loop Blocking & Fan-Out Amplification + +The middleware introduced in beta.24 ran three independent classes of work on the uvicorn event loop that weren't safe to run there: synchronous boto3 for DynamoDB + Cognito, an inline-awaited sliding-session write on the response path, and a refresh-coalescing lock that only wrapped the Cognito exchange instead of the full resolve path. Under Angular's ~8-endpoint page-load fan-out with a cold `SessionCache` window, a single cookie-bearing user produced ~16 serialized blocking AWS round-trips on one uvicorn worker running in a single ECS task — every slow call stalled every concurrent request on the same task. The observable symptoms were ALB 504s, `TargetResponseTime` p-max of 15.6s at 0.7% CPU, `/files/quota` outliers reaching ~80s, and endpoint p95s climbing into the hundreds of ms under trivial load. (#264) + +### How it works now + +`SessionRepository.{get,put,update_tokens,touch_last_seen,delete}` and `CognitoRefreshClient.refresh` now offload every boto3 call via `asyncio.to_thread`, so the event loop keeps scheduling other coroutines for the full AWS round-trip duration. A new per-session single-flight primitive (`apis/shared/sessions_bff/single_flight.py`) wraps the whole `cache.get → repository.get → needs_refresh → (maybe refresh)` block in `SessionRefreshMiddleware._resolve_session` — the first caller per `session_id` runs the loader; N concurrent followers await a shared `asyncio.Future` and consume the leader's result. The existing `get_session_lock(session_id)` around the Cognito exchange is preserved end-to-end as defense in depth. `_maybe_slide` no longer `await`s `touch_last_seen` inline — the DDB write dispatches as a detached `asyncio.Task` and the response returns the fresh `Max-Age` synchronously. The cache/throttle boundary alignment that forced a single request to pay both `get_item` and `update_item` on the cache-miss boundary has been de-aligned: `_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS` is now a strict multiple of `_DEFAULT_REFRESH_LEEWAY_SECONDS` (300s vs 60s). + +### Backend + +- `apis/shared/sessions_bff/repository.py` — every boto3 call now wrapped in a nested sync helper invoked via `await asyncio.to_thread(helper, ...)`; method signatures, return types, and exception branches unchanged +- `apis/shared/sessions_bff/refresh.py` — `refresh` is now `async def`, calling `await asyncio.to_thread(self._refresh_sync, ...)`; `CognitoRefreshError` contract and `RefreshResult` shape preserved verbatim +- `apis/shared/sessions_bff/single_flight.py` — new module. `async def resolve_once(session_id, loader_coro_factory) -> tuple[Optional[SessionRecord], bool]`. Leader registers an `asyncio.Future` under a thread-lock-guarded `dict`, runs the loader, sets the result/exception on the Future, removes the registry entry in a `finally` block. Followers `await` the existing Future. Distinct `session_id`s never share a Future +- `apis/shared/middleware/session_refresh.py` — `_resolve_session` wraps the cache/repo/refresh block in `resolve_once(session_id, _loader)`. `_maybe_slide` updates the local cache synchronously and dispatches `touch_last_seen` via `asyncio.create_task`, keeping the task on `self._slide_tasks` with an `add_done_callback(self._slide_tasks.discard)` — Python's asyncio docs explicitly warn that unreferenced tasks can be GC'd mid-flight, and our initial fix landed this footgun (caught by CI on Python 3.12) +- `apis/shared/sessions_bff/config.py` — `_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS` raised 60s → 300s. Strict multiple of the 60s leeway guarantees cache-miss and slide-throttle boundaries never coincide + +### Infrastructure + +- `infrastructure/cdk.context.json` — `appApi.desiredCount` raised 1 → 2 for concurrency slack. A single blocked event loop on one task can no longer halt all ingress + +### Test Coverage + +~900 lines of new property-based tests. `test_session_refresh_bug_condition.py` encodes each of the seven sub-conditions as a hypothesis property that fails on unfixed code and passes on fixed code (Property 1 / Expected Behavior from the bugfix spec). `test_session_refresh_preservation.py` locks in the 11 preservation invariants that must stay unchanged for non-buggy inputs — dormant pass-through, no-cookie pass-through, unrecoverable-cookie clearing, `Max-Age` re-emit contract, refresh-storm coalescing, codec + client-secret singletons, CSRF decision unchanged, absolute-lifetime cap, fail-closed rotation, uniform `CookieDecodeError` handling. `test_single_flight.py` covers the primitive itself: concurrent callers share one loader invocation, exceptions propagate to every waiter, registry entries clean up after failure, distinct sessions are independent. + +--- + +## BFF Cross-Task Cookie & Refresh Correctness + +The `desiredCount: 1 → 2` bump in the event-loop fix immediately exposed two latent defects in beta.24's BFF design that were hidden when only one task existed. Both had to be fixed before the deployment was actually safe to run with more than one replica. (#273, #274, #275) + +### Shared AES-256 data key via Secrets Manager + +`CookieCodec` in beta.24 called `kms:GenerateDataKey` on first use per process and cached the resulting plaintext AES-256 key in memory. The code's own docstring predicted what would happen with more than one task: _"two codecs in one process can never decrypt each other's output."_ And that's what happened — Task A sealed a cookie with Key-A, the ALB routed the follow-up to Task B which had its own Key-B, `unseal` hit `InvalidTag` → `CookieDecodeError` → `Discarding unrecoverable BFF cookie (bad seal)` → 401. CloudWatch confirmed: three app-api streams each independently logged _"BFF cookie codec initialized (KMS data key fetched)"_ and every subsequent `/auth/session` returned 401. + +The fix moves the data key out of per-process state and into a single Secrets Manager secret, encrypted at rest by the existing `BFFCookieSigningKey` CMK: + +- CDK creates `BFFCookieDataKeySecret` with `generateSecretString` (44-char alphanumeric, ~261 bits of entropy). On every deploy the secret already exists so the value is stable — cookies survive redeploys +- `CookieCodec._ensure_cipher` reads the secret string and applies SHA-256 to derive the 32-byte AES-256 key. Single-shot SHA-256 of a ≥256-bit-entropy random input is a sound KDF for AES-256 usage +- Every app-api task decrypts the same secret and derives the same key → all codecs round-trip each other's seals. The `kms:GenerateDataKey` permission dropped from the runtime task role (least privilege); `kms:Decrypt` stays because Secrets Manager invokes it on the caller's behalf when reading a CMK-encrypted secret + +A previous attempt at this bootstrap (#273's initial chained `AwsCustomResource` flow with `kms:GenerateDataKey → secretsmanager:PutSecretValue`) failed stack create with `Response object is too long`. Root cause: the `AwsCustomResource` framework Lambda JSON-stringifies the AWS-SDK response before applying `outputPaths`, and KMS returns `CiphertextBlob` as a Uint8Array that serializes as `{"0":233,"1":18,...}` — ~1.5 KB for a 200-byte ciphertext, past CloudFormation's 4 KB response-object limit. The Secrets-Manager-native `generateSecretString` path in #274 removes the chained custom resources entirely (-153 lines net), no per-cold-start `kms:Decrypt` call, simpler runtime IAM surface. + +### Cross-task refresh lock via DDB conditional-write + +The in-process single-flight and the existing `get_session_lock` only coalesce same-session callers within one Python process. Once the cookie-codec fix lands and both tasks can share cookies again, under `desiredCount: 2` two tasks each receive a same-session request crossing the refresh-leeway window and each call `cognito-idp:initiate_auth` with the same refresh token. Cognito rotates on the winning call; the loser receives `NotAuthorizedException`, the loser's middleware clears the cookie, and the user is silently logged out. + +- `SessionRepository.try_acquire_refresh_lock(session_id, owner, lock_ttl_seconds)` — conditional `UpdateItem` that succeeds iff `attribute_not_exists(refresh_lock_until) OR refresh_lock_until < :now` AND `attribute_exists(PK)` (no phantom rows for sessions that don't exist). Loser returns `False` +- `SessionRepository.update_tokens` gains `expected_lock_owner=...` — when supplied, the write conditionally requires `refresh_lock_owner = :owner` (strict, not "owner-or-absent") and atomically `REMOVE`s the lock attrs in the same write. The stale-leader-stomp case (Task A's lock TTLs, Task B refreshes, Task A returns with older tokens) now surfaces as `ConditionalCheckFailedException` so the caller can re-read and adopt the peer's tokens +- `SessionRepository.release_refresh_lock(session_id, owner)` — best-effort cleanup for the leader-failed path so a peer doesn't have to wait the full TTL before retrying +- `SessionRefreshMiddleware._resolve_session._loader` — two-tier coalescing: (1) existing `get_session_lock` collapses N in-process same-session callers to one contender; (2) `try_acquire_refresh_lock` elects exactly one leader fleet-wide. Followers poll the row via `_wait_for_peer_refresh` and adopt the leader's tokens (rotation detected by refresh-token mismatch; non-rotation by access-token mismatch + future-dated `exp`). Absolute-lifetime guard added ahead of the lock acquisition — if `now > created_at + absolute_lifetime_seconds`, clear the cookie instead of burning a Cognito refresh on a row that's about to TTL-evict + +### Test Coverage + +Cross-task integration tests (`test_session_refresh_cross_task.py`, 480 lines) run two `SessionRefreshMiddleware` instances against one moto DDB table and exercise leader/follower paths, follower-polling-then-adopting, lock TTL recovery after a dead leader, follower-fall-back-terminal when the leader is stuck, and the headline invariant: two tasks racing in parallel call Cognito at most once. Eight new repository tests lock the lock primitive shape, plus targeted tests for the strict-owner release condition and the phantom-row-prevention guard on acquire. + +### Infrastructure + +- New `BFFCookieDataKeySecret` (Secrets Manager), encrypted with `BFFCookieSigningKey`. SSM parameter `/${projectPrefix}/auth/bff-cookie-data-key-secret-arn` publishes the ARN for app-api +- App-api task role: added `secretsmanager:GetSecretValue` on the new secret; kept `kms:Decrypt` (needed by Secrets Manager to read the CMK-encrypted secret); removed `kms:GenerateDataKey` and `kms:DescribeKey` +- No IAM change required for the DDB refresh lock — app-api task role already had `dynamodb:UpdateItem` on `BFFSessionsTable` + +### Breaking changes + +- None user-facing. The new env var and SSM parameter are additive; existing deployments redeploy Infrastructure first, then App API, to pick up the shared secret + +--- + +## Token Accounting Correctness + +Two related bugs were inflating cost and context-% reporting on tool-use turns. (#270) + +### Per-message cost double-count + +Strands emits per-LLM-call metadata (each call's tokens) AND a final `AgentResultEvent` whose `EventLoopMetrics.accumulated_usage` is summed across every call in the turn. Both were emitted as `metadata` events and routed into `per_message_metadata[current_assistant_message_index]["usage"]` via `.update()`. Because the `AgentResult` event arrives after every `message_stop`, the index still pointed at the last assistant message — so cumulative tokens overwrote that message's per-call values, double-counting earlier messages' input tokens when each entry was priced and summed. + +Fix: route the result-extracted cumulative on the existing `metadata_summary` (turn-summary) track instead of `metadata`. The `stream_processor` main loop consumes both event types into `accumulated_metadata` so the final summary still carries true totals. + +### Context-% inflation within a tool turn + +Bedrock reports each per-LLM-call `inputTokens` as the FULL context size sent on that call. For a 2-call tool turn (`call_1.input=1000`, `call_2.input=2500`), Strands' `accumulated_usage` reports 3500 — but the actual current context occupancy is 2500. The final SSE `usage` field driving the context-% badge and compaction trigger was inheriting Strands' summed value. + +Fix: `stream_coordinator` no longer accumulates `metadata_summary` into `accumulated_metadata`. Per-call `metadata` events last-write-wins via `.update()`, so `accumulated_metadata.usage` equals the most recent call's full input = current context. Added a `CAUTION` comment noting `AgentResult.context_size` / `EventLoopMetrics.latest_context_size` return only `inputTokens` (excluding `cacheRead` / `cacheWrite`) — under prompt caching they under-report by 99%+, so we deliberately sum all three buckets. `TTFT` placeholder of 0 changed to `null` (a real time-to-first-token can never be 0ms and aggregations need to distinguish absence from a real zero); `LatencyMetrics.time_to_first_token` is now `Optional[int]` in both the shared and app-api models. + +### Test Coverage + +`test_per_message_cost_attribution.py` pins the `metadata` vs `metadata_summary` contract, the main-loop accumulator's both-tracks consumption, and the `stream_coordinator` current-context semantics (two parametrized cases plus all-three-buckets-summed for cache-read/write). Direct unit coverage for `CostCalculator` arrived in `test_calculator.py` (26 cases: per-bucket pricing, cache scenarios against Sonnet 4.5 rates, defensive missing-key / None handling, `calculate_cache_savings`, `validate_pricing` / `validate_usage`). + +--- + +## Auth UX & Local-Dev Bypass + +### Centralized 401 handling + proactive session detection + +Beta.24 only redirected on 401 from the SessionService bootstrap path — a session that expired mid-session left the user stranded with a generic toast (CRUD endpoints) or no feedback (SSE chat stream). Every 401 now flows through `SessionService.handleUnauthorized()`, which dedupes concurrent calls and queues a single navigation to `/auth/login` with a preserved `returnUrl`. Session loss is surfaced proactively rather than waiting for the next HTTP call to fail: (#277) + +- **Cookie-presence fast-path** in bootstrap and recheck. The JS-readable `__Host-bff_csrf` cookie is set and cleared alongside `__Host-bff_session` with matching `Max-Age`, so if the CSRF cookie is gone the session cookie is gone too — skip the `/auth/session` round-trip and bounce straight to login +- **Visibility re-probe** in the app shell. On tab refocus, `recheck()` runs the cookie check and falls back to `/auth/session`, so a session that expired while the tab was backgrounded is caught immediately rather than on the next user action + +### `SKIP_AUTH=true` local-dev bypass + +A single-env-var bypass for unattended local dev (and Claude Code agents) that can't round-trip through an external IdP. (#272) + +- Returns a fake admin `User` from the three auth dependencies in `apis.shared.auth.dependencies`; CSRF middleware, RBAC, and profile cache flow naturally because no `bff_session` is resolved +- **Allowlist startup guard** in `app_api/main.lifespan` — app refuses to boot when `SKIP_AUTH=true` is paired with any non-localhost entry in `CORS_ORIGINS` (or an empty `CORS_ORIGINS`). Fails closed for deploy targets we haven't anticipated rather than blocklisting known cloud env vars +- **CI guard workflow** (`.github/workflows/skip-auth-guard.yml`) — greps CDK source, workflow files, and Dockerfiles for `SKIP_AUTH=true` / `SKIP_AUTH: true` patterns and fails the build if any leak into deployed config +- Inference-api is intentionally not bypassed — all SPA traffic flows through app-api per the BFF pattern, so one bypass is sufficient +- Optional tuning: `SKIP_AUTH_ROLES`, `SKIP_AUTH_USER_ID`, `SKIP_AUTH_EMAIL` override the default fake user + +### Lava-lamp backdrop dark-mode fix + +The dark-mode CSS for the auth pages' lava-lamp backdrop and frosted-glass card never applied on cold load: hand-written `html.dark .X` selectors don't match under Angular's emulated view encapsulation, and `ThemeService` (`providedIn:'root'`) was never injected by anything in the pre-auth tree. Switched the auth-page CSS to `:host-context(html.dark) .X` (the pattern already used component-scoped elsewhere) and forced `ThemeService` to construct at bootstrap via `provideAppInitializer`, so the persisted/system theme is applied to `` before any route renders, including `/auth/login` and `/auth/first-boot` on cold load. (#271) + +--- + +## Attachments: PDF Thumbnails, Rich Previews, Markdown Modal + +### Server-rendered PDF page-1 thumbnails + +Real first-page thumbnails for PDF attachments instead of the skeleton mockup. Page rasterization runs in app-api via `pypdfium2` (Apache 2.0 / BSD, bundled PDFium binary, no system `poppler`/`ghostscript`). (#263) + +- New `ThumbnailRenderer` with a MIME-type dispatcher; PDF only today. Class docstring documents the recommended out-of-process design for `.docx` / `.xlsx` so the dispatcher stays small +- `GET /files/{upload_id}/thumbnail` — lazy: HEAD-checks for a cached `_thumb.png` sibling next to the original, renders + stores on miss, returns a short-lived presigned GET URL. 415 for unsupported MIME types, 422 for unreadable / corrupt PDFs. Render runs in `loop.run_in_executor` so request workers aren't blocked +- Single-file and session-cascade deletes also remove the thumbnail sibling +- `FileUploadService.getThumbnail()` returns a typed result so callers switch on `ready` / `unsupported` / `unavailable` without parsing HTTP errors. Badge fetches on mount for PDFs and renders as `object-cover`, suppressing the bottom fade. Silent fall-back to the skeleton on any error + +### Rich previews in user messages + +The dense badge is replaced with a richer attachment renderer in user message history. (#254) + +- **Images** render as an iMessage-style mosaic: 1-bubble, 2-col, 1+2 split, 2×2 grid, 5+ with `+N` overlay. Opens in a full-screen lightbox with arrow-key navigation +- **Non-image files** render as a document-style card: tinted header strip with type chip, white "page" body with a folded corner, filename + size footer. Text-based files (txt, md, csv, html) show a real content excerpt; binary types (pdf, docx, xls/xlsx) get skeleton lines +- `GET /files/{upload_id}/preview-url` — short-lived presigned GET URL scoped to the file owner, used for inline images and the lightbox +- `GET /files/{upload_id}/text-snippet` — first 2KB of a text-based file decoded as UTF-8 for the document card content peek + +### Inline markdown preview for `.md` files + +Parsed markdown renders in the attachment card excerpt instead of raw text; clicking a `.md` card opens a full-screen modal viewer rather than opening the raw source in a new tab. Reuses `ngx-markdown` (already wired up for assistant messages) and the existing presigned preview-url flow. (#262) + +--- + +## Spreadsheet Analysis Tools + +New spreadsheet analysis capability for CSV/XLSX files. (#f88ce7ec, #0ab90bb1) + +- `list_spreadsheets` — enumerates CSV/Excel files from knowledge bases and chat attachments; includes file size and MIME type metadata +- `analyze_spreadsheet` — downloads files from S3, executes Python analysis via Code Interpreter, returns results. Intelligent schema detection with skiprows probing handles report-style exports with metadata rows. Stderr is cleaned to filter pandas/numpy internal frames and show only user-relevant errors. Output truncated at 10K chars, errors at 600 chars, to prevent context-window overflow +- Tools injected per-request into `ToolRegistry` via `extra_tools`; chat routes (app-api and inference-api) pass conversation context to the factories +- Targeted error hints for XLSX→CSV filename mismatches in the sandbox environment; tolerant filename matching for CSV↔XLSX aliasing to prevent retry loops; schema footer preservation on errors for better retry context +- File metadata models and utilities for consistent attachment handling; stream processor error handling improved for Code Interpreter responses + +--- + +## 📦 Dependencies + +| Package | From | To | +|---|---|---| +| strands-agents (backend) | 1.37.0 | 1.39.0 | +| strands-agents-tools (backend) | 0.5.1 | 0.5.2 | +| pypdfium2 (backend, new) | — | latest | + +`CacheConfig(strategy="auto")` remains intentionally deferred on `BedrockModel`. The strands v1.39.0 bump includes the SDK-side fix (strands PR #1438 — `cachePoint` blocks alongside non-PDF document attachments), so the technical barrier is gone — but the user-visible cost/badge impact warrants a separate scoped rollout. (#265) + +--- + +## 🏗️ Infrastructure + +- **New**: `BFFCookieDataKeySecret` (Secrets Manager), encrypted at rest with the existing `BFFCookieSigningKey` CMK. SSM parameter `/${projectPrefix}/auth/bff-cookie-data-key-secret-arn` +- **Changed**: `appApi.desiredCount` raised 1 → 2 +- **IAM delta on app-api task role**: added `secretsmanager:GetSecretValue` on `BFFCookieDataKeySecret`; removed `kms:GenerateDataKey` and `kms:DescribeKey` on `BFFCookieSigningKey`; kept `kms:Decrypt` (Secrets Manager invokes it on the caller's behalf when reading a CMK-encrypted secret) +- **No new tables**. The cross-task refresh lock reuses `BFFSessionsTable` via conditional `UpdateItem` + +--- + +## 🔧 CI/CD + +- **New workflow**: `.github/workflows/skip-auth-guard.yml` — greps CDK source, workflow files, and Dockerfiles for `SKIP_AUTH=true` / `SKIP_AUTH: true` patterns and fails the build if any leak into deployed config. Uses SHA-pinned `actions/checkout` and `ubuntu-24.04` per existing supply-chain conventions in `tests/supply_chain/` + +--- + +## 🚀 Deployment notes + +Deploy Infrastructure first, then App API, in that order. + +1. **Infrastructure stack** creates `BFFCookieDataKeySecret` and publishes its ARN to SSM. The secret value is generated by Secrets Manager on create and stays stable across subsequent deploys — cookies survive redeploys +2. **App API stack** picks up `BFF_COOKIE_DATA_KEY_SECRET_ARN` on the next task rotation; existing tasks keep the old per-process data key until they drain. Both states coexist cleanly — new tasks seal under the shared key; old tasks still seal under their own; unsealing on a task that holds a different key fails the same way it does today and the SPA bounces to login. End state (all tasks rotated): cookies round-trip cleanly across the fleet +3. **`desiredCount: 2` takes effect** on the App API stack's next deploy. CloudFormation scales up without draining traffic; the fix makes multi-replica safe + +No manual cleanup required if you were running on beta.24 — the migration is forward-only. If you want zero-drift on the user population, invalidate active sessions once post-deploy: `aws dynamodb scan --table-name ${BFFSessionsTable} --select COUNT` then a bulk delete, or just let the 30-day absolute-lifetime cap roll them off naturally. + +--- + + --- diff --git a/VERSION b/VERSION index 33008cb0..97aa7525 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -1.0.0-beta.24 +1.0.0-beta.25 diff --git a/backend/pyproject.toml b/backend/pyproject.toml index d6393e2b..4c23b5c8 100644 --- a/backend/pyproject.toml +++ b/backend/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "agentcore-stack" -version = "1.0.0-beta.24" +version = "1.0.0-beta.25" requires-python = ">=3.10" description = "Multi-agent conversational AI system with AWS Bedrock AgentCore" readme = "README.md" @@ -39,13 +39,17 @@ dependencies = [ "cryptography==47.0.0", "python-multipart==0.0.27", "aiohttp==3.13.5", + + # PDF page-1 rasterization for attachment thumbnails (Apache 2.0 / BSD). + # Bundles PDFium (no system poppler/ghostscript needed). + "pypdfium2==4.30.0", ] [project.optional-dependencies] # AgentCore-specific dependencies (for inference_api) agentcore = [ - "strands-agents==1.37.0", - "strands-agents-tools==0.5.1", + "strands-agents==1.39.0", + "strands-agents-tools==0.5.2", "aws-opentelemetry-distro==0.17.0", "bedrock-agentcore==1.6.4", @@ -56,7 +60,7 @@ agentcore = [ # Voice/BidiAgent dependencies (Nova Sonic speech-to-speech) bidi = [ - "strands-agents[bidi]==1.37.0", + "strands-agents[bidi]==1.39.0", ] # Document ingestion pipeline dependencies (for Lambda deployment) diff --git a/backend/src/.env.example b/backend/src/.env.example index 44ca0eb4..85ec9162 100644 --- a/backend/src/.env.example +++ b/backend/src/.env.example @@ -69,6 +69,41 @@ AGENTCORE_CODE_INTERPRETER_ID= # DEVELOPMENT SETTINGS # ============================================================================= +# Local-dev auth bypass (OPTIONAL — LOCAL DEV ONLY) +# Purpose: Skip the Cognito redirect and return a fake admin user from +# the three auth dependencies (get_current_user, +# get_current_user_from_session, get_current_user_trusted). Lets an +# unattended agent or a dev with no IdP access hit protected app-api +# routes without the OAuth round-trip. +# Default: unset (auth enforced) +# Guard rails: +# - app-api refuses to boot when SKIP_AUTH=true unless every entry in +# CORS_ORIGINS is a localhost URL (localhost, 127.0.0.1, ::1, +# 0.0.0.0). This is the allowlist that keeps the bypass off +# deployed environments. +# - A CI workflow (.github/workflows/skip-auth-guard.yml) refuses any +# PR that puts SKIP_AUTH=true into Dockerfiles, CDK, scripts, or +# other workflows. +# - Inference-api is NOT bypassed — SPA traffic flows through app-api +# so a single chokepoint suffices. +# DO NOT enable in any deployed environment. +SKIP_AUTH= + +# Roles to assign the fake user (OPTIONAL — only read when SKIP_AUTH=true) +# Purpose: Comma-separated list of roles for the bypass user. Drives +# RBAC filtering (model visibility, admin endpoints, etc.). +# Default: admin +# Example: DotNetDevelopers,QA +SKIP_AUTH_ROLES=admin + +# User ID to assign the fake user (OPTIONAL — only read when SKIP_AUTH=true) +# Default: local-dev +SKIP_AUTH_USER_ID=local-dev + +# Email to assign the fake user (OPTIONAL — only read when SKIP_AUTH=true) +# Default: dev@local +SKIP_AUTH_EMAIL=dev@local + # Enable quota enforcement (OPTIONAL) # Purpose: Enforce user quota limits on chat requests # If true (default), checks user quota before processing each request diff --git a/backend/src/agents/builtin_tools/__init__.py b/backend/src/agents/builtin_tools/__init__.py index 5c0df74f..3c7ea2ed 100644 --- a/backend/src/agents/builtin_tools/__init__.py +++ b/backend/src/agents/builtin_tools/__init__.py @@ -2,10 +2,15 @@ This package contains tools that leverage AWS Bedrock capabilities: - Code Interpreter: Execute Python code for diagrams and charts +- Spreadsheet Analysis: Analyze tabular data via Code Interpreter (factory-produced, not in registry) """ from .code_interpreter_diagram_tool import generate_diagram_and_validate +from .spreadsheet_analysis import make_list_spreadsheets_tool, make_analyze_tool +# Only static tools go in __all__ (registered in ToolRegistry at startup). +# Factory-produced tools (make_list_spreadsheets_tool, make_analyze_tool) are created +# per-request with context and injected via extra_tools — not registered here. __all__ = [ 'generate_diagram_and_validate', ] diff --git a/backend/src/agents/builtin_tools/spreadsheet_analysis/__init__.py b/backend/src/agents/builtin_tools/spreadsheet_analysis/__init__.py new file mode 100644 index 00000000..2e380954 --- /dev/null +++ b/backend/src/agents/builtin_tools/spreadsheet_analysis/__init__.py @@ -0,0 +1,10 @@ +"""Spreadsheet analysis tools for Code Interpreter integration. + +Provides tools that enable the agent to list and analyze tabular data files +from assistant knowledge bases and chat attachments using Code Interpreter. +""" + +from .list_spreadsheets_tool import make_list_spreadsheets_tool +from .analyze_tool import make_analyze_tool + +__all__ = ["make_list_spreadsheets_tool", "make_analyze_tool"] diff --git a/backend/src/agents/builtin_tools/spreadsheet_analysis/analyze_tool.py b/backend/src/agents/builtin_tools/spreadsheet_analysis/analyze_tool.py new file mode 100644 index 00000000..e50f9414 --- /dev/null +++ b/backend/src/agents/builtin_tools/spreadsheet_analysis/analyze_tool.py @@ -0,0 +1,606 @@ +"""Analyze spreadsheet files using Code Interpreter. + +Factory function creates a context-bound tool that downloads tabular files +from S3, pushes them to Code Interpreter, and executes Python code for analysis. +""" + +import logging +import os +import re +from typing import Any, Dict, Optional + +import boto3 +from strands import tool + +from .list_spreadsheets_tool import _get_kb_files, _get_session_files + +logger = logging.getLogger(__name__) + +MAX_OUTPUT_CHARS = 10000 # ~2500 tokens — safe margin under context limits +MAX_ERROR_CHARS = 600 # cleaned traceback budget — full pandas tracebacks are noise + +_SCHEMA_MARKER = "[__SCHEMA__]" +_SHEETS_MARKER = "[__SHEETS__]" + + +def _truncate_output(text: str) -> str: + """Truncate tool output to prevent blowing the LLM context window.""" + if not text or len(text) <= MAX_OUTPUT_CHARS: + return text + return text[:MAX_OUTPUT_CHARS] + f"\n\n... (output truncated — {len(text):,} chars total, showing first {MAX_OUTPUT_CHARS:,})" + + +def _strip_first_row(schema: str) -> str: + """Drop the ``first_row: ...`` line from a schema footer. + + On the happy path the first-row preview helps the model write correct + code. On the error path the model already has the load line and column + list — the full row dump is ~30 fields of noise. This trims it. + """ + return "\n".join( + line for line in schema.splitlines() + if not line.startswith("first_row:") + ) + + +# --------------------------------------------------------------------------- +# Stderr cleaning +# --------------------------------------------------------------------------- + +# Frames we never want to show the LLM — they're pandas/numpy internals with +# zero signal for fixing the user's code. +_INTERNAL_FRAME_MARKERS = ( + "site-packages/pandas/", + "site-packages/numpy/", + "pandas/_libs/", + "pandas/core/", + "pandas/io/", +) + + +def _clean_stderr(stderr: str) -> str: + """Strip pandas internal frames and dtype warnings from a traceback. + + Keeps the user-code frame (the `/tmp/ipykernel_*.py` line they wrote) and + the final exception line. Falls back to a truncated raw stderr if the + traceback doesn't match the expected shape. + """ + if not stderr: + return "Unknown error" + + lines = stderr.splitlines() + + # 1. Drop DtypeWarning noise (spans 2 lines: the warning + the call-site). + filtered: list[str] = [] + skip_next = False + for line in lines: + if skip_next: + skip_next = False + continue + if "DtypeWarning:" in line or "FutureWarning:" in line or "UserWarning:" in line: + skip_next = True # next line is usually the offending code snippet + continue + filtered.append(line) + + # 2. Find the final exception line (e.g. "KeyError: 'NET_AMOUNT'"). + final_exception = "" + for line in reversed(filtered): + stripped = line.strip() + if not stripped: + continue + # Exception lines are left-flush and match "ExceptionName: message". + if not line.startswith((" ", "\t")) and re.match(r"^[A-Z][A-Za-z]*(?:Error|Exception|Warning):", stripped): + final_exception = stripped + break + + # 3. Find the user-code frame (ipykernel tempfile, not site-packages). + user_frame_lines: list[str] = [] + for i, line in enumerate(filtered): + stripped = line.strip() + if not stripped.startswith("File "): + continue + if any(m in stripped for m in _INTERNAL_FRAME_MARKERS): + continue + # Keep this frame + up to the next 2 lines (the code snippet + pointer). + user_frame_lines.append(stripped) + for j in range(i + 1, min(i + 3, len(filtered))): + nxt = filtered[j].strip() + if not nxt or nxt.startswith("File "): + break + user_frame_lines.append(nxt) + break + + if user_frame_lines and final_exception: + cleaned = "\n".join(user_frame_lines) + "\n" + final_exception + elif final_exception: + cleaned = final_exception + else: + # Unrecognized shape — return a short tail rather than a 3K dump. + cleaned = "\n".join(filtered[-8:]).strip() + + if len(cleaned) > MAX_ERROR_CHARS: + cleaned = cleaned[:MAX_ERROR_CHARS] + " ..." + return cleaned + + +# --------------------------------------------------------------------------- +# Schema preview probe +# --------------------------------------------------------------------------- + + +def _build_preview_code(csv_filename: str) -> str: + """Return Python code that prints a compact schema snapshot for csv_filename. + + Runs a bounded skiprows probe (0..8) to handle report-style exports with + leading metadata rows. Picks the skiprows value that produces the cleanest + header — no ``Unnamed:`` columns, no duplicates, non-empty names — and + emits a ready-to-use ``pd.read_csv(...)`` invocation when the best + candidate is meaningfully better than skiprows=0. Otherwise it reports the + columns at skiprows=0 and lets the model decide. + + Output is bracketed with _SCHEMA_MARKER so it can be reliably extracted + from the interpreter's stdout stream even if user code prints other things. + """ + return f""" +import warnings, pandas as pd +warnings.filterwarnings('ignore') + +def _score(cols): + # Higher is better. Punishes Unnamed columns and duplicates. + if not cols: + return -10_000 + unnamed = sum(1 for c in cols if str(c).startswith('Unnamed:')) + named = len(cols) - unnamed + dup_penalty = (len(cols) - len(set(cols))) * 20 + blank_penalty = sum(1 for c in cols if not str(c).strip()) * 10 + return named - (unnamed * 5) - dup_penalty - blank_penalty + +try: + with open({csv_filename!r}, 'r') as _fh: + _total_rows = sum(1 for _ in _fh) + + # Score skiprows=0..8, keep the winner and remember the baseline. + _baseline_score, _baseline_cols = -float('inf'), [] + _best_skip, _best_score, _best_cols = 0, -float('inf'), [] + for _sk in range(9): + try: + _cols = pd.read_csv({csv_filename!r}, nrows=0, skiprows=_sk, low_memory=False).columns.tolist() + except Exception: + continue + _sc = _score(_cols) + if _sk == 0: + _baseline_score, _baseline_cols = _sc, _cols + if _sc > _best_score: + _best_skip, _best_score, _best_cols = _sk, _sc, _cols + + # Confidence gate: only prescribe a non-zero skiprows when the winner + # actually fixes a header problem — either more named columns OR fewer + # Unnamed columns than the baseline — AND the winner is mostly clean. + # A score-delta threshold alone can't distinguish "found the real header" + # from "data row happens to parse cleanly", so we anchor on named/unnamed + # counts instead. + def _named_unnamed(cols): + u = sum(1 for c in cols if str(c).startswith('Unnamed:')) + return len(cols) - u, u + _base_named, _base_unnamed = _named_unnamed(_baseline_cols) + _win_named, _win_unnamed = _named_unnamed(_best_cols) + _win_clean_ratio = _win_named / max(len(_best_cols), 1) + + _prescribe = ( + _best_skip > 0 + and _win_clean_ratio >= 0.7 + and (_win_named > _base_named or _win_unnamed < _base_unnamed) + ) + + if _prescribe: + _report_skip, _report_cols = _best_skip, _best_cols + else: + _report_skip, _report_cols = 0, _baseline_cols + + _data_rows = max(_total_rows - 1 - _report_skip, 0) + _col_preview = ', '.join(str(c) for c in _report_cols[:20]) + if len(_report_cols) > 20: + _col_preview += f' ... (+{{len(_report_cols) - 20}} more)' + + try: + _head = pd.read_csv({csv_filename!r}, skiprows=_report_skip, nrows=1, low_memory=False) + _first_row = _head.iloc[0].to_dict() if len(_head) else {{}} + _first_row = {{k: (str(v)[:40] + '...' if len(str(v)) > 40 else v) for k, v in _first_row.items()}} + except Exception: + _first_row = {{}} + + if _prescribe: + _load = f"pd.read_csv({csv_filename!r}, skiprows={{_report_skip}}, low_memory=False)" + _note = f" # {{_report_skip}} metadata row(s) detected before header" + else: + _load = f"pd.read_csv({csv_filename!r}, low_memory=False)" + _note = "" + + print({_SCHEMA_MARKER!r}) + print(f'file: {csv_filename} ({{_data_rows}} rows x {{len(_report_cols)}} cols)') + print(f'load: {{_load}}{{_note}}') + print(f'columns: {{_col_preview}}') + print(f'first_row: {{_first_row}}') + # If confidence was low, flag it so the model knows to verify. + if not _prescribe and _win_unnamed > 0 and _win_unnamed < len(_best_cols): + print(f'note: header may need adjustment (skiprows=0 has {{_base_unnamed}}/{{len(_baseline_cols)}} unnamed columns); inspect head() if unsure') + print({_SCHEMA_MARKER!r}) +except Exception as _e: + print({_SCHEMA_MARKER!r}) + print(f'schema preview unavailable: {{_e}}') + print({_SCHEMA_MARKER!r}) +""" + + +def _extract_schema_preview(stdout: str) -> tuple[str, str]: + """Split stdout into (schema_block, remaining_stdout). + + The schema block is whatever is between _SCHEMA_MARKER pairs; if no markers + are found, returns ("", stdout). + """ + if _SCHEMA_MARKER not in stdout: + return "", stdout + parts = stdout.split(_SCHEMA_MARKER) + # parts = [before, schema, after, ...]; stitch back everything non-schema. + if len(parts) >= 3: + schema = parts[1].strip() + remaining = (parts[0] + _SCHEMA_MARKER.join(parts[2:])).strip("\n") + return schema, remaining + return "", stdout + + +def _get_code_interpreter_id() -> Optional[str]: + """Get Code Interpreter ID from environment or SSM.""" + ci_id = os.getenv("AGENTCORE_CODE_INTERPRETER_ID") + if ci_id: + return ci_id + try: + project_name = os.getenv("PROJECT_NAME", "strands-agent-chatbot") + environment = os.getenv("ENVIRONMENT", "dev") + region = os.getenv("AWS_REGION", "us-west-2") + ssm = boto3.client("ssm", region_name=region) + response = ssm.get_parameter(Name=f"/{project_name}/{environment}/agentcore/code-interpreter-id") + return response["Parameter"]["Value"] + except Exception: + return None + + +def make_analyze_tool( + assistant_id: Optional[str], + session_id: str, + user_id: str, +): + """Create an analyze_spreadsheet tool bound to the given context.""" + + @tool + def analyze_spreadsheet( + filename: str, + python_code: str, + output_filename: Optional[str] = None, + ) -> Dict[str, Any]: + """Analyze a spreadsheet file using Python code in Code Interpreter. + + Downloads the specified file and loads it into a sandboxed Python + environment for analysis. Use pandas, numpy, matplotlib, and seaborn. + + ⚠️ CRITICAL — filename vs. in-sandbox path + ------------------------------------------- + The ``filename`` parameter names the **source** file (exactly as it + appears in the chat attachment or knowledge base, e.g. + ``"FY_27_Ledger.xlsx"``). + + In the sandbox, XLSX files are pre-converted to CSV: + ``FY_27_Ledger.xlsx`` → loadable as ``FY_27_Ledger.csv`` + + So ``python_code`` must read the CSV form, even for an XLSX source: + + filename: "FY_27_Ledger.xlsx" (source name) + python_code: pd.read_csv('FY_27_Ledger.csv', low_memory=False) + ^^^ .csv, not .xlsx + + CSV files keep their name unchanged in the sandbox. + + Handling leading metadata rows + ------------------------------ + Some exports have metadata rows above the real header. The tool + response always includes a schema footer with a ready-to-use + ``load:`` command that accounts for this — e.g. + ``pd.read_csv('file.csv', skiprows=3, low_memory=False)``. + **On any retry, use that exact load line verbatim** instead of + guessing ``skiprows``. + + Best for: aggregations, filtering, trends, comparisons, statistics, + charts. For simple factual lookups, use knowledge base search. + + Args: + filename: Source filename from list_spreadsheets results. Use + the original name (``.xlsx`` or ``.csv``), not the sandbox + form. + python_code: Python to execute. Load XLSX sources via + ``pd.read_csv('.csv', ...)``. Available libraries: + pandas, numpy, matplotlib, seaborn, openpyxl. + output_filename: Optional PNG filename if generating a chart. + Must end with ``.png``. Example: ``"chart.png"``. + + Returns: + Analysis results as text (with a schema footer), and optionally + a chart image. + """ + from bedrock_agentcore.tools.code_interpreter_client import CodeInterpreter + + # 1. Validate Code Interpreter is available + ci_id = _get_code_interpreter_id() + if not ci_id: + return {"content": [{"text": "❌ Code Interpreter is not configured. Contact your administrator."}], "status": "error"} + + # 2. Find the file in accessible sources + file_info = _find_file(filename, assistant_id, session_id) + if not file_info: + return {"content": [{"text": f"❌ File '{filename}' not found or not accessible. Use list_spreadsheets to see available files."}], "status": "error"} + + # 3. Download from S3 + try: + file_bytes = _download_file(file_info) + except Exception as e: + return {"content": [{"text": f"❌ Failed to download file: {e}"}], "status": "error"} + + # 4. Push file to Code Interpreter + content_type = file_info.get("content_type", "") + is_xlsx = "spreadsheetml" in content_type or filename.lower().endswith(".xlsx") + + region = os.getenv("AWS_REGION", "us-west-2") + code_interpreter = CodeInterpreter(region) + + try: + code_interpreter.start(identifier=ci_id) + + if is_xlsx: + # Push XLSX as base64, decode in sandbox. Only the first sheet + # is converted; if the workbook has multiple sheets we surface + # a warning so the model can tell the user rather than silently + # analyzing the wrong tab. + import base64 + b64_content = base64.b64encode(file_bytes).decode("ascii") + csv_filename = os.path.splitext(filename)[0] + ".csv" + + code_interpreter.invoke("writeFiles", {"content": [ + {"path": "_encoded.b64", "text": b64_content}, + ]}) + bootstrap_code = f""" +import base64, io, csv +from openpyxl import load_workbook + +with open('_encoded.b64', 'r') as f: + raw = base64.b64decode(f.read()) + +wb = load_workbook(io.BytesIO(raw), read_only=True, data_only=True) +_active_sheet = wb.sheetnames[0] +ws = wb[_active_sheet] +with open('{csv_filename}', 'w', newline='') as out: + writer = csv.writer(out) + for row in ws.iter_rows(values_only=True): + if all(cell is None for cell in row): + continue + writer.writerow([str(cell) if cell is not None else '' for cell in row]) + +# Emit the sheet inventory so the caller can warn about multi-sheet workbooks. +print({_SHEETS_MARKER!r}) +print(f'active: {{_active_sheet}}') +print(f'all: {{wb.sheetnames}}') +print({_SHEETS_MARKER!r}) +wb.close() +""" + multi_sheet_note = "" + resp = code_interpreter.invoke("executeCode", {"code": bootstrap_code, "language": "python", "clearContext": False}) + bootstrap_stdout = "" + for event in resp.get("stream", []): + result = event.get("result", {}) + if result.get("isError", False): + error_msg = _clean_stderr(result.get("structuredContent", {}).get("stderr", "")) + return {"content": [{"text": f"❌ Failed to convert XLSX in sandbox:\n```\n{error_msg}\n```"}], "status": "error"} + bootstrap_stdout += result.get("structuredContent", {}).get("stdout", "") + + # Parse the sheet inventory emitted by the bootstrap. + if _SHEETS_MARKER in bootstrap_stdout: + try: + block = bootstrap_stdout.split(_SHEETS_MARKER)[1].strip() + active = "" + all_sheets: list[str] = [] + for line in block.splitlines(): + if line.startswith("active:"): + active = line.split(":", 1)[1].strip() + elif line.startswith("all:"): + # Parse the Python list literal ("['a', 'b']"). + import ast + try: + all_sheets = ast.literal_eval(line.split(":", 1)[1].strip()) + except (ValueError, SyntaxError): + all_sheets = [] + if len(all_sheets) > 1: + others = [s for s in all_sheets if s != active] + multi_sheet_note = ( + f"⚠ Workbook has {len(all_sheets)} sheets; analyzing only '{active}'. " + f"Other sheets not loaded: {', '.join(others[:5])}" + + (f" (+{len(others) - 5} more)" if len(others) > 5 else "") + ) + except Exception as e: + logger.warning(f"Failed to parse XLSX sheet inventory: {e}") + else: + # CSV — push directly as text + csv_filename = filename if filename.lower().endswith(".csv") else os.path.splitext(filename)[0] + ".csv" + multi_sheet_note = "" + try: + csv_text = file_bytes.decode("utf-8") + except UnicodeDecodeError: + csv_text = file_bytes.decode("utf-8", errors="replace") + code_interpreter.invoke("writeFiles", {"content": [{"path": csv_filename, "text": csv_text}]}) + + # 5. Probe schema — separate exec so its output is isolated from user code. + schema_preview = "" + try: + preview_resp = code_interpreter.invoke("executeCode", { + "code": _build_preview_code(csv_filename), + "language": "python", + "clearContext": False, + }) + preview_stdout = "" + for event in preview_resp.get("stream", []): + result = event.get("result", {}) + if result.get("isError", False): + continue + preview_stdout += result.get("structuredContent", {}).get("stdout", "") + schema_preview, _ = _extract_schema_preview(preview_stdout) + except Exception as e: + logger.warning(f"Schema preview failed for {csv_filename}: {e}") + + # 6. Execute user code + response = code_interpreter.invoke("executeCode", { + "code": python_code, + "language": "python", + "clearContext": False, + }) + + execution_output = "" + for event in response.get("stream", []): + result = event.get("result", {}) + if result.get("isError", False): + error_msg = _clean_stderr(result.get("structuredContent", {}).get("stderr", "")) + error_text = f"❌ Code execution failed:\n```\n{error_msg}\n```" + + # Targeted hint for the most common wrong-filename error: + # the model wrote `pd.read_csv('FY_27_Ledger.xlsx', ...)` + # but in the sandbox the file lives as `FY_27_Ledger.csv` + # (see docstring: XLSX sources are pre-converted). Naming + # this out explicitly is much more effective than relying + # on the model to infer it from the schema footer. + if ( + is_xlsx + and "FileNotFoundError" in error_msg + and filename in error_msg + ): + error_text += ( + f"\n\n**Hint:** In the sandbox, the XLSX source " + f"`{filename}` is loaded as `{csv_filename}`. " + f"Retry with `pd.read_csv('{csv_filename}', " + f"low_memory=False)`." + ) + + if schema_preview: + # Drop the first_row dump on errors — the load line + + # column list is enough for the retry, first_row is + # ~1K tokens of bloat on a path that's already costing + # a round-trip. + trimmed_schema = _strip_first_row(schema_preview) + error_text += f"\n\nDataset info (use the `load:` line verbatim):\n```\n{trimmed_schema}\n```" + else: + error_text += f"\n\nTry: `pd.read_csv('{csv_filename}', low_memory=False)`" + if multi_sheet_note: + error_text += f"\n\n{multi_sheet_note}" + return {"content": [{"text": error_text}], "status": "error"} + stdout = result.get("structuredContent", {}).get("stdout", "") + if stdout: + execution_output += stdout + + # 7. Download chart if requested + success_text = _truncate_output(execution_output) or "✅ Code executed successfully (no output)." + if schema_preview: + success_text = f"{success_text}\n\n---\nDataset: {schema_preview.splitlines()[0] if schema_preview else ''}" + if multi_sheet_note: + success_text = f"{success_text}\n{multi_sheet_note}" + + if output_filename and output_filename.endswith(".png"): + try: + dl_response = code_interpreter.invoke("readFiles", {"paths": [output_filename]}) + file_content = None + for event in dl_response.get("stream", []): + result = event.get("result", {}) + if "content" in result: + for block in result["content"]: + if "data" in block: + file_content = block["data"] + elif "resource" in block and "blob" in block["resource"]: + file_content = block["resource"]["blob"] + if file_content: + break + if file_content: + break + + if file_content: + return { + "content": [ + {"text": success_text}, + {"image": {"format": "png", "source": {"bytes": file_content}}}, + ], + "status": "success", + } + except Exception as e: + logger.warning(f"Failed to download chart {output_filename}: {e}") + + return { + "content": [{"text": success_text}], + "status": "success", + } + + finally: + try: + code_interpreter.stop() + except Exception: + pass + + return analyze_spreadsheet + + +def _find_file(filename: str, assistant_id: Optional[str], session_id: str) -> Optional[Dict[str, Any]]: + """Find a file by name in accessible sources. Returns file info or None. + + Matches are tolerant to XLSX ↔ CSV aliasing: if the model asks for + ``foo.csv`` but only ``foo.xlsx`` exists (because the sandbox converts + XLSX → CSV and the model copied the sandbox name into the ``filename`` + param on retry), we treat them as the same file. Prevents the common + round-trip loop where analyze_spreadsheet rejects a reasonable guess + and forces the model to call list_spreadsheets (#206). + """ + candidates: list[Dict[str, Any]] = [] + if assistant_id: + candidates.extend(_get_kb_files(assistant_id)) + candidates.extend(_get_session_files(session_id)) + + target_lower = filename.lower() + target_stem, _ = os.path.splitext(target_lower) + + # First pass: exact match (case-insensitive). + for f in candidates: + if f["filename"].lower() == target_lower: + return f + + # Second pass: same stem, tabular extension. Covers foo.csv -> foo.xlsx + # and foo.xlsx -> foo.csv. Only applies to tabular files so we don't + # accidentally alias foo.pdf to foo.docx. + from apis.shared.files.models import is_tabular_file + + if target_stem and any(target_lower.endswith(ext) for ext in (".csv", ".xls", ".xlsx")): + for f in candidates: + cand_lower = f["filename"].lower() + cand_stem, _ = os.path.splitext(cand_lower) + if cand_stem == target_stem and is_tabular_file(f["filename"], f.get("content_type", "")): + return f + + return None + + +def _download_file(file_info: Dict[str, Any]) -> bytes: + """Download file bytes from S3.""" + region = os.environ.get("AWS_REGION", "us-west-2") + s3 = boto3.client("s3", region_name=region) + + if file_info["source"] == "knowledge_base": + bucket = os.environ.get("S3_ASSISTANTS_DOCUMENTS_BUCKET_NAME") + if not bucket: + raise ValueError("S3_ASSISTANTS_DOCUMENTS_BUCKET_NAME not configured") + else: + bucket = file_info.get("s3_bucket") + if not bucket: + raise ValueError("S3 bucket not found in file metadata") + + response = s3.get_object(Bucket=bucket, Key=file_info["s3_key"]) + return response["Body"].read() diff --git a/backend/src/agents/builtin_tools/spreadsheet_analysis/list_spreadsheets_tool.py b/backend/src/agents/builtin_tools/spreadsheet_analysis/list_spreadsheets_tool.py new file mode 100644 index 00000000..888e285e --- /dev/null +++ b/backend/src/agents/builtin_tools/spreadsheet_analysis/list_spreadsheets_tool.py @@ -0,0 +1,152 @@ +"""List available spreadsheet files for analysis. + +Factory function creates a context-bound tool that only exposes CSV/XLSX +files belonging to the current assistant's knowledge base or chat session. +""" + +import logging +import os +from typing import Any, Dict, List, Optional + +import boto3 +from strands import tool + +from apis.shared.files.models import is_tabular_file + +logger = logging.getLogger(__name__) + + +def _is_tabular_file(filename: str, content_type: str) -> bool: + """Deprecated wrapper — use apis.shared.files.models.is_tabular_file. + + Kept as a module-local name so existing callers in this file stay + readable; shares the canonical implementation that the inference-api + route uses when partitioning chat attachments (#206). + """ + return is_tabular_file(filename, content_type) + + +def make_list_spreadsheets_tool( + assistant_id: Optional[str], + session_id: str, + user_id: str, +): + """Create a list_spreadsheets tool bound to the given context.""" + + @tool + def list_spreadsheets() -> Dict[str, Any]: + """List CSV/XLSX spreadsheet files available for analysis. + + Returns spreadsheets from the assistant's knowledge base (if a + conversation is scoped to an assistant) and/or files attached to + the current conversation. Use this to discover which files can be + analyzed with the analyze_spreadsheet tool. + + Returns: + Dictionary with 'files' list containing available spreadsheets, + each with filename, source, content_type, size_bytes, and document_id. + """ + files: List[Dict[str, Any]] = [] + + # 1. Assistant KB files + if assistant_id: + files.extend(_get_kb_files(assistant_id)) + + # 2. Session-attached files + files.extend(_get_session_files(session_id)) + + if not files: + return { + "content": [{"text": "No spreadsheet files (CSV or XLSX) are available. Upload a spreadsheet to the assistant's knowledge base or attach one to this conversation."}], + "status": "success", + } + + file_list = "\n".join( + f"- {f['filename']} ({f['source']}, {f['size_bytes'] / 1024:.0f} KB)" + for f in files + ) + return { + "content": [{"text": f"Available spreadsheet files:\n{file_list}"}], + "status": "success", + "files": files, + } + + return list_spreadsheets + + +def _get_kb_files(assistant_id: str) -> List[Dict[str, Any]]: + """Query DynamoDB for completed tabular documents in the assistant's KB.""" + table_name = os.environ.get("DYNAMODB_ASSISTANTS_TABLE_NAME") + if not table_name: + logger.warning("DYNAMODB_ASSISTANTS_TABLE_NAME not set, skipping KB files") + return [] + + try: + dynamodb = boto3.resource("dynamodb", region_name=os.environ.get("AWS_REGION", "us-west-2")) + table = dynamodb.Table(table_name) + + response = table.query( + KeyConditionExpression="PK = :pk AND begins_with(SK, :sk_prefix)", + ExpressionAttributeValues={":pk": f"AST#{assistant_id}", ":sk_prefix": "DOC#"}, + ) + + files = [] + for item in response.get("Items", []): + if item.get("status") != "complete": + continue + filename = item.get("filename", "") + content_type = item.get("contentType", item.get("content_type", "")) + if not _is_tabular_file(filename, content_type): + continue + files.append({ + "filename": filename, + "source": "knowledge_base", + "content_type": content_type, + "size_bytes": int(item.get("sizeBytes", item.get("size_bytes", 0))), + "document_id": item.get("documentId", item.get("document_id", "")), + "s3_key": item.get("s3Key", item.get("s3_key", "")), + }) + return files + + except Exception as e: + logger.error(f"Error querying KB files for assistant {assistant_id}: {e}") + return [] + + +def _get_session_files(session_id: str) -> List[Dict[str, Any]]: + """Query DynamoDB for tabular files attached to the current session.""" + try: + from apis.shared.files.repository import get_file_upload_repository + + repo = get_file_upload_repository() + + import asyncio + try: + loop = asyncio.get_event_loop() + if loop.is_running(): + import concurrent.futures + with concurrent.futures.ThreadPoolExecutor() as executor: + session_files = executor.submit(asyncio.run, repo.list_session_files(session_id)).result() + else: + session_files = loop.run_until_complete(repo.list_session_files(session_id)) + except RuntimeError: + session_files = asyncio.run(repo.list_session_files(session_id)) + + files = [] + for f in session_files: + if not _is_tabular_file(f.filename, f.mime_type): + continue + files.append({ + "filename": f.filename, + "source": "chat_attachment", + "content_type": f.mime_type, + "size_bytes": f.size_bytes, + "document_id": f.upload_id, + "s3_key": f.s3_key, + "s3_bucket": f.s3_bucket, + }) + return files + + except Exception as e: + logger.error(f"Error querying session files for {session_id}: {e}") + return [] diff --git a/backend/src/agents/main_agent/base_agent.py b/backend/src/agents/main_agent/base_agent.py index 75c41b74..a34ea525 100644 --- a/backend/src/agents/main_agent/base_agent.py +++ b/backend/src/agents/main_agent/base_agent.py @@ -59,6 +59,7 @@ def __init__( max_tokens: Optional[int] = None, inference_params: Optional[Dict[str, Any]] = None, skip_persistence: bool = False, + extra_tools: Optional[List[Any]] = None, ): """ Initialize base agent with shared infrastructure. @@ -84,6 +85,7 @@ def __init__( self.user_id = user_id or session_id self.auth_token = auth_token self.enabled_tools = enabled_tools + self.extra_tools = extra_tools or [] self.agent = None # Merge legacy temperature/max_tokens into the canonical dict. Explicit @@ -415,6 +417,11 @@ async def _load_with_context(): logger.info(f"Added {len(external_clients)} external MCP clients to tools") + # Append context-bound tools (e.g., spreadsheet analysis) created per-request + if self.extra_tools: + local_tools.extend(self.extra_tools) + logger.info(f"Added {len(self.extra_tools)} extra context-bound tools") + return local_tools def get_model_config(self) -> dict: diff --git a/backend/src/agents/main_agent/core/model_config.py b/backend/src/agents/main_agent/core/model_config.py index 3b6dd31d..5c13b0b1 100644 --- a/backend/src/agents/main_agent/core/model_config.py +++ b/backend/src/agents/main_agent/core/model_config.py @@ -229,11 +229,12 @@ def to_bedrock_config(self) -> Dict[str, Any]: config: Dict[str, Any] = {"model_id": self.model_id} _apply_canonical_params(config, self.inference_params, _BEDROCK_PARAM_MAP, "bedrock") - # TODO: Re-enable once Bedrock supports cachePoint blocks alongside - # non-PDF document blocks (.md, .docx, etc.). Currently causes: - # ValidationException: messages.N.content.M.type: Field required - # because Bedrock can't translate cachePoint after document blocks - # to the Anthropic format. + # Bedrock prompt caching is intentionally deferred. The previous SDK + # blocker — strands PR #1438, which fixed `cachePoint` blocks landing + # alongside non-PDF document attachments — is resolved in + # strands-agents 1.39.0, so the technical barrier is gone. Re-enabling + # is being held for a separate, scoped rollout (cost/badge impact is + # user-visible the moment caching turns on). # See: https://github.com/strands-agents/sdk-python/pull/1438 # if self.caching_enabled: # from strands.models import CacheConfig diff --git a/backend/src/agents/main_agent/core/system_prompt_builder.py b/backend/src/agents/main_agent/core/system_prompt_builder.py index 7fce7daf..a195859a 100644 --- a/backend/src/agents/main_agent/core/system_prompt_builder.py +++ b/backend/src/agents/main_agent/core/system_prompt_builder.py @@ -51,6 +51,64 @@ - Always explain your reasoning when using tools - If you don't have the right tool for a task, clearly inform the user about the limitation +HANDLING MISSING TOOLS: +Users can toggle individual tools on and off from the Tools section of the +model settings panel (the gear icon next to the message input). When a user +asks for something you would normally handle with a tool that isn't currently +available to you, don't just say "I can't do that." Instead: + +1. Identify which capability they're asking for in plain language + (e.g. "spreadsheet analysis", "web browsing", "Python execution", + "knowledge base search"). +2. Tell them that capability isn't active in the current session and suggest + they enable the matching tool from the Tools panel in settings, then retry + the request. +3. If you can offer a partial answer without the tool (e.g. explaining a + formula they could run themselves), do that as a fallback — but lead with + the tool suggestion so they know the better path exists. + +Common user intents and the tools to point at: +- Analyzing spreadsheet/CSV data, aggregations, totals, trends → "Spreadsheet Analysis" +- Listing files attached to the conversation or assistant → "List Spreadsheet Files" +- Running Python code, generating charts or diagrams from data → "Code Interpreter" +- Live web searches, news, current events → the web search tools +- Fetching a specific URL's contents → the URL fetch tool +- Questions answerable from the assistant's knowledge base → the knowledge base search tool + +Example response when spreadsheet analysis is disabled and a user asks for a +column total: + +> I can compute that for you, but the Spreadsheet Analysis tool isn't +> currently enabled for this conversation. Open the settings panel (gear +> icon next to the message input), enable "Spreadsheet Analysis" under +> Tools, and send the request again — I'll run the aggregation directly +> on the file. Alternatively, you can open the file in Excel and use +> `=SUM(NET_AMOUNT)` on the column. + +SPREADSHEET ANALYSIS — DISAMBIGUATION: +When more than one spreadsheet is attached (including the assistant's +knowledge base plus any chat attachments), do not silently pick one for +`analyze_spreadsheet`. The turn preamble will list every available tabular +file when multiple exist. Use that list to decide: + +1. If the user named a specific file (or the reference is unambiguous from + the query), analyze that file and state which one in your response: + "Analyzing `X.xlsx`: …" +2. If the user's request could reasonably span multiple files (e.g. "total + X across the ledgers"), either run `analyze_spreadsheet` on each file + and combine the results, or explain the approach and ask the user which + files to include. +3. If the reference is ambiguous, ask the user which file they mean + rather than guessing from RAG chunk ordering. + +Always name the file(s) you analyzed in the final response so the user can +audit the choice. Example: + +> Analyzed `FY_27_Ledger.xlsx` — the total NET_AMOUNT is $20,419,308.89 +> across 18,551 transactions. Note: `FY_27_Ledger(_11).xlsx` is also +> attached but was not included in this total. Let me know if you'd like +> a combined figure. + Your goal is to be helpful, accurate, and efficient in completing user requests using the available tools.""" diff --git a/backend/src/agents/main_agent/streaming/stream_coordinator.py b/backend/src/agents/main_agent/streaming/stream_coordinator.py index 01a4b667..043bc960 100644 --- a/backend/src/agents/main_agent/streaming/stream_coordinator.py +++ b/backend/src/agents/main_agent/streaming/stream_coordinator.py @@ -180,13 +180,26 @@ async def stream_response( if "metrics" in event_data: accumulated_metadata["metrics"].update(event_data["metrics"]) - # Collect metadata_summary event (don't send to client as-is) + # Collect metadata_summary event (don't send to client as-is). + # + # NOTE: metadata_summary carries Strands' EventLoopMetrics + # `accumulated_usage`, which sums each LLM call's full + # context-size across the turn (and across the agent's + # whole lifetime, per Strands' docs). For a 2-call tool + # turn with call_1.input=1000 and call_2.input=2500, + # accumulated_usage.inputTokens=3500 — but the *current* + # context occupancy is 2500, not 3500. We deliberately do + # NOT update accumulated_metadata["usage"] / ["metrics"] + # from this event: stream_coordinator's accumulated_metadata + # drives (a) the final SSE `usage` the frontend uses for + # the context-% badge and (b) the compaction trigger — + # both want "current context size", which the per-call + # `metadata` events already provide via last-write-wins + # `.update()`. Per-message cost attribution rides + # per_message_metadata (per-call) and is unaffected. + # We only keep the first_token_time backstop. if event.get("type") == "metadata_summary": event_data = event.get("data", {}) - if "usage" in event_data: - accumulated_metadata["usage"].update(event_data["usage"]) - if "metrics" in event_data: - accumulated_metadata["metrics"].update(event_data["metrics"]) if "first_token_time" in event_data: first_token_time = event_data["first_token_time"] # Associate first_token_time with first assistant message if we have one @@ -327,6 +340,16 @@ async def stream_response( # checkpoint advanced on this turn, emit a `compaction` SSE # so the frontend can place an inline "earlier messages # summarized" divider. Fires after metadata, before done. + # + # CAUTION: do NOT replace this with Strands' + # AgentResult.context_size / EventLoopMetrics.latest_context_size. + # Both return ONLY `inputTokens` from the last cycle — + # under Bedrock prompt caching that's the uncached + # suffix only, so a 50k-token fully-cached context + # reports ~50 (inputTokens) and hides ~49,950 in + # cacheReadInputTokens. Summing all three buckets + # below is the only correct "current context size" + # under caching. if hasattr(session_manager, "update_after_turn"): usage = accumulated_metadata.get("usage", {}) total_input_tokens = ( @@ -1165,27 +1188,25 @@ async def _store_message_metadata( end_to_end_latency_ms = int((stream_end_time - stream_start_time) * 1000) logger.info(f"📊 Calculated E2E latency: {end_to_end_latency_ms}ms") - # Get time to first token - # PRIORITY 1: Use provider's timeToFirstByteMs if available (most accurate) + # Get time to first token. We persist `None` (not 0) when the + # provider didn't emit `timeToFirstByteMs` and we couldn't + # measure it locally — a real TTFT can never be 0ms, and any + # downstream aggregation (averages, percentiles) needs to + # distinguish "not measured" from a real value to avoid + # pulling stats toward zero. if accumulated_metadata.get("metrics", {}).get("timeToFirstByteMs"): time_to_first_token_ms = int(accumulated_metadata["metrics"]["timeToFirstByteMs"]) logger.info(f"📊 Using provider timeToFirstByteMs: {time_to_first_token_ms}ms") - # PRIORITY 2: Estimate TTFT as a portion of latency if we don't have it - # This is a rough estimate but better than 0 or None - # For most LLM calls, TTFT is typically 20-40% of total latency - elif end_to_end_latency_ms and end_to_end_latency_ms > 100: - # If E2E latency is available and substantial, estimate TTFT - # We don't have actual TTFT so we can't store it accurately - # Instead, log that we're missing it - logger.info(f"📊 No TTFT available - provider did not send timeToFirstByteMs for this message") - # Still create latency metrics with just E2E, using a placeholder of 0 for TTFT - # This is better than losing all latency data - time_to_first_token_ms = 0 # Indicates "not measured" - - # Create latency metrics if we have at least E2E latency + else: + logger.info("📊 No TTFT available - provider did not send timeToFirstByteMs for this message") + + # Create latency metrics if we have at least E2E latency. + # `time_to_first_token_ms` may be None — LatencyMetrics.time_to_first_token + # is Optional, so this serializes as JSON null. if end_to_end_latency_ms is not None: latency_metrics = LatencyMetrics( - time_to_first_token=time_to_first_token_ms if time_to_first_token_ms is not None else 0, end_to_end_latency=end_to_end_latency_ms + time_to_first_token=time_to_first_token_ms, + end_to_end_latency=end_to_end_latency_ms, ) logger.info(f"📊 Created LatencyMetrics: TTFT={time_to_first_token_ms}ms, E2E={end_to_end_latency_ms}ms") else: diff --git a/backend/src/agents/main_agent/streaming/stream_processor.py b/backend/src/agents/main_agent/streaming/stream_processor.py index f2501713..3960cb0c 100644 --- a/backend/src/agents/main_agent/streaming/stream_processor.py +++ b/backend/src/agents/main_agent/streaming/stream_processor.py @@ -300,12 +300,13 @@ def _handle_completion_events(event: RawEvent) -> Tuple[List[ProcessedEvent], bo if event.get("force_stop", False): reason = event.get("force_stop_reason", "unknown reason") - # Create structured error event + error_message, recoverable = _format_force_stop_message(reason) + error_event = StreamErrorEvent( - error=f"Agent force-stopped: {reason}", + error=error_message, code=ErrorCode.AGENT_ERROR, detail=str(reason) if reason != "unknown reason" else None, - recoverable=False + recoverable=recoverable, ) # Convert to event dict format events.append(_create_event("error", error_event.model_dump(exclude_none=True))) @@ -314,6 +315,66 @@ def _handle_completion_events(event: RawEvent) -> Tuple[List[ProcessedEvent], bo return events, should_break +def _format_force_stop_message(reason: Any) -> tuple[str, bool]: + """Translate raw Bedrock errors in ``force_stop_reason`` into user-facing + markdown. Returns (message, recoverable). + + We detect a small handful of high-signal patterns (document size limit, + throttling, access denied) and surface actionable guidance. Anything + else falls through to a generic "Agent force-stopped" with the raw + error for transparency. + """ + reason_str = str(reason or "") + reason_lower = reason_str.lower() + + # Bedrock ConverseStream rejects document content blocks over ~4.5 MB + # internal size. Triggered most often by XLSX files that inflate + # significantly during parsing. The fix-forward is to route tabular + # files through the spreadsheet analysis tools, but a turn with a + # non-tabular file this large (or history that accumulated past the + # limit) still needs a friendlier message than the raw AWS error. See + # issue #206. + if "maximum document size" in reason_lower or ( + "validationexception" in reason_lower and "document" in reason_lower + ): + return ( + "⚠️ One of the attached files is too large for the model to read " + "directly.\n\n" + "Bedrock limits inline documents to 4.5 MB of internal content, " + "and spreadsheets (especially XLSX) often expand past that when " + "parsed. To work with large data files, enable **Spreadsheet " + "Analysis** in the Tools section of the settings panel (gear " + "icon next to the message input) and re-attach the file — the " + "tool runs pandas on the real file and isn't limited by the " + "document size cap.\n\n" + "For large PDFs or docs, split them into smaller sections or " + "convert to plain text.", + True, + ) + + if "throttl" in reason_lower or "too many requests" in reason_lower: + return ( + "⚠️ The model is receiving too many requests right now.\n\n" + "> " + reason_str + "\n\n" + "Please wait a moment and try again.", + True, + ) + + if "accessdenied" in reason_lower or "access denied" in reason_lower: + return ( + "⚠️ I don't have access to complete this request.\n\n" + "> " + reason_str + "\n\n" + "Try a different model, or check that the required permissions " + "are configured for this workspace.", + False, + ) + + # Default: preserve prior behavior so operators can still read the raw + # error in logs and the UI, but flag it recoverable since most transient + # issues (network blips, tool timeouts) benefit from a retry. + return (f"Agent force-stopped: {reason_str}", False) + + def _handle_content_block_events(event: RawEvent, current_block_index: Dict[str, int]) -> List[ProcessedEvent]: """Process content block events from the agent stream. @@ -1031,7 +1092,16 @@ def _extract_metrics_data(metrics_obj: Any) -> Dict[str, Any]: metadata_data["metrics"] = metrics_data if metadata_data: - metadata_event = _create_event("metadata", metadata_data) + # Emit on the turn-summary track, NOT the per-message track. + # `result.metrics.accumulated_usage` is summed across every + # LLM call in the turn (Strands' EventLoopMetrics). If we + # emitted this as a `metadata` event, the stream coordinator + # would route it into per_message_metadata[last] and clobber + # that message's per-call usage — pricing each entry would + # then double-count earlier messages' input tokens. The + # main loop also accumulates `metadata_summary` events + # (see below), so the final summary stays cumulative. + metadata_event = _create_event("metadata_summary", metadata_data) events.append(metadata_event) # Check for metadata in nested event structure (like content blocks) @@ -1250,8 +1320,14 @@ async def mock_stream(): # NOTE: timeToFirstByteMs from provider will be stored in metrics # and the coordinator can use it as a fallback if first_token_time is not set for processed_event in _handle_metadata_events(event): - # Accumulate metadata for summary - if processed_event.get("type") == "metadata": + # Accumulate metadata for the final summary. Both per-call + # `metadata` events (each LLM call's usage) and the + # turn-cumulative `metadata_summary` event (extracted from + # AgentResult.metrics.accumulated_usage) feed this dict — + # the cumulative event arrives last and `update()` makes it + # last-write-wins, so accumulated_metadata ends the turn + # carrying true totals. + if processed_event.get("type") in ("metadata", "metadata_summary"): event_data = processed_event.get("data", {}) if "usage" in event_data: accumulated_metadata["usage"].update(event_data["usage"]) @@ -1275,8 +1351,11 @@ async def mock_stream(): # Check one more time for metadata in case result came with complete metadata_events_after_complete = _handle_metadata_events(event) for processed_event in metadata_events_after_complete: - # Accumulate metadata for summary - if processed_event.get("type") == "metadata": + # Accumulate metadata for summary — see note on the main + # loop's accumulator above. Both `metadata` (per-call) + # and `metadata_summary` (turn-cumulative from result) + # feed accumulated_metadata so the final emit is total. + if processed_event.get("type") in ("metadata", "metadata_summary"): event_data = processed_event.get("data", {}) if "usage" in event_data: accumulated_metadata["usage"].update(event_data["usage"]) diff --git a/backend/src/agents/main_agent/tools/tool_catalog.py b/backend/src/agents/main_agent/tools/tool_catalog.py index c65344a8..587a1387 100644 --- a/backend/src/agents/main_agent/tools/tool_catalog.py +++ b/backend/src/agents/main_agent/tools/tool_catalog.py @@ -84,6 +84,22 @@ def to_dict(self) -> dict: icon="code-bracket", ), + # --- Built-in Tools (Spreadsheet Analysis) --- + "list_spreadsheets": ToolMetadata( + tool_id="list_spreadsheets", + name="List Spreadsheet Files", + description="List spreadsheet files available for analysis from the assistant's knowledge base or conversation attachments.", + category=ToolCategory.DATA, + icon="folder-open", + ), + "analyze_spreadsheet": ToolMetadata( + tool_id="analyze_spreadsheet", + name="Spreadsheet Analysis", + description="Analyze spreadsheet data using Python code. Use for aggregations, comparisons, trends, filtering, and chart generation. For simple factual lookups, use the knowledge base search instead.", + category=ToolCategory.DATA, + icon="table-cells", + ), + # --- Gateway/MCP Tools --- # These are loaded dynamically from the gateway but we define metadata here # for the admin UI. Actual tool availability depends on gateway configuration. diff --git a/backend/src/apis/app_api/chat/routes.py b/backend/src/apis/app_api/chat/routes.py index 9864f8ff..3e2ed6c1 100644 --- a/backend/src/apis/app_api/chat/routes.py +++ b/backend/src/apis/app_api/chat/routes.py @@ -376,11 +376,22 @@ async def chat_agent_stream(request: ChatRequest, current_user: User = Depends(g try: # Get agent instance (with or without tool filtering) # Use assistant's system prompt if provided + + # Create context-bound spreadsheet analysis tools if enabled + from apis.inference_api.chat.routes import _build_spreadsheet_tools + extra_tools = _build_spreadsheet_tools( + enabled_tools=authorized_tools, + assistant_id=assistant_id_to_use, + session_id=request.session_id, + user_id=user_id, + ) + agent = await get_agent( session_id=request.session_id, user_id=user_id, enabled_tools=authorized_tools, # Filtered by RBAC (may be None for all allowed) system_prompt=system_prompt, # Assistant instructions if assistant is attached + extra_tools=extra_tools, ) # Wrap stream to ensure flush on disconnect and prevent further processing diff --git a/backend/src/apis/app_api/files/routes.py b/backend/src/apis/app_api/files/routes.py index 44119dc7..ef115b97 100644 --- a/backend/src/apis/app_api/files/routes.py +++ b/backend/src/apis/app_api/files/routes.py @@ -16,6 +16,9 @@ PresignRequest, PresignResponse, CompleteUploadResponse, + PreviewUrlResponse, + TextSnippetResponse, + ThumbnailResponse, FileListResponse, QuotaResponse, QuotaExceededError as QuotaExceededModel, @@ -30,6 +33,7 @@ FileNotFoundError, FileUploadError, ) +from .thumbnails import ThumbnailRenderError, ThumbnailUnsupportedError logger = logging.getLogger(__name__) @@ -136,6 +140,91 @@ async def complete_upload( ) +@router.get("/{upload_id}/preview-url", response_model=PreviewUrlResponse) +async def get_preview_url( + upload_id: str, + user: User = Depends(get_current_user_from_session), + service: FileUploadService = Depends(get_file_upload_service), +): + """ + Generate a short-lived presigned GET URL for an uploaded file. + + Used by the UI to render image previews inline and to open files in a + lightbox. The URL is scoped to the file owner and expires after a few + minutes. + """ + try: + return await service.get_preview_url(user.user_id, upload_id) + except FileNotFoundError: + raise HTTPException( + status_code=status.HTTP_404_NOT_FOUND, + detail=f"File {upload_id} not found or not owned by you", + ) + + +@router.get("/{upload_id}/text-snippet", response_model=TextSnippetResponse) +async def get_text_snippet( + upload_id: str, + user: User = Depends(get_current_user_from_session), + service: FileUploadService = Depends(get_file_upload_service), +): + """ + Return a short UTF-8 text excerpt from the start of a file. + + Used by the UI to render a content peek inside the document-style + attachment card for text-based files (txt, md, csv, html). Returns an + empty snippet for non-text MIME types so the UI can fall back to a + skeleton mockup. + """ + try: + return await service.get_text_snippet(user.user_id, upload_id) + except FileNotFoundError: + raise HTTPException( + status_code=status.HTTP_404_NOT_FOUND, + detail=f"File {upload_id} not found or not owned by you", + ) + + +@router.get("/{upload_id}/thumbnail", response_model=ThumbnailResponse) +async def get_thumbnail( + upload_id: str, + user: User = Depends(get_current_user_from_session), + service: FileUploadService = Depends(get_file_upload_service), +): + """ + Return a presigned URL for a PNG thumbnail of the file's first page. + + Lazy-renders on first request and caches the resulting `_thumb.png` + sibling object next to the original. Subsequent calls hit the cache and + return immediately. + + Status codes: + - 200: Thumbnail available (response body indicates `cached`). + - 404: File not found or not owned by the caller. + - 415: MIME type has no thumbnail renderer (UI should fall back to its + skeleton card). + - 422: File present but unrenderable (corrupt, encrypted, empty PDF, ...). + """ + try: + return await service.get_or_create_thumbnail(user.user_id, upload_id) + except FileNotFoundError: + raise HTTPException( + status_code=status.HTTP_404_NOT_FOUND, + detail=f"File {upload_id} not found or not owned by you", + ) + except ThumbnailUnsupportedError as e: + raise HTTPException( + status_code=status.HTTP_415_UNSUPPORTED_MEDIA_TYPE, + detail=str(e), + ) + except ThumbnailRenderError as e: + logger.warning(f"Thumbnail render failed for {upload_id}: {e}") + raise HTTPException( + status_code=status.HTTP_422_UNPROCESSABLE_ENTITY, + detail="Could not render a thumbnail for this file", + ) + + # ============================================================================= # File Management Endpoints # ============================================================================= diff --git a/backend/src/apis/app_api/files/service.py b/backend/src/apis/app_api/files/service.py index eeec0935..b837d26c 100644 --- a/backend/src/apis/app_api/files/service.py +++ b/backend/src/apis/app_api/files/service.py @@ -4,6 +4,7 @@ Business logic for file uploads with S3 pre-signed URLs and quota management. """ +import asyncio import os import logging import uuid @@ -20,12 +21,34 @@ PresignRequest, PresignResponse, CompleteUploadResponse, + PreviewUrlResponse, + TextSnippetResponse, + ThumbnailResponse, + THUMBNAIL_SUPPORTED_MIME_TYPES, FileResponse, FileListResponse, QuotaResponse, is_allowed_mime_type, ALLOWED_MIME_TYPES, ) +from .thumbnails import ( + ThumbnailRenderer, + ThumbnailRenderError, + ThumbnailUnsupportedError, + get_thumbnail_renderer, +) + + +# MIME types that are safe to decode as UTF-8 text for inline previews. +TEXT_PREVIEW_MIME_TYPES = frozenset( + { + "text/plain", + "text/markdown", + "text/csv", + "text/html", + } +) +TEXT_SNIPPET_MAX_BYTES = 2048 from apis.shared.files.repository import FileUploadRepository, get_file_upload_repository logger = logging.getLogger(__name__) @@ -89,6 +112,10 @@ class FileUploadService: - File listing and deletion """ + # Sibling key, in the same per-upload S3 "folder" as the original. + # Stored alongside the original so cleanup happens with the file. + THUMBNAIL_KEY_NAME = "_thumb.png" + def __init__( self, repository: Optional[FileUploadRepository] = None, @@ -97,9 +124,11 @@ def __init__( max_file_size: Optional[int] = None, max_files_per_message: Optional[int] = None, user_quota_bytes: Optional[int] = None, + thumbnail_renderer: Optional[ThumbnailRenderer] = None, ): """Initialize with dependencies.""" self.repository = repository or get_file_upload_repository() + self._thumbnail_renderer = thumbnail_renderer or get_thumbnail_renderer() # S3 configuration # Use region from AWS_REGION env var to ensure presigned URLs use regional endpoint @@ -133,6 +162,7 @@ def __init__( # Pre-signed URL expiration self.presign_expiration = 15 * 60 # 15 minutes + self.preview_url_expiration = 10 * 60 # 10 minutes for GET previews # ========================================================================= # Pre-signed URL Flow @@ -301,6 +331,284 @@ async def complete_upload( size_bytes=file_meta.size_bytes, ) + # ========================================================================= + # Preview URL + # ========================================================================= + + async def get_preview_url( + self, user_id: str, upload_id: str + ) -> PreviewUrlResponse: + """ + Generate a short-lived presigned GET URL for displaying a file. + + Used by the UI to render image thumbnails inline and to power the + full-size lightbox. Only owners can generate preview URLs for their + own files. Files must be in READY state. + + Args: + user_id: The owner's user ID + upload_id: The upload identifier + + Returns: + PreviewUrlResponse with a presigned GET URL + + Raises: + FileNotFoundError: If file not found, not owned by user, or not ready + """ + file_meta = await self.repository.get_file(user_id, upload_id) + if not file_meta: + raise FileNotFoundError(f"File {upload_id} not found") + + if file_meta.status != FileStatus.READY: + raise FileNotFoundError( + f"File {upload_id} is not ready (status: {file_meta.status})" + ) + + try: + url = self._s3_client.generate_presigned_url( + "get_object", + Params={ + "Bucket": self.bucket_name, + "Key": file_meta.s3_key, + "ResponseContentType": file_meta.mime_type, + "ResponseContentDisposition": f'inline; filename="{file_meta.filename}"', + }, + ExpiresIn=self.preview_url_expiration, + ) + except ClientError as e: + logger.error(f"Failed to generate preview URL: {e}") + raise + + expires_at = ( + datetime.now(timezone.utc) + timedelta(seconds=self.preview_url_expiration) + ).isoformat() + "Z" + + return PreviewUrlResponse( + upload_id=upload_id, + url=url, + expires_at=expires_at, + mime_type=file_meta.mime_type, + filename=file_meta.filename, + ) + + async def get_text_snippet( + self, user_id: str, upload_id: str + ) -> TextSnippetResponse: + """ + Return a short UTF-8 text excerpt from the start of a file. + + Used by the UI to render a content peek inside the document-style + attachment card. Only text-based MIME types are supported; other + types return an empty snippet so the UI can fall back to a skeleton. + + Args: + user_id: The owner's user ID + upload_id: The upload identifier + + Returns: + TextSnippetResponse with the decoded snippet (possibly empty) + + Raises: + FileNotFoundError: If file not found, not owned, or not ready + """ + file_meta = await self.repository.get_file(user_id, upload_id) + if not file_meta: + raise FileNotFoundError(f"File {upload_id} not found") + + if file_meta.status != FileStatus.READY: + raise FileNotFoundError( + f"File {upload_id} is not ready (status: {file_meta.status})" + ) + + if file_meta.mime_type not in TEXT_PREVIEW_MIME_TYPES: + return TextSnippetResponse( + upload_id=upload_id, + snippet="", + truncated=False, + mime_type=file_meta.mime_type, + ) + + # Read up to TEXT_SNIPPET_MAX_BYTES + 1 bytes so we can detect truncation. + try: + response = self._s3_client.get_object( + Bucket=self.bucket_name, + Key=file_meta.s3_key, + Range=f"bytes=0-{TEXT_SNIPPET_MAX_BYTES}", + ) + body = response["Body"].read() + except ClientError as e: + logger.warning(f"Failed to read snippet for {upload_id}: {e}") + return TextSnippetResponse( + upload_id=upload_id, + snippet="", + truncated=False, + mime_type=file_meta.mime_type, + ) + + truncated = file_meta.size_bytes > TEXT_SNIPPET_MAX_BYTES + excerpt = body[:TEXT_SNIPPET_MAX_BYTES] + try: + text = excerpt.decode("utf-8") + except UnicodeDecodeError: + # Trim trailing partial multi-byte sequence and retry; fall back to replace. + text = excerpt.decode("utf-8", errors="replace") + + return TextSnippetResponse( + upload_id=upload_id, + snippet=text, + truncated=truncated, + mime_type=file_meta.mime_type, + ) + + # ========================================================================= + # Thumbnails + # ========================================================================= + + def _thumbnail_s3_key(self, file_meta: FileMetadata) -> str: + """ + Derive the sibling thumbnail key for an original file. + + Originals live at ``user-files/{user}/{session}/{upload_id}/{filename}``, + thumbnails at ``user-files/{user}/{session}/{upload_id}/_thumb.png`` — + same parent prefix so cleanup paths can find both. + """ + base, _, _ = file_meta.s3_key.rpartition("/") + return f"{base}/{self.THUMBNAIL_KEY_NAME}" + + async def get_or_create_thumbnail( + self, user_id: str, upload_id: str + ) -> ThumbnailResponse: + """ + Return a presigned URL for a PNG thumbnail of the file's first page. + + Lazy-renders on first request: checks for a cached ``_thumb.png`` + sibling object in S3, generates one synchronously if missing, then + returns a short-lived presigned URL. Subsequent calls hit the cache. + + Args: + user_id: The owner's user ID. + upload_id: The upload identifier. + + Returns: + ThumbnailResponse with a presigned GET URL and ``cached`` flag. + + Raises: + FileNotFoundError: File not found, not owned, or not ready. + ThumbnailUnsupportedError: MIME type has no registered renderer. + ThumbnailRenderError: The file was unreadable / corrupt / encrypted. + """ + file_meta = await self.repository.get_file(user_id, upload_id) + if not file_meta: + raise FileNotFoundError(f"File {upload_id} not found") + + if file_meta.status != FileStatus.READY: + raise FileNotFoundError( + f"File {upload_id} is not ready (status: {file_meta.status})" + ) + + if file_meta.mime_type not in THUMBNAIL_SUPPORTED_MIME_TYPES: + raise ThumbnailUnsupportedError( + f"No thumbnail renderer for {file_meta.mime_type}" + ) + + thumb_key = self._thumbnail_s3_key(file_meta) + + # Cache check via HEAD — cheap, no body transfer. + cached = True + try: + self._s3_client.head_object(Bucket=self.bucket_name, Key=thumb_key) + except ClientError as e: + if e.response["Error"]["Code"] in ("404", "NoSuchKey"): + cached = False + else: + logger.error(f"HEAD failed for thumbnail {thumb_key}: {e}") + raise + + if not cached: + await self._render_and_store_thumbnail(file_meta, thumb_key) + + # Generate the presigned URL for the thumbnail. Use the same + # expiration window as preview URLs so the UI's caching expectations + # line up. + try: + url = self._s3_client.generate_presigned_url( + "get_object", + Params={ + "Bucket": self.bucket_name, + "Key": thumb_key, + "ResponseContentType": "image/png", + }, + ExpiresIn=self.preview_url_expiration, + ) + except ClientError as e: + logger.error(f"Failed to generate thumbnail presigned URL: {e}") + raise + + expires_at = ( + datetime.now(timezone.utc) + timedelta(seconds=self.preview_url_expiration) + ).isoformat() + "Z" + + return ThumbnailResponse( + upload_id=upload_id, + url=url, + expires_at=expires_at, + cached=cached, + ) + + async def _render_and_store_thumbnail( + self, file_meta: FileMetadata, thumb_key: str + ) -> None: + """Stream the original from S3, rasterize page 1, store the PNG.""" + try: + response = self._s3_client.get_object( + Bucket=self.bucket_name, + Key=file_meta.s3_key, + ) + raw = response["Body"].read() + except ClientError as e: + logger.error(f"Failed to read source for thumbnail {file_meta.upload_id}: {e}") + raise ThumbnailRenderError(f"Failed to read source: {e}") from e + + # Run the CPU-bound render off the event loop so we don't stall the + # request worker. pypdfium2 releases the GIL for the heavy bits. + loop = asyncio.get_event_loop() + png_bytes = await loop.run_in_executor( + None, + self._thumbnail_renderer.render, + file_meta.mime_type, + raw, + ) + + try: + self._s3_client.put_object( + Bucket=self.bucket_name, + Key=thumb_key, + Body=png_bytes, + ContentType="image/png", + ) + except ClientError as e: + logger.error(f"Failed to write thumbnail {thumb_key}: {e}") + raise + + logger.info( + f"Rendered thumbnail for upload {file_meta.upload_id} " + f"({len(png_bytes)} bytes)" + ) + + def _delete_thumbnail_object(self, file_meta: FileMetadata) -> None: + """ + Best-effort delete of the thumbnail sibling. + + S3 ``delete_object`` is idempotent — a missing key is not an error. + We swallow other errors so a broken thumbnail never blocks deletion + of the underlying file. + """ + thumb_key = self._thumbnail_s3_key(file_meta) + try: + self._s3_client.delete_object(Bucket=self.bucket_name, Key=thumb_key) + except ClientError as e: + logger.warning(f"Failed to delete thumbnail {thumb_key}: {e}") + # ========================================================================= # File Management # ========================================================================= @@ -344,6 +652,9 @@ async def delete_file(self, user_id: str, upload_id: str) -> bool: logger.warning("Failed to delete S3 object", exc_info=True) # Continue with metadata deletion even if S3 fails + # Best-effort delete of the thumbnail sibling, if one exists. + self._delete_thumbnail_object(file_meta) + # Delete metadata deleted = await self.repository.delete_file(user_id, upload_id) @@ -488,6 +799,9 @@ async def delete_session_files(self, session_id: str) -> int: logger.warning(f"Failed to delete S3 object for {file_meta.upload_id}: {e}") # Continue with metadata deletion + # Best-effort delete of the thumbnail sibling, if one exists. + self._delete_thumbnail_object(file_meta) + # Delete metadata deleted = await self.repository.delete_file( file_meta.user_id, file_meta.upload_id diff --git a/backend/src/apis/app_api/files/thumbnails.py b/backend/src/apis/app_api/files/thumbnails.py new file mode 100644 index 00000000..c5bdbd89 --- /dev/null +++ b/backend/src/apis/app_api/files/thumbnails.py @@ -0,0 +1,147 @@ +""" +Thumbnail rendering for non-image file attachments. + +Currently supports PDF (page 1) via pypdfium2. Office formats (.docx, .xlsx) +are intentionally not implemented here — see ThumbnailRenderer.render() for +guidance on how to extend this. +""" + +import io +import logging +from typing import Callable, Dict + +logger = logging.getLogger(__name__) + + +# Bounded box for the longest dimension of a generated thumbnail. The UI's +# attachment card body is ~240x128 logical pixels, so 256 covers retina +# without being wasteful on storage / CPU. +THUMBNAIL_MAX_DIMENSION = 256 + + +class ThumbnailUnsupportedError(Exception): + """Raised when no renderer is registered for a given MIME type.""" + + +class ThumbnailRenderError(Exception): + """Raised when rendering ran but the source file was unreadable.""" + + +class ThumbnailRenderer: + """ + MIME-type-dispatching renderer that produces a PNG of a file's first page. + + Today this is PDF-only. The dispatcher exists so that callers (the file + upload service, the route layer) speak a single API: hand it a MIME type + plus bytes, get back a PNG or a typed error. New formats plug in by + adding an entry to ``_renderers``. + + ----- Future formats: .docx and .xlsx ----- + + Office formats are deliberately out of scope for this in-process renderer. + The standard rasterization path requires LibreOffice (``soffice --headless + --convert-to pdf``) to first convert the document to PDF, which can then be + handed to the existing PDF path. LibreOffice adds roughly 500 MB to a + container image, pulls in ~20 system packages, and noticeably increases + cold start time — costs that are inappropriate for the app-api request + path, which today serves chat traffic with a tight latency budget. + + Recommendation when those formats are needed: + + 1. Build a separate **thumbnail render service**. A small Fargate task or + a dedicated Lambda using a pre-baked LibreOffice container image is a + clean fit. Either flavor can stay scaled to zero when idle. + 2. Have app-api enqueue render requests (SQS or a synchronous HTTPS call + behind an internal ALB) instead of importing the converter. The render + service writes the resulting PNG to the same `_thumb.png` sibling key + the PDF path uses, so the cache-and-serve flow on this side is + unchanged. + 3. Keep the dispatcher's public API stable: callers should still get a PNG + back, and the cache layout in S3 should not change. The only difference + is *where* the bytes are produced. + + Until that service exists, callers are expected to filter on + ``THUMBNAIL_SUPPORTED_MIME_TYPES`` from ``apis.shared.files.models`` and + return a 415 for unsupported types so the UI can fall back to the + existing skeleton card. + """ + + def __init__(self) -> None: + self._renderers: Dict[str, Callable[[bytes], bytes]] = { + "application/pdf": self._render_pdf, + # Future entries plug in here. See class docstring for the + # recommended out-of-process design for .docx and .xlsx. + } + + def render(self, mime_type: str, raw: bytes) -> bytes: + """ + Render a thumbnail PNG for the given file bytes. + + Args: + mime_type: The source file's MIME type. + raw: The raw file bytes. + + Returns: + PNG-encoded bytes for a thumbnail bounded by + THUMBNAIL_MAX_DIMENSION on its longest side. + + Raises: + ThumbnailUnsupportedError: No renderer is registered for mime_type. + ThumbnailRenderError: The renderer ran but the file was unreadable. + """ + renderer = self._renderers.get(mime_type) + if renderer is None: + raise ThumbnailUnsupportedError( + f"No thumbnail renderer registered for {mime_type}" + ) + return renderer(raw) + + def _render_pdf(self, raw: bytes) -> bytes: + # Imported lazily so unit tests that don't touch the renderer don't + # need the native lib loaded. + try: + import pypdfium2 as pdfium + except ImportError as e: + raise ThumbnailRenderError( + "pypdfium2 is not installed; PDF thumbnails are unavailable" + ) from e + + try: + pdf = pdfium.PdfDocument(io.BytesIO(raw)) + except Exception as e: + raise ThumbnailRenderError(f"Failed to open PDF: {e}") from e + + try: + if len(pdf) == 0: + raise ThumbnailRenderError("PDF has no pages") + + page = pdf[0] + try: + width, height = page.get_size() + longest = max(width, height) + if longest <= 0: + raise ThumbnailRenderError("PDF page has zero dimensions") + + # Scale so the longest side lands at THUMBNAIL_MAX_DIMENSION. + scale = THUMBNAIL_MAX_DIMENSION / longest + bitmap = page.render(scale=scale) + pil_image = bitmap.to_pil() + finally: + page.close() + finally: + pdf.close() + + buffer = io.BytesIO() + pil_image.save(buffer, format="PNG", optimize=True) + return buffer.getvalue() + + +_renderer_instance: ThumbnailRenderer | None = None + + +def get_thumbnail_renderer() -> ThumbnailRenderer: + """Get or create the singleton ThumbnailRenderer.""" + global _renderer_instance + if _renderer_instance is None: + _renderer_instance = ThumbnailRenderer() + return _renderer_instance diff --git a/backend/src/apis/app_api/main.py b/backend/src/apis/app_api/main.py index 17770531..376e5cb9 100644 --- a/backend/src/apis/app_api/main.py +++ b/backend/src/apis/app_api/main.py @@ -28,9 +28,58 @@ format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' ) logger = logging.getLogger(__name__) + +# Refuse to boot unless SKIP_AUTH is paired with a positive local-dev +# signal. Allowlist over blocklist: every CORS_ORIGINS entry must be a +# localhost URL. Any deployed origin (or empty config) trips this — far +# safer than enumerating every env var a deployed runtime might set, and +# fails closed for new deploy targets we haven't met yet. +# +# Runs from `lifespan()` rather than at import time so tests that import +# this module (e.g. tests/routes/test_pbt_auth_sweep.py) don't trip the +# check on environments where SKIP_AUTH=true is set globally. +_SKIP_AUTH_LOCAL_HOSTS = {"localhost", "127.0.0.1", "::1", "0.0.0.0"} + + +def _validate_skip_auth_or_raise() -> None: + """Raise RuntimeError if SKIP_AUTH=true is paired with non-local CORS_ORIGINS. + + No-op when SKIP_AUTH is unset/false. When set, every CORS_ORIGINS entry + must resolve to a localhost host or boot is refused. + """ + if os.environ.get("SKIP_AUTH", "").lower() != "true": + return + + from urllib.parse import urlparse + + origins = [ + o.strip() + for o in os.environ.get("CORS_ORIGINS", "").split(",") + if o.strip() + ] + + def _is_local(origin: str) -> bool: + try: + return (urlparse(origin).hostname or "") in _SKIP_AUTH_LOCAL_HOSTS + except Exception: + return False + + if not origins or not all(_is_local(o) for o in origins): + raise RuntimeError( + "SKIP_AUTH=true requires CORS_ORIGINS to contain only localhost " + "origins (localhost, 127.0.0.1, ::1, 0.0.0.0). Refusing to start " + "— this bypass is local-dev only." + ) + logger.warning( + "SKIP_AUTH=true — auth dependencies will return a fake admin user. " + "DO NOT enable this in any deployed environment." + ) + + @asynccontextmanager async def lifespan(app: FastAPI): # Startup + _validate_skip_auth_or_raise() logger.info("=== AgentCore Public Stack API Starting ===") logger.info("Agent execution engine initialized") diff --git a/backend/src/apis/app_api/messages/models.py b/backend/src/apis/app_api/messages/models.py index a42f7326..44c0f472 100644 --- a/backend/src/apis/app_api/messages/models.py +++ b/backend/src/apis/app_api/messages/models.py @@ -31,11 +31,22 @@ class MessageContent(BaseModel): class LatencyMetrics(BaseModel): - """Latency measurements in milliseconds""" + """Latency measurements in milliseconds. + + ``time_to_first_token`` is ``None`` when the provider did not emit + ``timeToFirstByteMs`` and we couldn't compute it locally — distinct from + a measured value of 0ms (which is physically impossible). Aggregations + over TTFT must filter ``None`` so a missing measurement doesn't pull + averages toward zero. + """ model_config = ConfigDict(populate_by_name=True) - time_to_first_token: int = Field(..., alias="timeToFirstToken", description="Time from request start to first token received (ms)") + time_to_first_token: Optional[int] = Field( + None, + alias="timeToFirstToken", + description="Time from request start to first token (ms); None if not measured", + ) end_to_end_latency: int = Field(..., alias="endToEndLatency", description="Total time from request start to completion (ms)") diff --git a/backend/src/apis/inference_api/chat/routes.py b/backend/src/apis/inference_api/chat/routes.py index 175c138a..046b8fae 100644 --- a/backend/src/apis/inference_api/chat/routes.py +++ b/backend/src/apis/inference_api/chat/routes.py @@ -230,6 +230,227 @@ async def _resolve_caching_enabled(model_id: str | None, explicit_caching_enable return caching +# ============================================================ +# Spreadsheet Analysis Tool Injection +# ============================================================ + +SPREADSHEET_TOOL_IDS = {"list_spreadsheets", "analyze_spreadsheet"} + + +def _build_spreadsheet_tools( + enabled_tools: list | None, + assistant_id: str | None, + session_id: str, + user_id: str, +) -> list: + """Create context-bound spreadsheet analysis tools if enabled by the user.""" + if not enabled_tools: + return [] + + requested = SPREADSHEET_TOOL_IDS.intersection(enabled_tools) + if not requested: + return [] + + from agents.builtin_tools.spreadsheet_analysis import make_list_spreadsheets_tool, make_analyze_tool + + tools = [] + if "list_spreadsheets" in requested: + tools.append(make_list_spreadsheets_tool(assistant_id, session_id, user_id)) + if "analyze_spreadsheet" in requested: + tools.append(make_analyze_tool(assistant_id, session_id, user_id)) + + logger.info(f"Created {len(tools)} spreadsheet analysis tools (assistant={assistant_id})") + return tools + + +# ============================================================ +# Attachment Partitioning (#206) +# ============================================================ + +def _estimate_decoded_size(file: "FileContent") -> int: + """Estimate decoded byte size of a base64-encoded FileContent payload. + + Base64 inflates bytes by ~4/3, so decoded size ≈ len(b64) * 3 / 4. + This avoids allocating the full bytes just to check a threshold. + """ + try: + # Account for base64 padding: strip "=" padding before estimating. + stripped = (file.bytes or "").rstrip("=") + return (len(stripped) * 3) // 4 + except Exception: + return 0 + + +def _partition_attachments( + all_files: list, +) -> tuple[list, list, list]: + """Split attachments into (inline_for_bedrock, tabular, oversized_non_tabular). + + - Tabular files (csv/xlsx) are never sent inline — they route through + the spreadsheet analysis tools. Keeps Bedrock's 4.5MB document limit + from exploding on XLSX files that expand during internal parsing. + - Non-tabular files larger than INLINE_DOCUMENT_MAX_BYTES are dropped + from the inline set with a user-facing note, to prevent mid-stream + ValidationException on the raw AWS error path. + - Everything else rides along as a regular document/image content block. + """ + from apis.shared.files.models import INLINE_DOCUMENT_MAX_BYTES, is_tabular_file + + inline: list = [] + tabular: list = [] + oversized: list = [] + + for file in all_files: + if is_tabular_file(file.filename, file.content_type): + tabular.append(file) + continue + # Only size-gate non-image documents. Images have their own Bedrock + # limits (much larger) and the prompt builder reroutes them as + # image blocks, which are not affected by the document-size cap. + content_type = (file.content_type or "").lower() + is_image = content_type.startswith("image/") + if not is_image and _estimate_decoded_size(file) > INLINE_DOCUMENT_MAX_BYTES: + oversized.append(file) + continue + inline.append(file) + + return inline, tabular, oversized + + +def _build_attachment_guidance( + diverted_tabular: list, + oversized_inline: list, + enabled_tools: list | None, +) -> str: + """Return a short markdown addendum describing how attachments will be + handled, to append to the user's message so the agent (and the user) + both understand why a file isn't inline. + """ + parts: list[str] = [] + + if diverted_tabular: + names = ", ".join(f"`{f.filename}`" for f in diverted_tabular) + tool_is_enabled = bool(enabled_tools) and ( + "analyze_spreadsheet" in enabled_tools or "list_spreadsheets" in enabled_tools + ) + if tool_is_enabled: + parts.append( + f"_Attached spreadsheet(s) {names} are available through the " + f"Spreadsheet Analysis tool rather than inline — use " + f"`list_spreadsheets` to see them and `analyze_spreadsheet` " + f"to run aggregations or lookups._" + ) + else: + parts.append( + f"_Attached spreadsheet(s) {names} can't be read inline at " + f"this size. To analyze them, enable **Spreadsheet Analysis** " + f"in the Tools section of the settings panel (gear icon next " + f"to the message input), then re-send your message._" + ) + + if oversized_inline: + names = ", ".join(f"`{f.filename}`" for f in oversized_inline) + parts.append( + f"_Attached file(s) {names} exceed the inline document size limit " + f"and were skipped. Try a smaller file, or convert to CSV/XLSX " + f"and use the Spreadsheet Analysis tool._" + ) + + return "\n\n".join(parts) + + +def _build_tabular_inventory( + session_id: str, + assistant_id: str | None, + enabled_tools: list | None, +) -> str: + """Inventory every tabular file visible to the agent this turn, and + prepend it to the user message when more than one exists. + + Motivation: when the vector search returns chunks from multiple source + files with identical schemas (e.g. two monthly FY ledgers), the model + has no way to tell there's more than one spreadsheet at all — RAG + surfaces chunk content but not a full file inventory. The model picks + whichever file yielded the first high-ranked chunk and silently runs + analyze_spreadsheet against just that one. The user's "total" is + wrong by exactly the other file(s). + + We ship the file list inline so the agent sees the full set at turn + start and can call list_spreadsheets / pick deliberately / ask the + user / aggregate across files. Only emitted when the analysis tools + are enabled (otherwise the agent can't act on it anyway) and when at + least two tabular files exist (one file isn't ambiguous). + """ + if not enabled_tools: + return "" + tool_is_enabled = ( + "analyze_spreadsheet" in enabled_tools + or "list_spreadsheets" in enabled_tools + ) + if not tool_is_enabled: + return "" + + # Lazy imports to avoid pulling the agent layer into module-load time + # on cold starts where this code path isn't exercised. + try: + from agents.builtin_tools.spreadsheet_analysis.list_spreadsheets_tool import ( + _get_kb_files, + _get_session_files, + ) + except Exception: + return "" + + files: list[dict] = [] + try: + if assistant_id: + files.extend(_get_kb_files(assistant_id)) + files.extend(_get_session_files(session_id)) + except Exception: + logger.warning("Failed to enumerate tabular files for inventory", exc_info=True) + return "" + + # De-duplicate by (filename, source) — a single file shouldn't be + # listed twice if our lookups overlap. + seen: set[tuple[str, str]] = set() + unique: list[dict] = [] + for f in files: + key = (f.get("filename", ""), f.get("source", "")) + if key in seen: + continue + seen.add(key) + unique.append(f) + + if len(unique) < 2: + # Single file: no ambiguity, and list_spreadsheets covers discovery + # for the agent if it ever needs it. + return "" + + def _fmt_size(n: int) -> str: + if n >= 1024 * 1024: + return f"{n / (1024 * 1024):.1f} MB" + if n >= 1024: + return f"{n // 1024} KB" + return f"{n} B" + + lines = [] + for f in unique: + name = f.get("filename", "") + source = "knowledge base" if f.get("source") == "knowledge_base" else "chat attachment" + size = _fmt_size(int(f.get("size_bytes") or 0)) + lines.append(f"- `{name}` ({source}, {size})") + + listing = "\n".join(lines) + return ( + f"_Multiple spreadsheet files are attached. Before running " + f"`analyze_spreadsheet`, decide which file(s) the user's request " + f"refers to — if it's ambiguous or spans multiple files, call " + f"`list_spreadsheets` and/or ask the user rather than picking one " + f"silently. State which file(s) you analyzed in your response._\n\n" + f"**Available spreadsheets:**\n{listing}" + ) + + + # ============================================================ # Helper Functions for Streaming Error/Status Messages # ============================================================ @@ -357,8 +578,18 @@ async def invocations(request: InvocationRequest, current_user: User = Depends(g if input_data.file_upload_ids: logger.info(f"File upload IDs: {len(input_data.file_upload_ids)} IDs to resolve") - # Resolve file upload IDs to FileContent objects - files_to_send = list(input_data.files) if input_data.files else [] + # Resolve file upload IDs to FileContent objects, then partition: + # - inline_files: images + non-tabular documents that Bedrock can + # ingest directly as document content blocks + # - tabular_files: csv/xlsx, which we intentionally NEVER send inline + # because XLSX in particular inflates dramatically inside Bedrock + # (1.4MB zipped → >4.5MB internal, triggering ValidationException). + # They remain available to the agent via list_spreadsheets / + # analyze_spreadsheet, which run pandas on the real file. See #206. + # - oversized_files: non-tabular docs that exceed our inline size + # budget; we skip them inline and surface a note instead of + # letting Bedrock reject the turn. + all_files = list(input_data.files) if input_data.files else [] if input_data.file_upload_ids: try: @@ -368,14 +599,27 @@ async def invocations(request: InvocationRequest, current_user: User = Depends(g upload_ids=input_data.file_upload_ids, max_files=5, # Bedrock document limit ) - # Convert ResolvedFileContent to FileContent for rf in resolved_files: - files_to_send.append(FileContent(filename=rf.filename, content_type=rf.content_type, bytes=rf.bytes)) + all_files.append( + FileContent(filename=rf.filename, content_type=rf.content_type, bytes=rf.bytes) + ) logger.info(f"Resolved {len(resolved_files)} files from upload IDs") - except Exception as e: - logger.warning("Failed to resolve file upload IDs") + except Exception: + logger.warning("Failed to resolve file upload IDs", exc_info=True) # Continue without files rather than failing the request + files_to_send, diverted_tabular, oversized_inline = _partition_attachments(all_files) + if diverted_tabular: + logger.info( + f"Diverted {len(diverted_tabular)} tabular file(s) from inline document blocks; " + f"available via spreadsheet tools: {[f.filename for f in diverted_tabular]}" + ) + if oversized_inline: + logger.warning( + f"Skipped {len(oversized_inline)} oversized file(s) (> inline limit): " + f"{[(f.filename, _estimate_decoded_size(f)) for f in oversized_inline]}" + ) + # Pre-create session metadata so OAuth interrupts and other state can # attach to the session row from turn one. Best-effort; on failure the # post-stream lazy-create in StreamCoordinator still covers it. @@ -637,11 +881,25 @@ async def invocations(request: InvocationRequest, current_user: User = Depends(g try: existing_metadata = await get_session_metadata(input_data.session_id, user_id) if existing_metadata: - # Update existing metadata with assistant_id in preferences - prefs_dict = existing_metadata.preferences.model_dump(by_alias=False) if existing_metadata.preferences else {} + # Update existing metadata: merge assistant_id into the + # preferences sub-model. The top-level SessionMetadata has + # no assistant_id field, so applying the update there + # (previous behavior) silently did nothing under + # extra="allow" and left preferences.assistant_id=None. + # That broke the mid-session validation above on turn 2+ + # because the check relies on preferences.assistant_id to + # recognize an already-attached assistant (#205). + prefs_dict = ( + existing_metadata.preferences.model_dump(by_alias=False) + if existing_metadata.preferences + else {} + ) prefs_dict["assistant_id"] = input_data.rag_assistant_id + merged_preferences = SessionPreferences(**prefs_dict) - updated_metadata = existing_metadata.model_copy(update={"assistant_id": input_data.rag_assistant_id}) + updated_metadata = existing_metadata.model_copy( + update={"preferences": merged_preferences} + ) else: # Create new metadata with assistant_id in preferences @@ -750,6 +1008,23 @@ async def invocations(request: InvocationRequest, current_user: User = Depends(g if caching_enabled is False: logger.info("Prompt caching disabled for model") + # Get agent instance with user-specific configuration + # AgentCore Memory tracks preferences across sessions per user_id + # Supports multiple LLM providers: AWS Bedrock, OpenAI, and Google Gemini + # Use augmented message and assistant system prompt if assistant RAG was applied + + # Spreadsheet tools scoped to the assistant's document corpus, + # when an assistant is attached to this request. The frontend + # keeps the assistant id in the URL for the whole session's + # lifetime, so we can trust `input_data.rag_assistant_id` + # directly; no preferences fallback needed. + extra_tools = _build_spreadsheet_tools( + enabled_tools=input_data.enabled_tools, + assistant_id=input_data.rag_assistant_id, + session_id=input_data.session_id, + user_id=user_id, + ) + agent = await get_agent( session_id=input_data.session_id, user_id=user_id, @@ -761,6 +1036,7 @@ async def invocations(request: InvocationRequest, current_user: User = Depends(g provider=input_data.provider, inference_params=inference_params, agent_type=input_data.agent_type, + extra_tools=extra_tools, is_resume=False, ) @@ -821,11 +1097,33 @@ async def stream_with_quota_warning() -> AsyncGenerator[str, None]: # will be modified before reaching the model. This happens when: # 1. RAG augmentation prepends context chunks to the message # 2. File attachments cause PromptBuilder to rewrite into ContentBlocks + # 3. Attachment guidance is appended (tabular routed to tools, etc.) # The original text becomes the single source of truth for UI display, # while the full augmented prompt stays in AgentCore Memory for the LLM. + attachment_guidance = _build_attachment_guidance( + diverted_tabular, oversized_inline, input_data.enabled_tools + ) + # When multiple spreadsheets are visible, ship the full inventory + # up front so the agent can disambiguate intentionally instead of + # silently picking whichever file the vector search ranked first. + tabular_inventory = _build_tabular_inventory( + session_id=input_data.session_id, + assistant_id=input_data.rag_assistant_id, + enabled_tools=input_data.enabled_tools, + ) + # Bind to a new local so we don't trip Python's local-scope rules + # inside this generator closure (augmented_message is defined in + # the outer function; reassigning it here would make the whole + # name local and UnboundLocalError before the assignment runs). + final_message = augmented_message + if attachment_guidance: + final_message = f"{final_message}\n\n{attachment_guidance}" + if tabular_inventory: + final_message = f"{final_message}\n\n{tabular_inventory}" + message_will_be_modified = ( - augmented_message != input_data.message # RAG augmentation - or bool(files_to_send) # File attachments + final_message != input_data.message # RAG augmentation / attachment guidance / inventory + or bool(files_to_send) # File attachments ) # Strands' resume protocol wants each entry wrapped as # {"interruptResponse": {...}}. The InvocationRequest schema @@ -838,7 +1136,7 @@ async def stream_with_quota_warning() -> AsyncGenerator[str, None]: ) async for event in agent.stream_async( - augmented_message, + final_message, session_id=input_data.session_id, files=files_to_send if files_to_send else None, citations=citations_for_storage if citations_for_storage else None, diff --git a/backend/src/apis/inference_api/chat/service.py b/backend/src/apis/inference_api/chat/service.py index 168da1e7..9bd0ff19 100644 --- a/backend/src/apis/inference_api/chat/service.py +++ b/backend/src/apis/inference_api/chat/service.py @@ -118,6 +118,7 @@ async def get_agent( provider: Optional[str] = None, max_tokens: Optional[int] = None, agent_type: Optional[str] = None, + extra_tools: Optional[list] = None, inference_params: Optional[Dict[str, Any]] = None, is_resume: bool = False, ) -> BaseAgent: @@ -172,8 +173,7 @@ async def get_agent( agent_type=agent_type, ) - # Check cache - if cache_key in _agent_cache: + if not extra_tools and cache_key in _agent_cache: cached = _agent_cache[cache_key] # Defense in depth: a non-resume request should never be served a # paused agent. If we ever desync the cache key between the original @@ -210,6 +210,8 @@ async def get_agent( system_prompt=system_prompt, caching_enabled=caching_enabled, provider=provider, + max_tokens=max_tokens, + extra_tools=extra_tools, inference_params=merged_params, ) @@ -218,6 +220,11 @@ async def get_agent( if hasattr(agent, "_construction_snapshot"): agent._construction_snapshot["agent_type"] = resolved_agent_type + # Don't cache agents with context-bound extra_tools + if extra_tools: + logger.debug("⏭️ Skipping cache for agent with extra_tools") + return agent + # Add to cache with LRU eviction if len(_agent_cache) >= _CACHE_MAX_SIZE: # Remove oldest entry (first inserted) diff --git a/backend/src/apis/shared/auth/dependencies.py b/backend/src/apis/shared/auth/dependencies.py index e4e48417..658c948c 100644 --- a/backend/src/apis/shared/auth/dependencies.py +++ b/backend/src/apis/shared/auth/dependencies.py @@ -227,6 +227,29 @@ def _get_bff_cognito_validator(): return _bff_cognito_validator +def _skip_auth_user() -> Optional[User]: + """Return a fake admin user when SKIP_AUTH=true, else None. + + Local-dev-only bypass so an unattended agent (or a dev with no IdP + access) can hit protected routes without the OAuth round-trip. The + startup check in `app_api/main.py` refuses to boot when this is + combined with deployed-environment indicators. + """ + if os.environ.get("SKIP_AUTH", "").lower() != "true": + return None + roles = [ + r.strip() + for r in os.environ.get("SKIP_AUTH_ROLES", "admin").split(",") + if r.strip() + ] + return User( + user_id=os.environ.get("SKIP_AUTH_USER_ID", "local-dev"), + email=os.environ.get("SKIP_AUTH_EMAIL", "dev@local"), + name="Local Dev", + roles=roles, + ) + + async def get_current_user( credentials: Optional[HTTPAuthorizationCredentials] = Depends(security) ) -> User: @@ -247,6 +270,9 @@ async def get_current_user( - 401 if token is missing or invalid - 500 if no JWT validator is available """ + if (fake := _skip_auth_user()) is not None: + return fake + # Check if credentials are missing if credentials is None: raise HTTPException( @@ -308,6 +334,9 @@ async def get_current_user_from_session(request: Request) -> User: HTTPException 401 if no session was resolved by the upstream middleware (cookie missing, malformed, or session record gone). """ + if (fake := _skip_auth_user()) is not None: + return fake + record = getattr(request.state, "bff_session", None) if record is None: raise HTTPException( @@ -395,6 +424,9 @@ async def get_current_user_trusted( """ logger.debug("[get_current_user_trusted] Trusted auth extraction started") + if (fake := _skip_auth_user()) is not None: + return fake + # Check if credentials are missing if credentials is None: logger.debug("[get_current_user_trusted] No credentials provided - returning 401") diff --git a/backend/src/apis/shared/files/__init__.py b/backend/src/apis/shared/files/__init__.py index 6e85c844..a29ba630 100644 --- a/backend/src/apis/shared/files/__init__.py +++ b/backend/src/apis/shared/files/__init__.py @@ -17,8 +17,12 @@ QuotaExceededError, ALLOWED_MIME_TYPES, ALLOWED_EXTENSIONS, + TABULAR_MIME_TYPES, + TABULAR_EXTENSIONS, + INLINE_DOCUMENT_MAX_BYTES, get_file_format, is_allowed_mime_type, + is_tabular_file, ) from .repository import ( @@ -47,8 +51,12 @@ "QuotaExceededError", "ALLOWED_MIME_TYPES", "ALLOWED_EXTENSIONS", + "TABULAR_MIME_TYPES", + "TABULAR_EXTENSIONS", + "INLINE_DOCUMENT_MAX_BYTES", "get_file_format", "is_allowed_mime_type", + "is_tabular_file", # Repository "FileUploadRepository", "get_file_upload_repository", diff --git a/backend/src/apis/shared/files/models.py b/backend/src/apis/shared/files/models.py index c3fba383..1d0210aa 100644 --- a/backend/src/apis/shared/files/models.py +++ b/backend/src/apis/shared/files/models.py @@ -5,6 +5,7 @@ Supports the pre-signed URL upload flow for S3. """ +import os from datetime import datetime, timezone from enum import Enum from typing import List, Optional @@ -69,6 +70,56 @@ def is_allowed_mime_type(mime_type: str) -> bool: return mime_type in ALLOWED_MIME_TYPES +# ============================================================================= +# Tabular File Detection +# ============================================================================= + +# Tabular files (CSV, XLSX) are routed to the spreadsheet analysis tools +# (list_spreadsheets, analyze_spreadsheet) instead of being sent inline as +# Bedrock document content blocks. Two reasons: +# 1. XLSX files compress well on disk but expand dramatically when Bedrock +# parses them internally. A 1.4MB xlsx can exceed Bedrock's 4.5MB +# document-content limit and crash the turn with ValidationException +# (see #206). +# 2. Even when under the limit, sending raw tabular bytes as a document +# block is wasteful — the model does pandas-quality aggregation poorly +# from text-rendered tables. analyze_spreadsheet runs real Python on +# the real file and is cheaper in tokens and more accurate. + +TABULAR_MIME_TYPES = frozenset({ + "text/csv", + "application/vnd.ms-excel", + "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", +}) + +TABULAR_EXTENSIONS = frozenset({".csv", ".xls", ".xlsx"}) + + +def is_tabular_file(filename: str, mime_type: str) -> bool: + """Return True when the file should be handled by spreadsheet tools + rather than sent inline as a Bedrock document block. + """ + if mime_type and mime_type.lower() in TABULAR_MIME_TYPES: + return True + if filename: + lower = filename.lower() + for ext in TABULAR_EXTENSIONS: + if lower.endswith(ext): + return True + return False + + +# Bedrock's /ConverseStream imposes a 4.5MB hard limit on each document +# content block's *internal* representation. Non-tabular formats (PDF, docx, +# txt, md) don't inflate much, but we leave margin for per-request overhead +# and for cumulative size across attachments. Rejecting inline files above +# this threshold with a friendly message is better than a raw AWS +# ValidationException mid-stream. +INLINE_DOCUMENT_MAX_BYTES = int( + os.environ.get("INLINE_DOCUMENT_MAX_BYTES", 4 * 1024 * 1024) # 4MB +) + + # ============================================================================= # Database Models (stored in DynamoDB) # ============================================================================= @@ -250,6 +301,48 @@ class CompleteUploadResponse(BaseModel): model_config = ConfigDict(populate_by_name=True) +class PreviewUrlResponse(BaseModel): + """Response for GET /api/files/{uploadId}/preview-url.""" + + upload_id: str = Field(..., alias="uploadId") + url: str = Field(..., description="Short-lived presigned GET URL") + expires_at: str = Field(..., alias="expiresAt", description="ISO8601 expiration time") + mime_type: str = Field(..., alias="mimeType") + filename: str + + model_config = ConfigDict(populate_by_name=True) + + +class TextSnippetResponse(BaseModel): + """Response for GET /api/files/{uploadId}/text-snippet.""" + + upload_id: str = Field(..., alias="uploadId") + snippet: str = Field(..., description="UTF-8 decoded text from the start of the file") + truncated: bool = Field(..., description="True if the file was longer than the snippet limit") + mime_type: str = Field(..., alias="mimeType") + + model_config = ConfigDict(populate_by_name=True) + + +# MIME types the thumbnail renderer can currently produce a preview image for. +# Callers should consult this set before invoking the thumbnail endpoint to +# avoid hammering the service for unsupported types. +THUMBNAIL_SUPPORTED_MIME_TYPES = frozenset({ + "application/pdf", +}) + + +class ThumbnailResponse(BaseModel): + """Response for GET /api/files/{uploadId}/thumbnail.""" + + upload_id: str = Field(..., alias="uploadId") + url: str = Field(..., description="Short-lived presigned GET URL for a PNG thumbnail") + expires_at: str = Field(..., alias="expiresAt", description="ISO8601 expiration time") + cached: bool = Field(..., description="True if served from cache, False if newly rendered") + + model_config = ConfigDict(populate_by_name=True) + + class FileResponse(BaseModel): """Single file in list response.""" diff --git a/backend/src/apis/shared/middleware/session_refresh.py b/backend/src/apis/shared/middleware/session_refresh.py index 4b36abba..6a2d7bb5 100644 --- a/backend/src/apis/shared/middleware/session_refresh.py +++ b/backend/src/apis/shared/middleware/session_refresh.py @@ -17,9 +17,12 @@ import asyncio import logging +import secrets import time from typing import Optional +from botocore.exceptions import ClientError + from starlette.middleware.base import BaseHTTPMiddleware from starlette.requests import Request from starlette.responses import Response @@ -43,6 +46,7 @@ CognitoRefreshError, ) from apis.shared.sessions_bff.repository import SessionRepository +from apis.shared.sessions_bff.single_flight import resolve_once logger = logging.getLogger(__name__) @@ -64,6 +68,7 @@ def __init__( cookie_codec: Optional[CookieCodec] = None, refresh_client: Optional[CognitoRefreshClient] = None, cache: Optional[SessionCache] = None, + refresh_lock_ttl_seconds: int = 30, ) -> None: super().__init__(app) self._config = config @@ -71,6 +76,19 @@ def __init__( self._cookie_codec = cookie_codec self._refresh_client = refresh_client self._cache = cache + # Cross-task refresh lock TTL. A leader that crashes mid-refresh + # strands the lock for at most this many seconds, after which any + # peer can re-acquire and retry. Followers poll for at most this + # long before falling back to terminal. 30s is a safety cushion + # over the worst-case (Cognito + DDB + retries) refresh latency. + self._refresh_lock_ttl_seconds = refresh_lock_ttl_seconds + # Strong-reference set for fire-and-forget slide-write tasks. + # Without keeping a reference, `asyncio.create_task(...)` can be + # garbage-collected mid-execution — Python's docs explicitly warn + # about this, and on fast CI runners the task dies before the + # scheduler picks it up. We remove each task via `add_done_callback` + # so the set doesn't grow unboundedly. + self._slide_tasks: set[asyncio.Task] = set() def _ensure_collaborators(self) -> None: if self._config is None: @@ -145,6 +163,13 @@ async def _maybe_slide(self, record: SessionRecord) -> Optional[int]: past it, the cookie is allowed to expire on its own original Max-Age — we don't extend, but we also don't proactively clear (the user might still complete the in-flight request). + + The DDB `touch_last_seen` write is dispatched as a detached + `asyncio.Task` — the response path must not wait on it. The local + cache is updated synchronously BEFORE scheduling so subsequent + same-request reads (and the next cache window) see the slid state + even if the background write hasn't landed yet. Today's "swallow + failures" semantics are preserved inside `_slide_write_task`. """ assert self._config is not None now = int(time.time()) @@ -165,26 +190,59 @@ async def _maybe_slide(self, record: SessionRecord) -> Optional[int]: return None new_ttl = now + new_max_age - try: - await self._repository.touch_last_seen( - record.session_id, last_seen_at=now, ttl=new_ttl - ) - except Exception as exc: - # Don't fail the request if the slide-write fails — the user - # still has a valid session for the rest of its current TTL. - logger.warning( - "BFF session slide failed for %s: %s", record.session_id, exc - ) - return None - # Reflect the slide locally so subsequent same-request reads (and the - # cache) don't think the row still needs a slide. + # Reflect the slide locally BEFORE dispatching the background write + # so subsequent same-request reads (and the cache) don't think the + # row still needs a slide — even if the background task hasn't yet + # landed the DDB write. record.last_seen_at = now record.ttl = new_ttl if self._cache is not None: self._cache.set(record) + + # Fire-and-forget: the response path MUST NOT wait on the DDB write. + # Failures are swallowed inside `_slide_write_task` (preserving + # today's "slide failures are non-fatal" semantics — the user still + # has a valid session for the rest of its current TTL). + # + # CRITICAL: keep a strong reference on the middleware instance + # (`self._slide_tasks`). Without this, Python is free to GC the + # task before it runs — we observed this on Python 3.12 CI runners + # where the preservation tests saw 0 update_item calls because the + # task was collected mid-flight. The done-callback removes the task + # again so the set doesn't leak. + task = asyncio.create_task( + self._slide_write_task( + session_id=record.session_id, + last_seen_at=now, + ttl=new_ttl, + ) + ) + self._slide_tasks.add(task) + task.add_done_callback(self._slide_tasks.discard) return new_max_age + async def _slide_write_task( + self, *, session_id: str, last_seen_at: int, ttl: int + ) -> None: + """Background helper for `_maybe_slide`'s fire-and-forget DDB write. + + Swallows exceptions so a DDB blip doesn't surface as an unhandled + task exception — today's inline slide-write already swallowed + failures, and we preserve that contract verbatim. The local cache + was updated synchronously in `_maybe_slide` before this task was + scheduled, so the user keeps seeing the slid state for the rest of + their current cache window. + """ + try: + await self._repository.touch_last_seen( + session_id, last_seen_at=last_seen_at, ttl=ttl + ) + except Exception as exc: + logger.warning( + "BFF session slide failed for %s: %s", session_id, exc + ) + async def _persist_refresh( self, *, @@ -193,18 +251,26 @@ async def _persist_refresh( last_seen_at: int, ttl: int, rotated: bool, + lock_owner: str, ) -> bool: """Write refreshed tokens to DDB. Retry when rotation makes it critical. Returns True on success or on a benign (non-rotation) failure. Returns False only when rotation happened *and* every retry failed — caller should treat that as session-unrecoverable. + + The write also clears the cross-task refresh lock (atomic with the + token rotation), conditional on `lock_owner` matching. A + `ConditionalCheckFailedException` here means a peer task acquired + the lock after ours expired — we abandon the persist and the caller + should re-read DDB to adopt the peer's tokens. """ # Three attempts on rotation (≈350ms total worst-case), single shot # otherwise. boto3 already retries below us for transient API errors; # this layer handles longer blips and validation/throttling errors # that boto3 lets through. attempts = 3 if rotated else 1 + last_exc: Optional[Exception] = None for attempt in range(attempts): try: await self._repository.update_tokens( @@ -215,8 +281,26 @@ async def _persist_refresh( access_token_exp=refreshed.access_token_exp, last_seen_at=last_seen_at, ttl=ttl, + expected_lock_owner=lock_owner, ) return True + except ClientError as exc: + # Lock-ownership condition failed — a peer task took over. + # Don't retry: their refresh is authoritative now. Caller + # adopts their tokens via the post-failure DDB re-read. + if ( + exc.response.get("Error", {}).get("Code") + == "ConditionalCheckFailedException" + ): + logger.info( + "BFF refresh persist for %s lost lock to a peer task — " + "deferring to peer's tokens.", + session_id, + ) + return False + last_exc = exc + if attempt + 1 < attempts: + await asyncio.sleep(0.05 * (2**attempt)) # 50ms, 100ms except Exception as exc: last_exc = exc if attempt + 1 < attempts: @@ -240,6 +324,48 @@ async def _persist_refresh( ) return True + async def _wait_for_peer_refresh( + self, + *, + session_id: str, + previous: SessionRecord, + max_wait_seconds: float, + ) -> Optional[SessionRecord]: + """Poll DDB for a peer task's freshly persisted tokens. + + Called when we lost the cross-task refresh lock to a peer. Polls + the session row with bounded backoff (50ms → 500ms) until we + observe tokens that differ from `previous` — at which point we + adopt them — or `max_wait_seconds` elapses. + + Returns the peer's record on success, or `None` if we timed out + (peer crashed mid-refresh). The caller treats `None` as terminal + and clears the cookie; the lock TTL ensures the next request can + re-acquire and retry without waiting for a stuck row. + """ + deadline = time.monotonic() + max_wait_seconds + sleep_for = 0.05 + while time.monotonic() < deadline: + await asyncio.sleep(sleep_for) + peer = await self._repository.get(session_id) + if peer is None: + # Row gone (delete or TTL eviction) — terminal. + return None + # Refresh-token rotation: peer issued a new refresh token, ours + # is now revoked. Adopt their record. + if peer.cognito_refresh_token != previous.cognito_refresh_token: + return peer + # No rotation but a fresh access token landed: peer refreshed + # successfully, we can use the new access token. + if ( + peer.cognito_access_token != previous.cognito_access_token + and peer.access_token_exp + > int(time.time()) + self._config.refresh_leeway_seconds + ): + return peer + sleep_for = min(sleep_for * 1.5, 0.5) + return None + async def _resolve_session( self, cookie_value: str ) -> tuple[Optional[SessionRecord], bool]: @@ -247,6 +373,21 @@ async def _resolve_session( `should_clear_cookie` is True when the cookie is present but unrecoverable — bad seal, missing row, expired TTL, or refresh failure. + + Cookie unseal happens before the single-flight wrap so a bad seal + short-circuits without registering a Future (and without keying the + registry off an untrusted session id). Once we have a validated + session id, the cache → `repository.get` → `needs_refresh` → + (maybe refresh) path is coalesced through `resolve_once` so an + Angular page-load fan-out of N same-session requests issues at most + one DynamoDB `get_item` per cache window. + + The per-session `get_session_lock(session_id)` around the Cognito + refresh exchange stays exactly where it is today — the single-flight + sits upstream of it. In the common case that the single-flight + already coalesces N callers to one loader invocation, only the + leader ever reaches the refresh lock; the existing one-`initiate_auth`- + per-`session_id`-per-leeway-window contract is preserved end-to-end. """ try: payload = self._cookie_codec.unseal(cookie_value) @@ -256,92 +397,180 @@ async def _resolve_session( session_id = payload.session_id - cached = self._cache.get(session_id) if self._cache else None - if cached is not None and not cached.needs_refresh( - int(time.time()), self._config.refresh_leeway_seconds - ): - return cached, False - - record = await self._repository.get(session_id) - if record is None: - logger.info("Discarding BFF cookie — no matching session row") - return None, True + async def _loader() -> tuple[Optional[SessionRecord], bool]: + cached = self._cache.get(session_id) if self._cache else None + if cached is not None and not cached.needs_refresh( + int(time.time()), self._config.refresh_leeway_seconds + ): + return cached, False - if not record.needs_refresh( - int(time.time()), self._config.refresh_leeway_seconds - ): - self._cache.set(record) - return record, False - - # Coalesce concurrent refreshes for the same session id. - async with get_session_lock(session_id): - # Re-check after acquiring the lock — another waiter may have - # already refreshed, in which case we serve the fresh row. - current = await self._repository.get(session_id) - if current is None: + record = await self._repository.get(session_id) + if record is None: + logger.info("Discarding BFF cookie — no matching session row") return None, True - if not current.needs_refresh( + + if not record.needs_refresh( int(time.time()), self._config.refresh_leeway_seconds ): - self._cache.set(current) - return current, False - - try: - refreshed = self._refresh_client.refresh( + self._cache.set(record) + return record, False + + # Two-tier coalescing: + # + # 1. `get_session_lock` (in-process): collapses N concurrent + # same-session callers within ONE task to a single refresh + # contender. + # 2. `try_acquire_refresh_lock` (cross-process, DDB conditional + # write): one of those contenders, across all tasks, becomes + # the leader and actually calls Cognito. Followers poll DDB + # for the leader's persisted tokens. + # + # Without the cross-process lock, two tasks under desiredCount: 2 + # would each call `cognito-idp:initiate_auth` with the same refresh + # token — Cognito rotates on the first; the second fails + # `NotAuthorizedException` and the loser's middleware clears the + # user's cookie. The DDB lock turns that race into a leader/ + # follower handoff so exactly one Cognito refresh happens per + # session per leeway window across the entire fleet. + async with get_session_lock(session_id): + current = await self._repository.get(session_id) + if current is None: + return None, True + if not current.needs_refresh( + int(time.time()), self._config.refresh_leeway_seconds + ): + self._cache.set(current) + return current, False + + # Past absolute lifetime — terminal. Don't burn a Cognito + # refresh-token rotation on a session we'd just write a + # past-dated TTL onto (which would instantly TTL-evict the + # row right after we wrote tokens). Mirrors the slide path + # in `_maybe_slide`, which also short-circuits at the cap. + if ( + current.created_at + self._config.absolute_lifetime_seconds + <= int(time.time()) + ): + logger.info( + "BFF session %s past absolute lifetime — clearing cookie " + "rather than refreshing.", + session_id, + ) + self._cache.invalidate(session_id) + return None, True + + lock_owner = secrets.token_hex(16) + lock_acquired = await self._repository.try_acquire_refresh_lock( + session_id=session_id, + owner=lock_owner, + lock_ttl_seconds=self._refresh_lock_ttl_seconds, + ) + if not lock_acquired: + # FOLLOWER: a peer task is doing the Cognito refresh. + # Wait for their tokens to land on the row, then adopt. + peer = await self._wait_for_peer_refresh( + session_id=session_id, + previous=current, + max_wait_seconds=self._refresh_lock_ttl_seconds, + ) + if peer is None: + # Peer never wrote — likely crashed or hit a Cognito + # error. Lock will TTL out; the user's next request + # will get to retry. Fail closed for *this* request. + self._cache.invalidate(session_id) + return None, True + # Cross-task adoption succeeded — peer's refresh is + # authoritative. Log at INFO so CloudWatch can answer + # "how often is cross-task coalescing actually firing?" + logger.info( + "BFF session %s adopted peer task's refreshed tokens " + "(cross-task lock follower path).", + session_id, + ) + self._cache.set(peer) + return peer, False + + # LEADER: do the Cognito refresh and persist. + try: + refreshed = await self._refresh_client.refresh( + username=current.username, + refresh_token=current.cognito_refresh_token, + ) + except CognitoRefreshError: + # Refresh refused — release the lock so a peer can retry + # the next request without waiting for the full lock TTL, + # then treat as terminal for this request. + await self._repository.release_refresh_lock( + session_id, lock_owner + ) + self._cache.invalidate(session_id) + return None, True + + now = int(time.time()) + # Slide the row's DDB TTL alongside the token rotation: the user + # is provably active. Capped at `created_at + absolute_lifetime` + # so a long-lived browser tab can't roll the session forever. + absolute_cap = ( + current.created_at + self._config.absolute_lifetime_seconds + ) + new_ttl = min( + now + self._config.session_ttl_seconds, + absolute_cap, + ) + # Detect refresh-token rotation. When Cognito rotates, the OLD + # refresh token is dead the moment the new one is issued — so a + # DDB write failure here means the session is unrecoverable on + # the *next* request even though *this* one succeeded. Retry + # aggressively, then fail-closed (clear cookie now) so the user + # re-auths immediately rather than getting silently 401'd later. + # Without rotation, the previous refresh token is still valid, + # so a DDB write failure is benign: the next request will just + # re-trigger refresh with the same (still good) refresh token. + rotated = refreshed.refresh_token != current.cognito_refresh_token + persist_ok = await self._persist_refresh( + session_id=session_id, + refreshed=refreshed, + last_seen_at=now, + ttl=new_ttl, + rotated=rotated, + lock_owner=lock_owner, + ) + if not persist_ok: + # Two reasons this lands here: + # (a) Rotation persist exhausted retries — session is + # unrecoverable; clear cookie and force re-auth. + # (b) Lock-owner condition failed (peer took over) — + # re-read DDB and adopt the peer's tokens rather + # than logging the user out. + peer = await self._repository.get(session_id) + if ( + peer is not None + and not peer.needs_refresh( + int(time.time()), + self._config.refresh_leeway_seconds, + ) + ): + self._cache.set(peer) + return peer, False + self._cache.invalidate(session_id) + return None, True + updated = SessionRecord( + session_id=current.session_id, + user_id=current.user_id, username=current.username, - refresh_token=current.cognito_refresh_token, + cognito_access_token=refreshed.access_token, + cognito_refresh_token=refreshed.refresh_token, + id_token=refreshed.id_token, + access_token_exp=refreshed.access_token_exp, + csrf_secret=current.csrf_secret, + created_at=current.created_at, + last_seen_at=now, + ttl=new_ttl, ) - except CognitoRefreshError: - # Refresh refused — treat as terminal, force re-login. - self._cache.invalidate(session_id) - return None, True + self._cache.set(updated) + return updated, False - now = int(time.time()) - # Slide the row's DDB TTL alongside the token rotation: the user - # is provably active. Capped at `created_at + absolute_lifetime` - # so a long-lived browser tab can't roll the session forever. - absolute_cap = ( - current.created_at + self._config.absolute_lifetime_seconds - ) - new_ttl = min( - now + self._config.session_ttl_seconds, - absolute_cap, - ) - # Detect refresh-token rotation. When Cognito rotates, the OLD - # refresh token is dead the moment the new one is issued — so a - # DDB write failure here means the session is unrecoverable on - # the *next* request even though *this* one succeeded. Retry - # aggressively, then fail-closed (clear cookie now) so the user - # re-auths immediately rather than getting silently 401'd later. - # Without rotation, the previous refresh token is still valid, - # so a DDB write failure is benign: the next request will just - # re-trigger refresh with the same (still good) refresh token. - rotated = refreshed.refresh_token != current.cognito_refresh_token - persist_ok = await self._persist_refresh( - session_id=session_id, - refreshed=refreshed, - last_seen_at=now, - ttl=new_ttl, - rotated=rotated, - ) - if not persist_ok: - self._cache.invalidate(session_id) - return None, True - updated = SessionRecord( - session_id=current.session_id, - user_id=current.user_id, - username=current.username, - cognito_access_token=refreshed.access_token, - cognito_refresh_token=refreshed.refresh_token, - id_token=refreshed.id_token, - access_token_exp=refreshed.access_token_exp, - csrf_secret=current.csrf_secret, - created_at=current.created_at, - last_seen_at=now, - ttl=new_ttl, - ) - self._cache.set(updated) - return updated, False + return await resolve_once(session_id, _loader) @staticmethod def _reemit_cookies( diff --git a/backend/src/apis/shared/sessions/models.py b/backend/src/apis/shared/sessions/models.py index 771b4b36..664b5591 100644 --- a/backend/src/apis/shared/sessions/models.py +++ b/backend/src/apis/shared/sessions/models.py @@ -332,11 +332,22 @@ class MessageContent(BaseModel): class LatencyMetrics(BaseModel): - """Latency measurements in milliseconds""" + """Latency measurements in milliseconds. + + ``time_to_first_token`` is ``None`` when the provider did not emit + ``timeToFirstByteMs`` and we couldn't compute it locally — distinct from + a measured value of 0ms (which is physically impossible). Aggregations + over TTFT must filter ``None`` so a missing measurement doesn't pull + averages toward zero. + """ model_config = ConfigDict(populate_by_name=True) - time_to_first_token: int = Field(..., alias="timeToFirstToken", description="Time from request start to first token received (ms)") + time_to_first_token: Optional[int] = Field( + None, + alias="timeToFirstToken", + description="Time from request start to first token (ms); None if not measured", + ) end_to_end_latency: int = Field(..., alias="endToEndLatency", description="Total time from request start to completion (ms)") diff --git a/backend/src/apis/shared/sessions_bff/config.py b/backend/src/apis/shared/sessions_bff/config.py index 311f1bbe..0a775eb9 100644 --- a/backend/src/apis/shared/sessions_bff/config.py +++ b/backend/src/apis/shared/sessions_bff/config.py @@ -29,9 +29,12 @@ # fail anyway, so there's no value in carrying the cookie further. _DEFAULT_ABSOLUTE_LIFETIME_SECONDS = 30 * 24 * 3600 # Don't write to DDB / re-emit cookies on every request; coalesce to once per -# minute. Tabs that hit the BFF more often than this just ride the existing -# row. -_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS = 60 +# 5 minutes. Kept a strict multiple of `_DEFAULT_REFRESH_LEEWAY_SECONDS` (60s) +# so cache-expiry (TTL = leeway) and slide-throttle boundaries are never +# aligned — without this, a single request crossing the 60s boundary would +# incur BOTH a `get_item` AND an `update_item` on the critical path. De- +# alignment keeps a cache-miss request from also paying the slide-write cost. +_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS = 60 * 5 @dataclass(frozen=True) diff --git a/backend/src/apis/shared/sessions_bff/cookie.py b/backend/src/apis/shared/sessions_bff/cookie.py index 75b215f0..50741c5c 100644 --- a/backend/src/apis/shared/sessions_bff/cookie.py +++ b/backend/src/apis/shared/sessions_bff/cookie.py @@ -1,16 +1,38 @@ -"""Cookie codec — AES-GCM-sealed session id with a KMS-wrapped data key. +"""Cookie codec — AES-GCM-sealed session id with a SHA-256-derived data key. -Phase 1 CDK provisioned `BFFCookieSigningKey` as a symmetric KMS key. We -follow the envelope-encryption pattern from the CDK comment: +The CDK provisions two collaborating resources: - 1. At first use, call `kms:GenerateDataKey(KeyId=...)` to get a 256-bit - AES key. The plaintext key is held in process memory; the wrapped - blob is discarded. - 2. Cookie value = base64url( version || nonce || AES-GCM(payload) ). - The KMS key is *not* embedded — rotation requires a task restart, - which is fine for Phase 2 (rotation is on the Phase 7 hardening list). - 3. `unseal` is constant-time on failure: any decode/auth-tag error maps - to `CookieDecodeError` so callers can't time-distinguish failure modes. + - `BFFCookieSigningKey` — symmetric KMS CMK that encrypts the secret at + rest. App-api never calls KMS directly; SecretsManager invokes + `kms:Decrypt` on the caller's behalf when `GetSecretValue` runs. + - `BFFCookieDataKeySecret` — Secrets Manager secret holding a 44-char + high-entropy random string (~261 bits of entropy) generated once at + stack create. + +Every app-api task on first use: + + 1. Reads the secret string from `BFFCookieDataKeySecret`. + 2. Derives the 32-byte AES-256 key with `SHA-256(secret_string)` — a + single-shot KDF that is secure when the input has ≥256 bits of + entropy (a 44-char alphanumeric secret has ~261). + 3. Caches the resulting `AESGCM` cipher as the process-wide singleton. + +This shared-secret design replaces the prior pattern of each task calling +`kms:GenerateDataKey` directly: that produced a fresh random key per +process, so under `desiredCount > 1` cookies sealed by Task A unsealed as +`bad seal` on Task B (every page-load fan-out became a 401 storm). It +also avoids the chained `AwsCustomResource` bootstrap design that broke +on first deploy because the framework Lambda JSON-stringifies KMS's +`Uint8Array` ciphertext as `{"0":233,...}`, exceeding the 4 KB +CloudFormation response limit. + + - Cookie value = base64url( version || nonce || AES-GCM(payload) ). + - The KMS key is *not* embedded — rotation requires regenerating the + secret AND restarting all tasks; in-flight cookies sealed under the + old key fail to unseal (Phase 7 hardening: kid-versioned cookies + enable hot rotation). + - `unseal` is constant-time on failure: any decode/auth-tag error maps + to `CookieDecodeError` so callers can't time-distinguish failure modes. Payload is JSON-encoded so we can extend `CookiePayload` later without breaking format compatibility (the version byte gates that). The whole @@ -20,6 +42,7 @@ from __future__ import annotations import base64 +import hashlib import json import logging import os @@ -49,30 +72,55 @@ class CookieDecodeError(Exception): """ +class CookieDataKeyUnavailable(Exception): + """Raised when the data-key secret can't be fetched at startup. + + Distinct from `CookieDecodeError` so callers can return 5xx (transient + infra problem — Secrets Manager unreachable, secret empty) rather than + silently clearing every active user's cookie. + """ + + class CookieCodec: - """Stateful seal/unseal pair backed by a process-cached KMS data key. + """Stateful seal/unseal pair backed by a process-cached AES-GCM cipher. Construct one per process. The first `seal()` or `unseal()` call lazily - fetches the data key via `kms:GenerateDataKey`; subsequent calls reuse - the cached cipher. Thread-safe on the lazy-init path. + fetches the **shared** data-key secret from Secrets Manager and + derives the AES key via SHA-256; subsequent calls reuse the cached + cipher. Thread-safe on the lazy-init path. + + Across multiple ECS tasks (`desiredCount > 1`), every task's codec + derives the **same** plaintext key, so cookies sealed by any task + unseal on any other task. This is the property that the prior + `kms:GenerateDataKey`-per-process design lacked. """ def __init__( self, kms_key_arn: Optional[str] = None, *, - kms_client: Optional[object] = None, + data_key_secret_arn: Optional[str] = None, + secrets_manager_client: Optional[object] = None, ) -> None: + # `kms_key_arn` is retained for config-shape compatibility (and so + # callers can introspect which CMK encrypts the secret at rest), but + # the codec no longer calls KMS directly — SecretsManager handles + # decryption on the caller's behalf when GetSecretValue runs. if kms_key_arn is None: kms_key_arn = os.environ.get("BFF_COOKIE_SIGNING_KEY_ARN") or "" + if data_key_secret_arn is None: + data_key_secret_arn = ( + os.environ.get("BFF_COOKIE_DATA_KEY_SECRET_ARN") or "" + ) self._kms_key_arn = kms_key_arn - self._kms_client = kms_client + self._data_key_secret_arn = data_key_secret_arn + self._secrets_manager_client = secrets_manager_client self._cipher: Optional[AESGCM] = None self._init_lock = Lock() @property def enabled(self) -> bool: - return bool(self._kms_key_arn) + return bool(self._kms_key_arn) and bool(self._data_key_secret_arn) def _ensure_cipher(self) -> AESGCM: if self._cipher is not None: @@ -80,16 +128,40 @@ def _ensure_cipher(self) -> AESGCM: with self._init_lock: if self._cipher is not None: return self._cipher - if not self._kms_key_arn: + # Configuration missing — surface as decode error so the + # middleware path stays the same as for a `bad seal` and clears + # the cookie. (This branch is normally only hit in tests or in + # a misconfigured deploy; the env vars are populated by CDK.) + if not self._kms_key_arn or not self._data_key_secret_arn: raise CookieDecodeError() - kms = self._kms_client or boto3.client("kms") - response = kms.generate_data_key( - KeyId=self._kms_key_arn, - KeySpec="AES_256", - ) - plaintext_key = response["Plaintext"] + + sm = self._secrets_manager_client or boto3.client("secretsmanager") + try: + secret = sm.get_secret_value(SecretId=self._data_key_secret_arn) + except Exception as exc: + # Infra failure — propagate so the request returns 5xx + # rather than silently invalidating sessions. + raise CookieDataKeyUnavailable( + f"Failed to fetch BFF data key secret from Secrets Manager: {exc}" + ) from exc + secret_string = secret.get("SecretString") or "" + if not secret_string: + raise CookieDataKeyUnavailable( + "BFF cookie data key secret is empty — bootstrap missing" + ) + + # Single-shot KDF: SHA-256 of a high-entropy random input + # produces a uniformly distributed 32-byte AES-256 key. The + # CDK generates a 44-char alphanumeric secret (~261 bits of + # entropy), so the SHA-256 output is statistically + # indistinguishable from random. + plaintext_key = hashlib.sha256(secret_string.encode("utf-8")).digest() + self._cipher = AESGCM(plaintext_key) - logger.info("BFF cookie codec initialized (KMS data key fetched)") + logger.info( + "BFF cookie codec initialized " + "(data key derived from Secrets Manager secret via SHA-256)" + ) return self._cipher def seal(self, payload: CookiePayload) -> str: @@ -119,10 +191,10 @@ def unseal(self, value: str) -> CookiePayload: unknown version) raise `CookieDecodeError` with no information about the cause — callers treat every decode failure identically. - Infrastructure failures from `_ensure_cipher` (KMS unavailable, etc.) - propagate up so the middleware can return 5xx instead of silently - clearing the session cookie and forcing every active user to re-login - on a transient KMS hiccup. + Infrastructure failures from `_ensure_cipher` (Secrets Manager + unavailable, etc.) propagate up so the middleware can return 5xx + instead of silently clearing the session cookie and forcing every + active user to re-login on a transient hiccup. """ # Cipher acquisition is intentionally outside the try/except below — # a botocore error here must not be coerced into CookieDecodeError. @@ -162,11 +234,16 @@ def unseal(self, value: str) -> CookiePayload: raise CookieDecodeError() -# Process-wide singleton. Every codec instance pulls a fresh random data key -# from KMS, so two codecs in one process can never decrypt each other's -# output — the seal happens in the auth/callback route, the unseal happens -# in `SessionRefreshMiddleware`, and they MUST share a cipher. Treat this as -# the only construction path in production code. +# Process-wide singleton. The first `seal` or `unseal` call fetches the +# shared data-key secret from Secrets Manager and derives the AES key via +# SHA-256; subsequent calls reuse the same `AESGCM` cipher. Across +# processes (e.g. multiple ECS tasks under `desiredCount > 1`), every +# task's singleton derives the **same** plaintext key — so a cookie +# sealed by any task unseals on any other task, including across rolling +# deploys where two task revisions briefly coexist. The seal happens in +# the auth/callback route, the unseal happens in `SessionRefreshMiddleware` +# and the voice WebSocket route, and they all MUST go through this +# singleton. _default_codec: Optional[CookieCodec] = None _default_codec_lock = Lock() @@ -190,7 +267,7 @@ def _reset_default_codec_for_tests() -> None: def _set_default_codec_for_tests(codec: CookieCodec) -> None: """Install a pre-built codec (typically with `_cipher` pre-injected so - no KMS call is made) as the process-wide singleton.""" + no Secrets Manager call is made) as the process-wide singleton.""" global _default_codec with _default_codec_lock: _default_codec = codec diff --git a/backend/src/apis/shared/sessions_bff/refresh.py b/backend/src/apis/shared/sessions_bff/refresh.py index 0a2c5709..4c515ac7 100644 --- a/backend/src/apis/shared/sessions_bff/refresh.py +++ b/backend/src/apis/shared/sessions_bff/refresh.py @@ -15,6 +15,7 @@ from __future__ import annotations +import asyncio import base64 import hashlib import hmac @@ -156,9 +157,15 @@ def _idp(self): return self._cognito_idp return boto3.client("cognito-idp", region_name=self._region) - def refresh(self, *, username: str, refresh_token: str) -> RefreshResult: - """Call Cognito to exchange the refresh token for a fresh access - token. Raises `CognitoRefreshError` on any failure.""" + def _refresh_sync(self, *, username: str, refresh_token: str) -> RefreshResult: + """Synchronous Cognito refresh exchange. + + This is the raw boto3 path — kept private so callers can't invoke it + directly from the event loop. Use :meth:`refresh` instead, which + offloads this call via ``asyncio.to_thread`` so the uvicorn event + loop stays responsive (and so other sessions' ``get_session_lock`` + acquisitions can still progress while ours is held). + """ if not self.enabled: raise CognitoRefreshError("BFF refresh client is not configured") @@ -187,3 +194,22 @@ def refresh(self, *, username: str, refresh_token: str) -> RefreshResult: id_token=result.get("IdToken"), access_token_exp=int(time.time()) + expires_in, ) + + async def refresh(self, *, username: str, refresh_token: str) -> RefreshResult: + """Exchange the refresh token for a fresh access token, off the loop. + + Offloads the synchronous boto3 ``initiate_auth`` call via + ``asyncio.to_thread`` so the event loop keeps scheduling other + coroutines while Cognito is in flight. Critically, this matters + while the per-session ``get_session_lock(session_id)`` is held — + unrelated sessions' locks must remain acquirable on the loop. + + The exception contract and :class:`RefreshResult` return shape are + identical to :meth:`_refresh_sync`: ``CognitoRefreshError`` is + raised on any Cognito failure and should be treated as terminal. + """ + return await asyncio.to_thread( + self._refresh_sync, + username=username, + refresh_token=refresh_token, + ) diff --git a/backend/src/apis/shared/sessions_bff/repository.py b/backend/src/apis/shared/sessions_bff/repository.py index c5eb39ff..5ba510be 100644 --- a/backend/src/apis/shared/sessions_bff/repository.py +++ b/backend/src/apis/shared/sessions_bff/repository.py @@ -8,13 +8,28 @@ Attrs: user_id, cognito_access_token, cognito_refresh_token, id_token, access_token_exp, csrf_secret, created_at, last_seen_at, ttl + Cross-task refresh-lock attrs (added at runtime, never on the initial + write — both default to "absent" until a refresh contender writes them): + refresh_lock_owner: short opaque token identifying the leader + refresh_lock_until: epoch seconds; lock is considered expired past this + The `ttl` attribute is wired to DynamoDB TTL so absolute session lifetime is enforced by the table itself — even if a session row is somehow leaked from the cleanup paths, DynamoDB will eventually evict it. + +The refresh-lock attrs coordinate the Cognito refresh exchange across tasks: +the per-process `get_session_lock` and `single_flight` only coalesce within +a single Python process, so under `desiredCount > 1` two tasks can otherwise +issue parallel `cognito-idp:initiate_auth` calls with the same refresh token — +Cognito rotates on the first; the second fails `NotAuthorizedException` and +the loser unilaterally clears the user's cookie. The lock turns this into a +leader/follower handoff: one task does the refresh, the other reads the +freshly persisted tokens off the row. """ from __future__ import annotations +import asyncio import logging import os import time @@ -31,11 +46,16 @@ class SessionRepository: """Async-shaped wrapper over the BFF sessions DynamoDB table. - The methods are declared `async` to match the rest of `apis.shared`, - but boto3 is sync — calls run on the event loop thread. That mirrors - `UserRepository` and is intentional: refresh-storm coalescing happens - one layer up via `get_session_lock()`, so the lookup itself never - fans out enough to need a thread pool. + The methods are ``async def`` and offload each boto3 call via + ``asyncio.to_thread`` so the uvicorn event loop stays free to schedule + unrelated coroutines during the DynamoDB round-trip. Without this + offload, a single slow DDB call freezes every in-flight request — and + under page-load fan-out the blocking calls serialize, producing the + 80s+ latency tails that motivated the event-loop-blocking bugfix. + + The ``_item_to_record`` translation and the post-read TTL + defense-in-depth check run on the calling coroutine (pure Python, no + I/O); only the boto3 round-trip is offloaded. """ def __init__(self, table_name: Optional[str] = None) -> None: @@ -100,8 +120,14 @@ def _record_to_item(record: SessionRecord) -> dict: async def get(self, session_id: str) -> Optional[SessionRecord]: if not self._enabled: return None + + key = self._key(session_id) + + def _call() -> dict: + return self._table.get_item(Key=key) + try: - response = self._table.get_item(Key=self._key(session_id)) + response = await asyncio.to_thread(_call) except ClientError as exc: logger.error("BFF session get_item failed for %s: %s", session_id, exc) return None @@ -118,7 +144,13 @@ async def get(self, session_id: str) -> Optional[SessionRecord]: async def put(self, record: SessionRecord) -> None: if not self._enabled: return - self._table.put_item(Item=self._record_to_item(record)) + + item = self._record_to_item(record) + + def _call() -> None: + self._table.put_item(Item=item) + + await asyncio.to_thread(_call) async def update_tokens( self, @@ -129,6 +161,7 @@ async def update_tokens( access_token_exp: int, last_seen_at: int, ttl: Optional[int] = None, + expected_lock_owner: Optional[str] = None, ) -> None: """Atomically replace the Cognito tokens after a refresh. @@ -138,6 +171,21 @@ async def update_tokens( is supplied, the row's DynamoDB TTL slides forward in the same write — a refresh proves the user is active, so the session row's expiry should slide alongside it. + + When `expected_lock_owner` is supplied, the write is conditional on + the row's `refresh_lock_owner` attribute strictly matching. The lock + attributes are also REMOVED in the same write, releasing the + cross-task lock alongside the token rotation. The condition fires + on two distinct stale-leader cases that both must NOT stomp: + + 1. A peer holds the lock right now (their owner != ours) — we never + had it or our acquisition was stale. + 2. A peer held the lock, completed the refresh, and `REMOVE`d the + attrs — the row has no lock owner at all but our tokens are + now older than the row's persisted state. + + Both surface as `ConditionalCheckFailedException`; the caller + re-reads the row and adopts the peer's tokens instead of stomping. """ if not self._enabled: return @@ -159,7 +207,7 @@ async def update_tokens( if ttl is not None: update_expr += ", #ttl = :ttl" expr_values[":ttl"] = ttl - kwargs = { + kwargs: dict = { "Key": self._key(session_id), "UpdateExpression": update_expr, "ExpressionAttributeValues": expr_values, @@ -167,7 +215,130 @@ async def update_tokens( if ttl is not None: # `ttl` is a reserved word in DynamoDB expressions. kwargs["ExpressionAttributeNames"] = {"#ttl": "ttl"} - self._table.update_item(**kwargs) + + if expected_lock_owner is not None: + # Atomically release the cross-task refresh lock alongside the + # token write. The condition is strict — `refresh_lock_owner` + # MUST equal our owner. We don't accept "lock attrs absent" + # because that's exactly the stale-leader stomp case: a peer + # whose lock TTL'd, took over, refreshed, and persisted (which + # REMOVEs the lock attrs) — letting `attribute_not_exists` + # match here would let our stale tokens overwrite the peer's + # freshly rotated ones, silently logging the user out on the + # next request when Cognito rejects our (now-revoked) refresh + # token. The leader always set these attrs in + # `try_acquire_refresh_lock`, so the strict form is correct + # in every legitimate flow. + kwargs["UpdateExpression"] = ( + update_expr + " REMOVE refresh_lock_owner, refresh_lock_until" + ) + expr_values[":owner"] = expected_lock_owner + kwargs["ConditionExpression"] = "refresh_lock_owner = :owner" + + def _call() -> None: + self._table.update_item(**kwargs) + + await asyncio.to_thread(_call) + + async def try_acquire_refresh_lock( + self, + session_id: str, + owner: str, + lock_ttl_seconds: int, + ) -> bool: + """Atomically claim leadership of a cross-task Cognito refresh. + + Conditional `UpdateItem` on the session row: succeeds (returns True) + only if no peer holds the lock OR the holder's lock has expired + (`refresh_lock_until < now`). On contention returns False — the + caller should poll the row for the leader's persisted tokens. + + Lock TTL bounds the worst case: a leader that crashes mid-refresh + strands the lock for at most `lock_ttl_seconds` (we use 30s in the + middleware), after which any peer can re-acquire and retry. + + Returns False on `ConditionalCheckFailedException`. Other DDB + errors propagate so the caller can surface them as 5xx — silently + suppressing them would create a "neither leader nor follower" gap. + """ + if not self._enabled: + return False + now = int(time.time()) + kwargs: dict = { + "Key": self._key(session_id), + "UpdateExpression": ( + "SET refresh_lock_owner = :owner, " + "refresh_lock_until = :until" + ), + # `attribute_exists(PK)` guards against UpdateItem's + # upsert-by-default behavior — without it, a logout that races + # the refresh path (deletes the row between `repository.get()` + # and this call) would let us create a phantom row containing + # only the lock attrs and no `ttl`, which DDB TTL would never + # reap. With it, lock acquisition on a missing row fails + # cleanly via ConditionalCheckFailedException → False. + "ConditionExpression": ( + "attribute_exists(PK) AND (" + "attribute_not_exists(refresh_lock_until) " + "OR refresh_lock_until < :now)" + ), + "ExpressionAttributeValues": { + ":owner": owner, + ":until": now + lock_ttl_seconds, + ":now": now, + }, + } + + def _call() -> bool: + try: + self._table.update_item(**kwargs) + return True + except ClientError as exc: + if ( + exc.response.get("Error", {}).get("Code") + == "ConditionalCheckFailedException" + ): + return False + raise + + return await asyncio.to_thread(_call) + + async def release_refresh_lock(self, session_id: str, owner: str) -> None: + """Release the cross-task refresh lock if `owner` still holds it. + + Used when the leader's Cognito refresh fails terminally and we want + a peer to be able to retry without waiting for the full lock TTL. + Best-effort: a `ConditionalCheckFailedException` (lock TTL'd or + re-acquired) is treated as a no-op. + + `update_tokens` clears the lock attributes atomically with a + successful refresh, so this is only for the failure path. + """ + if not self._enabled: + return + kwargs: dict = { + "Key": self._key(session_id), + "UpdateExpression": ( + "REMOVE refresh_lock_owner, refresh_lock_until" + ), + "ConditionExpression": "refresh_lock_owner = :owner", + "ExpressionAttributeValues": {":owner": owner}, + } + + def _call() -> None: + try: + self._table.update_item(**kwargs) + except ClientError as exc: + code = exc.response.get("Error", {}).get("Code") + if code == "ConditionalCheckFailedException": + return # peer re-acquired or lock TTL'd — fine + logger.warning( + "BFF refresh lock release failed for %s: %s", + session_id, + exc, + ) + + await asyncio.to_thread(_call) async def touch_last_seen( self, @@ -195,8 +366,12 @@ async def touch_last_seen( expr_values[":ttl"] = ttl kwargs["UpdateExpression"] = update_expr kwargs["ExpressionAttributeNames"] = {"#ttl": "ttl"} - try: + + def _call() -> None: self._table.update_item(**kwargs) + + try: + await asyncio.to_thread(_call) except ClientError as exc: # Touch failures are non-critical — log and move on rather than # surfacing as a request error. @@ -205,4 +380,10 @@ async def touch_last_seen( async def delete(self, session_id: str) -> None: if not self._enabled: return - self._table.delete_item(Key=self._key(session_id)) + + key = self._key(session_id) + + def _call() -> None: + self._table.delete_item(Key=key) + + await asyncio.to_thread(_call) diff --git a/backend/src/apis/shared/sessions_bff/single_flight.py b/backend/src/apis/shared/sessions_bff/single_flight.py new file mode 100644 index 00000000..0af06ab6 --- /dev/null +++ b/backend/src/apis/shared/sessions_bff/single_flight.py @@ -0,0 +1,117 @@ +"""Per-session single-flight primitive — session-resolve path coalescing. + +`get_session_lock` in `lock.py` only serializes the Cognito refresh exchange. +It does NOT coalesce the upstream unseal -> `SessionCache.get` -> +`SessionRepository.get` -> `needs_refresh` sequence. When Angular's ~8-endpoint +page-load fan-out hits a cold cache window, each coroutine independently +observes the miss and each runs its own blocking `get_item`, producing ~N +DynamoDB round-trips per cache window per session. + +The primitive in this module addresses that gap with a per-session +`asyncio.Future`: the first caller (the "leader") registers a Future under the +session id, runs the loader, and stores the result/exception on the Future. +Concurrent callers that arrive while the leader is still running (the +"followers") find the existing Future and simply `await` it, sharing the +leader's single DynamoDB call. + +This is a separate primitive from `get_session_lock`. The existing lock scope +around the Cognito exchange is preserved end-to-end — this single-flight sits +upstream of it. +""" + +from __future__ import annotations + +import asyncio +from threading import Lock as _ThreadLock +from typing import Awaitable, Callable, Dict, Optional, Tuple + +from apis.shared.sessions_bff.models import SessionRecord + +# Module-level registry of in-flight resolves keyed by `session_id`. +# Unlike `lock.py`, we use a plain `dict` rather than a `WeakValueDictionary` +# because an `asyncio.Future` that is only referenced by its awaiters would +# otherwise be collected if every awaiter was garbage-collected before +# resolution — the leader is responsible for removing its entry in a +# `finally` block, which keeps lifetime management explicit. +_inflight: Dict[str, "asyncio.Future[Tuple[Optional[SessionRecord], bool]]"] = {} +_registry_guard = _ThreadLock() + + +async def resolve_once( + session_id: str, + loader_coro_factory: Callable[ + [], Awaitable[Tuple[Optional[SessionRecord], bool]] + ], +) -> Tuple[Optional[SessionRecord], bool]: + """Run `loader_coro_factory()` at most once per concurrent `session_id`. + + Leader semantics: the first caller for a given `session_id` creates a new + `asyncio.Future`, registers it under the thread-lock-guarded registry, + runs the loader, sets the result or exception on the Future, removes the + entry from the registry, and returns the value. + + Follower semantics: any caller that finds an existing Future `await`s it + and returns its value, sharing the leader's single loader invocation. + + Exception propagation: an exception raised by the loader is set on the + Future so it propagates to the leader AND to every follower currently + awaiting. The registry entry is always removed before the leader returns + (success or failure), so any subsequent call after the failure starts a + fresh leader. + + Isolation: distinct `session_id`s do not share a Future — the registry is + keyed by `session_id`. + """ + # Fast path: look for an existing Future without holding the thread lock. + existing = _inflight.get(session_id) + if existing is not None: + return await existing + + # Slow path: register a new Future under the thread lock, double-checking + # so two coroutines on different threads can't race-create two Futures. + loop = asyncio.get_event_loop() + with _registry_guard: + existing = _inflight.get(session_id) + if existing is not None: + # Another caller won the race — fall through to follower path. + future = existing + is_leader = False + else: + future = loop.create_future() + _inflight[session_id] = future + is_leader = True + + if not is_leader: + return await future + + # Leader path — run the loader, set the result/exception, and clean up. + try: + result = await loader_coro_factory() + except BaseException as exc: # noqa: BLE001 — we must propagate everything + if not future.done(): + future.set_exception(exc) + # Mark the exception as retrieved on the leader's side. Followers + # still observe it when they `await` the Future; this only + # silences the "Future exception was never retrieved" warning + # emitted when no follower ever attached. + future.exception() + with _registry_guard: + # Only clear our own entry — another leader may have taken over + # after we set the exception, though in practice that's only + # possible if every follower has already consumed the Future. + if _inflight.get(session_id) is future: + del _inflight[session_id] + raise + else: + if not future.done(): + future.set_result(result) + with _registry_guard: + if _inflight.get(session_id) is future: + del _inflight[session_id] + return result + + +def _reset_for_tests() -> None: + """Test-only escape hatch — drop all tracked in-flight Futures.""" + with _registry_guard: + _inflight.clear() diff --git a/backend/tests/agents/main_agent/core/test_model_config.py b/backend/tests/agents/main_agent/core/test_model_config.py index 9489f909..e47ddc95 100644 --- a/backend/tests/agents/main_agent/core/test_model_config.py +++ b/backend/tests/agents/main_agent/core/test_model_config.py @@ -98,9 +98,10 @@ def test_explicit_gemini_overrides_gpt_model_id(self): class TestToBedrockConfig: """Validates: Requirements 1.6, 1.7""" - def test_bedrock_config_with_caching_disabled_due_to_bedrock_limitation(self): - """Req 1.6 — caching_enabled=True but cache_config omitted due to - Bedrock limitation with non-PDF document blocks. See model_config.py TODO.""" + def test_bedrock_config_with_caching_enabled_currently_omits_cache_config(self): + """Req 1.6 — caching_enabled=True but cache_config omitted while + Bedrock prompt caching rollout is deferred. The SDK-side blocker is + resolved in strands 1.39.0; see model_config.py for the deferral note.""" cfg = ModelConfig(caching_enabled=True) result = cfg.to_bedrock_config() diff --git a/backend/tests/agents/main_agent/streaming/test_per_message_cost_attribution.py b/backend/tests/agents/main_agent/streaming/test_per_message_cost_attribution.py new file mode 100644 index 00000000..0c24bcc8 --- /dev/null +++ b/backend/tests/agents/main_agent/streaming/test_per_message_cost_attribution.py @@ -0,0 +1,310 @@ +"""Regression test for per-message cost attribution on multi-LLM-call turns. + +Strands emits two sources of usage during a tool-use turn: + 1. Per-LLM-call metadata via ``ModelStreamChunkEvent`` (one per assistant + message), carrying just that call's tokens. + 2. A final ``AgentResultEvent`` whose ``AgentResult.metrics`` is an + ``EventLoopMetrics`` with ``accumulated_usage`` summed across every call + in the turn. + +``stream_processor._handle_metadata_events`` extracts both. The stream +coordinator routes any ``metadata`` event into +``per_message_metadata[current_assistant_message_index]["usage"].update(...)``. +Because the AgentResult event arrives *after* every ``message_stop`` (so the +index still points at the last assistant message), a naive ``.update()`` on +the same key overwrites the last message's per-call usage with the +turn-cumulative usage. Pricing each per-message entry and summing then +double-counts every earlier message's input tokens. + +This module locks the contract: + - The per-call metadata events stay typed ``metadata`` (per-message track). + - The result-extracted cumulative metadata is typed ``metadata_summary`` + (turn-summary track), so it never lands in per_message_metadata. + +If the contract regresses, simulating the dispatch loop will reproduce the +double-count and these assertions will fail. +""" + +from typing import Any, Dict, List + +from agents.main_agent.streaming.stream_processor import _handle_metadata_events + + +# Realistic per-call metadata chunk shape: Bedrock's `metadata` chunk wrapped +# inside Strands' ModelStreamChunkEvent (`{"event": chunk}`). +def _per_call_metadata_event(usage: Dict[str, int]) -> Dict[str, Any]: + return {"event": {"metadata": {"usage": usage, "metrics": {"latencyMs": 100}}}} + + +# Realistic AgentResultEvent shape. EventLoopMetrics has accumulated_usage +# summed across all calls; _handle_metadata_events extracts it via __dict__. +class _FakeEventLoopMetrics: + def __init__(self, accumulated_usage: Dict[str, int]) -> None: + self.accumulated_usage = accumulated_usage + self.accumulated_metrics = {"latencyMs": 250} + + +class _FakeAgentResult: + def __init__(self, accumulated_usage: Dict[str, int]) -> None: + self.metrics = _FakeEventLoopMetrics(accumulated_usage) + + +def _agent_result_event(accumulated_usage: Dict[str, int]) -> Dict[str, Any]: + return {"result": _FakeAgentResult(accumulated_usage)} + + +def _dispatch_to_per_message( + processed_events: List[Dict[str, Any]], + per_message_metadata: List[Dict[str, Any]], + current_index: int, +) -> None: + """Mimic stream_coordinator's per-message routing for a single source event. + + Only ``metadata`` events flow into ``per_message_metadata`` — the + ``metadata_summary`` track is for the turn-level accumulator and is + intentionally not routed here. + """ + for processed in processed_events: + if processed.get("type") != "metadata": + continue + usage = processed.get("data", {}).get("usage") + if not usage: + continue + per_message_metadata[current_index]["usage"].update(usage) + + +class TestPerMessageAttributionTwoCallTurn: + """Reproduce the dispatch sequence of a 2-call tool-use turn.""" + + CALL_0_USAGE = {"inputTokens": 1000, "outputTokens": 50, "totalTokens": 1050} + CALL_1_USAGE = {"inputTokens": 1300, "outputTokens": 80, "totalTokens": 1380} + TURN_CUMULATIVE = { + "inputTokens": CALL_0_USAGE["inputTokens"] + CALL_1_USAGE["inputTokens"], + "outputTokens": CALL_0_USAGE["outputTokens"] + CALL_1_USAGE["outputTokens"], + "totalTokens": CALL_0_USAGE["totalTokens"] + CALL_1_USAGE["totalTokens"], + } + + def test_per_call_metadata_routes_to_per_message_track(self): + """Each per-call metadata event carries one message's tokens, no more.""" + events = _handle_metadata_events(_per_call_metadata_event(self.CALL_0_USAGE)) + metadata_events = [e for e in events if e["type"] == "metadata"] + assert len(metadata_events) == 1 + assert metadata_events[0]["data"]["usage"] == self.CALL_0_USAGE + + def test_result_cumulative_does_not_route_to_per_message_track(self): + """The AgentResult cumulative must not be a `metadata` event. + + If it is, the dispatch loop overwrites the last per-message entry + with cumulative usage, double-counting earlier messages' input + tokens at pricing time. + """ + events = _handle_metadata_events(_agent_result_event(self.TURN_CUMULATIVE)) + per_message_typed = [e for e in events if e["type"] == "metadata"] + assert per_message_typed == [], ( + "AgentResult cumulative usage was emitted as a `metadata` event; " + "it would clobber the last per-message entry. Expected " + "`metadata_summary` so it stays on the turn-summary track only." + ) + + def test_result_cumulative_emitted_on_summary_track(self): + """Result-extracted cumulative is still emitted — just on metadata_summary.""" + events = _handle_metadata_events(_agent_result_event(self.TURN_CUMULATIVE)) + summary_events = [e for e in events if e["type"] == "metadata_summary"] + assert len(summary_events) == 1 + assert summary_events[0]["data"]["usage"] == self.TURN_CUMULATIVE + + def test_full_turn_dispatch_preserves_per_call_attribution(self): + """Drive the full event sequence and assert no double-counting.""" + per_message_metadata = [ + {"usage": {}, "metrics": {}}, + {"usage": {}, "metrics": {}}, + ] + + # Message 0's per-call metadata fires while index = 0. + _dispatch_to_per_message( + _handle_metadata_events(_per_call_metadata_event(self.CALL_0_USAGE)), + per_message_metadata, + current_index=0, + ) + # Message 1's per-call metadata fires while index = 1. + _dispatch_to_per_message( + _handle_metadata_events(_per_call_metadata_event(self.CALL_1_USAGE)), + per_message_metadata, + current_index=1, + ) + # AgentResult cumulative fires last, with index still at 1. If this + # leaks onto the `metadata` track, msg 1's usage gets clobbered with + # the turn cumulative — input tokens for msg 0 would be summed twice + # when pricing each entry independently. + _dispatch_to_per_message( + _handle_metadata_events(_agent_result_event(self.TURN_CUMULATIVE)), + per_message_metadata, + current_index=1, + ) + + assert per_message_metadata[0]["usage"] == self.CALL_0_USAGE + assert per_message_metadata[1]["usage"] == self.CALL_1_USAGE + + # Pricing each entry independently must equal the cumulative input, + # not 2× msg 0's input + msg 1's input. + summed_input = ( + per_message_metadata[0]["usage"]["inputTokens"] + + per_message_metadata[1]["usage"]["inputTokens"] + ) + assert summed_input == self.TURN_CUMULATIVE["inputTokens"] + + +class TestSummaryAccumulatorAcceptsBothTracks: + """The stream_processor main loop must keep `accumulated_metadata` cumulative. + + Per-call events accumulate via ``.update()`` (last-write-wins), so before + the cumulative arrives the accumulator only holds the last call's usage — + which is *not* cumulative. The accumulator must therefore consume both + `metadata` and `metadata_summary` events for the final summary emission + to carry true turn totals. + """ + + def test_accumulator_processes_both_tracks(self): + """Walk the same sequence the main loop does and check the final state.""" + accumulated: Dict[str, Any] = {"usage": {}, "metrics": {}} + + sequence = [ + _per_call_metadata_event(TestPerMessageAttributionTwoCallTurn.CALL_0_USAGE), + _per_call_metadata_event(TestPerMessageAttributionTwoCallTurn.CALL_1_USAGE), + _agent_result_event(TestPerMessageAttributionTwoCallTurn.TURN_CUMULATIVE), + ] + + for raw in sequence: + for processed in _handle_metadata_events(raw): + if processed.get("type") in ("metadata", "metadata_summary"): + data = processed.get("data", {}) + if "usage" in data: + accumulated["usage"].update(data["usage"]) + if "metrics" in data: + accumulated["metrics"].update(data["metrics"]) + + assert accumulated["usage"] == TestPerMessageAttributionTwoCallTurn.TURN_CUMULATIVE + + +class TestStreamCoordinatorContextOccupancy: + """The final SSE `usage` field must reflect current context, not sums. + + Bedrock reports each LLM call's `inputTokens` as the FULL context size + sent on that call. For a 2-call tool turn: + call_1.input = 1000 (system + user_msg) + call_2.input = 2500 (system + user_msg + tool_use + tool_result) + + Strands' EventLoopMetrics.accumulated_usage sums these into 3500 — but + the actual context occupancy is 2500, the size of the most recent call. + The frontend uses the SSE metadata `usage` to drive the context-% + badge, and the backend uses it to decide whether to trigger + compaction; both need "current context size", not the cross-call sum. + + This locks in the contract that stream_coordinator's accumulated_metadata + (which feeds the final SSE metadata) takes per-call values via + last-write-wins from `metadata` events and IGNORES the cross-call + cumulative carried on `metadata_summary`. + """ + + CALL_0_USAGE = {"inputTokens": 1000, "outputTokens": 50, "totalTokens": 1050} + CALL_1_USAGE = {"inputTokens": 2500, "outputTokens": 100, "totalTokens": 2600} + TURN_CUMULATIVE = { + "inputTokens": 3500, # 1000 + 2500 — Strands' accumulated_usage + "outputTokens": 150, + "totalTokens": 3650, + } + + def _simulate_stream_coordinator_accumulator( + self, events: List[Dict[str, Any]] + ) -> Dict[str, Any]: + """Mirror stream_coordinator's accumulator branches for a sequence of + already-processed events. Returns the resulting accumulated_metadata. + + - `metadata` events → update accumulated_metadata.usage/metrics. + - `metadata_summary` events → first_token_time only; usage/metrics ignored. + """ + accumulated: Dict[str, Any] = {"usage": {}, "metrics": {}} + for processed in events: + event_type = processed.get("type") + event_data = processed.get("data", {}) + if event_type == "metadata": + if "usage" in event_data: + accumulated["usage"].update(event_data["usage"]) + if "metrics" in event_data: + accumulated["metrics"].update(event_data["metrics"]) + # metadata_summary intentionally does NOT touch usage/metrics here + return accumulated + + def test_final_usage_reflects_last_call_not_sum(self): + """End of a 2-call tool turn — usage should be call_2's, not the sum.""" + # Drive the realistic event order through _handle_metadata_events + # exactly as stream_processor would, then through the coordinator's + # accumulator branches. + raw_events = [ + _per_call_metadata_event(self.CALL_0_USAGE), + _per_call_metadata_event(self.CALL_1_USAGE), + _agent_result_event(self.TURN_CUMULATIVE), + ] + processed: List[Dict[str, Any]] = [] + for raw in raw_events: + processed.extend(_handle_metadata_events(raw)) + + result = self._simulate_stream_coordinator_accumulator(processed) + + assert result["usage"] == self.CALL_1_USAGE, ( + "Final accumulated usage must equal the last per-call's full input " + "(current context size), not Strands' summed-across-calls value. " + "If this regresses, the context-% badge and compaction trigger " + "will inflate by ~the size of every prior call in the turn." + ) + + def test_compaction_input_tokens_match_current_context(self): + """The trigger threshold computation in stream_coordinator uses + `usage.inputTokens + cacheReadInputTokens + cacheWriteInputTokens`.""" + call_with_cache = { + "inputTokens": 200, + "outputTokens": 80, + "totalTokens": 280, + "cacheReadInputTokens": 2000, + "cacheWriteInputTokens": 300, + } + prior_call = { + "inputTokens": 100, + "outputTokens": 40, + "totalTokens": 140, + "cacheReadInputTokens": 0, + "cacheWriteInputTokens": 800, + } + cumulative_after_two_calls = { + "inputTokens": 300, # would be summed by Strands + "outputTokens": 120, + "totalTokens": 420, + "cacheReadInputTokens": 2000, + "cacheWriteInputTokens": 1100, # would be summed by Strands + } + + raw_events = [ + _per_call_metadata_event(prior_call), + _per_call_metadata_event(call_with_cache), + _agent_result_event(cumulative_after_two_calls), + ] + processed: List[Dict[str, Any]] = [] + for raw in raw_events: + processed.extend(_handle_metadata_events(raw)) + + result = self._simulate_stream_coordinator_accumulator(processed) + usage = result["usage"] + + # Compaction sums all three input buckets — must equal call_with_cache's + # totals (current context), not the summed-across-calls totals. + compaction_input = ( + usage.get("inputTokens", 0) + + usage.get("cacheReadInputTokens", 0) + + usage.get("cacheWriteInputTokens", 0) + ) + expected_current_context = ( + call_with_cache["inputTokens"] + + call_with_cache["cacheReadInputTokens"] + + call_with_cache["cacheWriteInputTokens"] + ) + assert compaction_input == expected_current_context diff --git a/backend/tests/agents/main_agent/streaming/test_stream_processor.py b/backend/tests/agents/main_agent/streaming/test_stream_processor.py index 04a99318..76bdf45a 100644 --- a/backend/tests/agents/main_agent/streaming/test_stream_processor.py +++ b/backend/tests/agents/main_agent/streaming/test_stream_processor.py @@ -608,7 +608,13 @@ def test_empty_event_returns_empty(self): assert _handle_metadata_events({}) == [] def test_result_with_accumulated_usage(self): - """result.metrics.accumulated_usage produces a metadata event.""" + """result.metrics.accumulated_usage rides the metadata_summary track. + + It must NOT be emitted as a `metadata` event — those land in + per_message_metadata in the stream coordinator and would clobber + the last assistant message's per-call usage with a turn-cumulative + value, double-counting earlier messages at pricing time. + """ raw = { "result": { "metrics": { @@ -621,9 +627,11 @@ def test_result_with_accumulated_usage(self): } } events = _handle_metadata_events(raw) - m = [e for e in events if e["type"] == "metadata"] - assert len(m) >= 1 - assert m[0]["data"]["usage"]["inputTokens"] == 500 + per_message_typed = [e for e in events if e["type"] == "metadata"] + summary_typed = [e for e in events if e["type"] == "metadata_summary"] + assert per_message_typed == [] + assert len(summary_typed) == 1 + assert summary_typed[0]["data"]["usage"]["inputTokens"] == 500 # --------------------------------------------------------------------------- diff --git a/backend/tests/apis/shared/middleware/test_session_refresh_bug_condition.py b/backend/tests/apis/shared/middleware/test_session_refresh_bug_condition.py new file mode 100644 index 00000000..3445180c --- /dev/null +++ b/backend/tests/apis/shared/middleware/test_session_refresh_bug_condition.py @@ -0,0 +1,739 @@ +"""Bug condition exploration property tests for SessionRefreshMiddleware event-loop blocking. + +Property 1: Bug Condition — Event-Loop Non-Blocking, Coalesced, Window-Staggered, Fire-and-Forget + +This file encodes the EXPECTED behavior (Property 1 / Expected Behavior 2.1–2.7) from +the design document. Each sub-condition test surfaces a counterexample that demonstrates +the corresponding sub-condition (1.1–1.7) of `isBugCondition` from design.md. + +CRITICAL: These tests MUST FAIL on unfixed code — failure confirms the bug exists. +They will PASS after the fix (task 3 series) is implemented: + - Repository/Cognito offload via asyncio.to_thread (2.1, 2.2) + - Per-session single-flight for the resolve path (2.3) + - Strict-multiple windows (throttle=300s, leeway=60s) (2.4) + - Fire-and-forget slide-write (2.5) + - appApi.desiredCount >= 2 (2.6) + - Bounded blocking DDB calls across fan-out (2.7) + +Scoped PBT Approach: each sub-condition is reproduced by a concrete, deterministic +scenario under pytest-asyncio. Hypothesis is used on the two sub-conditions that +generalize over a family of inputs (fan-out size for 1.3 / 1.7). + +Validates: Requirements 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7 +""" + +from __future__ import annotations + +import asyncio +import json +import secrets +import time +from pathlib import Path +from typing import Any, Optional +from unittest.mock import MagicMock + +import httpx +import pytest +from cryptography.hazmat.primitives.ciphers.aead import AESGCM +from fastapi import FastAPI, Request +from hypothesis import HealthCheck, given, settings +from hypothesis import strategies as st + +from apis.shared.middleware.session_refresh import SessionRefreshMiddleware +from apis.shared.sessions_bff import lock as lock_module +from apis.shared.sessions_bff.cache import SessionCache +from apis.shared.sessions_bff.config import ( + BFFConfig, + SESSION_COOKIE_NAME, + _DEFAULT_REFRESH_LEEWAY_SECONDS, + _DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS, +) +from apis.shared.sessions_bff.cookie import CookieCodec +from apis.shared.sessions_bff.lock import get_session_lock +from apis.shared.sessions_bff.models import CookiePayload, SessionRecord +from apis.shared.sessions_bff.refresh import ( + CognitoRefreshClient, + _reset_secret_cache_for_tests, +) +from apis.shared.sessions_bff.repository import SessionRepository + + +# ═══════════════════════════════════════════════════════════════════════════ +# Fixtures and helpers +# ═══════════════════════════════════════════════════════════════════════════ + + +class InstrumentedTable: + """Synchronous fake of a boto3 DynamoDB Table. + + Records call counts and can inject a `time.sleep` delay to block the + event loop thread on unfixed code, letting us prove whether the caller + yielded to the loop while the boto3 call was in flight. + + Mirrors the tiny subset of the Table API that `SessionRepository` uses: + `get_item`, `update_item`, `put_item`, `delete_item`. + """ + + def __init__( + self, + *, + record: Optional[SessionRecord] = None, + delay_s: float = 0.0, + ) -> None: + self._delay_s = delay_s + self._record = record + self.get_item_calls = 0 + self.update_item_calls = 0 + self.put_item_calls = 0 + self.delete_item_calls = 0 + + def _sleep(self) -> None: + if self._delay_s > 0: + time.sleep(self._delay_s) + + def get_item(self, Key: dict) -> dict: + self.get_item_calls += 1 + self._sleep() + if self._record is None: + return {} + return {"Item": _record_to_item(self._record)} + + def update_item(self, **kwargs: Any) -> dict: + self.update_item_calls += 1 + self._sleep() + return {} + + def put_item(self, Item: dict) -> dict: + self.put_item_calls += 1 + self._sleep() + return {} + + def delete_item(self, Key: dict) -> dict: + self.delete_item_calls += 1 + self._sleep() + return {} + + +def _record_to_item(r: SessionRecord) -> dict: + return { + "PK": f"SESSION#{r.session_id}", + "SK": "META", + "session_id": r.session_id, + "user_id": r.user_id, + "username": r.username, + "cognito_access_token": r.cognito_access_token, + "cognito_refresh_token": r.cognito_refresh_token, + "id_token": r.id_token, + "access_token_exp": r.access_token_exp, + "csrf_secret": r.csrf_secret, + "created_at": r.created_at, + "last_seen_at": r.last_seen_at, + "ttl": r.ttl, + } + + +def _make_repo(table: InstrumentedTable) -> SessionRepository: + """Build a SessionRepository backed by an InstrumentedTable. + + Bypasses boto3.resource() initialization by starting disabled, then + flipping `_enabled` and injecting the fake table. Exercises the real + SessionRepository async-method bodies — which is the point for + sub-condition 1.1 (offload). + """ + repo = SessionRepository(table_name="") + repo._enabled = True + repo._table = table # type: ignore[assignment] + repo._table_name = "test-bff-sessions" + return repo + + +def _make_codec() -> CookieCodec: + codec = CookieCodec(kms_key_arn="arn:aws:kms:fake") + # Pre-inject an AES-GCM cipher so no KMS call is attempted. + codec._cipher = AESGCM(secrets.token_bytes(32)) + return codec + + +def _make_record( + *, + session_id: str = "sess-001", + access_token_exp: Optional[int] = None, + last_seen_at: Optional[int] = None, + created_at: Optional[int] = None, +) -> SessionRecord: + now = int(time.time()) + return SessionRecord( + session_id=session_id, + user_id="user-sub-001", + username="alice", + cognito_access_token="access.original", + cognito_refresh_token="refresh.original", + id_token="id.original", + access_token_exp=access_token_exp if access_token_exp is not None else now + 3600, + csrf_secret="csrf-secret-deadbeef", + created_at=created_at if created_at is not None else now, + last_seen_at=last_seen_at if last_seen_at is not None else now, + ttl=now + 28800, + ) + + +def _enabled_config(**overrides: Any) -> BFFConfig: + defaults: dict[str, Any] = dict( + sessions_table_name="tbl", + cookie_signing_key_arn="arn:aws:kms:fake", + session_ttl_seconds=28800, + refresh_leeway_seconds=_DEFAULT_REFRESH_LEEWAY_SECONDS, + cognito_bff_app_client_id="client-id", + cognito_bff_app_client_secret_arn="arn:secret", + inference_api_url=None, + absolute_lifetime_seconds=30 * 24 * 3600, + sliding_renewal_throttle_seconds=_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS, + ) + defaults.update(overrides) + return BFFConfig(**defaults) + + +def _build_app( + *, + config: BFFConfig, + repository: SessionRepository, + codec: CookieCodec, + refresh_client: Any, + cache: Optional[SessionCache] = None, +) -> FastAPI: + app = FastAPI() + app.add_middleware( + SessionRefreshMiddleware, + config=config, + repository=repository, + cookie_codec=codec, + refresh_client=refresh_client, + cache=cache or SessionCache(ttl_seconds=60), + ) + + @app.get("/echo") + async def echo(request: Request) -> dict: + record = getattr(request.state, "bff_session", None) + return { + "has_session": record is not None, + "session_id": record.session_id if record else None, + } + + return app + + +@pytest.fixture(autouse=True) +def _reset_session_state() -> Any: + """Clear process-wide state between tests so storm/coalescing behavior + stays independent across cases.""" + lock_module._reset_for_tests() + _reset_secret_cache_for_tests() + yield + lock_module._reset_for_tests() + _reset_secret_cache_for_tests() + + +# ═══════════════════════════════════════════════════════════════════════════ +# Sub-condition 1.1 — SessionRepository.* must offload sync boto3 to a threadpool +# ═══════════════════════════════════════════════════════════════════════════ + + +@pytest.mark.asyncio +@pytest.mark.parametrize( + "method_name", + ["get", "touch_last_seen", "update_tokens", "put", "delete"], +) +async def test_1_1_session_repository_methods_offload_sync_boto3( + method_name: str, +) -> None: + """(1.1) Repository offload. + + Each SessionRepository async method that wraps boto3 must execute its + boto3 call off the event loop thread. We prove this by running the + method concurrently with a 50ms marker coroutine against a 500ms + slow-stubbed table. + + - Fixed code: marker completes in ~0.05s while repo call is still in flight. + - Unfixed code: sync boto3 freezes the loop for the full 500ms, starving + the marker so it only completes once the method returns. + + Expected Behavior 2.1 (design.md). + """ + record = _make_record(session_id=f"sess-1-1-{method_name}") + table = InstrumentedTable(record=record, delay_s=0.5) + repo = _make_repo(table) + + now = int(time.time()) + if method_name == "get": + op = repo.get(record.session_id) + elif method_name == "touch_last_seen": + op = repo.touch_last_seen(record.session_id, last_seen_at=now) + elif method_name == "update_tokens": + op = repo.update_tokens( + session_id=record.session_id, + access_token="access.rotated", + refresh_token="refresh.rotated", + id_token=None, + access_token_exp=now + 3600, + last_seen_at=now, + ) + elif method_name == "put": + op = repo.put(record) + elif method_name == "delete": + op = repo.delete(record.session_id) + else: + pytest.fail(f"unknown method_name: {method_name}") + + marker_elapsed: dict[str, float] = {} + + async def marker(start: float) -> None: + await asyncio.sleep(0.05) + marker_elapsed["t"] = time.monotonic() - start + + t0 = time.monotonic() + marker_task = asyncio.create_task(marker(t0)) + await op + op_elapsed = time.monotonic() - t0 + await marker_task + + # Sanity: the stubbed boto3 call really took ~500ms. + assert op_elapsed >= 0.4, ( + f"[1.1/{method_name}] Sanity: stubbed {method_name} should take ~500ms, " + f"got {op_elapsed:.3f}s — the InstrumentedTable delay may not be wired." + ) + # Counterexample: on unfixed code, the marker sits behind the frozen loop. + assert "t" in marker_elapsed, ( + f"[1.1/{method_name}] Marker coroutine never completed — " + f"event loop fully frozen by sync boto3." + ) + assert marker_elapsed["t"] < 0.25, ( + f"[1.1/{method_name}] Marker coroutine starved by sync boto3: " + f"marker elapsed={marker_elapsed['t']:.3f}s, " + f"op elapsed={op_elapsed:.3f}s. " + f"SessionRepository.{method_name} must offload its boto3 call via " + "asyncio.to_thread so the event loop continues scheduling other " + "coroutines for the round-trip duration." + ) + + +# ═══════════════════════════════════════════════════════════════════════════ +# Sub-condition 1.2 — CognitoRefreshClient.refresh must offload initiate_auth +# ═══════════════════════════════════════════════════════════════════════════ + + +@pytest.mark.asyncio +async def test_1_2_cognito_refresh_offloads_sync_initiate_auth() -> None: + """(1.2) Cognito offload. + + CognitoRefreshClient.refresh must execute cognito-idp:initiate_auth + off the event loop thread, including while the per-session + get_session_lock(session_id) is held. We prove this by running + refresh concurrently with: + (a) a 50ms marker coroutine; + (b) an unrelated get_session_lock(other_session_id) acquisition. + + - Fixed code: both complete promptly while refresh is still in flight. + - Unfixed code: the sync initiate_auth freezes the loop, starving + the marker and delaying the unrelated lock acquisition. + + Expected Behavior 2.2 (design.md). + """ + slow_cognito = MagicMock() + + def slow_initiate_auth(**_kwargs: Any) -> dict: + time.sleep(0.5) + return { + "AuthenticationResult": { + "AccessToken": "access.fresh", + "RefreshToken": "refresh.fresh", + "IdToken": "id.fresh", + "ExpiresIn": 3600, + } + } + + slow_cognito.initiate_auth.side_effect = slow_initiate_auth + + slow_secrets = MagicMock() + slow_secrets.get_secret_value.return_value = {"SecretString": "client-secret"} + + client = CognitoRefreshClient( + app_client_id="client-id", + app_client_secret_arn="arn:secret", + cognito_idp_client=slow_cognito, + secrets_manager_client=slow_secrets, + ) + + marker_elapsed: dict[str, float] = {} + lock_elapsed: dict[str, float] = {} + refresh_elapsed: dict[str, float] = {} + + async def call_refresh(start: float) -> None: + result = client.refresh(username="alice", refresh_token="refresh.original") + # Support both the unfixed (sync) and fixed (coroutine) shape. + if asyncio.iscoroutine(result): + result = await result + refresh_elapsed["t"] = time.monotonic() - start + + async def marker(start: float) -> None: + await asyncio.sleep(0.05) + marker_elapsed["t"] = time.monotonic() - start + + async def acquire_other_lock(start: float) -> None: + other_lock = get_session_lock("other-session-id") + async with other_lock: + pass + lock_elapsed["t"] = time.monotonic() - start + + t0 = time.monotonic() + marker_task = asyncio.create_task(marker(t0)) + other_lock_task = asyncio.create_task(acquire_other_lock(t0)) + await call_refresh(t0) + await marker_task + await other_lock_task + + # Sanity: the stubbed initiate_auth really took ~500ms. + assert refresh_elapsed.get("t", 0.0) >= 0.4, ( + f"[1.2] Sanity: stubbed refresh should take ~500ms, " + f"got {refresh_elapsed.get('t', 0.0):.3f}s — stub not wired." + ) + assert "t" in marker_elapsed, ( + "[1.2] Marker coroutine never completed — loop fully frozen." + ) + assert marker_elapsed["t"] < 0.25, ( + f"[1.2] Marker coroutine starved by sync Cognito initiate_auth: " + f"marker elapsed={marker_elapsed['t']:.3f}s, " + f"refresh elapsed={refresh_elapsed['t']:.3f}s. " + "CognitoRefreshClient.refresh must offload initiate_auth via " + "asyncio.to_thread so other coroutines — including those for " + "different session_ids — make progress while the per-session " + "asyncio.Lock is held." + ) + assert lock_elapsed["t"] < 0.25, ( + f"[1.2] Unrelated get_session_lock('other-session-id') acquisition " + f"starved by sync Cognito call: lock elapsed={lock_elapsed['t']:.3f}s, " + f"refresh elapsed={refresh_elapsed['t']:.3f}s. " + "Even uncontended locks for different sessions block when the " + "event loop thread is frozen." + ) + + +# ═══════════════════════════════════════════════════════════════════════════ +# Sub-condition 1.3 — Resolve-path coalescing: N concurrent reqs → 1 get_item +# ═══════════════════════════════════════════════════════════════════════════ + + +@pytest.mark.asyncio +@pytest.mark.parametrize("fanout", [8]) +async def test_1_3_concurrent_same_session_fanout_coalesces_to_one_get_item( + fanout: int, +) -> None: + """(1.3) Resolve-path coalescing. + + N concurrent SessionRefreshMiddleware.dispatch calls for the same + session_id with a cold SessionCache and a valid sealed cookie must + result in exactly ONE DynamoDB get_item invocation. The upstream + unseal → SessionCache.get → SessionRepository.get path needs + coalescing via a per-session single-flight primitive. + + - Fixed code: 1 get_item (single-flight leader + followers). + - Unfixed code: N get_item calls — the existing get_session_lock only + wraps the Cognito exchange, not the resolve path. + + Expected Behavior 2.3 (design.md). + """ + record = _make_record(session_id="sess-1-3") + # Small delay so concurrent dispatches overlap long enough for each + # to observe cache-miss independently on unfixed code. + table = InstrumentedTable(record=record, delay_s=0.05) + repo = _make_repo(table) + codec = _make_codec() + refresh_client = MagicMock() + cache = SessionCache(ttl_seconds=60) # cold → cache miss + app = _build_app( + config=_enabled_config(), + repository=repo, + codec=codec, + refresh_client=refresh_client, + cache=cache, + ) + + sealed = codec.seal(CookiePayload(session_id=record.session_id)) + transport = httpx.ASGITransport(app=app) + + async with httpx.AsyncClient( + transport=transport, base_url="http://test" + ) as client: + client.cookies.set(SESSION_COOKIE_NAME, sealed) + responses = await asyncio.gather( + *(client.get("/echo") for _ in range(fanout)) + ) + + for r in responses: + assert r.status_code == 200 + + assert table.get_item_calls == 1, ( + f"[1.3] Fan-out of {fanout} concurrent same-session requests against " + f"a cold cache must coalesce to exactly one get_item call. " + f"Observed: {table.get_item_calls} get_item calls (bug target: {fanout}). " + "A per-session asyncio.Future single-flight is required upstream of " + "SessionRepository.get." + ) + + +# ═══════════════════════════════════════════════════════════════════════════ +# Sub-condition 1.4 — Cache window and slide throttle must be de-aligned +# ═══════════════════════════════════════════════════════════════════════════ + + +@given( + throttle=st.just(_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS), + leeway=st.just(_DEFAULT_REFRESH_LEEWAY_SECONDS), +) +@settings(max_examples=1, deadline=None, suppress_health_check=[HealthCheck.function_scoped_fixture]) +def test_1_4a_default_throttle_is_strict_multiple_of_leeway( + throttle: int, leeway: int +) -> None: + """(1.4) Window de-alignment — config invariant. + + _DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS must be a strict multiple of + _DEFAULT_REFRESH_LEEWAY_SECONDS AND strictly greater. This de-aligns + cache-expiry (TTL = leeway) from slide-throttle expiry so a single + request crossing one boundary does not also cross the other. + + - Fixed code: throttle=300, leeway=60 → 300 > 60 and 300 % 60 == 0. + - Unfixed code: both default to 60 → 60 > 60 is False. + + Expected Behavior 2.4 (design.md). + """ + assert throttle > leeway, ( + f"[1.4a] Sliding-renewal throttle ({throttle}s) must be strictly " + f"greater than refresh leeway ({leeway}s) to de-align boundaries." + ) + assert throttle % leeway == 0, ( + f"[1.4a] Sliding-renewal throttle ({throttle}s) must be a strict " + f"multiple of refresh leeway ({leeway}s)." + ) + + +@pytest.mark.asyncio +async def test_1_4b_single_request_at_boundary_skips_slide_write() -> None: + """(1.4) Window de-alignment — runtime behavior. + + A single request with SessionCache TTL just elapsed AND + (now - last_seen_at) == refresh_leeway_seconds must issue AT MOST ONE + of {get_item, update_item} on the critical path. On unfixed code the + aligned 60s windows guarantee BOTH writes on the same request (the + cache miss drives get_item AND the past-throttle state drives + update_item). + + Expected Behavior 2.4 (design.md). + """ + now = int(time.time()) + record = _make_record( + session_id="sess-1-4b", + last_seen_at=now - _DEFAULT_REFRESH_LEEWAY_SECONDS, + ) + table = InstrumentedTable(record=record, delay_s=0.01) + repo = _make_repo(table) + codec = _make_codec() + refresh_client = MagicMock() + cache = SessionCache(ttl_seconds=60) # cold → cache miss + # Use the real default throttle so the test fails on unfixed code + # (throttle == leeway == 60s) and passes on fixed code (throttle=300s, + # leeway=60s). + app = _build_app( + config=_enabled_config( + sliding_renewal_throttle_seconds=_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS, + ), + repository=repo, + codec=codec, + refresh_client=refresh_client, + cache=cache, + ) + + sealed = codec.seal(CookiePayload(session_id=record.session_id)) + transport = httpx.ASGITransport(app=app) + + async with httpx.AsyncClient( + transport=transport, base_url="http://test" + ) as client: + client.cookies.set(SESSION_COOKIE_NAME, sealed) + response = await client.get("/echo") + assert response.status_code == 200 + + ddb_calls = table.get_item_calls + table.update_item_calls + assert ddb_calls <= 1, ( + f"[1.4b] Single request at cache/throttle boundary issued " + f"{table.get_item_calls} get_item + {table.update_item_calls} " + f"update_item = {ddb_calls} DDB calls on critical path. " + "Windows must be de-aligned (throttle > leeway, strict multiple) " + "so a cache miss never also triggers a slide write." + ) + + +# ═══════════════════════════════════════════════════════════════════════════ +# Sub-condition 1.5 — _maybe_slide must fire-and-forget the DDB write +# ═══════════════════════════════════════════════════════════════════════════ + + +@pytest.mark.asyncio +async def test_1_5_slide_write_is_fire_and_forget() -> None: + """(1.5) Fire-and-forget slide. + + When a slide is warranted, the response path must NOT wait on + touch_last_seen. Stubbing update_item with a 500ms delay, the total + dispatch elapsed must stay well under 500ms. + + - Fixed code: _maybe_slide schedules touch_last_seen as an + asyncio.Task and returns synchronously → elapsed ~= handler time. + - Unfixed code: _maybe_slide awaits touch_last_seen inline → + elapsed >= 500ms. + + Expected Behavior 2.5 (design.md). + """ + now = int(time.time()) + record = _make_record( + session_id="sess-1-5", + last_seen_at=now - 3600, # past any reasonable throttle window + ) + table = InstrumentedTable(record=record, delay_s=0.5) + repo = _make_repo(table) + codec = _make_codec() + refresh_client = MagicMock() + + # Pre-seed the cache so repo.get is not on the path — this test isolates + # the slide-write-on-response-path question from the coalescing question. + cache = SessionCache(ttl_seconds=60) + cache.set(record) + + # Use a small throttle so the slide is warranted (last_seen == now-3600). + app = _build_app( + config=_enabled_config(sliding_renewal_throttle_seconds=60), + repository=repo, + codec=codec, + refresh_client=refresh_client, + cache=cache, + ) + + sealed = codec.seal(CookiePayload(session_id=record.session_id)) + transport = httpx.ASGITransport(app=app) + + async with httpx.AsyncClient( + transport=transport, base_url="http://test" + ) as client: + client.cookies.set(SESSION_COOKIE_NAME, sealed) + t0 = time.monotonic() + response = await client.get("/echo") + elapsed = time.monotonic() - t0 + + assert response.status_code == 200 + # Sanity: the slide write was in fact requested (fires exactly once; + # in the fixed scenario it's still counted on the fake table — it just + # doesn't block the response path). + assert table.update_item_calls >= 1, ( + f"[1.5] Sanity: the slide path should have fired update_item at least " + f"once, got {table.update_item_calls}. Check last_seen_at setup." + ) + assert elapsed < 0.25, ( + f"[1.5] Dispatch elapsed={elapsed:.3f}s; the response waited on the " + "500ms stubbed update_item. _maybe_slide must dispatch the DDB write " + "as a detached asyncio.Task so the response returns without blocking." + ) + + +# ═══════════════════════════════════════════════════════════════════════════ +# Sub-condition 1.6 — Production deployment must have concurrency slack +# ═══════════════════════════════════════════════════════════════════════════ + + +def test_1_6_cdk_app_api_desired_count_at_least_two() -> None: + """(1.6) Concurrency slack at deployment. + + infrastructure/cdk.context.json must set appApi.desiredCount >= 2 so + a single blocked event loop on one ECS task cannot stall all ingress. + + Expected Behavior 2.6 (design.md). + """ + cdk_context_path = ( + Path(__file__).resolve().parents[5] / "infrastructure" / "cdk.context.json" + ) + assert cdk_context_path.exists(), ( + f"[1.6] Expected cdk.context.json at {cdk_context_path}" + ) + ctx = json.loads(cdk_context_path.read_text()) + app_api = ctx.get("appApi", {}) + desired = app_api.get("desiredCount") + assert isinstance(desired, int) and desired >= 2, ( + f"[1.6] appApi.desiredCount must be >= 2 in the production context " + f"(found: {desired!r}). Single-task deployment cannot absorb a " + "blocked event loop — a slow AWS call on one task halts every " + "concurrent request." + ) + + +# ═══════════════════════════════════════════════════════════════════════════ +# Sub-condition 1.7 — Fan-out at cache boundary must not amplify to N*2 DDB calls +# ═══════════════════════════════════════════════════════════════════════════ + + +@pytest.mark.asyncio +@pytest.mark.parametrize("fanout", [8]) +async def test_1_7_fanout_at_boundary_bounded_blocking_ddb_calls( + fanout: int, +) -> None: + """(1.7) Fan-out amplification. + + N concurrent requests for the same session at a cache-boundary moment + must produce AT MOST 2 blocking DDB calls across the entire fan-out + (ideally 1 get_item and 0 slide-writes when windows are de-aligned). + + - Fixed code: single-flight + de-aligned windows → ≤ 1 get_item + + ≤ 1 update_item = ≤ 2. + - Unfixed code: each coroutine observes cache miss + past-throttle + independently on its local SessionRecord copy and issues its own + get_item + update_item → 2*N blocking calls. + + Expected Behavior 2.7 (design.md). + """ + now = int(time.time()) + record = _make_record( + session_id="sess-1-7", + last_seen_at=now - _DEFAULT_REFRESH_LEEWAY_SECONDS, # past aligned throttle on unfixed + ) + table = InstrumentedTable(record=record, delay_s=0.01) + repo = _make_repo(table) + codec = _make_codec() + refresh_client = MagicMock() + cache = SessionCache(ttl_seconds=60) # cold → cache miss + app = _build_app( + config=_enabled_config( + sliding_renewal_throttle_seconds=_DEFAULT_SLIDING_RENEWAL_THROTTLE_SECONDS, + ), + repository=repo, + codec=codec, + refresh_client=refresh_client, + cache=cache, + ) + + sealed = codec.seal(CookiePayload(session_id=record.session_id)) + transport = httpx.ASGITransport(app=app) + + async with httpx.AsyncClient( + transport=transport, base_url="http://test" + ) as client: + client.cookies.set(SESSION_COOKIE_NAME, sealed) + responses = await asyncio.gather( + *(client.get("/echo") for _ in range(fanout)) + ) + + for r in responses: + assert r.status_code == 200 + + blocking_calls = table.get_item_calls + table.update_item_calls + assert blocking_calls <= 2, ( + f"[1.7] Fan-out of {fanout} concurrent same-session requests at a " + f"cache-boundary moment produced {table.get_item_calls} get_item + " + f"{table.update_item_calls} update_item = {blocking_calls} blocking " + f"DDB calls (bug: ~{2 * fanout}). Single-flight coalescing AND " + "window de-alignment are required." + ) diff --git a/backend/tests/apis/shared/middleware/test_session_refresh_preservation.py b/backend/tests/apis/shared/middleware/test_session_refresh_preservation.py new file mode 100644 index 00000000..4f42c8db --- /dev/null +++ b/backend/tests/apis/shared/middleware/test_session_refresh_preservation.py @@ -0,0 +1,1213 @@ +"""Preservation property tests for SessionRefreshMiddleware. + +Property 2: BFF Middleware Contracts Unchanged for Non-Buggy Inputs. + +This file encodes the observable contracts (Preservation Requirements 3.1–3.11) +that the event-loop-blocking fix MUST preserve. Tests are run on UNFIXED code +first and MUST PASS — confirming the baseline behavior to lock in. After the +fix lands (task 3.x series) these same tests must continue to pass with no +modifications. + +Observation-first methodology: each preservation test encodes behavior +OBSERVED on today's code — response status, `Set-Cookie` headers (including +every attribute), `request.state.bff_session`, `request.state.bff_csrf_token`, +DDB call counts, Cognito call counts, KMS/Secrets Manager call counts — rather +than re-derived from the spec. + +The hypothesis strategies cover the axes that exist today: `is_enabled()` +true/false, `__Host-bff_session` cookie present/absent, cookie seal +valid/invalid/expired, `SessionCache` hit/miss, `needs_refresh` yes/no, +refresh-token rotation yes/no, slide warranted yes/no, absolute-lifetime cap +passed yes/no, request method safe/unsafe. Inputs that themselves reproduce +an isBugCondition sub-condition (fan-outs at aligned boundaries, slide timing +vs response timing, etc.) are avoided — preservation is about the externally +observable contract, not about how many DDB calls happen under bug-triggering +inputs. + +Validates: Requirements 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11 +""" + +from __future__ import annotations + +import asyncio +import secrets +import time +from typing import Any, Optional +from unittest.mock import AsyncMock, MagicMock + +import httpx +import pytest +from cryptography.hazmat.primitives.ciphers.aead import AESGCM +from fastapi import FastAPI, Request +from fastapi.testclient import TestClient +from hypothesis import HealthCheck, given, settings +from hypothesis import strategies as st + +from apis.shared.middleware.csrf import CSRFMiddleware +from apis.shared.middleware.session_refresh import SessionRefreshMiddleware +from apis.shared.sessions_bff import cache as cache_module +from apis.shared.sessions_bff import cookie as cookie_module +from apis.shared.sessions_bff import lock as lock_module +from apis.shared.sessions_bff import refresh as refresh_module +from apis.shared.sessions_bff.cache import SessionCache +from apis.shared.sessions_bff.config import ( + BFFConfig, + CSRF_COOKIE_NAME, + CSRF_HEADER_NAME, + SESSION_COOKIE_NAME, + _DEFAULT_REFRESH_LEEWAY_SECONDS, +) +from apis.shared.sessions_bff.cookie import CookieCodec, get_default_codec +from apis.shared.sessions_bff.csrf import CSRFHelper +from apis.shared.sessions_bff.models import CookiePayload, SessionRecord +from apis.shared.sessions_bff.refresh import ( + CognitoRefreshClient, + CognitoRefreshError, + RefreshResult, + _reset_secret_cache_for_tests, + resolve_bff_client_secret, +) +from apis.shared.sessions_bff.repository import SessionRepository + + +# ═══════════════════════════════════════════════════════════════════════════ +# Shared helpers — duplicated from test_session_refresh_bug_condition.py for +# test-file isolation. Keep the two files' helper shapes in sync. +# ═══════════════════════════════════════════════════════════════════════════ + + +class InstrumentedTable: + """Synchronous fake of a boto3 DynamoDB Table. + + Records call counts so preservation tests can assert "zero AWS calls" + for dormant / no-cookie pass-through paths, and "exactly one get_item" + for the refresh-storm coalescing contract. + + `update_item` writes are classified into three kinds by inspecting the + `UpdateExpression`: + - `lock_acquire_calls`: cross-task refresh-lock acquisition (writes + `refresh_lock_owner` + `refresh_lock_until`, no token columns). + - `token_persist_calls`: token rotation write (sets + `cognito_access_token` etc., usually also REMOVE-ing the lock). + - `slide_calls`: sliding-renewal touch (writes only `last_seen_at` + and optionally `ttl`). + `update_item_calls` remains the total (sum) so existing assertions on + "any update_item issued" continue to hold. The injected side-effect is + applied only to the token-persist path so tests that simulate "DDB + throttled during persist" don't accidentally fail at the lock-acquire + write — that's a different code path with different recovery semantics. + """ + + def __init__( + self, + *, + record: Optional[SessionRecord] = None, + delay_s: float = 0.0, + update_item_side_effect: Optional[Exception] = None, + ) -> None: + self._delay_s = delay_s + self._record = record + self._update_item_side_effect = update_item_side_effect + self.get_item_calls = 0 + self.update_item_calls = 0 + self.lock_acquire_calls = 0 + self.token_persist_calls = 0 + self.slide_calls = 0 + self.put_item_calls = 0 + self.delete_item_calls = 0 + + def _sleep(self) -> None: + if self._delay_s > 0: + time.sleep(self._delay_s) + + def get_item(self, Key: dict) -> dict: + self.get_item_calls += 1 + self._sleep() + if self._record is None: + return {} + return {"Item": _record_to_item(self._record)} + + @staticmethod + def _classify_update(update_expr: str) -> str: + """Classify which middleware path issued this update_item. + + Token persist writes always set `cognito_access_token`. Pure lock + acquires write `refresh_lock_owner` without touching tokens. Slide + writes touch only `last_seen_at` (+ optionally `ttl`). + """ + if "cognito_access_token" in update_expr: + return "token_persist" + if "refresh_lock_owner" in update_expr: + return "lock_acquire" + return "slide" + + def update_item(self, **kwargs: Any) -> dict: + self.update_item_calls += 1 + kind = self._classify_update(kwargs.get("UpdateExpression", "")) + if kind == "token_persist": + self.token_persist_calls += 1 + elif kind == "lock_acquire": + self.lock_acquire_calls += 1 + else: + self.slide_calls += 1 + self._sleep() + # Side-effect injection applies only to the token-persist path — + # tests that simulate "rotation persist exhausted" mean exactly + # that write, not the upstream lock-acquire. + if self._update_item_side_effect is not None and kind == "token_persist": + raise self._update_item_side_effect + return {} + + def put_item(self, Item: dict) -> dict: + self.put_item_calls += 1 + self._sleep() + return {} + + def delete_item(self, Key: dict) -> dict: + self.delete_item_calls += 1 + self._sleep() + return {} + + +def _record_to_item(r: SessionRecord) -> dict: + return { + "PK": f"SESSION#{r.session_id}", + "SK": "META", + "session_id": r.session_id, + "user_id": r.user_id, + "username": r.username, + "cognito_access_token": r.cognito_access_token, + "cognito_refresh_token": r.cognito_refresh_token, + "id_token": r.id_token, + "access_token_exp": r.access_token_exp, + "csrf_secret": r.csrf_secret, + "created_at": r.created_at, + "last_seen_at": r.last_seen_at, + "ttl": r.ttl, + } + + +def _make_repo(table: InstrumentedTable) -> SessionRepository: + """SessionRepository backed by an InstrumentedTable. + + Bypasses boto3.resource() by starting disabled, then flipping `_enabled` + and injecting the fake table. Exercises the real repository async-method + bodies so preservation tests see the production code path. + """ + repo = SessionRepository(table_name="") + repo._enabled = True + repo._table = table # type: ignore[assignment] + repo._table_name = "test-bff-sessions" + return repo + + +def _make_codec() -> CookieCodec: + codec = CookieCodec(kms_key_arn="arn:aws:kms:fake") + codec._cipher = AESGCM(secrets.token_bytes(32)) + return codec + + +def _make_record( + *, + session_id: str = "sess-pres-001", + access_token_exp: Optional[int] = None, + last_seen_at: Optional[int] = None, + created_at: Optional[int] = None, + ttl: Optional[int] = None, +) -> SessionRecord: + now = int(time.time()) + return SessionRecord( + session_id=session_id, + user_id="user-sub-001", + username="alice", + cognito_access_token="access.original", + cognito_refresh_token="refresh.original", + id_token="id.original", + access_token_exp=access_token_exp if access_token_exp is not None else now + 3600, + csrf_secret="csrf-secret-deadbeef", + created_at=created_at if created_at is not None else now, + last_seen_at=last_seen_at if last_seen_at is not None else now, + ttl=ttl if ttl is not None else now + 28800, + ) + + +def _enabled_config(**overrides: Any) -> BFFConfig: + defaults: dict[str, Any] = dict( + sessions_table_name="tbl", + cookie_signing_key_arn="arn:aws:kms:fake", + session_ttl_seconds=28800, + refresh_leeway_seconds=_DEFAULT_REFRESH_LEEWAY_SECONDS, + cognito_bff_app_client_id="client-id", + cognito_bff_app_client_secret_arn="arn:secret", + inference_api_url=None, + absolute_lifetime_seconds=30 * 24 * 3600, + sliding_renewal_throttle_seconds=60, + ) + defaults.update(overrides) + return BFFConfig(**defaults) + + +def _disabled_config() -> BFFConfig: + return BFFConfig( + sessions_table_name=None, + cookie_signing_key_arn=None, + session_ttl_seconds=28800, + refresh_leeway_seconds=60, + cognito_bff_app_client_id=None, + cognito_bff_app_client_secret_arn=None, + inference_api_url=None, + ) + + +def _build_app( + *, + config: BFFConfig, + repository: Any, + codec: CookieCodec, + refresh_client: Any, + cache: Optional[SessionCache] = None, + include_csrf: bool = False, +) -> FastAPI: + app = FastAPI() + if include_csrf: + # Added first → innermost relative to SessionRefreshMiddleware. + # Request order: SessionRefresh → CSRF → route. + app.add_middleware(CSRFMiddleware) + app.add_middleware( + SessionRefreshMiddleware, + config=config, + repository=repository, + cookie_codec=codec, + refresh_client=refresh_client, + cache=cache or SessionCache(ttl_seconds=60), + ) + + @app.get("/echo") + async def echo_get(request: Request) -> dict: + record = getattr(request.state, "bff_session", None) + csrf = getattr(request.state, "bff_csrf_token", None) + return { + "has_session": record is not None, + "session_id": record.session_id if record else None, + "access_token": record.cognito_access_token if record else None, + "csrf_token": csrf, + } + + @app.post("/submit") + async def submit_post(request: Request) -> dict: + record = getattr(request.state, "bff_session", None) + return { + "has_session": record is not None, + "session_id": record.session_id if record else None, + } + + return app + + +@pytest.fixture(autouse=True) +def _reset_session_state() -> Any: + """Clear process-wide state between tests.""" + lock_module._reset_for_tests() + _reset_secret_cache_for_tests() + cache_module._reset_default_cache_for_tests() + cookie_module._reset_default_codec_for_tests() + yield + lock_module._reset_for_tests() + _reset_secret_cache_for_tests() + cache_module._reset_default_cache_for_tests() + cookie_module._reset_default_codec_for_tests() + + +# ═══════════════════════════════════════════════════════════════════════════ +# Set-Cookie parsing helpers — the preservation contract on cookie attributes +# is observed from the raw `Set-Cookie` header, so we parse it here. +# ═══════════════════════════════════════════════════════════════════════════ + + +def _parse_set_cookie(header: str) -> dict[str, Any]: + """Parse a raw Set-Cookie header into {name, value, attributes}. + + Attributes are keyed case-folded for reliable membership checks. + Boolean attributes (HttpOnly, Secure) map to True. + """ + parts = [p.strip() for p in header.split(";")] + name, _, value = parts[0].partition("=") + attrs: dict[str, Any] = {} + for attr in parts[1:]: + if "=" in attr: + k, _, v = attr.partition("=") + attrs[k.strip().lower()] = v.strip() + else: + attrs[attr.strip().lower()] = True + return {"name": name.strip(), "value": value.strip(), "attrs": attrs} + + +def _find_set_cookies( + response_headers: Any, cookie_name: str +) -> list[dict[str, Any]]: + """Return every parsed Set-Cookie for a given cookie name.""" + parsed = [] + for header in response_headers.get_list("set-cookie"): + pc = _parse_set_cookie(header) + if pc["name"] == cookie_name: + parsed.append(pc) + return parsed + + +def _wait_for(predicate: Any, *, timeout_s: float = 1.0, interval_s: float = 0.01) -> bool: + """Poll ``predicate`` until it returns truthy or ``timeout_s`` elapses. + + The slide-write path became fire-and-forget in task 3.5 — `_maybe_slide` + schedules the DDB `touch_last_seen` on a detached `asyncio.create_task` + and returns the Max-Age synchronously. `TestClient` returns the response + before the scheduled task has a chance to run on slower CI schedulers, + so assertions about `update_item_calls == 1` must poll rather than + sample immediately. The observable external contract (cookie attributes, + Max-Age, response body) is unchanged — only the internal timing of the + background write moves. + """ + deadline = time.monotonic() + timeout_s + while time.monotonic() < deadline: + if predicate(): + return True + time.sleep(interval_s) + return predicate() + + +# ═══════════════════════════════════════════════════════════════════════════ +# Requirement 3.1 — Dormant pass-through with zero AWS calls +# ═══════════════════════════════════════════════════════════════════════════ + + +# Cookie-safe ASCII: printable, no semicolons/commas/whitespace/control chars — +# httpx's cookiejar only accepts ASCII values and rejects the RFC 6265 separators. +_COOKIE_SAFE_ALPHABET = st.characters( + min_codepoint=0x21, + max_codepoint=0x7E, + blacklist_characters=";, \t\"\\", +) + + +@given( + method=st.sampled_from(["GET", "POST", "PUT", "PATCH", "DELETE", "HEAD", "OPTIONS"]), + path=st.sampled_from(["/echo", "/submit"]), + with_cookie=st.booleans(), + cookie_value=st.text(alphabet=_COOKIE_SAFE_ALPHABET, min_size=0, max_size=64), +) +@settings( + max_examples=30, + deadline=None, + suppress_health_check=[HealthCheck.function_scoped_fixture], +) +def test_3_1_dormant_passthrough_zero_aws_calls( + method: str, path: str, with_cookie: bool, cookie_value: str +) -> None: + """(3.1) Dormant pass-through. + + When `BFFConfig.is_enabled() == False`, every request shape (method, + path, cookie present/absent) short-circuits through `call_next(request)` + with zero DDB calls and zero Cognito calls. + """ + table = InstrumentedTable() + repo = _make_repo(table) + # Force the repo into the "enabled" posture so we'd observe a call if + # the middleware mistakenly went past its `is_enabled()` guard. + codec = _make_codec() + refresh_client = MagicMock() + app = _build_app( + config=_disabled_config(), + repository=repo, + codec=codec, + refresh_client=refresh_client, + ) + + cookies: dict[str, str] = {} + if with_cookie: + cookies[SESSION_COOKIE_NAME] = cookie_value + + with TestClient(app) as client: + response = client.request(method, path, cookies=cookies) + + # OPTIONS/HEAD may be allowed or not depending on route — we only care + # that the middleware did not touch AWS regardless of status. + assert response.status_code < 500, ( + f"[3.1] dormant pass-through produced 5xx for {method} {path}: " + f"{response.status_code}" + ) + assert table.get_item_calls == 0, ( + f"[3.1] dormant middleware issued {table.get_item_calls} get_item " + f"calls — must be zero when is_enabled() == False" + ) + assert table.update_item_calls == 0, ( + f"[3.1] dormant middleware issued {table.update_item_calls} " + "update_item calls — must be zero when is_enabled() == False" + ) + assert table.put_item_calls == 0 + assert table.delete_item_calls == 0 + refresh_client.refresh.assert_not_called() + # No Set-Cookie emitted by the middleware when dormant. + assert response.headers.get_list("set-cookie") == [] + + +# ═══════════════════════════════════════════════════════════════════════════ +# Requirement 3.2 — No-cookie pass-through with zero AWS calls +# ═══════════════════════════════════════════════════════════════════════════ + + +@given( + method=st.sampled_from(["GET", "POST", "PUT", "PATCH", "DELETE"]), + path=st.sampled_from(["/echo", "/submit"]), +) +@settings( + max_examples=20, + deadline=None, + suppress_health_check=[HealthCheck.function_scoped_fixture], +) +def test_3_2_no_cookie_passthrough_zero_aws_calls( + method: str, path: str +) -> None: + """(3.2) No-cookie pass-through. + + When `is_enabled() == True` but no `__Host-bff_session` cookie is present + (Bearer-token requests, anonymous endpoints), the middleware must pass + through with zero AWS calls and no `request.state.bff_session`. + """ + table = InstrumentedTable() + repo = _make_repo(table) + codec = _make_codec() + refresh_client = MagicMock() + app = _build_app( + config=_enabled_config(), + repository=repo, + codec=codec, + refresh_client=refresh_client, + ) + + with TestClient(app) as client: + response = client.request(method, path) + + assert response.status_code < 500 + # When the call returned 200 with body, the handler reports has_session=False. + if response.status_code == 200 and response.headers.get( + "content-type", "" + ).startswith("application/json"): + body = response.json() + assert body["has_session"] is False, ( + "[3.2] state.bff_session must NOT be set when no cookie is present" + ) + assert table.get_item_calls == 0, ( + f"[3.2] no-cookie path issued {table.get_item_calls} get_item calls" + ) + assert table.update_item_calls == 0 + assert table.put_item_calls == 0 + assert table.delete_item_calls == 0 + refresh_client.refresh.assert_not_called() + + +# ═══════════════════════════════════════════════════════════════════════════ +# Requirement 3.3 — Unrecoverable cookie clears BOTH cookies with matching attrs +# ═══════════════════════════════════════════════════════════════════════════ + + +def _assert_clear_cookie_attrs(parsed: dict[str, Any]) -> None: + """Attributes observed today on a cleared BFF cookie: + + Max-Age=0; Path=/; SameSite=lax; Secure + + HttpOnly is present on the session cookie only (intentional: the CSRF + cookie is JS-readable). All other attributes are identical across both + cookies. + """ + attrs = parsed["attrs"] + assert attrs.get("max-age") == "0", ( + f"[3.3] clear must set Max-Age=0; got attrs={attrs}" + ) + assert attrs.get("path") == "/", ( + f"[3.3] clear must set Path=/; got attrs={attrs}" + ) + assert attrs.get("samesite") == "lax", ( + f"[3.3] clear must set SameSite=lax; got attrs={attrs}" + ) + assert attrs.get("secure") is True, ( + f"[3.3] clear must set Secure; got attrs={attrs}" + ) + + +@pytest.mark.parametrize( + "scenario", + ["bad_seal", "missing_row", "expired_row", "terminal_refresh_error"], +) +def test_3_3_unrecoverable_cookie_clears_both_cookies_with_matching_attrs( + scenario: str, +) -> None: + """(3.3) Unrecoverable cookie → clear both. + + Bad-seal, missing-row, expired-row, and terminal-`CognitoRefreshError` + inputs all produce Set-Cookie for both `__Host-bff_session` and + `__Host-bff_csrf` with `Max-Age=0` and the today-observed attribute set. + The HttpOnly attribute intentionally differs between the two (session + is HttpOnly; CSRF is JS-readable by design); all other attrs match. + """ + codec = _make_codec() + refresh_client = MagicMock() + + if scenario == "bad_seal": + table = InstrumentedTable() + cookie_value = "not-a-sealed-cookie" + elif scenario == "missing_row": + # No record on the table — get_item returns {} → record None. + table = InstrumentedTable(record=None) + cookie_value = codec.seal(CookiePayload(session_id="sess-gone")) + elif scenario == "expired_row": + # TTL in the past — repository treats as missing (defense in depth). + expired = _make_record(ttl=int(time.time()) - 10) + table = InstrumentedTable(record=expired) + cookie_value = codec.seal(CookiePayload(session_id=expired.session_id)) + elif scenario == "terminal_refresh_error": + # Access token within leeway → refresh path → Cognito raises. + rec = _make_record(access_token_exp=int(time.time()) + 5) + table = InstrumentedTable(record=rec) + cookie_value = codec.seal(CookiePayload(session_id=rec.session_id)) + refresh_client.refresh.side_effect = CognitoRefreshError("rotated-dead") + else: + pytest.fail(f"unknown scenario: {scenario}") + + repo = _make_repo(table) + app = _build_app( + config=_enabled_config(), + repository=repo, + codec=codec, + refresh_client=refresh_client, + ) + + with TestClient(app) as client: + response = client.get("/echo", cookies={SESSION_COOKIE_NAME: cookie_value}) + + assert response.status_code == 200 + assert response.json()["has_session"] is False, ( + f"[3.3/{scenario}] state.bff_session must NOT be set after clear" + ) + + session_clears = _find_set_cookies(response.headers, SESSION_COOKIE_NAME) + csrf_clears = _find_set_cookies(response.headers, CSRF_COOKIE_NAME) + assert len(session_clears) == 1, ( + f"[3.3/{scenario}] expected exactly one Set-Cookie for " + f"{SESSION_COOKIE_NAME}; got {len(session_clears)}" + ) + assert len(csrf_clears) == 1, ( + f"[3.3/{scenario}] expected exactly one Set-Cookie for " + f"{CSRF_COOKIE_NAME}; got {len(csrf_clears)}" + ) + + # Each cleared cookie carries Max-Age=0 and the shared attribute set. + _assert_clear_cookie_attrs(session_clears[0]) + _assert_clear_cookie_attrs(csrf_clears[0]) + + # HttpOnly is the one documented difference between the two cookies. + assert session_clears[0]["attrs"].get("httponly") is True, ( + f"[3.3/{scenario}] session cookie must remain HttpOnly on clear" + ) + assert csrf_clears[0]["attrs"].get("httponly") is not True, ( + f"[3.3/{scenario}] CSRF cookie must NOT be HttpOnly (JS must read it)" + ) + + # Shared (non-HttpOnly) attribute set is identical across the two clears. + shared_keys = {"max-age", "path", "samesite", "secure"} + sess_shared = {k: session_clears[0]["attrs"].get(k) for k in shared_keys} + csrf_shared = {k: csrf_clears[0]["attrs"].get(k) for k in shared_keys} + assert sess_shared == csrf_shared, ( + f"[3.3/{scenario}] shared clear attrs diverge: " + f"session={sess_shared}, csrf={csrf_shared}" + ) + + +# ═══════════════════════════════════════════════════════════════════════════ +# Requirement 3.4 — Max-Age re-emit contract (slide path) +# ═══════════════════════════════════════════════════════════════════════════ + + +@given( + # Session TTL bounded so it always fits well within the absolute cap. + session_ttl=st.integers(min_value=120, max_value=28800), + # Time since the last touch — past the throttle so a slide is warranted. + seconds_since_last_seen=st.integers(min_value=61, max_value=3600), +) +@settings( + max_examples=15, + deadline=None, + suppress_health_check=[HealthCheck.function_scoped_fixture], +) +def test_3_4_slide_max_age_matches_on_both_cookies( + session_ttl: int, seconds_since_last_seen: int +) -> None: + """(3.4) Max-Age re-emit contract. + + When `_maybe_slide` returns a non-None Max-Age, the Set-Cookie headers + for BOTH `__Host-bff_session` and `__Host-bff_csrf` carry that exact + Max-Age and the attribute set observed today on `_reemit_cookies`: + + Session: HttpOnly; Max-Age=; Path=/; SameSite=lax; Secure + CSRF: Max-Age=; Path=/; SameSite=lax; Secure + """ + now = int(time.time()) + record = _make_record(last_seen_at=now - seconds_since_last_seen) + table = InstrumentedTable(record=record) + repo = _make_repo(table) + codec = _make_codec() + refresh_client = MagicMock() + # Large absolute lifetime so the slide is not capped — the Max-Age we + # get back must equal session_ttl_seconds exactly. + app = _build_app( + config=_enabled_config( + session_ttl_seconds=session_ttl, + absolute_lifetime_seconds=30 * 24 * 3600, + sliding_renewal_throttle_seconds=60, + ), + repository=repo, + codec=codec, + refresh_client=refresh_client, + ) + + sealed = codec.seal(CookiePayload(session_id=record.session_id)) + with TestClient(app) as client: + response = client.get("/echo", cookies={SESSION_COOKIE_NAME: sealed}) + # Slide-write is fire-and-forget (task 3.5) — drive the event + # loop with a second request to let the background task from the + # first request flush. MUST happen inside the `with TestClient` + # block because TestClient tears down its anyio portal (and the + # event loop) on `__exit__`, which cancels any pending tasks. + _wait_for(lambda: table.update_item_calls >= 1) + if table.update_item_calls == 0: + # A no-op second request keeps the event loop alive long + # enough for the pending slide task to run. + client.get("/echo") + _wait_for(lambda: table.update_item_calls >= 1) + + assert response.status_code == 200 + # Slide must have fired exactly once (one DDB update_item). + assert table.update_item_calls == 1, ( + f"[3.4] slide must issue exactly one update_item; got " + f"{table.update_item_calls}" + ) + + session_emits = _find_set_cookies(response.headers, SESSION_COOKIE_NAME) + csrf_emits = _find_set_cookies(response.headers, CSRF_COOKIE_NAME) + assert len(session_emits) == 1, ( + f"[3.4] expected exactly one Set-Cookie for {SESSION_COOKIE_NAME}" + ) + assert len(csrf_emits) == 1, ( + f"[3.4] expected exactly one Set-Cookie for {CSRF_COOKIE_NAME}" + ) + + sess_attrs = session_emits[0]["attrs"] + csrf_attrs = csrf_emits[0]["attrs"] + + # Max-Age equals session_ttl_seconds on BOTH cookies (no absolute cap). + assert sess_attrs.get("max-age") == str(session_ttl), ( + f"[3.4] session cookie Max-Age mismatch: expected {session_ttl}, " + f"got {sess_attrs.get('max-age')}" + ) + assert csrf_attrs.get("max-age") == str(session_ttl), ( + f"[3.4] csrf cookie Max-Age mismatch: expected {session_ttl}, " + f"got {csrf_attrs.get('max-age')}" + ) + + # Attribute set observed on today's _reemit_cookies: + assert sess_attrs.get("path") == "/" + assert sess_attrs.get("samesite") == "lax" + assert sess_attrs.get("secure") is True + assert sess_attrs.get("httponly") is True + + assert csrf_attrs.get("path") == "/" + assert csrf_attrs.get("samesite") == "lax" + assert csrf_attrs.get("secure") is True + # CSRF is JS-readable → MUST NOT be HttpOnly. + assert csrf_attrs.get("httponly") is not True + + # Shared (non-HttpOnly) attribute set is identical. + shared = {"max-age", "path", "samesite", "secure"} + assert {k: sess_attrs.get(k) for k in shared} == { + k: csrf_attrs.get(k) for k in shared + } + + # The sealed value on the session cookie is the exact same value the + # browser already held — slide doesn't mint a new seal. + assert session_emits[0]["value"] == sealed, ( + "[3.4] slide must re-emit the same sealed session value, not a new seal" + ) + + +# ═══════════════════════════════════════════════════════════════════════════ +# Requirement 3.5 — Refresh-storm coalescing preserved (one initiate_auth per +# session per leeway window) +# ═══════════════════════════════════════════════════════════════════════════ + + +@pytest.mark.asyncio +async def test_3_5_refresh_storm_coalesces_to_single_initiate_auth() -> None: + """(3.5) Refresh-storm coalescing. + + 10 concurrent same-session requests crossing the refresh-leeway window + must drive exactly ONE `cognito-idp:initiate_auth` call (the existing + per-session lock coalescing contract). The fix MUST preserve this. + """ + now = int(time.time()) + record = _make_record(access_token_exp=now + 5) # within 60s leeway + table = InstrumentedTable(record=record) + repo = _make_repo(table) + codec = _make_codec() + + refresh_call_count = {"n": 0} + + async def _refresh(*, username: str, refresh_token: str) -> RefreshResult: + refresh_call_count["n"] += 1 + return RefreshResult( + access_token=f"access.fresh.{refresh_call_count['n']}", + refresh_token="refresh.original", # no rotation + id_token="id.fresh", + access_token_exp=int(time.time()) + 3600, + ) + + refresh_client = MagicMock() + refresh_client.refresh = AsyncMock(side_effect=_refresh) + + # After the first refresh lands, later repo.get calls should observe + # a record that no longer needs refresh (the update_item write is a + # no-op on the fake, so we pre-refresh the in-memory record copy). + fresh = _make_record( + session_id=record.session_id, access_token_exp=now + 3600 + ) + fresh.cognito_access_token = "access.fresh.1" + # Sequential responses: first few see the stale record, then the fresh one. + table._record = record # starts stale + original_get_item = table.get_item + + get_item_counter = {"n": 0} + + def counting_get_item(Key: dict) -> dict: + get_item_counter["n"] += 1 + # After the leader's update_item bumps tokens, followers arriving + # late should see the fresh record. Flip after 2 calls so both + # pre-lock and post-lock rechecks on the leader path see the stale row. + if get_item_counter["n"] > 2: + table._record = fresh + return original_get_item(Key) + + table.get_item = counting_get_item # type: ignore[assignment] + + app = _build_app( + config=_enabled_config(), + repository=repo, + codec=codec, + refresh_client=refresh_client, + ) + + sealed = codec.seal(CookiePayload(session_id=record.session_id)) + transport = httpx.ASGITransport(app=app) + + async with httpx.AsyncClient( + transport=transport, base_url="http://test" + ) as client: + client.cookies.set(SESSION_COOKIE_NAME, sealed) + responses = await asyncio.gather( + *(client.get("/echo") for _ in range(10)) + ) + + for r in responses: + assert r.status_code == 200 + + assert refresh_call_count["n"] == 1, ( + f"[3.5] 10 concurrent same-session requests drove " + f"{refresh_call_count['n']} Cognito initiate_auth calls — exactly " + "one is required per session per leeway window (existing " + "get_session_lock coalescing)." + ) + + +# ═══════════════════════════════════════════════════════════════════════════ +# Requirement 3.6 — Codec singleton, zero per-request KMS GenerateDataKey +# ═══════════════════════════════════════════════════════════════════════════ + + +def test_3_6_get_default_codec_is_singleton_with_no_per_request_kms() -> None: + """(3.6) Codec singleton. + + `get_default_codec()` returns the same instance across calls. The + underlying `secretsmanager:GetSecretValue` call happens at most once + per process. Hot seal/unseal traffic must not re-fetch. + + (This contract held under the original `kms:GenerateDataKey`-per-process + design and the interim KMS-wrap design too; only the underlying AWS + APIs and KDF changed when the codec was moved to a shared + Secrets-Manager-generated secret for cross-task seal/unseal.) + """ + sm_client = MagicMock() + sm_client.get_secret_value.return_value = { + "SecretString": "secret-3-6-high-entropy-1234567890ABCDEFGHIJ" + } + + codec = CookieCodec( + kms_key_arn="arn:aws:kms:fake-3.6", + data_key_secret_arn="arn:aws:secretsmanager:fake-3.6", + secrets_manager_client=sm_client, + ) + cookie_module._set_default_codec_for_tests(codec) + + first = get_default_codec() + for _ in range(25): + other = get_default_codec() + assert other is first, ( + "[3.6] get_default_codec() must return the same instance each call" + ) + + payload = CookiePayload(session_id="sess-3-6") + for _ in range(20): + sealed = first.seal(payload) + roundtripped = first.unseal(sealed) + assert roundtripped.session_id == "sess-3-6" + + assert sm_client.get_secret_value.call_count <= 1, ( + f"[3.6] Secrets Manager get_secret_value invoked " + f"{sm_client.get_secret_value.call_count} times — must be at most " + "one per process." + ) + + +# ═══════════════════════════════════════════════════════════════════════════ +# Requirement 3.7 — Client-secret cache, one Secrets Manager hit per process +# ═══════════════════════════════════════════════════════════════════════════ + + +def test_3_7_client_secret_cache_one_secrets_manager_hit_per_process() -> None: + """(3.7) Client-secret cache. + + `resolve_bff_client_secret()` must hit Secrets Manager exactly once per + process regardless of how many times it is called. + """ + sm_client = MagicMock() + sm_client.get_secret_value.return_value = {"SecretString": "client-secret-A"} + + first = resolve_bff_client_secret( + secret_arn="arn:secret", + region="us-east-1", + secrets_manager_client=sm_client, + ) + assert first == "client-secret-A" + + # Many subsequent calls — even with a fresh SM client — must not drive + # a new GetSecretValue, because the first call populated the cache. + for _ in range(50): + value = resolve_bff_client_secret( + secret_arn="arn:secret", + region="us-east-1", + secrets_manager_client=sm_client, + ) + assert value == "client-secret-A" + + assert sm_client.get_secret_value.call_count == 1, ( + f"[3.7] Secrets Manager get_secret_value called " + f"{sm_client.get_secret_value.call_count} times — must be exactly one." + ) + + +# ═══════════════════════════════════════════════════════════════════════════ +# Requirement 3.8 — CSRFMiddleware accept/reject unchanged, no new I/O +# ═══════════════════════════════════════════════════════════════════════════ + + +@pytest.mark.parametrize( + "case", + ["matching", "mismatched", "header_only", "cookie_only", "forged_pair", "missing"], +) +def test_3_8_csrf_decision_unchanged_with_zero_new_io(case: str) -> None: + """(3.8) CSRF path unchanged. + + With `SessionRefreshMiddleware` upstream populating `state.bff_session`, + the `CSRFMiddleware` accept/reject decision on unsafe-method requests + matches today's observed behavior across all five CSRF token cases. + No new DDB / Cognito / KMS / Secrets Manager I/O is introduced on the + CSRF path. + """ + record = _make_record() + table = InstrumentedTable(record=record) + repo = _make_repo(table) + codec = _make_codec() + refresh_client = MagicMock() + app = _build_app( + config=_enabled_config(), + repository=repo, + codec=codec, + refresh_client=refresh_client, + include_csrf=True, + ) + + sealed = codec.seal(CookiePayload(session_id=record.session_id)) + valid_token = CSRFHelper.derive_token(record.csrf_secret, record.session_id) + forged_token = "0" * 32 + + headers: dict[str, str] = {} + cookies: dict[str, str] = {SESSION_COOKIE_NAME: sealed} + + if case == "matching": + headers[CSRF_HEADER_NAME] = valid_token + cookies[CSRF_COOKIE_NAME] = valid_token + expected_status = 200 + elif case == "mismatched": + headers[CSRF_HEADER_NAME] = valid_token + cookies[CSRF_COOKIE_NAME] = "different-value" + expected_status = 403 + elif case == "header_only": + headers[CSRF_HEADER_NAME] = valid_token + expected_status = 403 + elif case == "cookie_only": + cookies[CSRF_COOKIE_NAME] = valid_token + expected_status = 403 + elif case == "forged_pair": + headers[CSRF_HEADER_NAME] = forged_token + cookies[CSRF_COOKIE_NAME] = forged_token + expected_status = 403 + elif case == "missing": + expected_status = 403 + else: + pytest.fail(f"unknown case: {case}") + + # Snapshot AWS call counters BEFORE the CSRF-exercising request. + # (Session resolve may have happened on-open via middleware init; we + # expect exactly one get_item for the resolve, and zero writes.) + initial_refresh_calls = refresh_client.refresh.call_count + initial_update_calls = table.update_item_calls + + with TestClient(app) as client: + response = client.post("/submit", headers=headers, cookies=cookies) + + assert response.status_code == expected_status, ( + f"[3.8/{case}] unexpected CSRF decision: expected {expected_status}, " + f"got {response.status_code}" + ) + # Zero NEW Cognito / DDB write I/O on the CSRF path itself. + assert refresh_client.refresh.call_count == initial_refresh_calls, ( + f"[3.8/{case}] CSRF path triggered an unexpected Cognito refresh" + ) + # CSRF itself never writes to DDB. + assert table.update_item_calls - initial_update_calls <= 1, ( + f"[3.8/{case}] more than one update_item observed — at most the " + "preceding session-resolve slide is expected." + ) + + +# ═══════════════════════════════════════════════════════════════════════════ +# Requirement 3.9 — Absolute-lifetime cap returns None from _maybe_slide +# ═══════════════════════════════════════════════════════════════════════════ + + +@pytest.mark.asyncio +async def test_3_9_maybe_slide_returns_none_past_absolute_cap() -> None: + """(3.9) Absolute-lifetime cap. + + When `now > created_at + absolute_lifetime_seconds`, `_maybe_slide` + returns `None` (no cookie re-emit, no DDB write). + """ + now = int(time.time()) + # Session was created 200s ago with an absolute lifetime of 100s → cap + # was reached 100s ago. last_seen_at is past the throttle so otherwise + # a slide would be warranted. + record = _make_record( + created_at=now - 200, + last_seen_at=now - 120, + ) + table = InstrumentedTable(record=record) + repo = _make_repo(table) + codec = _make_codec() + refresh_client = MagicMock() + config = _enabled_config( + absolute_lifetime_seconds=100, + sliding_renewal_throttle_seconds=60, + ) + + # Build the middleware directly so we can invoke _maybe_slide in + # isolation — the preservation contract is specifically that the + # method returns None past the cap. + middleware = SessionRefreshMiddleware( + app=FastAPI(), + config=config, + repository=repo, + cookie_codec=codec, + refresh_client=refresh_client, + cache=SessionCache(ttl_seconds=60), + ) + middleware._ensure_collaborators() + + result = await middleware._maybe_slide(record) + assert result is None, ( + f"[3.9] _maybe_slide must return None past the absolute cap; " + f"got {result!r}" + ) + assert table.update_item_calls == 0, ( + f"[3.9] _maybe_slide must NOT schedule a DDB write past the cap; " + f"observed {table.update_item_calls} update_item calls." + ) + + +# ═══════════════════════════════════════════════════════════════════════════ +# Requirement 3.10 — Fail-closed rotation: cache invalidated AND cookies cleared +# ═══════════════════════════════════════════════════════════════════════════ + + +def test_3_10_rotation_persist_exhausts_invalidates_cache_and_clears_cookies() -> None: + """(3.10) Fail-closed rotation. + + When refresh-token rotation kicks in AND `_persist_refresh` exhausts all + retries (update_item fails every time), the middleware MUST: + (a) invalidate the cache entry for this session + (b) clear BOTH BFF cookies on the response + so the user is forced to re-authenticate before their next request + hits a dead refresh token. + """ + now = int(time.time()) + # Access token within leeway → refresh path. + record = _make_record(access_token_exp=now + 5) + table = InstrumentedTable( + record=record, + update_item_side_effect=RuntimeError("DDB throttled"), + ) + repo = _make_repo(table) + codec = _make_codec() + refresh_client = MagicMock() + # Rotation kicks in — refresh_token differs from current. + refresh_client.refresh = AsyncMock( + return_value=RefreshResult( + access_token="access.fresh", + refresh_token="refresh.ROTATED", + id_token="id.fresh", + access_token_exp=now + 3600, + ) + ) + + cache = SessionCache(ttl_seconds=60) + # Pre-seed the cache so we can verify invalidation. + cache.set(record) + assert cache.get(record.session_id) is not None + + app = _build_app( + config=_enabled_config(), + repository=repo, + codec=codec, + refresh_client=refresh_client, + cache=cache, + ) + + sealed = codec.seal(CookiePayload(session_id=record.session_id)) + with TestClient(app) as client: + response = client.get("/echo", cookies={SESSION_COOKIE_NAME: sealed}) + + assert response.status_code == 200 + assert response.json()["has_session"] is False, ( + "[3.10] state.bff_session must NOT be set after fail-closed rotation" + ) + + # (a) Cache entry invalidated. + assert cache.get(record.session_id) is None, ( + "[3.10] cache entry must be invalidated after exhausted rotation persist" + ) + + # (b) Both cookies cleared. + session_clears = _find_set_cookies(response.headers, SESSION_COOKIE_NAME) + csrf_clears = _find_set_cookies(response.headers, CSRF_COOKIE_NAME) + assert len(session_clears) == 1 and len(csrf_clears) == 1, ( + f"[3.10] both BFF cookies must be cleared; got " + f"session={len(session_clears)}, csrf={len(csrf_clears)}" + ) + _assert_clear_cookie_attrs(session_clears[0]) + _assert_clear_cookie_attrs(csrf_clears[0]) + + # Sanity: update_tokens was retried 3 times on rotation. Use the + # token_persist sub-counter so we measure persist attempts only, + # not the (also-incrementing) lock_acquire write that precedes them. + assert table.token_persist_calls == 3, ( + f"[3.10] rotation must retry update_tokens 3 times; got " + f"{table.token_persist_calls}" + ) + + +# ═══════════════════════════════════════════════════════════════════════════ +# Requirement 3.11 — Cookie decode uniformity (no new timing/shape oracle) +# ═══════════════════════════════════════════════════════════════════════════ + + +@given( + garbage=st.one_of( + # Arbitrary non-empty ASCII cookie-safe strings — typical "bad seal" + # wire shape. Excludes '' because an empty cookie value is treated + # as "no cookie present" by the middleware (requirement 3.2), not + # as a decode failure. + st.text(alphabet=_COOKIE_SAFE_ALPHABET, min_size=1, max_size=64), + # Hex-encoded random bytes — invalid base64url alphabet and length. + st.binary(min_size=1, max_size=48).map(lambda b: b.hex()), + ), +) +@settings( + max_examples=25, + deadline=None, + suppress_health_check=[HealthCheck.function_scoped_fixture], +) +def test_3_11_cookie_decode_failure_produces_uniform_response_shape( + garbage: str, +) -> None: + """(3.11) Cookie decode uniformity. + + Every `CookieDecodeError` branch — bad base64, bad tag, truncated blob, + wrong version, non-JSON body — produces the SAME externally observable + response shape: identical status, identical Set-Cookie clearing pattern + for both BFF cookies, identical handler body (has_session=False). + + The middleware must NOT surface any oracle that lets a caller + distinguish decode failure modes. + """ + table = InstrumentedTable() + repo = _make_repo(table) + codec = _make_codec() + refresh_client = MagicMock() + app = _build_app( + config=_enabled_config(), + repository=repo, + codec=codec, + refresh_client=refresh_client, + ) + + with TestClient(app) as client: + response = client.get( + "/echo", cookies={SESSION_COOKIE_NAME: garbage} + ) + + assert response.status_code == 200, ( + f"[3.11] bad-seal path must return 200 with cleared cookie; " + f"got {response.status_code}" + ) + assert response.json() == { + "has_session": False, + "session_id": None, + "access_token": None, + "csrf_token": None, + }, ( + f"[3.11] handler body diverges for garbage cookie {garbage!r}: " + f"{response.json()}" + ) + + # Both cookies cleared with the same attribute set. + session_clears = _find_set_cookies(response.headers, SESSION_COOKIE_NAME) + csrf_clears = _find_set_cookies(response.headers, CSRF_COOKIE_NAME) + assert len(session_clears) == 1, ( + f"[3.11] expected one session-cookie clear; got {len(session_clears)}" + ) + assert len(csrf_clears) == 1, ( + f"[3.11] expected one csrf-cookie clear; got {len(csrf_clears)}" + ) + _assert_clear_cookie_attrs(session_clears[0]) + _assert_clear_cookie_attrs(csrf_clears[0]) + + # Zero AWS calls — decode failure is caught before any DDB / Cognito I/O. + assert table.get_item_calls == 0, ( + f"[3.11] bad-seal path must NOT reach DDB; observed " + f"{table.get_item_calls} get_item calls." + ) + refresh_client.refresh.assert_not_called() diff --git a/backend/tests/apis/shared/sessions_bff/test_cookie.py b/backend/tests/apis/shared/sessions_bff/test_cookie.py index afeaf61a..49f4d5bc 100644 --- a/backend/tests/apis/shared/sessions_bff/test_cookie.py +++ b/backend/tests/apis/shared/sessions_bff/test_cookie.py @@ -1,21 +1,33 @@ """Tests for the AES-GCM cookie codec. -Uses an injected `AESGCM` cipher to avoid mocking KMS — `CookieCodec` exposes -the `_cipher` attribute which we set directly. (Production callers always go -through `_ensure_cipher`, which is what the KMS-integration test exercises.) +Two layers of coverage: + + 1. Round-trip / decode tests — use an injected `AESGCM` cipher (set on + `_cipher` directly) so we don't need to mock Secrets Manager. + 2. `_ensure_cipher` path — exercises the deploy-time-bootstrapped data + key flow (`secretsmanager:GetSecretValue` -> SHA-256 -> AESGCM cipher) + with mock clients. This is the path that runs in production every + time a task starts. + +The cross-task seal/unseal regression — a cookie sealed by one process +unsealing on a *different* process — is locked in by +`test_two_codecs_with_same_secret_derive_the_same_cipher`. """ from __future__ import annotations import base64 +import hashlib import os import secrets +from unittest.mock import MagicMock import pytest from cryptography.hazmat.primitives.ciphers.aead import AESGCM from apis.shared.sessions_bff.cookie import ( CookieCodec, + CookieDataKeyUnavailable, CookieDecodeError, _reset_default_codec_for_tests, _set_default_codec_for_tests, @@ -110,17 +122,24 @@ def test_seal_preserves_extras() -> None: def test_default_codec_is_a_singleton() -> None: """The auth/callback route seals with this codec and the `SessionRefreshMiddleware` unseals with it on the next request — they - must be the *same* instance, since each `CookieCodec` derives its own - random AES key. A second instance would fail every unseal as 'bad seal'. + must be the *same* instance within a process so we don't refetch the + data-key secret on every cookie operation. + + Cross-process consistency (Task A's seal unsealing on Task B) is locked + in by `test_two_codecs_with_same_secret_derive_the_same_cipher`. """ _reset_default_codec_for_tests() try: os.environ["BFF_COOKIE_SIGNING_KEY_ARN"] = "arn:aws:kms:fake" + os.environ["BFF_COOKIE_DATA_KEY_SECRET_ARN"] = ( + "arn:aws:secretsmanager:us-east-1:0:secret:bff-data-key" + ) first = get_default_codec() second = get_default_codec() assert first is second finally: os.environ.pop("BFF_COOKIE_SIGNING_KEY_ARN", None) + os.environ.pop("BFF_COOKIE_DATA_KEY_SECRET_ARN", None) _reset_default_codec_for_tests() @@ -142,14 +161,136 @@ def test_default_codec_round_trip_seals_and_unseals() -> None: _reset_default_codec_for_tests() -def test_unseal_propagates_kms_infrastructure_errors() -> None: - """KMS unavailable is not a decode error — it must surface so the caller - can return 5xx instead of clearing the cookie and forcing re-login.""" - from unittest.mock import MagicMock +# ===================================================================== +# `_ensure_cipher` — Secrets Manager fetch + SHA-256 derivation path. +# ===================================================================== + +KMS_KEY_ARN = "arn:aws:kms:us-east-1:0:key/test" +DATA_KEY_SECRET_ARN = "arn:aws:secretsmanager:us-east-1:0:secret:bff-data-key" + + +def _make_sm_mock(secret_string: str) -> MagicMock: + sm = MagicMock() + sm.get_secret_value.return_value = {"SecretString": secret_string} + return sm + + +def test_ensure_cipher_fetches_secret_and_derives_key() -> None: + """Happy path: codec fetches the secret from Secrets Manager, derives + a 32-byte AES-256 key with SHA-256, then seals/unseals successfully.""" + secret_string = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKL012345" # 44 chars + sm = _make_sm_mock(secret_string) + + codec = CookieCodec( + kms_key_arn=KMS_KEY_ARN, + data_key_secret_arn=DATA_KEY_SECRET_ARN, + secrets_manager_client=sm, + ) + sealed = codec.seal(CookiePayload(session_id="sess-bootstrapped")) + assert codec.unseal(sealed).session_id == "sess-bootstrapped" + + sm.get_secret_value.assert_called_once_with(SecretId=DATA_KEY_SECRET_ARN) - fake_kms = MagicMock() - fake_kms.generate_data_key.side_effect = RuntimeError("KMS unreachable") - codec = CookieCodec(kms_key_arn="arn:aws:kms:fake", kms_client=fake_kms) - with pytest.raises(RuntimeError, match="KMS unreachable"): - codec.unseal("doesnt-matter") +def test_ensure_cipher_derived_key_matches_sha256_of_secret() -> None: + """Lock the KDF: a future change must keep the same derivation, or + every cookie sealed by an old task fails to unseal on a new task + after deploy.""" + secret_string = "deterministic-secret-for-kdf-pinning-test-1234" + sm = _make_sm_mock(secret_string) + + codec = CookieCodec( + kms_key_arn=KMS_KEY_ARN, + data_key_secret_arn=DATA_KEY_SECRET_ARN, + secrets_manager_client=sm, + ) + # Force initialization without exposing _cipher's key directly: use a + # parallel cipher with the expected key, encrypt, and decrypt with the + # codec. If the codec didn't derive via SHA-256, decrypt fails. + codec.seal(CookiePayload(session_id="x")) + expected_key = hashlib.sha256(secret_string.encode("utf-8")).digest() + expected_cipher = AESGCM(expected_key) + nonce = secrets.token_bytes(12) + ciphertext = expected_cipher.encrypt(nonce, b'{"sid":"y"}', bytes([1])) + blob = bytes([1]) + nonce + ciphertext + sealed = base64.urlsafe_b64encode(blob).rstrip(b"=").decode("ascii") + decoded = codec.unseal(sealed) + assert decoded.session_id == "y" + + +def test_ensure_cipher_caches_after_first_call() -> None: + """Hot-path requirement: only one Secrets Manager call per process.""" + sm = _make_sm_mock("a" * 44) + codec = CookieCodec( + kms_key_arn=KMS_KEY_ARN, + data_key_secret_arn=DATA_KEY_SECRET_ARN, + secrets_manager_client=sm, + ) + for _ in range(5): + codec.seal(CookiePayload(session_id="x")) + assert sm.get_secret_value.call_count == 1 + + +def test_two_codecs_with_same_secret_derive_the_same_cipher() -> None: + """Regression lock for the dev `bad seal` 401 storm. + + Two independent `CookieCodec` instances simulate two ECS tasks. Both + fetch the SAME secret string from Secrets Manager and derive the same + 32-byte key via SHA-256. A cookie sealed on `task_a` MUST unseal on + `task_b`. Pre-fix, each task generated its own random data key and + this failed. + """ + secret_string = "shared-secret-across-tasks-1234567890ABCDEFGH" + sm_a = _make_sm_mock(secret_string) + sm_b = _make_sm_mock(secret_string) + + task_a = CookieCodec( + kms_key_arn=KMS_KEY_ARN, + data_key_secret_arn=DATA_KEY_SECRET_ARN, + secrets_manager_client=sm_a, + ) + task_b = CookieCodec( + kms_key_arn=KMS_KEY_ARN, + data_key_secret_arn=DATA_KEY_SECRET_ARN, + secrets_manager_client=sm_b, + ) + + sealed_on_a = task_a.seal(CookiePayload(session_id="sess-cross-task")) + decoded_on_b = task_b.unseal(sealed_on_a) + assert decoded_on_b.session_id == "sess-cross-task" + + +def test_ensure_cipher_propagates_secrets_manager_failure() -> None: + """Secrets Manager unreachable must surface as `CookieDataKeyUnavailable` + so the request returns 5xx — never as a decode error that clears the + user's cookie.""" + sm = MagicMock() + sm.get_secret_value.side_effect = RuntimeError("Secrets Manager unreachable") + codec = CookieCodec( + kms_key_arn=KMS_KEY_ARN, + data_key_secret_arn=DATA_KEY_SECRET_ARN, + secrets_manager_client=sm, + ) + with pytest.raises(CookieDataKeyUnavailable): + codec.unseal("anything") + + +def test_ensure_cipher_rejects_empty_secret_string() -> None: + """Bootstrap not yet completed (or secret manually wiped) — fail loud + rather than silently invalidate every active session.""" + sm = _make_sm_mock("") + codec = CookieCodec( + kms_key_arn=KMS_KEY_ARN, + data_key_secret_arn=DATA_KEY_SECRET_ARN, + secrets_manager_client=sm, + ) + with pytest.raises(CookieDataKeyUnavailable, match="bootstrap missing"): + codec.unseal("anything") + + +def test_ensure_cipher_missing_config_surfaces_as_decode_error() -> None: + """No KMS ARN or no secret ARN — same shape as today's "BFF disabled" + path. Treated as `bad seal` so the middleware clears the cookie.""" + codec = CookieCodec(kms_key_arn="", data_key_secret_arn="") + with pytest.raises(CookieDecodeError): + codec.unseal("anything") diff --git a/backend/tests/apis/shared/sessions_bff/test_repository.py b/backend/tests/apis/shared/sessions_bff/test_repository.py index ec6c771b..b20c33cf 100644 --- a/backend/tests/apis/shared/sessions_bff/test_repository.py +++ b/backend/tests/apis/shared/sessions_bff/test_repository.py @@ -77,3 +77,302 @@ async def test_disabled_repository_is_inert() -> None: # All ops succeed silently — no exceptions, no AWS calls. assert await repo.get("any") is None await repo.delete("any") + + +# ===================================================================== +# Cross-task refresh lock — try_acquire_refresh_lock / release_refresh_lock +# ===================================================================== + + +@pytest.mark.asyncio +async def test_try_acquire_refresh_lock_succeeds_on_unlocked_row( + repository, sample_record +) -> None: + """The first contender claims the lock when no peer is holding one.""" + record = sample_record() + await repository.put(record) + + acquired = await repository.try_acquire_refresh_lock( + session_id=record.session_id, + owner="task-A", + lock_ttl_seconds=30, + ) + assert acquired is True + + +@pytest.mark.asyncio +async def test_try_acquire_refresh_lock_blocks_concurrent_peer( + repository, sample_record +) -> None: + """While task-A's lock is fresh, task-B's acquisition MUST fail. + + This is the cross-task coalescing primitive — without it, two tasks + would each call cognito-idp:initiate_auth with the same refresh token + under desiredCount > 1. + """ + record = sample_record() + await repository.put(record) + + a = await repository.try_acquire_refresh_lock( + session_id=record.session_id, + owner="task-A", + lock_ttl_seconds=30, + ) + b = await repository.try_acquire_refresh_lock( + session_id=record.session_id, + owner="task-B", + lock_ttl_seconds=30, + ) + assert a is True + assert b is False + + +@pytest.mark.asyncio +async def test_try_acquire_refresh_lock_takes_over_after_ttl_expires( + repository, sample_record +) -> None: + """A leader that crashed mid-refresh strands the lock for at most + `lock_ttl_seconds`. After that, any peer can re-acquire — no manual + cleanup required, no permanent stuck state.""" + record = sample_record() + await repository.put(record) + + # task-A acquires with a 0-second TTL → lock_until = now, so any + # contender at a later second sees `refresh_lock_until < :now`. + a = await repository.try_acquire_refresh_lock( + session_id=record.session_id, + owner="task-A", + lock_ttl_seconds=0, + ) + assert a is True + + # Sleep 1s so the next contender's :now is strictly greater. + time.sleep(1) + + b = await repository.try_acquire_refresh_lock( + session_id=record.session_id, + owner="task-B", + lock_ttl_seconds=30, + ) + assert b is True + + +@pytest.mark.asyncio +async def test_try_acquire_refresh_lock_distinct_sessions_dont_block( + repository, sample_record +) -> None: + rec_a = sample_record(session_id="sess-A") + rec_b = sample_record(session_id="sess-B") + await repository.put(rec_a) + await repository.put(rec_b) + + a = await repository.try_acquire_refresh_lock( + session_id=rec_a.session_id, owner="task-1", lock_ttl_seconds=30 + ) + b = await repository.try_acquire_refresh_lock( + session_id=rec_b.session_id, owner="task-1", lock_ttl_seconds=30 + ) + assert a is True + assert b is True + + +@pytest.mark.asyncio +async def test_release_refresh_lock_clears_attrs_for_owner( + repository, sample_record +) -> None: + record = sample_record() + await repository.put(record) + await repository.try_acquire_refresh_lock( + session_id=record.session_id, owner="task-A", lock_ttl_seconds=30 + ) + + await repository.release_refresh_lock(record.session_id, owner="task-A") + + # After release a peer can immediately acquire. + b = await repository.try_acquire_refresh_lock( + session_id=record.session_id, owner="task-B", lock_ttl_seconds=30 + ) + assert b is True + + +@pytest.mark.asyncio +async def test_release_refresh_lock_is_no_op_for_non_owner( + repository, sample_record +) -> None: + """Best-effort release: if a peer has already taken over the lock + (because ours TTL'd), the release MUST NOT clear their lock attrs.""" + record = sample_record() + await repository.put(record) + await repository.try_acquire_refresh_lock( + session_id=record.session_id, owner="task-A", lock_ttl_seconds=30 + ) + + # task-B (who never held the lock) calls release — must not blow away + # task-A's lock. + await repository.release_refresh_lock(record.session_id, owner="task-B") + + # task-A's lock is still in force; a third contender can't acquire. + c = await repository.try_acquire_refresh_lock( + session_id=record.session_id, owner="task-C", lock_ttl_seconds=30 + ) + assert c is False + + +@pytest.mark.asyncio +async def test_update_tokens_with_lock_owner_clears_lock_atomically( + repository, sample_record +) -> None: + """Successful refresh persist clears the lock attributes in the same + write so peers don't have to wait for the TTL to retry.""" + record = sample_record() + await repository.put(record) + await repository.try_acquire_refresh_lock( + session_id=record.session_id, owner="task-A", lock_ttl_seconds=30 + ) + + await repository.update_tokens( + session_id=record.session_id, + access_token="access.fresh", + refresh_token="refresh.rotated", + id_token="id.fresh", + access_token_exp=int(time.time()) + 3600, + last_seen_at=int(time.time()), + expected_lock_owner="task-A", + ) + + # Lock cleared → another contender can acquire immediately. + b = await repository.try_acquire_refresh_lock( + session_id=record.session_id, owner="task-B", lock_ttl_seconds=30 + ) + assert b is True + + +@pytest.mark.asyncio +async def test_update_tokens_rejects_persist_when_peer_owns_the_lock( + repository, sample_record +) -> None: + """Stale-leader guard: if our lock TTL'd and a peer took over, we must + NOT overwrite their freshly persisted tokens. ConditionalCheckFailed + propagates so the caller can re-read DDB and adopt the peer's state.""" + from botocore.exceptions import ClientError + + record = sample_record() + await repository.put(record) + # Peer task acquired the lock. + await repository.try_acquire_refresh_lock( + session_id=record.session_id, owner="peer-task", lock_ttl_seconds=30 + ) + + with pytest.raises(ClientError) as exc_info: + await repository.update_tokens( + session_id=record.session_id, + access_token="access.stale", + refresh_token="refresh.stale", + id_token="id.stale", + access_token_exp=int(time.time()) + 3600, + last_seen_at=int(time.time()), + expected_lock_owner="our-task", # ≠ peer-task + ) + assert ( + exc_info.value.response.get("Error", {}).get("Code") + == "ConditionalCheckFailedException" + ) + + +@pytest.mark.asyncio +async def test_update_tokens_rejects_persist_when_peer_already_cleared_the_lock( + repository, sample_record +) -> None: + """The other half of the stale-leader guard: a peer whose lock TTL'd, + took over, refreshed, and successfully persisted (which atomically + REMOVEs the lock attrs) — the row now has NO lock attributes at all. + A stale leader trying to persist with `expected_lock_owner=our-task` + must still fail closed; otherwise our older Cognito tokens would + silently overwrite the peer's freshly rotated ones, and the next + request would get NotAuthorizedException from Cognito (our refresh + token was revoked when the peer's rotation was issued). + + Sequence: + 1. Task A acquires lock at T0. + 2. Task A's Cognito call hangs. + 3. Task B sees lock TTL'd, acquires, refreshes, persists (clears). + 4. Task A's Cognito finally returns; A tries to persist. + => MUST fail with ConditionalCheckFailedException. + """ + from botocore.exceptions import ClientError + + record = sample_record() + await repository.put(record) + + # Peer acquired the lock and successfully persisted (clearing it). + await repository.try_acquire_refresh_lock( + session_id=record.session_id, owner="peer-task", lock_ttl_seconds=30 + ) + await repository.update_tokens( + session_id=record.session_id, + access_token="access.peer-fresh", + refresh_token="refresh.peer-rotated", + id_token="id.peer", + access_token_exp=int(time.time()) + 3600, + last_seen_at=int(time.time()), + expected_lock_owner="peer-task", + ) + + # Stale leader (our-task) — never owned a lock that's still on the + # row, but holds an old `lock_owner` from before the TTL. Must fail. + with pytest.raises(ClientError) as exc_info: + await repository.update_tokens( + session_id=record.session_id, + access_token="access.stale-leader", + refresh_token="refresh.stale-leader", + id_token="id.stale-leader", + access_token_exp=int(time.time()) + 3600, + last_seen_at=int(time.time()), + expected_lock_owner="our-task", + ) + assert ( + exc_info.value.response.get("Error", {}).get("Code") + == "ConditionalCheckFailedException" + ) + + # Peer's tokens are still intact on the row. + fetched = await repository.get(record.session_id) + assert fetched is not None + assert fetched.cognito_access_token == "access.peer-fresh" + assert fetched.cognito_refresh_token == "refresh.peer-rotated" + + +@pytest.mark.asyncio +async def test_try_acquire_refresh_lock_does_not_create_phantom_row( + repository, moto_bff_dynamodb +) -> None: + """Logout-during-refresh guard: if the session row was deleted between + `repository.get()` and `try_acquire_refresh_lock`, UpdateItem would + upsert a phantom row containing only the lock attrs (and crucially no + `ttl`, so DDB TTL would never reap it). The `attribute_exists(PK)` + guard turns that into a clean False return. + + Asserts via raw DDB get_item — `repository.get` would mask a phantom + behind its post-read TTL check (a row with no `ttl` attribute reads + as `int(item.get("ttl", 0)) <= now`, treated as missing), so we + bypass that and look at the raw item. + """ + # Session row never existed (or was just deleted by a logout from + # another task between this request's repository.get() and here). + acquired = await repository.try_acquire_refresh_lock( + session_id="never-existed", + owner="task-A", + lock_ttl_seconds=30, + ) + assert acquired is False + + # No phantom row was created — check the raw table, since + # repository.get() would also return None for a phantom (no `ttl`). + table = moto_bff_dynamodb.Table("test-bff-sessions") + response = table.get_item( + Key={"PK": "SESSION#never-existed", "SK": "META"} + ) + assert "Item" not in response, ( + "try_acquire_refresh_lock created a phantom row with no `ttl` — " + "DDB TTL would never reap it" + ) diff --git a/backend/tests/apis/shared/sessions_bff/test_session_refresh_cross_task.py b/backend/tests/apis/shared/sessions_bff/test_session_refresh_cross_task.py new file mode 100644 index 00000000..946e357c --- /dev/null +++ b/backend/tests/apis/shared/sessions_bff/test_session_refresh_cross_task.py @@ -0,0 +1,480 @@ +"""Cross-task refresh-coalescing tests for SessionRefreshMiddleware. + +Locks in the regression that PR #264 created and the cookie-codec fix would +*expose* once dev started working again: with `desiredCount: 2`, two +`SessionRefreshMiddleware` instances running in two ECS tasks would each +see a cookie crossing the refresh-leeway boundary, each call +`cognito-idp:initiate_auth` with the same refresh token, and one of them +would lose the rotation race — Cognito revokes the original token on the +winner's exchange, the loser gets `NotAuthorizedException`, the loser's +middleware clears the user's cookie. Page-load fan-outs become routine +silent logouts. + +The fix coalesces the refresh exchange across tasks via a DynamoDB +conditional-write lock (`refresh_lock_owner` + `refresh_lock_until` on +the session row). These tests instantiate two repository + middleware +pairs against ONE moto-backed DDB table so we can drive the leader and +follower paths deterministically without spinning real ECS tasks. + +What's covered: + - Leader-only Cognito refresh under same-time contention from two tasks + - Follower adoption of the leader's persisted tokens (no Cognito call) + - Leader crash (Cognito error) releases the lock so peers can retry + - Lock TTL recovery: a crashed leader's lock unblocks peers after TTL + - Refresh-token rotation: peer's rotated tokens propagate to follower +""" + +from __future__ import annotations + +import asyncio +import secrets +import time +from typing import Optional +from unittest.mock import AsyncMock, MagicMock + +import boto3 +import pytest +from cryptography.hazmat.primitives.ciphers.aead import AESGCM +from fastapi import FastAPI, Request +from fastapi.testclient import TestClient +from moto import mock_aws + +from apis.shared.middleware.session_refresh import SessionRefreshMiddleware +from apis.shared.sessions_bff import lock as lock_module +from apis.shared.sessions_bff import single_flight as single_flight_module +from apis.shared.sessions_bff.cache import SessionCache +from apis.shared.sessions_bff.config import ( + BFFConfig, + SESSION_COOKIE_NAME, +) +from apis.shared.sessions_bff.cookie import CookieCodec +from apis.shared.sessions_bff.models import CookiePayload, SessionRecord +from apis.shared.sessions_bff.refresh import ( + CognitoRefreshError, + RefreshResult, +) +from apis.shared.sessions_bff.repository import SessionRepository + +# Single shared DDB table — both "tasks" attach to the same backing store, +# matching production where two ECS tasks read/write one BFFSessionsTable. +TABLE_NAME = "test-bff-sessions" + + +@pytest.fixture(autouse=True) +def _reset_module_state(): + """Drop process-wide locks + single-flight registries between tests so + a leftover Future or asyncio lock from one test can't influence the + next case's contention behavior.""" + lock_module._reset_for_tests() + single_flight_module._reset_for_tests() + yield + lock_module._reset_for_tests() + single_flight_module._reset_for_tests() + + +@pytest.fixture +def two_task_setup(monkeypatch): + """Spin up two `SessionRefreshMiddleware` instances over one moto DDB + table so each represents a distinct ECS task in the same fleet.""" + monkeypatch.setenv("AWS_DEFAULT_REGION", "us-east-1") + monkeypatch.setenv("AWS_ACCESS_KEY_ID", "testing") + monkeypatch.setenv("AWS_SECRET_ACCESS_KEY", "testing") + + with mock_aws(): + dynamodb = boto3.resource("dynamodb", region_name="us-east-1") + dynamodb.create_table( + TableName=TABLE_NAME, + KeySchema=[ + {"AttributeName": "PK", "KeyType": "HASH"}, + {"AttributeName": "SK", "KeyType": "RANGE"}, + ], + AttributeDefinitions=[ + {"AttributeName": "PK", "AttributeType": "S"}, + {"AttributeName": "SK", "AttributeType": "S"}, + ], + BillingMode="PAY_PER_REQUEST", + ) + + # Both tasks share the data-key secret (otherwise the cookie sealed + # by Task A would unseal as `bad seal` on Task B — that's the OTHER + # bug in this branch, exercised by test_cookie). We pre-inject one + # AES key here to keep the test focused on the refresh-lock path. + shared_aes_key = secrets.token_bytes(32) + + def _make_codec() -> CookieCodec: + codec = CookieCodec( + kms_key_arn="arn:aws:kms:fake", + data_key_secret_arn="arn:aws:secretsmanager:fake", + ) + codec._cipher = AESGCM(shared_aes_key) + return codec + + def _make_task(*, refresh_client) -> dict: + repo = SessionRepository(table_name=TABLE_NAME) + codec = _make_codec() + cache = SessionCache(ttl_seconds=60) + config = _enabled_config() + + app = FastAPI() + app.add_middleware( + SessionRefreshMiddleware, + config=config, + repository=repo, + cookie_codec=codec, + refresh_client=refresh_client, + cache=cache, + refresh_lock_ttl_seconds=2, # short for tests + ) + + @app.get("/echo") + async def echo(request: Request): + record = getattr(request.state, "bff_session", None) + return { + "has_session": record is not None, + "access_token": ( + record.cognito_access_token if record else None + ), + "refresh_token": ( + record.cognito_refresh_token if record else None + ), + } + + return { + "app": app, + "repository": repo, + "codec": codec, + "cache": cache, + "refresh_client": refresh_client, + } + + yield { + "make_task": _make_task, + "table_name": TABLE_NAME, + "shared_aes_key": shared_aes_key, + "make_codec": _make_codec, + } + + +def _enabled_config() -> BFFConfig: + return BFFConfig( + sessions_table_name="tbl", + cookie_signing_key_arn="arn:aws:kms:fake", + session_ttl_seconds=28800, + refresh_leeway_seconds=60, + cognito_bff_app_client_id="client-id", + cognito_bff_app_client_secret_arn="arn:secret", + inference_api_url=None, + absolute_lifetime_seconds=30 * 24 * 3600, + sliding_renewal_throttle_seconds=300, + ) + + +def _seed_session_in_refresh_window(repository: SessionRepository) -> SessionRecord: + """Persist a session whose access token is inside the refresh leeway, + so the middleware MUST hit the refresh path.""" + now = int(time.time()) + record = SessionRecord( + session_id="sess-cross-task", + user_id="user-001", + username="alice", + cognito_access_token="access.original", + cognito_refresh_token="refresh.original", + id_token="id.original", + access_token_exp=now + 5, # within 60s leeway + csrf_secret="csrf-secret", + created_at=now, + last_seen_at=now, + ttl=now + 28800, + ) + asyncio.run(repository.put(record)) + return record + + +def test_only_the_leader_calls_cognito_under_cross_task_contention( + two_task_setup, +) -> None: + """Two tasks see the same cookie in the refresh window. Exactly one + calls Cognito (the leader). The other adopts the leader's tokens + from DDB without ever calling Cognito. + + Pre-fix: BOTH tasks would call Cognito with the same refresh token, + and the loser would get NotAuthorizedException → clear cookie → 401. + """ + # Refresh client A is the leader's; refresh client B simulates the + # follower's. We assert that B is NEVER called. + leader_refresh = AsyncMock( + return_value=RefreshResult( + access_token="access.fresh-from-leader", + refresh_token="refresh.rotated-by-leader", + id_token="id.fresh", + access_token_exp=int(time.time()) + 3600, + ) + ) + follower_refresh = AsyncMock( + side_effect=AssertionError( + "Follower MUST NOT call Cognito — peer holds the refresh lock" + ) + ) + + task_a = two_task_setup["make_task"](refresh_client=MagicMock(refresh=leader_refresh)) + task_b = two_task_setup["make_task"](refresh_client=MagicMock(refresh=follower_refresh)) + + record = _seed_session_in_refresh_window(task_a["repository"]) + sealed = task_a["codec"].seal(CookiePayload(session_id=record.session_id)) + + # Drive task_a first (it'll grab the lock and refresh). Then drive + # task_b — it must observe the lock as held (or just released, with + # tokens already rotated on the row) and adopt rather than refresh. + with TestClient(task_a["app"]) as client_a: + response_a = client_a.get( + "/echo", cookies={SESSION_COOKIE_NAME: sealed} + ) + with TestClient(task_b["app"]) as client_b: + response_b = client_b.get( + "/echo", cookies={SESSION_COOKIE_NAME: sealed} + ) + + assert response_a.status_code == 200 + assert response_b.status_code == 200 + assert response_a.json()["has_session"] is True + assert response_b.json()["has_session"] is True + # Both tasks see the leader's freshly rotated tokens. + assert response_a.json()["access_token"] == "access.fresh-from-leader" + assert response_b.json()["access_token"] == "access.fresh-from-leader" + assert response_b.json()["refresh_token"] == "refresh.rotated-by-leader" + + leader_refresh.assert_called_once() + follower_refresh.assert_not_called() + + +def test_follower_polls_until_leader_persists_then_adopts( + two_task_setup, +) -> None: + """Simulates near-simultaneous arrival: task_a gets the lock just + before task_b runs. Task_b's `_wait_for_peer_refresh` polls DDB + and adopts task_a's tokens once they land. + + To force the follower to actually poll (rather than fall through + a fully-completed leader path), we make the leader's Cognito refresh + take a measurable amount of time and start the follower while the + leader is still in flight. + """ + leader_done = asyncio.Event() + follower_started = asyncio.Event() + + async def slow_leader_refresh(*args, **kwargs) -> RefreshResult: + # Wait for the follower to be inside its poll loop, then complete. + await follower_started.wait() + await asyncio.sleep(0.05) + leader_done.set() + return RefreshResult( + access_token="access.fresh-leader", + refresh_token="refresh.rotated-leader", + id_token="id.fresh", + access_token_exp=int(time.time()) + 3600, + ) + + leader_refresh = AsyncMock(side_effect=slow_leader_refresh) + follower_refresh = AsyncMock( + side_effect=AssertionError("Follower must NOT call Cognito") + ) + + task_a = two_task_setup["make_task"](refresh_client=MagicMock(refresh=leader_refresh)) + task_b = two_task_setup["make_task"](refresh_client=MagicMock(refresh=follower_refresh)) + record = _seed_session_in_refresh_window(task_a["repository"]) + sealed = task_a["codec"].seal(CookiePayload(session_id=record.session_id)) + + async def drive_both() -> tuple[dict, dict]: + async def hit(client_app): + from httpx import ASGITransport, AsyncClient + + async with AsyncClient( + transport=ASGITransport(app=client_app), base_url="http://t" + ) as client: + response = await client.get( + "/echo", cookies={SESSION_COOKIE_NAME: sealed} + ) + return response.json() + + async def driven_follower(): + # Start the follower a tick later, so the leader has the lock. + await asyncio.sleep(0.02) + follower_started.set() + return await hit(task_b["app"]) + + a, b = await asyncio.gather(hit(task_a["app"]), driven_follower()) + return a, b + + a_body, b_body = asyncio.run(drive_both()) + + assert a_body["has_session"] is True + assert b_body["has_session"] is True + assert a_body["access_token"] == "access.fresh-leader" + assert b_body["access_token"] == "access.fresh-leader" + leader_refresh.assert_called_once() + follower_refresh.assert_not_called() + + +def test_lock_ttl_lets_a_peer_retry_after_a_dead_leader( + two_task_setup, +) -> None: + """Leader's Cognito call fails → lock is released → peer can refresh + on its next request without waiting for the full TTL. + + This guards against the worst case where a leader crashes mid-refresh + and never persists tokens. We don't want every subsequent request to + fail closed for the duration of the lock TTL. + """ + leader_refresh = AsyncMock(side_effect=CognitoRefreshError("Cognito down")) + follower_refresh = AsyncMock( + return_value=RefreshResult( + access_token="access.peer-fresh", + refresh_token="refresh.peer-rotated", + id_token="id.peer", + access_token_exp=int(time.time()) + 3600, + ) + ) + + task_a = two_task_setup["make_task"]( + refresh_client=MagicMock(refresh=leader_refresh) + ) + task_b = two_task_setup["make_task"]( + refresh_client=MagicMock(refresh=follower_refresh) + ) + record = _seed_session_in_refresh_window(task_a["repository"]) + sealed = task_a["codec"].seal(CookiePayload(session_id=record.session_id)) + + # Task A: leader fails. The middleware clears its cookie for THIS + # request but releases the lock (so a peer can retry). + with TestClient(task_a["app"]) as client_a: + response_a = client_a.get( + "/echo", cookies={SESSION_COOKIE_NAME: sealed} + ) + assert response_a.status_code == 200 + assert response_a.json()["has_session"] is False + set_cookies_a = response_a.headers.get_list("set-cookie") + assert any( + "__Host-bff_session=" in c and "Max-Age=0" in c for c in set_cookies_a + ), "Task A must clear cookie after its own refresh failed" + + # Task B (different request): lock is released; peer becomes the new + # leader and refreshes successfully. + with TestClient(task_b["app"]) as client_b: + response_b = client_b.get( + "/echo", cookies={SESSION_COOKIE_NAME: sealed} + ) + assert response_b.status_code == 200 + assert response_b.json()["has_session"] is True + assert response_b.json()["access_token"] == "access.peer-fresh" + leader_refresh.assert_called_once() + follower_refresh.assert_called_once() + + +def test_follower_falls_back_terminal_when_leader_disappears_mid_refresh( + two_task_setup, +) -> None: + """Pathological case: leader holds the lock but never persists tokens + AND never releases (e.g. process killed). The follower's poll deadline + is bounded by `refresh_lock_ttl_seconds`; after that, this request + fails closed (clear cookie). The user re-auths. + + The next request after this one will see the lock TTL'd and can + re-acquire — that path is covered by + `test_lock_ttl_lets_a_peer_retry_after_a_dead_leader`. + """ + follower_refresh = AsyncMock( + side_effect=AssertionError("Follower must NOT call Cognito while a peer holds the lock") + ) + task_b = two_task_setup["make_task"]( + refresh_client=MagicMock(refresh=follower_refresh) + ) + record = _seed_session_in_refresh_window(task_b["repository"]) + + # Manually park a lock on the row as if some other task is mid-refresh + # but hasn't persisted yet (and won't, for the duration of this test). + asyncio.run( + task_b["repository"].try_acquire_refresh_lock( + session_id=record.session_id, + owner="ghost-task", + lock_ttl_seconds=2, # matches make_task's middleware TTL + ) + ) + + sealed = task_b["codec"].seal(CookiePayload(session_id=record.session_id)) + with TestClient(task_b["app"]) as client_b: + response = client_b.get( + "/echo", cookies={SESSION_COOKIE_NAME: sealed} + ) + + assert response.status_code == 200 + assert response.json()["has_session"] is False + set_cookies = response.headers.get_list("set-cookie") + assert any( + "__Host-bff_session=" in c and "Max-Age=0" in c for c in set_cookies + ), "Follower must clear cookie after polling timed out on a stuck leader" + follower_refresh.assert_not_called() + + +def test_two_tasks_in_parallel_call_cognito_at_most_once( + two_task_setup, +) -> None: + """Pure asyncio gather of one request per task at the same instant. + Whichever wins the conditional UpdateItem becomes the leader; the + other adopts. Combined Cognito call count must be exactly 1. + + This is the closest analogue to the page-load fan-out behavior we + care about in production — two tasks each receive their share of + the 8-endpoint fan-out at the moment the cookie crosses the leeway + window. + """ + refresh_count = {"calls": 0} + + async def counted_refresh(*args, **kwargs): + refresh_count["calls"] += 1 + await asyncio.sleep(0.05) + return RefreshResult( + access_token="access.fresh", + refresh_token="refresh.rotated", + id_token="id.fresh", + access_token_exp=int(time.time()) + 3600, + ) + + refresh_a = AsyncMock(side_effect=counted_refresh) + refresh_b = AsyncMock(side_effect=counted_refresh) + + task_a = two_task_setup["make_task"]( + refresh_client=MagicMock(refresh=refresh_a) + ) + task_b = two_task_setup["make_task"]( + refresh_client=MagicMock(refresh=refresh_b) + ) + record = _seed_session_in_refresh_window(task_a["repository"]) + sealed = task_a["codec"].seal(CookiePayload(session_id=record.session_id)) + + async def drive() -> tuple[dict, dict]: + from httpx import ASGITransport, AsyncClient + + async def hit(app): + async with AsyncClient( + transport=ASGITransport(app=app), base_url="http://t" + ) as client: + response = await client.get( + "/echo", cookies={SESSION_COOKIE_NAME: sealed} + ) + return response.json() + + return await asyncio.gather(hit(task_a["app"]), hit(task_b["app"])) + + a_body, b_body = asyncio.run(drive()) + + # Both succeeded. + assert a_body["has_session"] is True + assert b_body["has_session"] is True + # Both got the same fresh tokens (one set, sourced from the leader). + assert a_body["access_token"] == b_body["access_token"] == "access.fresh" + assert a_body["refresh_token"] == b_body["refresh_token"] == "refresh.rotated" + # CRITICAL: across BOTH tasks, Cognito refresh was called at most once. + assert refresh_count["calls"] == 1, ( + f"Cross-task coalescing violated — Cognito refresh was called " + f"{refresh_count['calls']} times across two tasks" + ) diff --git a/backend/tests/apis/shared/sessions_bff/test_session_refresh_middleware.py b/backend/tests/apis/shared/sessions_bff/test_session_refresh_middleware.py index cdc68f7e..7e64f02d 100644 --- a/backend/tests/apis/shared/sessions_bff/test_session_refresh_middleware.py +++ b/backend/tests/apis/shared/sessions_bff/test_session_refresh_middleware.py @@ -226,12 +226,16 @@ def test_near_expiry_session_triggers_refresh_once() -> None: repo = AsyncMock() repo.get.return_value = record codec = _make_codec() + # `refresh_client.refresh` is now `async` (task 3.2) — use AsyncMock so + # `await self._refresh_client.refresh(...)` in the middleware resolves. refresh = MagicMock() - refresh.refresh.return_value = RefreshResult( - access_token="access.fresh", - refresh_token="refresh.fresh", - id_token="id.fresh", - access_token_exp=int(time.time()) + 3600, + refresh.refresh = AsyncMock( + return_value=RefreshResult( + access_token="access.fresh", + refresh_token="refresh.fresh", + id_token="id.fresh", + access_token_exp=int(time.time()) + 3600, + ) ) app = _build_app( config=_enabled_config(), repository=repo, codec=codec, refresh_client=refresh @@ -245,7 +249,7 @@ def test_near_expiry_session_triggers_refresh_once() -> None: assert body["has_session"] is True # The refreshed token should be exposed downstream. assert body["access_token"] == "access.fresh" - refresh.refresh.assert_called_once_with( + refresh.refresh.assert_awaited_once_with( username="alice", refresh_token="refresh.original" ) repo.update_tokens.assert_awaited_once() @@ -259,7 +263,7 @@ def test_refresh_failure_clears_cookie() -> None: repo.get.return_value = record codec = _make_codec() refresh = MagicMock() - refresh.refresh.side_effect = CognitoRefreshError("rotated") + refresh.refresh = AsyncMock(side_effect=CognitoRefreshError("rotated")) app = _build_app( config=_enabled_config(), repository=repo, codec=codec, refresh_client=refresh ) @@ -298,7 +302,7 @@ def slow_refresh(*, username: str, refresh_token: str) -> RefreshResult: ) refresh = MagicMock() - refresh.refresh.side_effect = slow_refresh + refresh.refresh = AsyncMock(side_effect=slow_refresh) # After the first refresh, repo.get returns the *fresh* record so other # waiters short-circuit out of the refresh branch. @@ -381,7 +385,13 @@ def test_slide_within_throttle_window_does_not_write_or_reemit() -> None: def test_slide_past_throttle_writes_ddb_and_reemits_cookie() -> None: """Once `last_seen_at` is older than the throttle window, the slide fires: one DDB touch with a fresh ttl, plus a Set-Cookie carrying a - fresh Max-Age = session_ttl_seconds.""" + fresh Max-Age = session_ttl_seconds. + + The slide-write is fire-and-forget (task 3.5) — we poll for the + background task's side effect rather than sample immediately. The + observable external contract (Set-Cookie Max-Age) is unchanged; only + the internal timing of the write moves off the request path. + """ record = _make_record() record.last_seen_at = int(time.time()) - 120 # past the 60s throttle repo = AsyncMock() @@ -393,7 +403,21 @@ def test_slide_past_throttle_writes_ddb_and_reemits_cookie() -> None: ) sealed = codec.seal(CookiePayload(session_id=record.session_id)) - response = TestClient(app).get("/echo", cookies={SESSION_COOKIE_NAME: sealed}) + with TestClient(app) as client: + response = client.get("/echo", cookies={SESSION_COOKIE_NAME: sealed}) + # Poll for the fire-and-forget slide-write (task 3.5) INSIDE the + # `with` block — TestClient tears down its anyio portal (and the + # event loop) on `__exit__`, cancelling any unfinished tasks. + # Drive the loop with a second GET if the first request's + # background task hasn't flushed yet. + deadline = time.monotonic() + 1.0 + while time.monotonic() < deadline and repo.touch_last_seen.await_count == 0: + time.sleep(0.01) + if repo.touch_last_seen.await_count == 0: + client.get("/echo") + deadline = time.monotonic() + 1.0 + while time.monotonic() < deadline and repo.touch_last_seen.await_count == 0: + time.sleep(0.01) assert response.status_code == 200 # Exactly one slide-write, and it carries a ttl bumped by ~session_ttl_seconds. @@ -472,6 +496,44 @@ def test_slide_max_age_capped_by_remaining_absolute_lifetime() -> None: assert 350 <= max_age <= 400 +def test_refresh_path_past_absolute_cap_clears_cookie_without_calling_cognito() -> None: + """The refresh path must mirror the slide path's absolute-cap behavior: + once `created_at + absolute_lifetime` has passed, do NOT mint fresh + tokens. Persisting them would also write a past-dated `ttl` + (`min(now + session_ttl_seconds, absolute_cap)` is `< now` past the + cap), which would instantly TTL-evict the row right after the write + and silently log the user out one request later. Failing closed up + front avoids burning a Cognito refresh-token rotation we'd just + throw away.""" + record = _make_record(access_token_exp=int(time.time()) + 5) # within leeway + record.created_at = int(time.time()) - 200 # past 100s absolute cap + repo = AsyncMock() + repo.get.return_value = record + codec = _make_codec() + refresh = MagicMock() + refresh.refresh = AsyncMock( + side_effect=AssertionError( + "Cognito refresh MUST NOT be called past absolute lifetime" + ) + ) + app = _build_app( + config=_enabled_config(absolute_lifetime_seconds=100), + repository=repo, + codec=codec, + refresh_client=refresh, + ) + + sealed = codec.seal(CookiePayload(session_id=record.session_id)) + response = TestClient(app).get("/echo", cookies={SESSION_COOKIE_NAME: sealed}) + + assert response.status_code == 200 + assert response.json()["has_session"] is False + refresh.refresh.assert_not_called() + repo.update_tokens.assert_not_called() + cleared = " ".join(response.headers.get_list("set-cookie")) + assert SESSION_COOKIE_NAME in cleared and "Max-Age=0" in cleared + + def test_refresh_path_bumps_ttl_when_persisting_tokens() -> None: """The token-rotation write must also slide the row's ttl forward — otherwise a session that just refreshed could still expire moments @@ -481,11 +543,13 @@ def test_refresh_path_bumps_ttl_when_persisting_tokens() -> None: repo.get.return_value = record codec = _make_codec() refresh = MagicMock() - refresh.refresh.return_value = RefreshResult( - access_token="access.fresh", - refresh_token="refresh.original", # no rotation - id_token="id.fresh", - access_token_exp=int(time.time()) + 3600, + refresh.refresh = AsyncMock( + return_value=RefreshResult( + access_token="access.fresh", + refresh_token="refresh.original", # no rotation + id_token="id.fresh", + access_token_exp=int(time.time()) + 3600, + ) ) app = _build_app( config=_enabled_config(), repository=repo, codec=codec, refresh_client=refresh @@ -517,11 +581,13 @@ def test_rotation_persist_failure_invalidates_session() -> None: repo.update_tokens.side_effect = RuntimeError("DDB throttled") codec = _make_codec() refresh = MagicMock() - refresh.refresh.return_value = RefreshResult( - access_token="access.fresh", - refresh_token="refresh.ROTATED", # rotation kicked in - id_token="id.fresh", - access_token_exp=int(time.time()) + 3600, + refresh.refresh = AsyncMock( + return_value=RefreshResult( + access_token="access.fresh", + refresh_token="refresh.ROTATED", # rotation kicked in + id_token="id.fresh", + access_token_exp=int(time.time()) + 3600, + ) ) app = _build_app( config=_enabled_config(), repository=repo, codec=codec, refresh_client=refresh @@ -550,11 +616,13 @@ def test_non_rotation_persist_failure_does_not_invalidate() -> None: repo.update_tokens.side_effect = RuntimeError("DDB throttled") codec = _make_codec() refresh = MagicMock() - refresh.refresh.return_value = RefreshResult( - access_token="access.fresh", - refresh_token="refresh.original", # SAME — no rotation - id_token="id.fresh", - access_token_exp=int(time.time()) + 3600, + refresh.refresh = AsyncMock( + return_value=RefreshResult( + access_token="access.fresh", + refresh_token="refresh.original", # SAME — no rotation + id_token="id.fresh", + access_token_exp=int(time.time()) + 3600, + ) ) app = _build_app( config=_enabled_config(), repository=repo, codec=codec, refresh_client=refresh diff --git a/backend/tests/apis/shared/sessions_bff/test_single_flight.py b/backend/tests/apis/shared/sessions_bff/test_single_flight.py new file mode 100644 index 00000000..e2159765 --- /dev/null +++ b/backend/tests/apis/shared/sessions_bff/test_single_flight.py @@ -0,0 +1,211 @@ +"""Unit tests for the per-session single-flight primitive. + +Covers the contract documented in +`backend/src/apis/shared/sessions_bff/single_flight.py`: + +1. Two concurrent `resolve_once` calls for the same `session_id` share one + loader invocation; both receive the same result. +2. An exception raised by the loader propagates to every current waiter + (leader + all followers). +3. After a loader exception the registry entry is removed, so a subsequent + call starts a fresh leader. +4. Distinct `session_id`s are independent (two different sessions produce two + loader invocations). +5. Happy path: a single caller's result is returned correctly. +""" + +from __future__ import annotations + +import asyncio +import time +from typing import Optional, Tuple + +import pytest + +from apis.shared.sessions_bff import single_flight +from apis.shared.sessions_bff.models import SessionRecord + + +def _make_record(session_id: str = "sess-sf-001") -> SessionRecord: + now = int(time.time()) + return SessionRecord( + session_id=session_id, + user_id=f"user-for-{session_id}", + username="alice", + cognito_access_token="access.token.value", + cognito_refresh_token="refresh.token.value", + id_token="id.token.value", + access_token_exp=now + 3600, + csrf_secret="csrf-secret-deadbeef", + created_at=now, + last_seen_at=now, + ttl=now + 28800, + ) + + +@pytest.fixture(autouse=True) +def _reset_registry(): + """Drop any residual in-flight Futures between tests.""" + single_flight._reset_for_tests() + yield + single_flight._reset_for_tests() + + +@pytest.mark.asyncio +async def test_happy_path_single_caller_returns_loader_result(): + """A lone caller receives the loader's exact return value.""" + record = _make_record() + call_count = 0 + + async def loader() -> Tuple[Optional[SessionRecord], bool]: + nonlocal call_count + call_count += 1 + return record, False + + result = await single_flight.resolve_once("sess-sf-001", loader) + + assert result == (record, False) + assert call_count == 1 + # Registry is clean after success. + assert "sess-sf-001" not in single_flight._inflight + + +@pytest.mark.asyncio +async def test_concurrent_same_session_share_one_loader_invocation(): + """N concurrent `resolve_once` calls on the same session call loader once.""" + record = _make_record() + call_count = 0 + gate = asyncio.Event() + + async def loader() -> Tuple[Optional[SessionRecord], bool]: + nonlocal call_count + call_count += 1 + # Hold the leader open long enough for followers to attach. + await gate.wait() + return record, False + + async def release_after_followers_attach() -> None: + # Give followers a chance to see the existing Future. + await asyncio.sleep(0.05) + gate.set() + + tasks = [ + asyncio.create_task(single_flight.resolve_once("sess-sf-002", loader)) + for _ in range(8) + ] + releaser = asyncio.create_task(release_after_followers_attach()) + + results = await asyncio.gather(*tasks) + await releaser + + assert call_count == 1, "loader must be invoked exactly once for shared session" + for result in results: + assert result == (record, False) + assert "sess-sf-002" not in single_flight._inflight + + +@pytest.mark.asyncio +async def test_loader_exception_propagates_to_all_waiters(): + """An exception from the loader reaches the leader and every follower.""" + call_count = 0 + gate = asyncio.Event() + + class LoaderBoom(RuntimeError): + pass + + async def loader() -> Tuple[Optional[SessionRecord], bool]: + nonlocal call_count + call_count += 1 + await gate.wait() + raise LoaderBoom("cognito exploded") + + async def release_after_followers_attach() -> None: + await asyncio.sleep(0.05) + gate.set() + + tasks = [ + asyncio.create_task(single_flight.resolve_once("sess-sf-003", loader)) + for _ in range(5) + ] + releaser = asyncio.create_task(release_after_followers_attach()) + + results = await asyncio.gather(*tasks, return_exceptions=True) + await releaser + + assert call_count == 1 + assert len(results) == 5 + for outcome in results: + assert isinstance(outcome, LoaderBoom) + assert str(outcome) == "cognito exploded" + + +@pytest.mark.asyncio +async def test_registry_entry_removed_after_exception_so_next_call_is_fresh_leader(): + """After a loader failure, the next call must start a new leader.""" + attempts = 0 + + async def failing_loader() -> Tuple[Optional[SessionRecord], bool]: + nonlocal attempts + attempts += 1 + raise ValueError("transient ddb failure") + + with pytest.raises(ValueError): + await single_flight.resolve_once("sess-sf-004", failing_loader) + + # Registry entry must be gone so the next call is a new leader. + assert "sess-sf-004" not in single_flight._inflight + + record = _make_record("sess-sf-004") + + async def succeeding_loader() -> Tuple[Optional[SessionRecord], bool]: + nonlocal attempts + attempts += 1 + return record, False + + result = await single_flight.resolve_once("sess-sf-004", succeeding_loader) + + assert result == (record, False) + assert attempts == 2, "both loaders ran; the failure did not sticky-cache" + assert "sess-sf-004" not in single_flight._inflight + + +@pytest.mark.asyncio +async def test_distinct_sessions_are_independent(): + """Two different `session_id`s run two independent loader invocations.""" + calls: list[str] = [] + record_a = _make_record("sess-A") + record_b = _make_record("sess-B") + + async def loader_for(session_id: str, record: SessionRecord): + async def _loader() -> Tuple[Optional[SessionRecord], bool]: + calls.append(session_id) + # Small sleep to encourage interleaving. + await asyncio.sleep(0.01) + return record, False + + return _loader + + loader_a = await loader_for("sess-A", record_a) + loader_b = await loader_for("sess-B", record_b) + + result_a, result_b = await asyncio.gather( + single_flight.resolve_once("sess-A", loader_a), + single_flight.resolve_once("sess-B", loader_b), + ) + + assert result_a == (record_a, False) + assert result_b == (record_b, False) + assert sorted(calls) == ["sess-A", "sess-B"], "each session's loader runs exactly once" + assert "sess-A" not in single_flight._inflight + assert "sess-B" not in single_flight._inflight + + +@pytest.mark.asyncio +async def test_clear_cookie_flag_is_preserved(): + """`resolve_once` must faithfully propagate the `clear_cookie` bool.""" + + async def loader_none_clear() -> Tuple[Optional[SessionRecord], bool]: + return None, True + + result = await single_flight.resolve_once("sess-sf-005", loader_none_clear) + assert result == (None, True) diff --git a/backend/tests/auth/test_skip_auth.py b/backend/tests/auth/test_skip_auth.py new file mode 100644 index 00000000..dc539be8 --- /dev/null +++ b/backend/tests/auth/test_skip_auth.py @@ -0,0 +1,263 @@ +"""Tests for the SKIP_AUTH=true local-dev bypass. + +Covers: +- `_skip_auth_user()` returns None when disabled, fake User when enabled +- All three auth dependencies bypass when SKIP_AUTH=true +- `_validate_skip_auth_or_raise()` accepts localhost-only CORS_ORIGINS, + rejects empty CORS_ORIGINS, rejects any non-localhost origin +- The CI-guard regex matches realistic leak strings and skips the + legitimate references in dependencies.py / main.py +""" + +import importlib +import re +from unittest.mock import MagicMock, patch + +import pytest +from fastapi import HTTPException + +from apis.shared.auth.dependencies import ( + _skip_auth_user, + get_current_user, + get_current_user_from_session, + get_current_user_trusted, +) +from apis.shared.auth.models import User + + +# --------------------------------------------------------------------------- +# Env helpers +# --------------------------------------------------------------------------- + +@pytest.fixture +def clean_skip_auth_env(monkeypatch): + """Clear all SKIP_AUTH_* env vars so each test starts from a known state.""" + for key in ( + "SKIP_AUTH", + "SKIP_AUTH_ROLES", + "SKIP_AUTH_USER_ID", + "SKIP_AUTH_EMAIL", + "CORS_ORIGINS", + ): + monkeypatch.delenv(key, raising=False) + + +# --------------------------------------------------------------------------- +# _skip_auth_user +# --------------------------------------------------------------------------- + + +class TestSkipAuthUser: + """Tests for the `_skip_auth_user()` helper.""" + + def test_returns_none_when_unset(self, clean_skip_auth_env): + assert _skip_auth_user() is None + + @pytest.mark.parametrize("value", ["false", "0", "", "no", "FALSE"]) + def test_returns_none_when_falsey(self, clean_skip_auth_env, monkeypatch, value): + monkeypatch.setenv("SKIP_AUTH", value) + assert _skip_auth_user() is None + + @pytest.mark.parametrize("value", ["true", "TRUE", "True"]) + def test_returns_user_when_true(self, clean_skip_auth_env, monkeypatch, value): + monkeypatch.setenv("SKIP_AUTH", value) + user = _skip_auth_user() + assert isinstance(user, User) + assert user.user_id == "local-dev" + assert user.email == "dev@local" + assert user.name == "Local Dev" + assert user.roles == ["admin"] + + def test_overrides_via_env(self, clean_skip_auth_env, monkeypatch): + monkeypatch.setenv("SKIP_AUTH", "true") + monkeypatch.setenv("SKIP_AUTH_USER_ID", "phil") + monkeypatch.setenv("SKIP_AUTH_EMAIL", "phil@example.com") + monkeypatch.setenv("SKIP_AUTH_ROLES", "admin,DotNetDevelopers, ,QA ") + + user = _skip_auth_user() + assert user is not None + assert user.user_id == "phil" + assert user.email == "phil@example.com" + # whitespace-only entries filtered, surrounding whitespace stripped + assert user.roles == ["admin", "DotNetDevelopers", "QA"] + + +# --------------------------------------------------------------------------- +# Dependency bypass +# --------------------------------------------------------------------------- + + +class TestDependencyBypass: + """Tests that the bypass short-circuits each auth dependency.""" + + @pytest.mark.asyncio + async def test_get_current_user_bypassed(self, clean_skip_auth_env, monkeypatch): + monkeypatch.setenv("SKIP_AUTH", "true") + # Patch the validator to assert it's never consulted. + with patch( + "apis.shared.auth.dependencies._get_cognito_validator" + ) as mock_get: + user = await get_current_user(credentials=None) + assert mock_get.call_count == 0 + assert user.user_id == "local-dev" + assert user.roles == ["admin"] + + @pytest.mark.asyncio + async def test_get_current_user_from_session_bypassed( + self, clean_skip_auth_env, monkeypatch + ): + monkeypatch.setenv("SKIP_AUTH", "true") + # Build a request whose state.bff_session is unset; without the + # bypass this would 401, with the bypass we get the fake user. + request = MagicMock() + request.state = MagicMock(spec=[]) # no bff_session attr + + user = await get_current_user_from_session(request) + assert user.user_id == "local-dev" + + @pytest.mark.asyncio + async def test_get_current_user_trusted_bypassed( + self, clean_skip_auth_env, monkeypatch + ): + monkeypatch.setenv("SKIP_AUTH", "true") + # No credentials supplied — without the bypass this 401s. + user = await get_current_user_trusted(credentials=None) + assert user.user_id == "local-dev" + + @pytest.mark.asyncio + async def test_get_current_user_still_401_when_disabled( + self, clean_skip_auth_env + ): + """Sanity check: with SKIP_AUTH unset, missing credentials still 401.""" + with pytest.raises(HTTPException) as exc: + await get_current_user(credentials=None) + assert exc.value.status_code == 401 + + +# --------------------------------------------------------------------------- +# Startup guard +# --------------------------------------------------------------------------- + + +def _import_main_module(): + """Import (and reload) apis.app_api.main so it picks up current env. + + The module calls `load_dotenv(..., override=True)` at import time, + which would clobber monkeypatched env vars on reload. Patch the + upstream symbol so the `from dotenv import load_dotenv` re-binding + inside the module reload also picks up the no-op. + """ + with patch("dotenv.load_dotenv", lambda *a, **kw: None): + import apis.app_api.main as m + return importlib.reload(m) + + +class TestStartupGuard: + """Tests for `_validate_skip_auth_or_raise()` in app_api/main.py.""" + + def test_noop_when_skip_auth_off(self, clean_skip_auth_env): + m = _import_main_module() + # Doesn't raise even with no CORS_ORIGINS — guard is a no-op. + m._validate_skip_auth_or_raise() + + @pytest.mark.parametrize( + "origins", + [ + "http://localhost:4200", + "http://localhost:4200,http://127.0.0.1:8000", + "http://[::1]:4200", + "http://0.0.0.0:4200", + "http://localhost:4200, http://127.0.0.1:8000 ", # whitespace tolerated + ], + ) + def test_accepts_localhost_origins( + self, clean_skip_auth_env, monkeypatch, origins + ): + monkeypatch.setenv("SKIP_AUTH", "true") + monkeypatch.setenv("CORS_ORIGINS", origins) + m = _import_main_module() + m._validate_skip_auth_or_raise() # no raise + + def test_rejects_empty_cors_origins(self, clean_skip_auth_env, monkeypatch): + monkeypatch.setenv("SKIP_AUTH", "true") + monkeypatch.setenv("CORS_ORIGINS", "") + m = _import_main_module() + with pytest.raises(RuntimeError, match="localhost"): + m._validate_skip_auth_or_raise() + + def test_rejects_unset_cors_origins(self, clean_skip_auth_env, monkeypatch): + monkeypatch.setenv("SKIP_AUTH", "true") + # CORS_ORIGINS deliberately unset + m = _import_main_module() + with pytest.raises(RuntimeError, match="localhost"): + m._validate_skip_auth_or_raise() + + @pytest.mark.parametrize( + "origins", + [ + "https://app.example.com", + "http://localhost:4200,https://app.example.com", # one bad apple + "https://prod.boisestate.edu", + ], + ) + def test_rejects_non_localhost( + self, clean_skip_auth_env, monkeypatch, origins + ): + monkeypatch.setenv("SKIP_AUTH", "true") + monkeypatch.setenv("CORS_ORIGINS", origins) + m = _import_main_module() + with pytest.raises(RuntimeError, match="localhost"): + m._validate_skip_auth_or_raise() + + +# --------------------------------------------------------------------------- +# CI-guard regex +# --------------------------------------------------------------------------- + + +class TestCIGuardPattern: + """Tests that mirror the grep pattern in skip-auth-guard.yml. + + The CI workflow uses `grep -E` with this pattern; we validate the + same regex against representative leak strings and the legitimate + references in our own source so a future refactor of the workflow + has a behavioral spec to test against. + """ + + # Mirrors the PATTERN in .github/workflows/skip-auth-guard.yml + PATTERN = re.compile(r"""SKIP_AUTH[ \t]*[:=][ \t]*["']*true""") + + @pytest.mark.parametrize( + "leak", + [ + "SKIP_AUTH=true", + 'SKIP_AUTH: "true"', + "SKIP_AUTH: true", + "SKIP_AUTH:true", + "SKIP_AUTH: 'true'", + " SKIP_AUTH = true", + 'SKIP_AUTH="true"', + "ENV SKIP_AUTH=true", # Dockerfile + " SKIP_AUTH: 'true' # in some yaml", + ], + ) + def test_matches_leak_strings(self, leak): + assert self.PATTERN.search(leak) is not None, f"missed leak: {leak!r}" + + @pytest.mark.parametrize( + "benign", + [ + 'SKIP_AUTH = "false"', + "SKIP_AUTH=false", + "# Document SKIP_AUTH behaviour", + 'os.environ.get("SKIP_AUTH", "")', + 'if os.environ.get("SKIP_AUTH", "").lower() == "true":', + ], + ) + def test_skips_benign_strings(self, benign): + # The legitimate dependencies.py / main.py references compare against + # "true" but don't *assign* SKIP_AUTH=true, so they shouldn't match. + # The one exception is the inline comparison string "== \"true\"" — + # which the workflow excludes via path-based filtering, not regex. + # We only assert the pattern itself doesn't trip on these forms. + assert self.PATTERN.search(benign) is None, f"false positive: {benign!r}" diff --git a/backend/tests/conftest.py b/backend/tests/conftest.py index 5be3a462..7d9a5a30 100644 --- a/backend/tests/conftest.py +++ b/backend/tests/conftest.py @@ -4,6 +4,8 @@ import sys from pathlib import Path +import pytest + # Ensure AWS region is set so that module-level boto3 calls don't fail # during import (e.g. agents.main_agent.quota -> boto3.resource('dynamodb')) os.environ.setdefault("AWS_DEFAULT_REGION", "us-east-1") @@ -16,3 +18,36 @@ if str(SRC_DIR) not in sys.path: sys.path.insert(0, str(SRC_DIR)) + +# Scrub SKIP_AUTH bleed from local .env. Some tests reload +# `apis.app_api.main`, which calls `load_dotenv(override=True)` and +# clobbers process env with whatever `backend/src/.env` has set — +# typically `SKIP_AUTH=true` for local dev. Without this fixture every +# auth-aware test downstream of that reload returns the fake bypass +# user. Tests that need SKIP_AUTH on can still set it via monkeypatch +# (test-local setenv runs after this autouse delenv). +# +# Manages os.environ directly rather than depending on monkeypatch so +# this autouse fixture doesn't perturb fixture-teardown ordering for +# tests that already use monkeypatch + their own autouse fixtures +# (e.g. tests/apis/app_api/test_connectors_routes.py). +_SKIP_AUTH_ENV_KEYS = ( + "SKIP_AUTH", + "SKIP_AUTH_ROLES", + "SKIP_AUTH_USER_ID", + "SKIP_AUTH_EMAIL", +) + + +@pytest.fixture(autouse=True) +def _clear_skip_auth_env(): + saved = {k: os.environ.pop(k, None) for k in _SKIP_AUTH_ENV_KEYS} + try: + yield + finally: + for k, v in saved.items(): + if v is None: + os.environ.pop(k, None) + else: + os.environ[k] = v + diff --git a/backend/tests/costs/test_calculator.py b/backend/tests/costs/test_calculator.py new file mode 100644 index 00000000..0d6e8120 --- /dev/null +++ b/backend/tests/costs/test_calculator.py @@ -0,0 +1,282 @@ +"""Unit tests for CostCalculator — the source-of-truth for all USD math. + +These tests pin the per-bucket pricing formula, the cache-savings derivation, +and the input-validation predicates. The aggregator and storage tests cover +this code transitively, but only through mocks; this module is the only +place the math itself is asserted directly. + +Conventions for cases: + - "Sonnet 4.5 pricing" reflects Bedrock's published rates so a regression + in the formula would be visible in dollar terms a reader can sanity-check. + - Floats are compared with ``pytest.approx`` to avoid 1e-15 drift. +""" + +import pytest + +from apis.shared.costs.calculator import CostCalculator +from apis.shared.costs.models import CostBreakdown + + +# Bedrock rates for Claude Sonnet 4.5 ($/Mtok). Used as the "realistic" +# baseline so dollar amounts in tests can be compared to a published source. +SONNET_45_PRICING = { + "inputPricePerMtok": 3.0, + "outputPricePerMtok": 15.0, + "cacheWritePricePerMtok": 3.75, + "cacheReadPricePerMtok": 0.30, +} + + +class TestCalculateMessageCostBasic: + """Core formula: per-bucket pricing summed into total.""" + + def test_input_only(self): + usage = {"inputTokens": 1_000_000, "outputTokens": 0} + total, breakdown = CostCalculator.calculate_message_cost(usage, SONNET_45_PRICING) + assert total == pytest.approx(3.0) + assert breakdown.input_cost == pytest.approx(3.0) + assert breakdown.output_cost == 0.0 + assert breakdown.cache_read_cost == 0.0 + assert breakdown.cache_write_cost == 0.0 + + def test_output_only(self): + usage = {"inputTokens": 0, "outputTokens": 1_000_000} + total, breakdown = CostCalculator.calculate_message_cost(usage, SONNET_45_PRICING) + assert total == pytest.approx(15.0) + assert breakdown.output_cost == pytest.approx(15.0) + assert breakdown.input_cost == 0.0 + + def test_input_and_output_no_cache(self): + """Realistic short turn: 1k input + 500 output on Sonnet 4.5.""" + usage = {"inputTokens": 1_000, "outputTokens": 500} + total, breakdown = CostCalculator.calculate_message_cost(usage, SONNET_45_PRICING) + # 1000/1M * 3.00 + 500/1M * 15.00 = 0.003 + 0.0075 = 0.0105 + assert total == pytest.approx(0.0105) + assert breakdown.input_cost == pytest.approx(0.003) + assert breakdown.output_cost == pytest.approx(0.0075) + + def test_breakdown_components_sum_to_total(self): + """The total in the breakdown must equal the sum of its parts.""" + usage = { + "inputTokens": 1_234, + "outputTokens": 567, + "cacheReadInputTokens": 8_910, + "cacheWriteInputTokens": 2_345, + } + total, breakdown = CostCalculator.calculate_message_cost(usage, SONNET_45_PRICING) + component_sum = ( + breakdown.input_cost + + breakdown.output_cost + + breakdown.cache_read_cost + + breakdown.cache_write_cost + ) + assert breakdown.total_cost == pytest.approx(component_sum) + assert total == pytest.approx(component_sum) + + +class TestCalculateMessageCostWithCache: + """Cache buckets price separately and add to the total.""" + + def test_cache_read_only(self): + """A subsequent turn hitting the prompt cache.""" + usage = { + "inputTokens": 100, # uncached suffix + "outputTokens": 200, + "cacheReadInputTokens": 5_000, # cached prefix + "cacheWriteInputTokens": 0, + } + total, breakdown = CostCalculator.calculate_message_cost(usage, SONNET_45_PRICING) + # input: 100/1M * 3 = 0.0003 + # output: 200/1M * 15 = 0.003 + # cache_read: 5000/1M * 0.30 = 0.0015 + assert breakdown.input_cost == pytest.approx(0.0003) + assert breakdown.output_cost == pytest.approx(0.003) + assert breakdown.cache_read_cost == pytest.approx(0.0015) + assert breakdown.cache_write_cost == 0.0 + assert total == pytest.approx(0.0048) + + def test_cache_write_only(self): + """The first turn that establishes the cache pays the write premium.""" + usage = { + "inputTokens": 0, + "outputTokens": 100, + "cacheReadInputTokens": 0, + "cacheWriteInputTokens": 5_000, + } + total, breakdown = CostCalculator.calculate_message_cost(usage, SONNET_45_PRICING) + # cache_write: 5000/1M * 3.75 = 0.01875 + # output: 100/1M * 15 = 0.0015 + assert breakdown.cache_write_cost == pytest.approx(0.01875) + assert breakdown.output_cost == pytest.approx(0.0015) + assert breakdown.cache_read_cost == 0.0 + assert total == pytest.approx(0.02025) + + def test_cache_read_and_write_mixed(self): + """A turn that hits part of the cache and writes a new section.""" + usage = { + "inputTokens": 200, + "outputTokens": 300, + "cacheReadInputTokens": 10_000, + "cacheWriteInputTokens": 2_000, + } + total, breakdown = CostCalculator.calculate_message_cost(usage, SONNET_45_PRICING) + assert breakdown.input_cost == pytest.approx(200 / 1_000_000 * 3.0) + assert breakdown.output_cost == pytest.approx(300 / 1_000_000 * 15.0) + assert breakdown.cache_read_cost == pytest.approx(10_000 / 1_000_000 * 0.30) + assert breakdown.cache_write_cost == pytest.approx(2_000 / 1_000_000 * 3.75) + assert total == pytest.approx( + breakdown.input_cost + + breakdown.output_cost + + breakdown.cache_read_cost + + breakdown.cache_write_cost + ) + + def test_docstring_example_holds(self): + """The docstring example must match the implementation.""" + usage = { + "inputTokens": 1_000, + "outputTokens": 500, + "cacheReadInputTokens": 200, + "cacheWriteInputTokens": 100, + } + total, breakdown = CostCalculator.calculate_message_cost(usage, SONNET_45_PRICING) + assert breakdown.input_cost == pytest.approx(0.003) + assert breakdown.output_cost == pytest.approx(0.0075) + assert breakdown.cache_read_cost == pytest.approx(0.00006) + assert breakdown.cache_write_cost == pytest.approx(0.000375) + assert total == pytest.approx(0.010935) + + +class TestCalculateMessageCostDefensive: + """Missing or None fields should degrade to 0, never raise.""" + + def test_missing_pricing_fields_default_to_zero(self): + """Cache prices may be absent for non-Bedrock providers.""" + pricing = {"inputPricePerMtok": 1.0, "outputPricePerMtok": 2.0} + usage = { + "inputTokens": 1_000_000, + "outputTokens": 1_000_000, + "cacheReadInputTokens": 1_000_000, + "cacheWriteInputTokens": 1_000_000, + } + total, breakdown = CostCalculator.calculate_message_cost(usage, pricing) + assert breakdown.input_cost == pytest.approx(1.0) + assert breakdown.output_cost == pytest.approx(2.0) + assert breakdown.cache_read_cost == 0.0 + assert breakdown.cache_write_cost == 0.0 + assert total == pytest.approx(3.0) + + def test_none_pricing_values_default_to_zero(self): + """A managed-model row with explicit None for cache prices must not raise.""" + pricing = { + "inputPricePerMtok": 3.0, + "outputPricePerMtok": 15.0, + "cacheReadPricePerMtok": None, + "cacheWritePricePerMtok": None, + } + usage = { + "inputTokens": 1_000, + "outputTokens": 500, + "cacheReadInputTokens": 1_000, + "cacheWriteInputTokens": 1_000, + } + total, breakdown = CostCalculator.calculate_message_cost(usage, pricing) + assert breakdown.cache_read_cost == 0.0 + assert breakdown.cache_write_cost == 0.0 + + def test_none_usage_values_default_to_zero(self): + usage = { + "inputTokens": None, + "outputTokens": None, + "cacheReadInputTokens": None, + "cacheWriteInputTokens": None, + } + total, breakdown = CostCalculator.calculate_message_cost(usage, SONNET_45_PRICING) + assert total == 0.0 + assert breakdown.input_cost == 0.0 + assert breakdown.output_cost == 0.0 + assert breakdown.cache_read_cost == 0.0 + assert breakdown.cache_write_cost == 0.0 + + def test_empty_usage_and_pricing(self): + total, breakdown = CostCalculator.calculate_message_cost({}, {}) + assert total == 0.0 + assert isinstance(breakdown, CostBreakdown) + + +class TestCalculateCacheSavings: + """Cache savings = (input_price - cache_read_price) * read_tokens / 1M.""" + + def test_typical_savings(self): + """200 read tokens at Sonnet 4.5 rates.""" + savings = CostCalculator.calculate_cache_savings(200, 3.0, 0.30) + # standard: 200/1M * 3 = 0.0006; cached: 200/1M * 0.30 = 0.00006 + assert savings == pytest.approx(0.00054) + + def test_zero_reads_returns_zero(self): + assert CostCalculator.calculate_cache_savings(0, 3.0, 0.30) == 0.0 + + def test_none_reads_returns_zero(self): + """``None`` is the realistic shape from a model that didn't hit cache.""" + assert CostCalculator.calculate_cache_savings(None, 3.0, 0.30) == 0.0 + + def test_none_prices_default_to_zero(self): + """None prices must not raise — the formula collapses cleanly to 0.""" + assert CostCalculator.calculate_cache_savings(1_000, None, None) == 0.0 + + def test_savings_equals_full_input_cost_when_cache_is_free(self): + """If cache reads are priced at 0, savings is the full input cost.""" + savings = CostCalculator.calculate_cache_savings(1_000_000, 3.0, 0.0) + assert savings == pytest.approx(3.0) + + +class TestValidatePricing: + """validate_pricing requires inputPricePerMtok and outputPricePerMtok with non-None values.""" + + def test_complete_pricing_is_valid(self): + assert CostCalculator.validate_pricing(SONNET_45_PRICING) is True + + def test_minimal_pricing_is_valid(self): + """Cache fields are not required.""" + assert CostCalculator.validate_pricing({ + "inputPricePerMtok": 1.0, + "outputPricePerMtok": 2.0, + }) is True + + def test_missing_input_price_is_invalid(self): + assert CostCalculator.validate_pricing({"outputPricePerMtok": 2.0}) is False + + def test_missing_output_price_is_invalid(self): + assert CostCalculator.validate_pricing({"inputPricePerMtok": 1.0}) is False + + def test_none_value_is_invalid(self): + assert CostCalculator.validate_pricing({ + "inputPricePerMtok": None, + "outputPricePerMtok": 2.0, + }) is False + + +class TestValidateUsage: + """validate_usage requires inputTokens and outputTokens with non-None values.""" + + def test_complete_usage_is_valid(self): + assert CostCalculator.validate_usage({ + "inputTokens": 100, + "outputTokens": 50, + }) is True + + def test_zero_values_are_valid(self): + """Zero is a real measurement, not an absence.""" + assert CostCalculator.validate_usage({ + "inputTokens": 0, + "outputTokens": 0, + }) is True + + def test_missing_input_tokens_is_invalid(self): + assert CostCalculator.validate_usage({"outputTokens": 50}) is False + + def test_none_value_is_invalid(self): + assert CostCalculator.validate_usage({ + "inputTokens": None, + "outputTokens": 50, + }) is False diff --git a/backend/uv.lock b/backend/uv.lock index df65cc44..d03cc130 100644 --- a/backend/uv.lock +++ b/backend/uv.lock @@ -1,5 +1,5 @@ version = 1 -revision = 2 +revision = 3 requires-python = ">=3.10" resolution-markers = [ "python_full_version >= '3.15'", @@ -12,7 +12,7 @@ resolution-markers = [ [[package]] name = "agentcore-stack" -version = "1.0.0b24" +version = "1.0.0b25" source = { editable = "." } dependencies = [ { name = "aiofiles" }, @@ -25,6 +25,7 @@ dependencies = [ { name = "httpx" }, { name = "pillow" }, { name = "pyjwt", extra = ["crypto"] }, + { name = "pypdfium2" }, { name = "python-dotenv" }, { name = "python-multipart" }, { name = "starlette" }, @@ -98,6 +99,7 @@ requires-dist = [ { name = "openai", marker = "extra == 'agentcore'", specifier = "==2.32.0" }, { name = "pillow", specifier = "==12.2.0" }, { name = "pyjwt", extras = ["crypto"], specifier = "==2.12.1" }, + { name = "pypdfium2", specifier = "==4.30.0" }, { name = "pytest", marker = "extra == 'dev'", specifier = "==9.0.3" }, { name = "pytest-asyncio", marker = "extra == 'dev'", specifier = "==1.3.0" }, { name = "pytest-cov", marker = "extra == 'dev'", specifier = "==7.1.0" }, @@ -105,11 +107,11 @@ requires-dist = [ { name = "python-multipart", specifier = "==0.0.27" }, { name = "ruff", marker = "extra == 'dev'", specifier = "==0.15.12" }, { name = "starlette", specifier = "==1.0.0" }, - { name = "strands-agents", marker = "extra == 'agentcore'", specifier = "==1.37.0" }, - { name = "strands-agents", extras = ["bidi"], marker = "extra == 'bidi'", specifier = "==1.37.0" }, - { name = "strands-agents-tools", marker = "extra == 'agentcore'", specifier = "==0.5.1" }, + { name = "strands-agents", marker = "extra == 'agentcore'", specifier = "==1.39.0" }, + { name = "strands-agents", extras = ["bidi"], marker = "extra == 'bidi'", specifier = "==1.39.0" }, + { name = "strands-agents-tools", marker = "extra == 'agentcore'", specifier = "==0.5.2" }, { name = "tiktoken", marker = "extra == 'dev'", specifier = "==0.12.0" }, - { name = "types-aiofiles", marker = "extra == 'dev'", specifier = "==25.1.0.20251011" }, + { name = "types-aiofiles", marker = "extra == 'dev'", specifier = "==25.1.0.20260409" }, { name = "uvicorn", extras = ["standard"], specifier = "==0.46.0" }, ] provides-extras = ["agentcore", "bidi", "dev", "all"] @@ -3668,6 +3670,26 @@ crypto = [ { name = "cryptography" }, ] +[[package]] +name = "pypdfium2" +version = "4.30.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/a1/14/838b3ba247a0ba92e4df5d23f2bea9478edcfd72b78a39d6ca36ccd84ad2/pypdfium2-4.30.0.tar.gz", hash = "sha256:48b5b7e5566665bc1015b9d69c1ebabe21f6aee468b509531c3c8318eeee2e16", size = 140239, upload-time = "2024-05-09T18:33:17.552Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/c7/9a/c8ff5cc352c1b60b0b97642ae734f51edbab6e28b45b4fcdfe5306ee3c83/pypdfium2-4.30.0-py3-none-macosx_10_13_x86_64.whl", hash = "sha256:b33ceded0b6ff5b2b93bc1fe0ad4b71aa6b7e7bd5875f1ca0cdfb6ba6ac01aab", size = 2837254, upload-time = "2024-05-09T18:32:48.653Z" }, + { url = "https://files.pythonhosted.org/packages/21/8b/27d4d5409f3c76b985f4ee4afe147b606594411e15ac4dc1c3363c9a9810/pypdfium2-4.30.0-py3-none-macosx_11_0_arm64.whl", hash = "sha256:4e55689f4b06e2d2406203e771f78789bd4f190731b5d57383d05cf611d829de", size = 2707624, upload-time = "2024-05-09T18:32:51.458Z" }, + { url = "https://files.pythonhosted.org/packages/11/63/28a73ca17c24b41a205d658e177d68e198d7dde65a8c99c821d231b6ee3d/pypdfium2-4.30.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4e6e50f5ce7f65a40a33d7c9edc39f23140c57e37144c2d6d9e9262a2a854854", size = 2793126, upload-time = "2024-05-09T18:32:53.581Z" }, + { url = "https://files.pythonhosted.org/packages/d1/96/53b3ebf0955edbd02ac6da16a818ecc65c939e98fdeb4e0958362bd385c8/pypdfium2-4.30.0-py3-none-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:3d0dd3ecaffd0b6dbda3da663220e705cb563918249bda26058c6036752ba3a2", size = 2591077, upload-time = "2024-05-09T18:32:55.99Z" }, + { url = "https://files.pythonhosted.org/packages/ec/ee/0394e56e7cab8b5b21f744d988400948ef71a9a892cbeb0b200d324ab2c7/pypdfium2-4.30.0-py3-none-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:cc3bf29b0db8c76cdfaac1ec1cde8edf211a7de7390fbf8934ad2aa9b4d6dfad", size = 2864431, upload-time = "2024-05-09T18:32:57.911Z" }, + { url = "https://files.pythonhosted.org/packages/65/cd/3f1edf20a0ef4a212a5e20a5900e64942c5a374473671ac0780eaa08ea80/pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f1f78d2189e0ddf9ac2b7a9b9bd4f0c66f54d1389ff6c17e9fd9dc034d06eb3f", size = 2812008, upload-time = "2024-05-09T18:32:59.886Z" }, + { url = "https://files.pythonhosted.org/packages/c8/91/2d517db61845698f41a2a974de90762e50faeb529201c6b3574935969045/pypdfium2-4.30.0-py3-none-musllinux_1_1_aarch64.whl", hash = "sha256:5eda3641a2da7a7a0b2f4dbd71d706401a656fea521b6b6faa0675b15d31a163", size = 6181543, upload-time = "2024-05-09T18:33:02.597Z" }, + { url = "https://files.pythonhosted.org/packages/ba/c4/ed1315143a7a84b2c7616569dfb472473968d628f17c231c39e29ae9d780/pypdfium2-4.30.0-py3-none-musllinux_1_1_i686.whl", hash = "sha256:0dfa61421b5eb68e1188b0b2231e7ba35735aef2d867d86e48ee6cab6975195e", size = 6175911, upload-time = "2024-05-09T18:33:05.376Z" }, + { url = "https://files.pythonhosted.org/packages/7a/c4/9e62d03f414e0e3051c56d5943c3bf42aa9608ede4e19dc96438364e9e03/pypdfium2-4.30.0-py3-none-musllinux_1_1_x86_64.whl", hash = "sha256:f33bd79e7a09d5f7acca3b0b69ff6c8a488869a7fab48fdf400fec6e20b9c8be", size = 6267430, upload-time = "2024-05-09T18:33:08.067Z" }, + { url = "https://files.pythonhosted.org/packages/90/47/eda4904f715fb98561e34012826e883816945934a851745570521ec89520/pypdfium2-4.30.0-py3-none-win32.whl", hash = "sha256:ee2410f15d576d976c2ab2558c93d392a25fb9f6635e8dd0a8a3a5241b275e0e", size = 2775951, upload-time = "2024-05-09T18:33:10.567Z" }, + { url = "https://files.pythonhosted.org/packages/25/bd/56d9ec6b9f0fc4e0d95288759f3179f0fcd34b1a1526b75673d2f6d5196f/pypdfium2-4.30.0-py3-none-win_amd64.whl", hash = "sha256:90dbb2ac07be53219f56be09961eb95cf2473f834d01a42d901d13ccfad64b4c", size = 2892098, upload-time = "2024-05-09T18:33:13.107Z" }, + { url = "https://files.pythonhosted.org/packages/be/7a/097801205b991bc3115e8af1edb850d30aeaf0118520b016354cf5ccd3f6/pypdfium2-4.30.0-py3-none-win_arm64.whl", hash = "sha256:119b2969a6d6b1e8d55e99caaf05290294f2d0fe49c12a3f17102d01c441bd29", size = 2752118, upload-time = "2024-05-09T18:33:15.489Z" }, +] + [[package]] name = "pytest" version = "9.0.3" @@ -4362,7 +4384,7 @@ wheels = [ [[package]] name = "strands-agents" -version = "1.37.0" +version = "1.39.0" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "boto3" }, @@ -4378,9 +4400,9 @@ dependencies = [ { name = "typing-extensions" }, { name = "watchdog" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/03/88/cf23aa713ea68c8a0ad5144341da7ee022e88ce6206512aeafddba257b75/strands_agents-1.37.0.tar.gz", hash = "sha256:3fe6821f730f0468eee91e1ff38eb27a5244046893ffba63e8f5345288096509", size = 824168, upload-time = "2026-04-22T19:18:01.378Z" } +sdist = { url = "https://files.pythonhosted.org/packages/d6/5b/e267a7dab0b4a6d39133c9c0c516f93f33483e29f39e05c03b755f993ef6/strands_agents-1.39.0.tar.gz", hash = "sha256:efff5914323b8b4b472ca3f13c7115a5746935b00bc86dacc40a5d1ab1242817", size = 873258, upload-time = "2026-05-08T13:27:19.661Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/5f/ff/bede1b8d5fe1c776bd5ed33575505681b3b65ab20889fe6b8344b92fc82d/strands_agents-1.37.0-py3-none-any.whl", hash = "sha256:2fa12e22ed1dac228aa93e91c2ea5381d9b3f08416ed8162222b61b255fee0b1", size = 404526, upload-time = "2026-04-22T19:17:59.634Z" }, + { url = "https://files.pythonhosted.org/packages/95/41/d054b5a5f54175eb4e775d1e408e169439eba6be63e9e8f2e77ff44e38fc/strands_agents-1.39.0-py3-none-any.whl", hash = "sha256:7369dbfc6be29f59483a6183f5aacf0bdd0e7e5973b4b70f8d0e663880d42f79", size = 430272, upload-time = "2026-05-08T13:27:18.088Z" }, ] [package.optional-dependencies] @@ -4391,7 +4413,7 @@ bidi = [ [[package]] name = "strands-agents-tools" -version = "0.5.1" +version = "0.5.2" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "aiohttp" }, @@ -4412,9 +4434,9 @@ dependencies = [ { name = "tzdata", marker = "sys_platform == 'win32'" }, { name = "watchdog" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/b4/fc/8a9da78b5c4a8802367a8eeec046f98eda742b1ee1b2fff568c81c1b3479/strands_agents_tools-0.5.1.tar.gz", hash = "sha256:616ba88b5849d9fd495da057ccb670108580320b8cb0fc4faac5fc327f2622aa", size = 483123, upload-time = "2026-04-22T20:01:13.305Z" } +sdist = { url = "https://files.pythonhosted.org/packages/63/32/710a49ffd32b0a232ec1731620ee6105c045e9a77ecee1f3ecaa1a80a6cd/strands_agents_tools-0.5.2.tar.gz", hash = "sha256:96763c8ae75933c5dd327cca87561f573aed720c9c0f3d17fd20835910d11381", size = 483164, upload-time = "2026-04-30T17:08:13.151Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/64/59/79360f718683ae15cefeb8b0ca1e6d96608c4581280fb12b0f502375a705/strands_agents_tools-0.5.1-py3-none-any.whl", hash = "sha256:790865d073410e9a16ac44ce3a46c169b98e1f89844ce8670472b869257b7686", size = 316122, upload-time = "2026-04-22T20:01:11.599Z" }, + { url = "https://files.pythonhosted.org/packages/59/ef/fe73b6d25d095784d2e1f6f33419265e796143100fb2f32a6e86f8ae68af/strands_agents_tools-0.5.2-py3-none-any.whl", hash = "sha256:8f85e4cb28d9411e62e1f159aa7e300d3a0f4b1d2b878a7cdfd5d746d9333343", size = 316178, upload-time = "2026-04-30T17:08:11.416Z" }, ] [[package]] @@ -4567,11 +4589,11 @@ wheels = [ [[package]] name = "types-aiofiles" -version = "25.1.0.20251011" +version = "25.1.0.20260409" source = { registry = "https://pypi.org/simple" } -sdist = { url = "https://files.pythonhosted.org/packages/84/6c/6d23908a8217e36704aa9c79d99a620f2fdd388b66a4b7f72fbc6b6ff6c6/types_aiofiles-25.1.0.20251011.tar.gz", hash = "sha256:1c2b8ab260cb3cd40c15f9d10efdc05a6e1e6b02899304d80dfa0410e028d3ff", size = 14535, upload-time = "2025-10-11T02:44:51.237Z" } +sdist = { url = "https://files.pythonhosted.org/packages/6c/66/9e62a2692792bc96c0f423f478149f4a7b84720704c546c8960b0a047c89/types_aiofiles-25.1.0.20260409.tar.gz", hash = "sha256:49e67d72bdcf9fe406f5815758a78dc34a1249bb5aa2adba78a80aec0a775435", size = 14812, upload-time = "2026-04-09T04:22:35.308Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/71/0f/76917bab27e270bb6c32addd5968d69e558e5b6f7fb4ac4cbfa282996a96/types_aiofiles-25.1.0.20251011-py3-none-any.whl", hash = "sha256:8ff8de7f9d42739d8f0dadcceeb781ce27cd8d8c4152d4a7c52f6b20edb8149c", size = 14338, upload-time = "2025-10-11T02:44:50.054Z" }, + { url = "https://files.pythonhosted.org/packages/27/d0/28236f869ba4dfb223ecdbc267eb2bdb634b81a561dd992230a4f9ec48fa/types_aiofiles-25.1.0.20260409-py3-none-any.whl", hash = "sha256:923fedb532c772cc0f62e0ce4282725afa82ca5b41cabd9857f06b55e5eee8de", size = 14372, upload-time = "2026-04-09T04:22:34.328Z" }, ] [[package]] diff --git a/docs/kaizen/research/2026-05-10.md b/docs/kaizen/research/2026-05-10.md new file mode 100644 index 00000000..b009ba2e --- /dev/null +++ b/docs/kaizen/research/2026-05-10.md @@ -0,0 +1,265 @@ +# Kaizen Research — Sunday, May 10, 2026 +> Scan window: May 3 – May 10, 2026 (7 days; reference repo + UI/UX scan extended to 30 days for first-run baseline) +> Web budget: 64/50 used (target — UX-lens scan added 10 requests post-initial-run). Frontier-models also went over the sub-budget by ~5 due to two OpenAI WebFetch 403s. +> **Bootstrap run** — first execution of the kaizen-research skill. Subsequent runs cover only the prior 7 days for the reference repo + UX sources too. + +## TL;DR + +Three converging signals this week: +1. **MCP Apps is now the de-facto agentic UI standard**, and we don't host it yet. The spec (SEP-1865) is production-ready: tool results can declare a `ui://` resource that the host renders in a sandboxed iframe alongside the chat. Claude Desktop, ChatGPT, VS Code Copilot, Goose, Postman all ship support. Every third-party MCP server we connect could be shipping richer UX than text+JSON; we're leaving that on the table. +2. **Upstream is shrinking our backlog for free**: our open issues #266/#267 were quietly solved in Strands v1.37/v1.38 (now in our 1.39 pin from #265); `bedrock-agentcore` is 3 minor versions behind (1.6.4 → 1.9.0, latest published May 7 — inside the scan window) with likely fixes for two open SDK issues we feel. +3. **CI is broken**: 9 Nightly Build & Test failures + 6+ Deploy failures in 7 days, untriaged. + +**Recommended #1**: scope an MCP Apps host renderer in our chat (multi-PR initiative). It's the highest-leverage agentic-UX investment this week per the scan. **Recommended quick-win**: bump `bedrock-agentcore` 1.6.4 → 1.9.0. + +## External Scan + +### What's moving this week + +The week converged on two themes worth our attention. First, AWS shipped two AgentCore capabilities that map cleanly onto things we already do: **AgentCore Runtime BYO filesystem from S3/EFS** (cross-session filesystem persistence without custom mount code) and **AgentCore Memory metadata** (structured tags on long-term memory records for filtered retrieval). Both are direct value-adds to our `inference-api` and our `TurnBasedSessionManager` layer. Second, Strands has been cleaning up the long tail: v1.37 added a context-window lookup table (closes our open issue #267), v1.38 added large tool result offload (closes our open issue #266), and v1.39 — which we just pinned in #265 — added AWS-profile support for the OpenAI provider. We're caught up to the head, but we haven't yet *used* the v1.37/v1.38 features the upgrade unlocked. + +The reference repo (`aws-samples/sample-strands-agent-with-agentcore`) has diverged from us in one major direction (CDK → Terraform on Apr 19) and converged in several minor ones — most notably moving compaction state and per-message metadata onto Strands' own `agent.state` and `message.metadata` instead of a custom DynamoDB table. They also abandoned the `enabledTools` whitelist pattern that's still embedded in our CLAUDE.md, in favor of a `disabled_skills` blacklist read from DDB per-request. Those are architectural calls, not direct ports. + +The MCP spec is heading toward stateless transport (SEP-2567 sessionless MCP merged May 7), which is a strong fit for our SigV4 Gateway model — but our Python `mcp` library hasn't picked it up yet (current 1.27.1). Watch. + +### Notable items by source + +#### AWS Bedrock / AgentCore +- **AgentCore Runtime BYO file system from S3 and EFS** — Attach S3/EFS to runtimes for cross-session persistence without custom mount code — https://aws.amazon.com/about-aws/whats-new/2026/05/amazon-bedrock-agentcore-runtime/ — *relevance*: directly applicable to `inference-api`; could replace future filesystem-staging glue +- **AgentCore Memory adds metadata for long-term memory** — Long-term memory records now support structured metadata for filtered retrieval — https://aws.amazon.com/about-aws/whats-new/2026/05/agentcore-longterm-memory-metadata — *relevance*: `TurnBasedSessionManager` long-term flush could carry user/RBAC/conversation-type metadata for richer recall +- **Secure AI agents with AgentCore Identity on Amazon ECS** — OAuth federation walkthrough for ECS-hosted agents — https://aws.amazon.com/blogs/machine-learning/secure-ai-agents-with-amazon-bedrock-agentcore-identity-on-amazon-ecs/ — *relevance*: useful reference; pattern mirrors our `apis/shared/oauth/agentcore_identity.py` mint-fallback +- **OS-Level Actions in AgentCore Browser** — OS-level control for native UI agents — https://aws.amazon.com/blogs/machine-learning/introducing-os-level-actions-in-amazon-bedrock-agentcore-browser/ — *relevance*: informational; we don't use AgentCore Browser +- **AgentCore Payments preview** — Wallet/auth/governance for transactional agents (Coinbase + Stripe partners) — https://aws.amazon.com/blogs/machine-learning/agents-that-transact-introducing-amazon-bedrock-agentcore-payments-built-with-coinbase-and-stripe/ — *relevance*: informational; no commerce path today + +**Open AgentCore SDK issues affecting us:** +- **#456 — OTEL context detached across asyncio/thread boundaries in memory client + Strands session_manager** — https://github.com/aws/bedrock-agentcore-sdk-python/issues/456 — *applicability*: HIGH — we use Strands 1.39 + AgentCore Memory + `TurnBasedSessionManager`; X-Ray/OTEL traces likely show broken spans on memory writes +- **#452 — AgentCoreMemorySessionManager: add `async_mode` to prevent event-loop blocking** — https://github.com/aws/bedrock-agentcore-sdk-python/issues/452 — *applicability*: HIGH — `inference-api` is FastAPI/async; sync flush on the loop could be hurting concurrency +- **#453 — Auto-populate AgentCard.skills[] from ToolRegistry in serve_a2a** — *applicability*: medium; relevant if/when we expose A2A endpoints + +#### Strands Agents +- **v1.39.0 (current pin)** — AWS profile support for OpenAI, MCP init error messaging, Bedrock token-counting enhancements, A2A task-lifecycle states — https://github.com/strands-agents/sdk-python/releases/tag/v1.39.0 — *informational*: just landed in #265 +- **v1.38.0 — large tool result offload + `CachePoint` TTL for prompt caching** — https://github.com/strands-agents/sdk-python/releases/tag/v1.38.0 — *closes our issue #266* +- **v1.37.0 — context-window limit lookup tables + experimental checkpoint API** — https://github.com/strands-agents/sdk-python/releases/tag/v1.37.0 — *closes our issue #267* +- **#2266 — `BedrockModel.stream` leaks inner task on outer cancellation (May 9, open)** — https://github.com/strands-agents/sdk-python/issues/2266 — *applicability*: HIGH — we cancel SSE streams on client disconnect; check for "Task exception was never retrieved" in stream_coordinator logs +- **#2271 — Support dual cache prefixes in Bedrock auto caching strategy (May 10)** — https://github.com/strands-agents/sdk-python/issues/2271 — *applicability*: medium; pairs with issue #269 (prompt caching) if we move to Strands' built-in caching strategy +- **#2243 — Tool-level suspend/resume for external async callbacks** — https://github.com/strands-agents/sdk-python/issues/2243 — *applicability*: medium; could simplify our `oauth_required` SSE handoff +- **PR #2239 — Proactive Context Compression (merged May 8)** — https://github.com/strands-agents/sdk-python/pull/2239 — *applicability*: medium; could complement our SSE compaction surfacing + +#### Reference repo: aws-samples/sample-strands-agent-with-agentcore (last 30 days — bootstrap baseline) +- **CDK → Terraform migration (Apr 19, c422fbf)** — https://github.com/aws-samples/sample-strands-agent-with-agentcore/commit/c422fbf — *applicability*: NOT relevant for porting; we're CDK-native. **Implication**: the reference repo is no longer a usable CDK template going forward. Anything CDK-shaped historically pulled from them is frozen at pre-Apr-19 state. +- **Compaction state + metrics moved from custom DynamoDB to SDK `agent.state` + `message.metadata` (Apr 27, 2b1a13d)** — https://github.com/aws-samples/sample-strands-agent-with-agentcore/commit/2b1a13d — *applicability*: HIGH — our `TurnBasedSessionManager` could shed code by piggybacking on `agent.state` rather than maintaining parallel state; potential subtraction +- **Force re-auth on OAuth 401/403 mid-tool-call (Apr 22, 9fcdb4c)** — https://github.com/aws-samples/sample-strands-agent-with-agentcore/commit/9fcdb4c — *applicability*: HIGH — verify our `oauth_required` SSE flow handles mid-conversation 401/403 from Google etc. by re-emitting `oauth_required` rather than streaming an error +- **Supersede stale executions instead of 409-rejecting (May 6, d6c9516)** — https://github.com/aws-samples/sample-strands-agent-with-agentcore/commit/d6c9516 — *applicability*: medium; check how app-api handles concurrent submissions on the same conversation +- **Use SDK `agent.cancel()` for stop-signal handling (May 6, fd9acec)** — https://github.com/aws-samples/sample-strands-agent-with-agentcore/commit/fd9acec — *applicability*: medium; if we have custom cancellation code, may simplify +- **`enabledTools` whitelist replaced with `disabled_skills` blacklist (May 3, 092aa33)** — https://github.com/aws-samples/sample-strands-agent-with-agentcore/commit/092aa33 — *applicability*: monitor; our CLAUDE.md still mentions `enabled_tools` as a debug step. Inversion has UX upside but RBAC implications + +#### MCP ecosystem +- **SEP-2567 Sessionless MCP merged (May 7)** — https://github.com/modelcontextprotocol/modelcontextprotocol/pull/2567 — *implications*: drops `Mcp-Session-Id` and `session/create`; list endpoints become cacheable. Strong fit for our SigV4 Gateway model. Watch python `mcp` library for adoption. +- **SEP-2575 init-removal track (companion)** — same thread — *implications*: stateless HTTP transport simplifies Lambda-backed Gateway servers +- **Schema rename: `IncompleteResult` → `InputRequiredResult`** — typed-API break on next `mcp` lib bump +- **MCPSafe — security scanner for MCP servers** — https://github.com/orgs/modelcontextprotocol/discussions — could scan our Gateway-hosted servers +- **MCP servers repo (no new servers this week)** — discovery has moved to `registry.modelcontextprotocol.io` + +#### FastMCP (used by our externally hosted MCP servers, behind AgentCore Gateway) +- **Latest release: 3.2.4** — published 2026-04-14 (~26 days ago) — https://pypi.org/project/fastmcp/ — *applicability*: cross-reference against our MCP server repos' pinned FastMCP version; if any are behind 3.x, evaluate the migration path. +- **Bootstrap-run note**: FastMCP source category was added mid-bootstrap based on follow-up feedback. Full release-notes + issues scan (https://github.com/jlowin/fastmcp) deferred to the first regular Friday run (2026-05-15). For this bootstrap, only the PyPI version snapshot is captured. + +#### Agentic UI/UX patterns (30-day baseline scan for bootstrap) + +- **MCP Apps extension is production-ready (SEP-1865)** — https://modelcontextprotocol.io/extensions/apps/overview | https://blog.modelcontextprotocol.io/posts/2026-01-26-mcp-apps/ — *what it is*: spec letting MCP tools return a `_meta.ui.resourceUri` pointing to a `ui://` resource; host fetches the HTML and renders it in a sandboxed iframe alongside the chat with bidirectional `ui/`-prefixed JSON-RPC via `postMessage`. Claude Desktop, ChatGPT, VS Code Copilot, Goose, Postman, MCPJam already ship support. — *fit*: **direct port (high impact, high effort)** — this is the standard our chat is going to be measured against in 2026. — *where it'd land*: new SSE event (`ui_resource` carrying `{resourceUri, csp, permissions}`), Angular `` sandboxed-iframe component implementing the `ui/` host bridge, branch in tool-result rendering pipeline. +- **MCP Apps host security model — sandboxed iframe + opt-in capabilities** — https://modelcontextprotocol.io/extensions/apps/overview — *what it is*: hosts declare capabilities (`sendOpenLink`, mic, camera) a given app can request; tool-call proxying goes through the host with user consent. — *fit*: **direct port** — maps cleanly onto our existing `oauth_required` consent pattern. — *where it'd land*: extend `oauth_required` SSE event family with `ui_consent_required`; reuse per-provider consent badge UI. +- **MCP Apps example servers** — https://github.com/modelcontextprotocol/ext-apps/tree/main/examples — *what it is*: starter servers for data exploration (cohort heatmap, customer segmentation), forms (scenario modeler, budget allocator), media (PDF, video, sheet music), 3D (Cesium, Three.js). — *fit*: pattern-only — templates are React/Vue/Svelte but the protocol is framework-agnostic. — *informs*: the kinds of internal tools we'd expose as MCP Apps once we host. +- **AI SDK "Render Visual Interface in Chat" recipe** — https://ai-sdk.dev/cookbook — *what it is*: pattern where tool results map to specific UI components on the client, model drives which component renders. — *fit*: pattern-only (React hook). — *Angular equivalent*: a `toolRenderers` registry keyed by tool name, with a signal-driven `` component doing `@switch (toolName())` over registered renderers. We do a coarse version today; the pattern argues for making it a first-class extension point so per-tool components live next to the tool definition rather than in a god-switch. +- **AI SDK "Call Tools in Multiple Steps" / `streamText` multi-step** — https://ai-sdk.dev/cookbook — *fit*: pattern-only. — *Angular equivalent*: keep `signal()`-backed tool-call state mutable across the conversation (don't freeze at `tool_result`), so prior tool-call cards stay interactive as new steps stream in. +- **assistant-ui @0.14.0 (2026-05-07)** — https://github.com/Yonom/assistant-ui/releases — API consolidation (`useAui` replaces deprecated naming). Also: `mcp-app-studio` package updated alongside — assistant-ui is shipping first-party MCP Apps authoring/preview tooling. — *signal*: **MCP Apps is the assumed UI surface** for serious agentic chat shells going forward. +- **"Output isn't design" — Karri Saarinen, Linear (2026-04-17)** — https://linear.app/now/output-isn-t-design — *takeaway*: pointed pushback on generative-UI hype. "Plausible-looking generated interfaces unravel the moment you actually use them" because the work of resolving tensions and edge cases hasn't happened. — *implication for us*: when we add MCP Apps, treat the iframe as a vehicle for *purpose-built* UIs (forms, viewers), not as a "let the model generate a UI" shortcut. +- **"Interact with agent-created visualizations in canvases" — Cursor (2026-04-15)** — https://www.cursor.com/blog/canvas — *takeaway*: agent output that's interactive (charts you can drill into, plots you can re-parameterize) is now table stakes in agentic IDEs. Maps to our PDF/markdown/spreadsheet preview surface — direction is "previews become interactive viewers," not static thumbnails. +- **Linear Agent as named participant** — https://linear.app/now/how-we-use-linear-agent-at-linear (2026-04-10) + https://linear.app/changelog/2026-04-23-linear-agent-mcp-support — *pattern*: Linear's agent reads context via MCP and posts back as a structured agent identity in the issue thread (not as a chat message). **Agents as named participants with distinct affordances**, not just a stream of assistant text. — *fit for us*: worth considering for our multi-agent A2A flows — A2A sub-agents could render as distinct attributed turns rather than nested tool calls. +- **"Claude Design by Anthropic Labs" (2026-04-17)** — https://www.anthropic.com/news/claude-design-anthropic-labs — *takeaway*: "collaborate with Claude to produce polished visual work" as a first-class output type. Validates investing in artifact-style rendering surfaces beyond plain markdown. +- **NN/g "Designing AI Agents: 4 Lessons from China's Qwen Agent" (2026-05-08)** — https://www.nngroup.com/articles/designing-ai-agents/ — *evidence-based principles*: support discoverability, reuse familiar patterns, handle personal data carefully, protect user autonomy. — *applicability*: **discoverability** — tool-call rendering should surface available tools *before* the user has to phrase the right prompt (slash menu, suggestions from `enabled_tools`). **Autonomy** — our `oauth_required` consent event is on-pattern; extend the same explicit-consent model to MCP-Apps-initiated tool calls. +- **OpenAI AgentKit / Agent Builder visual canvas** — https://openai.com/index/introducing-agentkit/ — *takeaway*: agent *authoring* is moving to visual node-graphs. Not directly applicable to our runtime chat, but a signal that **agent-state visibility** (which agent is running, which tool, what step) is increasingly expected at runtime too — relevant to how we render A2A and multi-step tool flows. + +#### Frontier model announcements +- **Anthropic — higher Opus rate limits (May 6)** — https://www.anthropic.com/news/higher-limits-spacex — informational; we use Bedrock-hosted, not first-party +- **Anthropic — finance-agents pack (May 5)** — https://www.anthropic.com/news/finance-agents — Moody's MCP server is a concrete public MCP we could register in Gateway if a finance use case emerges +- **OpenAI — GPT-5.5 Instant displaces GPT-5.3 Instant (May 5)** — https://openai.com/index/gpt-5-5-instant/ — *risk*: confirm our model selector doesn't expose a deprecation-path 5.3 ID +- **Google / Gemini** — quiet week (no new model/API deltas) +- **Meta / Llama** — quiet week + +#### Agent harness patterns +- **Claude Code 2.1.136 — skills-under-plugins fix + MCP content-block fix** — https://github.com/anthropics/claude-code/blob/main/CHANGELOG.md — *relevance*: skill loading + MCP tool result rendering +- **Claude Code 2.1.133 — hooks receive `effort.level` + `worktree.baseRef` setting** — same URL — pattern worth mirroring in Strands hook payloads +- **Claude Code 2.1.132 — `CLAUDE_CODE_SESSION_ID` env into Bash subprocess** — same URL — session-id-everywhere pattern we already do loosely +- **CMA_coordinate_specialist_team.ipynb (May 6)** — https://github.com/anthropics/claude-cookbooks/tree/main/managed_agents — coordinator + 3 specialists with per-role tool scoping +- **CMA_verify_with_outcome_grader.ipynb (May 6)** — same repo — writer/grader loop with `user.define_outcome` rubrics; could bolt onto SSE for tool-result fact-checking +- **Agent Development Lifecycle (LangChain blog, May 9)** — https://www.langchain.com/blog/the-agent-development-lifecycle — our kaizen cadence already covers most of this; gap is "online evals" + +#### Pricing / quota +- No detected Bedrock or AgentCore pricing changes this week +- Note: `https://aws.amazon.com/bedrock/whats-new/` returned **404** — page appears retired. Skill source URL needs replacement. + +#### Community + GitHub issues +- HN: 0 hits for stack keywords (`bedrock`, `agentcore`, `strands`, `mcp`, `claude code`) in the 7-day window — quiet +- Reddit: blocked from WebFetch in this environment — gap to address (`.rss` or Reddit MCP) + +#### Cookbook / courses +- 4 new managed-agent cookbooks landed May 5–8 (vulnerability detection, coordinator/specialists, outcome grading, registry category) +- `anthropics/courses` quiet (last commit Nov 2025) — candidate to drop from weekly scan + +#### Seasonal +- Out of window — no re:Invent or NeurIPS items + +### Patterns worth considering + +- **Online evals via grader sub-agent** — sample N% of conversation turns, run a stateless grader, persist outcomes. Fits LangChain's Agent Development Lifecycle framing and the CMA outcome-grader cookbook. **Verdict**: monitor — interesting once we've shipped the core cleanups below. +- **Brain/hands separation** (Anthropic Managed Agents direction) — push session/checkpoint store outside the agent process. We already do this via AgentCore Memory; fully aligned. **Verdict**: aligned, no action. +- **Sessionless MCP** (SEP-2567) — list endpoints cacheable per (deployment, auth). Direct fit for SigV4 Gateway. **Verdict**: monitor; act when python `mcp` library adopts. + +## Internal Audit + +### Activity (last 7 days) +- **Commits on develop**: 8 (all from squash-merged PRs) +- **PRs opened**: 5 (4 dependabot — #237/#239/#241 still open, plus #276 docs) +- **PRs merged**: 8 +- **PRs reverted**: 0 +- **Issues opened**: 4 (#266, #267, #268, #269 on May 9 — Strands-features and prompt caching) +- **CI failures (workflow → count)**: Nightly Build & Test 9, Deploy Inference API 5, Deploy App API 6, Deploy Frontend 1, Version Check 6, Deploy Infrastructure 2 + +### Repeated friction signals +- **Nightly Build & Test failing 9× since May 6** — concentrated cluster; no signal it's been investigated. Could be the test flakiness from issue #220 (order-dependent flakiness in `test_cognito_idp_service`, `test_oauth_repositories`, `test_auth_providers*`) compounding, or a different cause. *Hypothesis*: untriaged. *Fix candidate*: triage one failure end-to-end; promote to a blocking issue if not already on the board. +- **Deploy workflows failing 6+ times May 6–9** — Inference API, App API, Frontend deploys all hit failures. *Hypothesis*: BFF migration shipped this week (#272–#277) introduced env-var or stack drift not caught in synth. *Fix candidate*: cross-check most recent failed deploy log against beta.24 ↔ post-beta.24 stack diff. +- **5 of 8 commits this week are BFF/auth fixes** (#270, #271, #273, #274, #275, #277) — the BFF migration shipped in beta.24 is still being patched. Healthy iteration, but the pace says "treat BFF as not-done-yet" before declaring beta.25. + +### Version-pin lag + +| Dep | Pinned | Latest | Lag | Notes | +|---|---|---|---|---| +| `bedrock-agentcore` | 1.6.4 | **1.9.0** | 3 minor / latest 2026-05-07 | Open issues #456 (OTEL detach) and #452 (event-loop blocking) may already be addressed | +| `boto3` | 1.42.96 | 1.43.6 | 1 minor / ~10 patches | Routine bump | +| `aws-cdk-lib` | 2.251.0 | 2.253.1 | 2 patch | Routine | +| `aws-cdk` | 2.1120.0 | 2.1121.0 | 1 patch | Routine | +| `@angular/core` | 21.2.11 | 21.2.12 | 1 patch | Routine | +| `strands-agents` | 1.39.0 | 1.39.0 | current | Just upgraded in #265 | +| `fastapi` | 0.136.1 | 0.136.1 | current | — | +| `mcp` | (transitive) | 1.27.1 | n/a | Watch for SEP-2567 adoption | + +### Retirement candidates + +- **`enabled_tools` whitelist debug guidance in `CLAUDE.md`** — Reference repo abandoned this pattern May 3 (`092aa33`) for `disabled_skills` blacklist. Not urgent retirement, but worth a re-evaluation if we touch tool-enablement code. +- **`anthropics/courses` source in `kaizen-research`** — Last commit Nov 2025; subagent reported "quiet". Drop from weekly scan list. +- **`https://aws.amazon.com/bedrock/whats-new/` URL in `kaizen-research`** — 404'd on this run. Replace with the AWS What's New RSS feed only, or a different filtered URL. +- **`https://docs.claude.com/en/docs/claude-code/release-notes` URL in `kaizen-research`** — 301→404. Replace with `https://github.com/anthropics/claude-code/blob/main/CHANGELOG.md`. +- **6 of 9 skills not modified in 60+ days** (angualar-best-practices, tailwind-ui, frontend-design, cdk-infrastructure, versioning, cors-deployment) — modification freshness alone is a weak signal for skills since they encode stable conventions. **No retirement recommended without invocation telemetry.** + +### Risks introduced this week + +- **`bedrock-agentcore` 3 minor versions behind** with a release in the scan window — issues #456 (OTEL trace detach in Memory + Strands session_manager) and #452 (async-mode for AgentCoreMemorySessionManager) may be already-fixed in 1.7-1.9. *What breaks if ignored*: silent observability gaps (broken spans on memory writes); concurrency degradation under load. — https://pypi.org/project/bedrock-agentcore/ +- **OpenAI displaces GPT-5.3 Instant with GPT-5.5 Instant (May 5)** — our model selector exposes per-model IDs. *What breaks if ignored*: customers using a 5.3 default may hit a deprecation window. — https://openai.com/index/gpt-5-5-instant/ +- **Strands #2266 — `BedrockModel.stream` leaks inner task on outer cancellation** — we cancel SSE streams on client disconnect. *What breaks if ignored*: orphaned tasks, "Task exception was never retrieved" log noise, possible memory pressure under churn. — https://github.com/strands-agents/sdk-python/issues/2266 +- **Reddit blocked from WebFetch** in the kaizen-research environment — community-signal scan is half-blind. — *Fix*: switch to `https://www.reddit.com/r//.rss` or a configured Reddit MCP server. + +## Ideas — Top 6 (ranked) + +> Bootstrap exceptionally lists 6 (vs the skill's nominal 5) because the UI/UX lens was added mid-run and surfaced an MCP Apps initiative worth ranking. Regular runs target 5. + +| # | Idea | Surface | Effort | Impact | Subtracts? | +|---|---|---|---|---|---| +| 1 | Scope an MCP Apps host renderer in our chat (multi-PR initiative) | frontend + backend (SSE event + component) | H | H | no — additive, but unlocks every future MCP server shipping a UI | +| 2 | Bump `bedrock-agentcore` 1.6.4 → 1.9.0; verify SDK issues #456/#452 are addressed | backend | L | M | no — pure dep bump (justified: 3 versions of upstream fixes, latest in scan window) | +| 3 | Promote tool-result rendering to a per-tool renderer registry (signal-backed) | frontend | M | M-H | partial — replaces an implicit switch with an explicit registry; bridges toward MCP Apps | +| 4 | Audit `BedrockModel.stream` cancellation path against Strands #2266 | backend | L | M-H | no — defensive; SSE-disconnect path is hot | +| 5 | Close issues #266 and #267 — features already in our Strands 1.39 pin; replace with smaller "wire upstream feature" tasks | cross-cutting | L | M | **yes — retires 2 build-from-scratch tickets (library-native subtraction)** | +| 6 | Triage Nightly Build & Test failure cluster (9× since May 6) | cross-cutting / CI | L-M | M-H | possibly — if root is issue #220, fixing it simplifies suite | + +### 1. Scope an MCP Apps host renderer in our chat +- **Source**: research/2026-05-10.md ▸ Agentic UI/UX ▸ MCP Apps (SEP-1865, production-ready); cross-confirmed by assistant-ui's `mcp-app-studio` direction +- **Surface area**: frontend (new `` Angular component, tool-result rendering pipeline branch) + backend (new SSE event `ui_resource`; possibly extend `oauth_required` family with `ui_consent_required`) +- **Change**: implement the host side of the MCP Apps spec — sandboxed iframe rendering `ui://` resources returned by MCP tools, with the `ui/`-prefixed JSON-RPC dialect over `postMessage`. Consent UX reuses the existing `oauth_required` pattern. Treat as a multi-PR initiative: (a) SSE event + plumbing, (b) iframe component + postMessage bridge, (c) consent UI, (d) end-to-end with one example MCP App from `ext-apps/examples`. +- **Subtracts**: no — pure addition. Justified because: every major host already ships this; without it, third-party MCP servers we connect can't deliver UI beyond text+JSON. We become the platform less-than the rest of the ecosystem. +- **Effort × Impact**: High × High +- **Verdict**: Worth scoping (formal scoping doc before any code). Could comfortably be a 3-4 week initiative spanning multiple sprints. + +### 2. Bump `bedrock-agentcore` 1.6.4 → 1.9.0 +- **Source**: PyPI (https://pypi.org/project/bedrock-agentcore/) + open SDK issues #456, #452 +- **Surface area**: `backend/pyproject.toml`, `backend/uv.lock` +- **Change**: pin update + smoke-test memory + identity flows in dev; verify CHANGELOG between 1.6 and 1.9 for any breaking changes +- **Subtracts**: addition only — justified by 3 versions of upstream fixes including likely-relevant OTEL trace detach (#456) and event-loop blocking (#452) +- **Effort × Impact**: Low × Medium +- **Verdict**: Worth trying + +### 3. Promote tool-result rendering to a per-tool renderer registry +- **Source**: research/2026-05-10.md ▸ Agentic UI/UX ▸ AI SDK generative-UI recipes + Cursor canvases +- **Surface area**: frontend (`` component or equivalent, plus a new `ToolRendererRegistry` service) +- **Change**: today our tool-result rendering is (implicitly) a switch in one place. Promote to a signal-backed registry keyed by tool name; per-tool renderers live next to the tool definition. Bridges naturally toward MCP Apps (which would just be "another registered renderer that emits an iframe"). Lifts a chunk of switch-like code into a declarative table. +- **Subtracts**: partial — replaces an implicit switch with an explicit registry; the registry's existence is more code, but it absorbs scattered tool-specific UI logic into one place. +- **Effort × Impact**: Medium × Medium-High +- **Verdict**: Worth trying — independently valuable AND pre-work for proposal #1. + +### 4. Audit `BedrockModel.stream` cancellation path +- **Source**: Strands open issue #2266 (filed May 9) +- **Surface area**: `backend/src/agents/main_agent/` stream coordinator + SSE handler +- **Change**: locate where we cancel `BedrockModel.stream`; ensure we `await task` on cancel paths so tasks don't orphan; add a log assertion in dev to detect "Task exception was never retrieved" +- **Subtracts**: addition only — defensive +- **Effort × Impact**: Low × Medium-High +- **Verdict**: Worth trying + +### 5. Close issues #266 and #267 — features already in our Strands 1.39 pin +- **Source**: Strands v1.37 (PR #2249, context-window lookup) + v1.38 (large tool result offload) +- **Surface area**: GitHub issues + small wiring in `stream_coordinator` and tool-result handling for spreadsheet/Code Interpreter outputs +- **Change**: close #266 and #267 with comments pointing at upstream PRs; replace with smaller "wire context-window lookup" and "wire large tool result offload" tasks if the wiring isn't automatic +- **Subtracts**: **yes — retires 2 "build from scratch" issues; replaces with at-most 2 "use upstream feature" tasks. Library-native subtraction.** +- **Effort × Impact**: Low × Medium +- **Verdict**: Worth trying + +### 6. Audit `oauth_required` SSE flow against ref-repo's mid-tool-call 401/403 handling +- **Source**: aws-samples/sample-strands-agent-with-agentcore commit `9fcdb4c` +- **Surface area**: `backend/src/apis/shared/oauth/agentcore_identity.py`, SSE event emission in `inference-api`, MCP/A2A tool wrappers +- **Change**: ensure mid-conversation 401/403 from Google/external OAuth providers re-emits `oauth_required` (consent-resume) rather than streaming a tool error to the user +- **Subtracts**: addition only — defensive; closes a real UX gap when upstream tokens revoke mid-stream +- **Effort × Impact**: Medium × High +- **Verdict**: Worth trying + +### 7. Triage Nightly Build & Test failure cluster +- **Source**: 9 failures since May 6 in `gh run list --status=failure` +- **Surface area**: `.github/workflows/nightly-*.yml`, possibly `tests/shared/test_cognito_idp_service.py` + adjacent (per issue #220) +- **Change**: pull the most recent failure log; trace to root cause; fix flakiness OR document why it's failing if it's a real regression; promote to blocking issue +- **Subtracts**: possibly — if root is issue #220 (test isolation), fixing it materially simplifies the suite +- **Effort × Impact**: Low-Medium × Medium-High +- **Verdict**: Worth trying + +## Take + +Two big themes this week. **First, agentic UI/UX has shifted under us.** MCP Apps shipped to production with adoption from Claude Desktop, ChatGPT, VS Code Copilot, Goose, and Postman; assistant-ui is building first-party MCP Apps tooling; the design conversation has moved from "what should an agent chat look like" to "how do we host other people's UIs in our chat." Our text+JSON tool-result rendering is now the baseline competitors are extending past. **Second, library-native subtraction is the kaizen loop's clearest win** — Strands 1.37/1.38 quietly closes our open issues #266/#267, and `bedrock-agentcore` 1.6.4 → 1.9.0 likely closes two open SDK issues we already feel. The single change that would matter most this week if scoped is **proposal #1 (MCP Apps host renderer)** — high effort, but the right strategic investment. Quick wins: **#2 (`bedrock-agentcore` bump)** and **#5 (close #266/#267)**. **#3 (renderer registry)** is the natural mid-ground that delivers value standalone AND pre-paves proposal #1. + +--- + +## Sources Scanned + +| # | Source | URL | Accessed | Items | +|---|---|---|---|---| +| 1 | AWS Bedrock + AgentCore (RSS, blog, pricing, SDK issues) | aws.amazon.com / github.com/aws/bedrock-agentcore-* | 2026-05-10 | 5 announcements + 5 issues | +| 2 | Strands Agents (releases, issues, PRs) | github.com/strands-agents/sdk-python | 2026-05-10 | 3 releases + 5 issues | +| 3 | Reference repo | github.com/aws-samples/sample-strands-agent-with-agentcore | 2026-05-10 | 12 commits in 30-day window | +| 4 | MCP ecosystem | modelcontextprotocol.io / github.com/modelcontextprotocol | 2026-05-10 | 4 spec items + 3 discussions | +| 4a | FastMCP (bootstrap: PyPI snapshot only — full scan deferred to 2026-05-15) | pypi.org/project/fastmcp + github.com/jlowin/fastmcp | 2026-05-10 | latest 3.2.4 | +| 4b | Agentic UI/UX (MCP Apps, AI SDK, assistant-ui, Linear/Cursor/Anthropic, NN/g) — 30-day baseline | modelcontextprotocol.io + ai-sdk.dev + assistant-ui.com + linear.app + cursor.com + anthropic.com + nngroup.com | 2026-05-10 | 11 items across MCP Apps spec, AI SDK patterns, assistant-ui, vendor product blogs, NN/g research | +| 5 | Frontier models (Anthropic, OpenAI, Google, Meta) | anthropic.com / openai.com / blog.google / ai.meta.com | 2026-05-10 | 3 Anthropic + 1 OpenAI + 0 others | +| 6 | Agent harness | github.com/anthropics + langchain.com + pydantic.dev | 2026-05-10 | 3 CC releases + 4 cookbook items | +| 7 | Community (HN Algolia + Reddit) | hn.algolia.com + reddit.com | 2026-05-10 | 0 HN hits, Reddit blocked | +| 8 | Version-pin diff | pypi.org / npmjs.com | 2026-05-10 | 8 deps checked, 4 lag | + +## Web Budget + +Used: 64 / 50 requests (target — UX-lens scan added 10 to the original 54). + +Skipped (unreachable / rate-limited): +- Reddit (`r/LocalLLaMA`, `r/MachineLearning`) — WebFetch blocked from this environment. Switch to `.rss` endpoint or configured Reddit MCP next run. +- `https://aws.amazon.com/bedrock/whats-new/` — 404 (page appears retired). Drop or replace. +- `https://docs.claude.com/en/docs/claude-code/release-notes` — 301→404. Replace with `github.com/anthropics/claude-code/blob/main/CHANGELOG.md`. +- OpenAI blog returned 403 twice; backfilled via search. + +Skipped (other): Security advisories (external) and Internal security posture (Dependabot + CodeQL) sources were initially included in this bootstrap run but **removed per scope refinement** — security signals are handled by Dependabot and CodeQL directly and don't need a weekly kaizen lens. Future runs won't scan them. + +Notes: +- Frontier-models sub-budget exceeded (11 vs ~6 target) due to two OpenAI WebFetch 403s requiring search backfill. +- This is a **bootstrap run**: reference-repo + UX-lens scope extended to 30 days for baseline; Carried Over and prior-decisions sections in the review-prep doc are necessarily empty. diff --git a/docs/kaizen/review-queue.md b/docs/kaizen/review-queue.md new file mode 100644 index 00000000..3f42f8d8 --- /dev/null +++ b/docs/kaizen/review-queue.md @@ -0,0 +1,92 @@ +# Kaizen Review Queue + +Items added by `kaizen-research`, consumed by `kaizen-review-prep`. + +## Open + +### [2026-05-10] Scope AgentCore Runtime BYO filesystem (S3 Files / EFS) for persistent agent workspaces +- **Source**: research/2026-05-10.md ▸ AWS Bedrock / AgentCore (re-evaluated 2026-05-10 via strategic-lens follow-up — original framing under-weighted the capability-unlock angle) +- **Surface**: backend (`inference-api` invocation handler reads/writes mount) + infrastructure (VPC config, IAM mount permissions, S3 Files or EFS access points, per-user prefix/access-point layout for RBAC); ADR-worthy +- **Effort × Impact**: H × H +- **Subtracts**: no — pure capability addition +- **Unlocks**: + - Code-interpreter / persistent agent workspace (artifacts survive turn and session boundaries) + - Cross-session file uploads — PDFs/spreadsheets persist between conversations instead of re-staging per session + - Shared skill/template/prompt hot-swap without redeploying the runtime container + - A2A multi-agent intermediate-result handoff via shared mount + - Persistent vector indexes / embedding caches — avoids cold-start rebuild +- **Open questions**: GA vs preview status (March 2026 managed session storage was preview; May 2026 BYO needs verification); VPC requirement is a new architectural surface for the runtime; multi-tenancy isolation strategy (per-user S3 prefix vs per-user EFS access point); RBAC mount-path layout; runtime data plane still only proxies `/invocations` + `/ping` so this doesn't unlock new HTTP routes +- **Status**: open + +### [2026-05-10] Scope an MCP Apps host renderer in our chat (multi-PR initiative) +- **Source**: research/2026-05-10.md ▸ Top 6 #1 ▸ Agentic UI/UX +- **Surface**: frontend + backend (new SSE event `ui_resource`; `` Angular component; consent UX) +- **Effort × Impact**: H × H +- **Subtracts**: no — pure addition. Justified: every major host (Claude Desktop, ChatGPT, VS Code Copilot, Goose, Postman) ships this; without it, third-party MCP servers we connect can only deliver text+JSON +- **Status**: open + +### [2026-05-10] Bump `bedrock-agentcore` 1.6.4 → 1.9.0 +- **Source**: research/2026-05-10.md ▸ Top 6 #2 +- **Surface**: backend +- **Effort × Impact**: L × M +- **Subtracts**: no — pure dep bump (justified: 3 versions of upstream fixes, latest in scan window) +- **Status**: open + +### [2026-05-10] Promote tool-result rendering to a per-tool renderer registry (signal-backed) +- **Source**: research/2026-05-10.md ▸ Top 6 #3 ▸ Agentic UI/UX (AI SDK + Cursor) +- **Surface**: frontend (`` component + new `ToolRendererRegistry` service) +- **Effort × Impact**: M × M-H +- **Subtracts**: partial — replaces implicit switch with explicit registry; absorbs scattered tool-specific UI logic. Pre-paves MCP Apps proposal #1. +- **Status**: open + +### [2026-05-10] Audit `BedrockModel.stream` cancellation path against Strands #2266 +- **Source**: research/2026-05-10.md ▸ Top 6 #4 +- **Surface**: backend +- **Effort × Impact**: L × M-H +- **Subtracts**: no — defensive (SSE-disconnect path is hot) +- **Status**: open + +### [2026-05-10] Close issues #266 and #267 — features already in our Strands 1.39 pin +- **Source**: research/2026-05-10.md ▸ Top 6 #5 +- **Surface**: cross-cutting +- **Effort × Impact**: L × M +- **Subtracts**: **yes — library-native subtraction; retires 2 build-from-scratch issues** +- **Status**: open + +### [2026-05-10] Triage Nightly Build & Test failure cluster (9× since May 6) +- **Source**: research/2026-05-10.md ▸ Top 6 #6 +- **Surface**: cross-cutting / CI +- **Effort × Impact**: L-M × M-H +- **Subtracts**: possibly — if root is issue #220 (test isolation) +- **Status**: open + +### [2026-05-10] Audit `oauth_required` SSE flow against ref-repo's mid-tool-call 401/403 handling +- **Source**: research/2026-05-10.md ▸ Risks +- **Surface**: backend +- **Effort × Impact**: M × H +- **Subtracts**: no — defensive +- **Status**: open (deferred 2 weeks per prior review — surface again on 2026-05-24) + +### [2026-05-10] Named A2A agent participants in the chat UI +- **Source**: research/2026-05-10.md ▸ Agentic UI/UX ▸ Linear Agent pattern +- **Surface**: frontend (extend message model with `agent_identity`, distinct avatar/name/styling) +- **Effort × Impact**: L-M × M +- **Subtracts**: no — additive but pattern-validated across Linear/ChatGPT/Cursor +- **Status**: open + +### [2026-05-10] Replace dead source URLs in `kaizen-research` skill +- **Source**: research/2026-05-10.md ▸ Retirement candidates +- **Surface**: skills (`.claude/skills/kaizen-research/SKILL.md`) +- **Effort × Impact**: L × L +- **Subtracts**: yes — replaces 2 broken URLs (`bedrock/whats-new/` 404, `docs.claude.com/.../release-notes` 404) with working ones; drops `anthropics/courses` (quiet since Nov 2025) +- **Status**: open + +### [2026-05-10] Add Reddit `.rss` or Reddit MCP to `kaizen-research` +- **Source**: research/2026-05-10.md ▸ Risks ▸ "Reddit blocked from WebFetch" +- **Surface**: skills (`.claude/skills/kaizen-research/SKILL.md`) +- **Effort × Impact**: L × L +- **Subtracts**: no — restores a half-blind source +- **Status**: open + +## Resolved + diff --git a/docs/kaizen/reviews/2026-05-10.md b/docs/kaizen/reviews/2026-05-10.md new file mode 100644 index 00000000..e48f395b --- /dev/null +++ b/docs/kaizen/reviews/2026-05-10.md @@ -0,0 +1,207 @@ +# Kaizen Review — Sunday, May 10, 2026 +> Prepared 11:00am MT. Review window: May 3 – May 10, 2026 (7 days). +> Source: research/2026-05-10.md + review-queue.md (8 open items). +> **Bootstrap run** — no prior reviews, no prior-week POC findings, no Carried Over items. Scope evolved mid-bootstrap: added FastMCP, library-native subtraction lens, and Agentic UI/UX lens; removed security posture lens (security is handled by Dependabot/CodeQL and doesn't need a weekly kaizen pass). + +## Week in Review + +Two themes braid this week. **Agentic UI/UX has shifted under us**: MCP Apps shipped to production with adoption from Claude Desktop, ChatGPT, VS Code Copilot, Goose, and Postman; the design conversation has moved from "what should an agent chat look like" to "how do we host other people's UIs in our chat." Our text+JSON tool-result rendering is now the baseline competitors are extending past. **Upstream-shrinks-our-backlog**: Strands v1.37/v1.38 quietly closed our open issues #266 and #267, `bedrock-agentcore` is 3 minor versions behind with likely fixes for two SDK issues we feel. The BFF migration is still v1 (5 of 8 commits this week are auth follow-ups), CI is unreliable. Net read: a "scope the big UX investment, harvest the upstream gains, stabilize hygiene" week. + +## Friction — the week's signal + +### Repeated patterns (≥2 occurrences) +- **CI deploy failures (6+ since May 6)** across Inference API, App API, and Frontend deploys. + - *Hypothesis*: BFF migration introduced env-var or stack drift not caught in synth (most failures cluster around the same auth/config landscape that's still being patched). + - *Candidate fix*: cross-check the most recent failed deploy log against the beta.23 → beta.24 stack diff; the most likely suspects are missing/renamed SSM parameters from the public-PKCE-client decommission (`/auth/cognito/app-client-id` removed) or the new `BFF_*` env vars. +- **Nightly Build & Test failures (9× since May 6)** — concentrated, untriaged. + - *Hypothesis*: known flakiness from issue #220 (`test_cognito_idp_service`, `test_oauth_repositories`, `test_auth_providers*` order-dependent) compounding under the BFF-heavy week's churn. + - *Candidate fix*: triage one failure log end-to-end. Either the root is #220 (then #220 needs to land) or it's a real regression masked by the noise. +- **BFF/auth fix churn (5 of 8 commits this week)** — #270, #271, #273, #274, #275, #277. + - *Hypothesis*: BFF migration is a v1, not a v1.1; expect 1–2 more weeks of follow-ups before declaring beta.25. + - *Candidate fix*: not a fix per se — adjust release-cut timing for beta.25 to wait for the churn to settle. + +### One-offs worth watching +- **`bedrock-agentcore` 1.6.4 → 1.9.0 lag** with a release published *inside* the scan window (May 7) — see proposal #1. +- **OpenAI displaces GPT-5.3 Instant with GPT-5.5 Instant** — model selector audit needed (proposal #6 — declined-by-default below; check first then decide). +- **Strands #2266 `BedrockModel.stream` cancel leak** — filed May 9; see proposal #2. + +### Silence that matters +- **No invocations of 6 of 9 skills in 60+ days** (angualar-best-practices, tailwind-ui, frontend-design, cdk-infrastructure, versioning, cors-deployment) — modification freshness is a weak signal for skills since they encode stable conventions; **not enough to act**, but worth instrumenting invocation telemetry if we want to make this a reliable retirement signal in the future. +- **HN was quiet on stack keywords this week** (0 hits in Algolia 7-day window) — not a problem; just a confirmation that this is an internal-momentum week, not a community-momentum week. +- **`anthropics/courses` quiet since Nov 2025** — proposal #6 below proposes dropping it from the scan list. + +## Proposals — ranked + +### 1. Scope an MCP Apps host renderer in our chat (multi-PR initiative) +- **Source**: research/2026-05-10.md ▸ Top 6 #1 ▸ Agentic UI/UX | review-queue.md (open) +- **Surface area**: frontend (new `` Angular component, tool-result rendering pipeline) + backend (new SSE event `ui_resource`; likely a `ui_consent_required` cousin of `oauth_required`) +- **Change**: implement the host side of MCP Apps (SEP-1865) — sandboxed iframe rendering `ui://` resources returned by MCP tools, with the `ui/`-prefixed JSON-RPC dialect over `postMessage`. Treat as a multi-PR initiative: (a) scoping/architecture doc + spec checklist, (b) SSE event + plumbing, (c) iframe component + postMessage bridge, (d) consent UX, (e) end-to-end demo with one example from `ext-apps/examples`. +- **Subtracts**: no — pure addition. Justified: Claude Desktop, ChatGPT, VS Code Copilot, Goose, and Postman ship this. Every third-party MCP server we connect could be shipping UIs richer than text+JSON; without a host, we leave that on the table. +- **Effort**: High (multi-PR; new SSE protocol event; sandboxed-iframe component; consent UX) +- **Impact**: High (strategic — agentic UI standard our chat is being measured against) +- **POC findings**: not POCed. +- **Ship means**: **scope this week, build over 3-4 weeks.** Open a scoping issue + architecture doc this week — not a code PR yet. Code PRs follow in subsequent sprints. +- **Decline means**: stay on text+JSON tool results; revisit when a third-party MCP server we connect ships an MCP App and we can't render it (the forcing function). +- **Recommendation**: **Ship (scope this week).** Highest strategic value of any item. Pre-paves with proposal #3. + +### 2. Bump `bedrock-agentcore` 1.6.4 → 1.9.0 +- **Source**: research/2026-05-10.md ▸ Top 6 #2 | review-queue.md (open) +- **Surface area**: backend (`backend/pyproject.toml`, `backend/uv.lock`) +- **Change**: pin update + smoke-test memory + identity flows in dev; verify upstream CHANGELOG between 1.6 and 1.9 for any breaking changes; close out open SDK issues #456 (OTEL trace detach) and #452 (event-loop blocking) if 1.9 addresses them. +- **Subtracts**: addition only — justified by 3 versions of upstream fixes including likely-relevant ones to OTEL trace detach and AgentCoreMemorySessionManager event-loop blocking. +- **Effort**: Low +- **Impact**: Medium (observability + concurrency) +- **POC findings**: not POCed — bootstrap run. +- **Ship means**: open a PR updating `pyproject.toml` and `uv.lock`; smoke-test memory write/read + identity OAuth flows in dev; if smoke passes, merge. +- **Decline means**: keep at 1.6.4 for another week; revisit after 1.10 ships or after we observe a memory/identity issue. +- **Recommendation**: **Ship.** Lowest effort × medium impact; clean upstream-harvest win. + +### 3. Promote tool-result rendering to a per-tool renderer registry (signal-backed) +- **Source**: research/2026-05-10.md ▸ Top 6 #3 ▸ Agentic UI/UX (AI SDK + Cursor canvases) | review-queue.md (open) +- **Surface area**: frontend (`` component + new `ToolRendererRegistry` service) +- **Change**: today our tool-result rendering is (implicitly) a switch in one place. Promote to a signal-backed registry keyed by tool name; per-tool renderers live next to the tool definition. Independently valuable AND the natural extension point for proposal #1 (MCP Apps becomes "just another registered renderer that emits an iframe"). +- **Subtracts**: partial — replaces an implicit switch with an explicit registry; the registry itself is new code, but it absorbs scattered tool-specific UI logic into one declarative table. +- **Effort**: Medium +- **Impact**: Medium-High (improves current tool-result UX AND pre-paves MCP Apps host) +- **POC findings**: not POCed. +- **Ship means**: open a PR that introduces the registry service + migrates 2-3 current tool renderers as a proof point. +- **Decline means**: keep the implicit switch; pay the cost when proposal #1 lands. +- **Recommendation**: **Ship.** Best risk-adjusted UX investment — value standalone AND scaffolding for proposal #1. + +### 4. Audit `BedrockModel.stream` cancellation path against Strands #2266 +- **Source**: research/2026-05-10.md ▸ Top 6 #4 | review-queue.md (open) +- **Surface area**: backend (`backend/src/agents/main_agent/` stream coordinator + SSE handler) +- **Change**: locate every `BedrockModel.stream` cancellation path; ensure each `await`s the inner task on cancel so it doesn't orphan; add a dev-only assertion / log filter to detect "Task exception was never retrieved" before it reaches prod. +- **Subtracts**: addition only — defensive. +- **Effort**: Low +- **Impact**: Medium-High (SSE-disconnect path is hot; orphan-task pressure is silent until it isn't) +- **POC findings**: not POCed. +- **Ship means**: open a PR with the audit + fixes + a regression test that triggers cancel + asserts no orphan tasks. +- **Decline means**: log a tech-debt issue; revisit if "Task exception was never retrieved" appears in CloudWatch. +- **Recommendation**: **Ship.** Pairs naturally with proposal #2 (same backend area, same week's Strands signal). Cheap insurance. + +### 5. Close issues #266 and #267 — features already in our Strands 1.39 pin +- **Source**: research/2026-05-10.md ▸ Top 6 #5 | review-queue.md (open) +- **Surface area**: cross-cutting — GitHub issues + small wiring in `stream_coordinator` and Code Interpreter / spreadsheet tool-result handling +- **Change**: + 1. Comment on #266 + #267 pointing at upstream PRs (Strands #2249 for context-window lookup; v1.38.0 release notes for large tool result offload). + 2. Verify the upstream features are *automatically* active under our 1.39 pin — if not, file replacement issues for the wiring work and link them. + 3. Close #266 + #267. +- **Subtracts**: **yes — library-native subtraction. Retires 2 "build from scratch" tickets; replaces with at-most 2 "wire upstream feature" tasks.** +- **Effort**: Low +- **Impact**: Medium (closes phantom tech debt; clears the issue list) +- **POC findings**: not POCed. +- **Ship means**: 30-minute issue-grooming pass; comment + close + (if needed) file 2 smaller follow-ups. +- **Decline means**: leave #266 + #267 open; future kaizen runs will re-flag them. +- **Recommendation**: **Ship.** Highest *subtraction* yield this week. The clearest demonstration of the kaizen loop earning its keep. + +### 6. Triage Nightly Build & Test failure cluster (9× since May 6) +- **Source**: research/2026-05-10.md ▸ Top 6 #6 | review-queue.md (open) +- **Surface area**: cross-cutting / CI (`.github/workflows/nightly-*.yml`, possibly `tests/shared/test_cognito_idp_service.py` per issue #220) +- **Change**: pull the most recent failure log; trace to root cause; if root is issue #220 (test isolation), bump #220 to blocking and land it; if it's a different cause, file and resolve. +- **Subtracts**: possibly — if root is #220, fixing it materially simplifies the suite (removes a tech-debt entry). +- **Effort**: Low-Medium (worst case: a real regression hiding under the noise) +- **Impact**: Medium-High (CI signal is currently unreliable; that affects *every* PR review) +- **POC findings**: not POCed. +- **Ship means**: 1-2 hour triage pass; either fix in a small PR or bump #220 to blocking. +- **Decline means**: continue ignoring nightly failures; eventually a real regression will hide here. +- **Recommendation**: **Ship.** This is independent of the kaizen-loop work above; it's hygiene. If the kaizen review is the venue that finally surfaces it, that's a kaizen win. + +### 7. Audit `oauth_required` SSE flow against ref-repo's mid-tool-call 401/403 handling +- **Source**: research/2026-05-10.md ▸ Risks | review-queue.md (open, deferred) +- **Surface area**: backend (`apis/shared/oauth/agentcore_identity.py`, SSE event emission in `inference-api`, MCP/A2A tool wrappers) +- **Change**: walk through the code paths where an external OAuth provider (Google etc.) returns 401/403 mid-stream; confirm the response is `oauth_required` SSE re-emission (consent-resume), not a streamed tool error to the user. Add a regression test if missing. +- **Subtracts**: addition only — defensive; closes a real UX gap when upstream tokens revoke. +- **Effort**: Medium (audit + likely 1-2 small fixes) +- **Impact**: High (user-visible UX; OAuth token revocation does happen) +- **POC findings**: not POCed. +- **Ship means**: open a tech-debt issue with the audit findings; fix in a follow-up PR if anything is broken. +- **Decline means**: assume current behavior is correct; revisit if a user reports a stuck-OAuth conversation. +- **Recommendation**: **Defer 2 weeks (revisit 2026-05-24).** Highest-impact backend proposal but BFF auth is still settling — landing this in the middle of the BFF patch parade risks compounding the churn. Wait for BFF to stabilize (likely beta.25), then audit cleanly. + +### 8. Named A2A agent participants in the chat UI +- **Source**: research/2026-05-10.md ▸ Agentic UI/UX ▸ Linear Agent pattern | review-queue.md (open) +- **Surface area**: frontend (extend message model with `agent_identity`; distinct avatar / name / styling for A2A sub-agent turns instead of nesting under generic `tool_use` cards) +- **Change**: when an A2A sub-agent produces output, render it as a distinctly attributed turn (named, avatar'd) rather than as a nested tool result. SSE already carries enough information; this is mostly a rendering change. +- **Subtracts**: no — additive but pattern-validated across Linear, ChatGPT agents, and Cursor multi-agent flows. +- **Effort**: Low-Medium +- **Impact**: Medium (legibility of multi-agent runs; user understanding of "who did what") +- **POC findings**: not POCed. +- **Ship means**: small PR extending the message model + a `` Angular component variant. +- **Decline means**: A2A sub-agent activity continues to be nested under tool cards; users can't easily tell when a different "actor" is responding. +- **Recommendation**: **Ship.** Low-effort UX win that future-proofs the chat for the A2A direction we're already heading. + +### 9. Replace dead source URLs in `kaizen-research` skill + drop `anthropics/courses` +- **Source**: research/2026-05-10.md ▸ Retirement candidates | review-queue.md (open) +- **Surface area**: skills (`.claude/skills/kaizen-research/SKILL.md`) +- **Change**: + - Replace `https://aws.amazon.com/bedrock/whats-new/` (404) with the AWS What's New RSS feed (already in the skill — drop the dead URL). + - Replace `https://docs.claude.com/en/docs/claude-code/release-notes` (301→404) with `https://github.com/anthropics/claude-code/blob/main/CHANGELOG.md`. + - Drop `https://github.com/anthropics/courses` from the cookbook source (quiet since Nov 2025). +- **Subtracts**: **yes — replaces 2 broken URLs with working ones; drops 1 stale source.** +- **Effort**: Low +- **Impact**: Low (skill quality) +- **POC findings**: not POCed. +- **Ship means**: 5-minute edit to `kaizen-research/SKILL.md`. +- **Decline means**: leave dead URLs; future runs waste budget on them. +- **Recommendation**: **Ship.** Trivial, pure subtraction. Bundle with proposal #10 in one skill-update PR. + +### 10. Add Reddit `.rss` or Reddit MCP to `kaizen-research` +- **Source**: research/2026-05-10.md ▸ Risks ▸ "Reddit blocked from WebFetch" | review-queue.md (open) +- **Surface area**: skills (`.claude/skills/kaizen-research/SKILL.md`) +- **Change**: switch the community-signal subagent from raw `reddit.com` URLs to `https://www.reddit.com/r//.rss` (try first — `.rss` may be allowed where the HTML page isn't), or wire a Reddit MCP server if available. +- **Subtracts**: no — restores a half-blind source. +- **Effort**: Low (try `.rss` first; fall back to MCP if blocked) +- **Impact**: Low-Medium (community signal is one of 11 sources; useful but not load-bearing) +- **POC findings**: not POCed. +- **Ship means**: edit the skill's source list. +- **Decline means**: keep accepting "Reddit blocked" as a known gap. +- **Recommendation**: **Ship.** Bundle with proposal #9 in a single skill-update PR. + +## Carried Over From Prior Reviews + +Bootstrap run — none. + +## Retirement Candidates + +- **`enabled_tools` whitelist debug guidance in `CLAUDE.md`** — Reference repo abandoned this pattern May 3; ours isn't *wrong*, just diverging from the reference. **Recommendation**: monitor; revisit if/when we touch tool-enablement code. Not retire-this-week. +- **Skills not modified in 60+ days (6 of 9)** — modification freshness alone isn't enough signal to retire skills that encode stable conventions (e.g. `tailwind-ui`, `cdk-infrastructure`). **Recommendation**: **don't retire.** If we want this to be a reliable retirement signal, we'd need invocation telemetry — that's a separate proposal worth filing for a future week. + +## Risks Acknowledged But Not Acted On + +- **OpenAI GPT-5.3 → 5.5 Instant displacement** — https://openai.com/index/gpt-5-5-instant/ — *what breaks if ignored*: customers using a 5.3 default may hit a deprecation window. **Recommendation**: **Watch until 2026-06-01.** Quick check next week to confirm whether OpenAI is publishing a deprecation date for 5.3; if yes, file a model-selector audit. +- **MCP Apps adoption window** — every major host shipped support in Q1 2026. The longer we wait, the more we're shaping our tool-result UI in a direction that doesn't compose with where the ecosystem is going. **Recommendation**: scope this week (proposal #1); first code PR by 2026-05-31. + +## What Shipped This Week + +- **#277 — feat(auth): centralize 401 redirect + proactive session detection** (May 10) — *closes a real refresh-edge UX hole* +- **#275 — fix(bff): tighten cross-task refresh-lock release + absolute-lifetime guard** — *prevents zombie refresh attempts after lock release* +- **#274 — fix(bff): replace KMS-wrap data-key bootstrap with Secrets-Manager-generated secret** — *removes a bootstrap race the AES-GCM codec couldn't recover from* +- **#273 — fix(bff): cross-task cookie-codec & refresh-lock correctness** — *cleanup* +- **#272 — feat(auth): add SKIP_AUTH=true local-dev bypass with allowlist guard** — *unblocks local dev when Cognito is offline* +- **#271 — fix(auth): make lava-lamp backdrop dark-mode aware** — *visual polish* +- **#270 — fix(token-accounting): correct per-message cost and context-window semantics** — *fixes the cost badge accuracy* +- **#265 — chore(deps): upgrade strands-agents to 1.39.0** — *the upgrade that quietly closes #266 and #267* + +## Take + +The week's most valuable shipping is the strands-agents 1.39.0 bump (#265) — the team probably doesn't yet know it closed two of our open issues. That's the kaizen loop earning its keep on the upstream-harvest side. The new UI/UX lens — added mid-bootstrap — earned its keep too: it surfaced **MCP Apps** as a production-ready agentic UI standard that every major host already ships, and that our chat doesn't. The most consequential change this week if scoped is **proposal #1 (MCP Apps host renderer)** — high effort but high strategic value. The best risk-adjusted move is **proposal #3 (per-tool renderer registry)** — independently valuable AND pre-paves #1. Quick wins: **#2 (`bedrock-agentcore` bump)** and **#5 (close #266/#267)** demonstrate library-native subtraction. CI (proposal #6) is the loudest non-kaizen problem; surface it here but fix it as hygiene. + +--- + +## Review Protocol (for Phil) + +1. Read Friction (2 min). +2. Mark each Proposal ✅ Ship / ❌ Decline / ⏸ Defer (4-6 min). **10 proposals**; my recommendations: 9 Ship, 1 Defer, 0 Decline (proposal #7 — `oauth_required` audit — is the defer until 2026-05-24). +3. Same for Risks Acknowledged. +4. Pick 1-3 to ship this week. Suggested if you only do 3: **#1 (scope MCP Apps host — scoping doc only this week), #2 (bedrock-agentcore bump), #5 (close #266/#267)** — covers strategic, quick-win, and subtraction. If 4: add **#3 (renderer registry)** as the bridge investment. + +Target: 12-17 minutes (slightly more than the nominal 10-15 because the bootstrap is larger than a normal weekly review). + +## Post-review (separate PRs) + +- ✅ Ship items → individual feature PRs over the week. The decision is logged in this doc; the implementation lives elsewhere. +- ❌ Decline items → appended to `docs/kaizen/decisions.md` with reason so future research doesn't re-propose. +- ⏸ Defer items → kept open in `review-queue.md` with a "revisit by" date; surface again in the next review when due. + +This skill produces the agenda. Implementation never happens here. diff --git a/frontend/ai.client/package-lock.json b/frontend/ai.client/package-lock.json index fb94ae19..13172270 100644 --- a/frontend/ai.client/package-lock.json +++ b/frontend/ai.client/package-lock.json @@ -1,12 +1,12 @@ { "name": "ai.client", - "version": "1.0.0-beta.24", + "version": "1.0.0-beta.25", "lockfileVersion": 3, "requires": true, "packages": { "": { "name": "ai.client", - "version": "1.0.0-beta.24", + "version": "1.0.0-beta.25", "dependencies": { "@angular/cdk": "21.2.9", "@angular/common": "21.2.11", diff --git a/frontend/ai.client/package.json b/frontend/ai.client/package.json index 27283c25..c328fbba 100644 --- a/frontend/ai.client/package.json +++ b/frontend/ai.client/package.json @@ -1,6 +1,6 @@ { "name": "ai.client", - "version": "1.0.0-beta.24", + "version": "1.0.0-beta.25", "scripts": { "ng": "ng", "start": "ng serve", diff --git a/frontend/ai.client/src/app/app.config.ts b/frontend/ai.client/src/app/app.config.ts index 76c528ac..9e18fac7 100644 --- a/frontend/ai.client/src/app/app.config.ts +++ b/frontend/ai.client/src/app/app.config.ts @@ -8,6 +8,7 @@ import { errorInterceptor } from './auth/error.interceptor'; import { withCredentialsInterceptor } from './auth/with-credentials.interceptor'; import { MARKED_OPTIONS, MarkedOptions, MarkedRenderer, provideMarkdown } from 'ngx-markdown'; import { SessionService } from './auth/session.service'; +import { ThemeService } from './components/topnav/components/theme-toggle/theme.service'; function markedOptionsFactory(): MarkedOptions { const renderer = new MarkedRenderer(); @@ -47,5 +48,13 @@ export const appConfig: ApplicationConfig = { // the user can pick a provider. Transport errors leave the SPA in a clean // unauthenticated state without redirecting. provideAppInitializer(() => inject(SessionService).bootstrap()), + + // ThemeService applies the persisted/system theme to in its + // constructor. It's providedIn:'root' but only injected by the topnav + // and authed pages, so on a cold load to /auth/login or /auth/first-boot + // it would never run and the dark-mode CSS on those screens would sit + // dormant. Inject it at bootstrap so the lava-lamp backdrop honors the + // user's preference (and prefers-color-scheme) on every route. + provideAppInitializer(() => { inject(ThemeService); }), ] }; diff --git a/frontend/ai.client/src/app/app.ts b/frontend/ai.client/src/app/app.ts index d2183f39..1ad60006 100644 --- a/frontend/ai.client/src/app/app.ts +++ b/frontend/ai.client/src/app/app.ts @@ -1,4 +1,4 @@ -import { Component, inject, signal } from '@angular/core'; +import { Component, DestroyRef, inject, signal } from '@angular/core'; import { Router, RouterOutlet } from '@angular/router'; import { Sidenav } from './components/sidenav/sidenav'; import { ErrorToastComponent } from './components/error-toast/error-toast.component'; @@ -6,14 +6,15 @@ import { ToastComponent } from './components/toast'; import { SidenavService } from './services/sidenav/sidenav.service'; import { HeaderService } from './services/header/header.service'; import { TooltipDirective } from './components/tooltip/tooltip.directive'; +import { SessionService } from './auth/session.service'; @Component({ selector: 'app-root', imports: [ - RouterOutlet, - Sidenav, - ErrorToastComponent, - ToastComponent, + RouterOutlet, + Sidenav, + ErrorToastComponent, + ToastComponent, TooltipDirective ], templateUrl: './app.html', @@ -24,6 +25,24 @@ export class App { protected sidenavService = inject(SidenavService); protected headerService = inject(HeaderService); private router = inject(Router); + private session = inject(SessionService); + + constructor() { + // Re-probe the BFF session whenever the tab regains focus. A session + // that expired while the tab was backgrounded surfaces immediately + // (redirect to /auth/login) instead of waiting for the next user + // action to 401. SSR-safe via the document guard. + if (typeof document !== 'undefined') { + const destroyRef = inject(DestroyRef); + const handler = () => { + if (document.visibilityState === 'visible') { + this.session.recheck(); + } + }; + document.addEventListener('visibilitychange', handler); + destroyRef.onDestroy(() => document.removeEventListener('visibilitychange', handler)); + } + } newChat() { this.router.navigate(['']); diff --git a/frontend/ai.client/src/app/auth/error.interceptor.spec.ts b/frontend/ai.client/src/app/auth/error.interceptor.spec.ts index e389a13d..bde70a6c 100644 --- a/frontend/ai.client/src/app/auth/error.interceptor.spec.ts +++ b/frontend/ai.client/src/app/auth/error.interceptor.spec.ts @@ -2,6 +2,7 @@ import { TestBed } from '@angular/core/testing'; import { HttpRequest, HttpResponse, HttpErrorResponse, HttpHandlerFn } from '@angular/common/http'; import { errorInterceptor } from './error.interceptor'; import { ErrorService } from '../services/error/error.service'; +import { SessionService } from './session.service'; import { of, throwError } from 'rxjs'; import { describe, it, expect, beforeEach, afterEach, vi } from 'vitest'; @@ -9,16 +10,23 @@ describe('errorInterceptor', () => { let errorService: { handleHttpError: ReturnType; }; + let sessionService: { + handleUnauthorized: ReturnType; + }; beforeEach(() => { TestBed.resetTestingModule(); errorService = { handleHttpError: vi.fn(), }; + sessionService = { + handleUnauthorized: vi.fn(), + }; TestBed.configureTestingModule({ providers: [ { provide: ErrorService, useValue: errorService }, + { provide: SessionService, useValue: sessionService }, ], }); }); @@ -147,6 +155,26 @@ describe('errorInterceptor', () => { }); }); + it('should call sessionService.handleUnauthorized on 401 and skip the toast', async () => { + const error = new HttpErrorResponse({ status: 401, url: 'http://localhost:8000/api/sessions' }); + const nextFn: HttpHandlerFn = vi.fn().mockReturnValue(throwError(() => error)); + const req = new HttpRequest('GET', 'http://localhost:8000/api/sessions'); + + await new Promise((resolve) => { + TestBed.runInInjectionContext(() => { + errorInterceptor(req, nextFn).subscribe({ + error: (err: unknown) => { + expect(sessionService.handleUnauthorized).toHaveBeenCalledTimes(1); + expect(errorService.handleHttpError).not.toHaveBeenCalled(); + // Caller still sees the error so any local cleanup runs. + expect(err).toBe(error); + resolve(); + }, + }); + }); + }); + }); + /** * Validates: Requirements 14.8 (success case) * Successful responses pass through without interception diff --git a/frontend/ai.client/src/app/auth/error.interceptor.ts b/frontend/ai.client/src/app/auth/error.interceptor.ts index 83aad8b9..31cacfd8 100644 --- a/frontend/ai.client/src/app/auth/error.interceptor.ts +++ b/frontend/ai.client/src/app/auth/error.interceptor.ts @@ -2,6 +2,7 @@ import { HttpInterceptorFn, HttpErrorResponse } from '@angular/common/http'; import { inject } from '@angular/core'; import { catchError, throwError } from 'rxjs'; import { ErrorService } from '../services/error/error.service'; +import { SessionService } from './session.service'; /** * HTTP interceptor that handles errors from non-streaming HTTP requests @@ -17,6 +18,7 @@ import { ErrorService } from '../services/error/error.service'; */ export const errorInterceptor: HttpInterceptorFn = (req, next) => { const errorService = inject(ErrorService); + const sessionService = inject(SessionService); // Skip error handling for SSE streaming endpoints // These are handled by fetchEventSource's onerror callback. @@ -43,12 +45,12 @@ export const errorInterceptor: HttpInterceptorFn = (req, next) => { req.url.includes(endpoint) ); - // 401s mean the BFF session is missing or expired. SessionService - // handles that by routing the user to /auth/login — a toast on top - // is just noise and tends to flash before the redirect lands. - const isUnauthorized = error.status === 401; - - if (!isSilentEndpoint && !isUnauthorized) { + // 401s mean the BFF session is missing or expired. Route to the + // SPA login page (idempotent across concurrent 401s) and skip the + // toast — it just flashes before the redirect lands. + if (error.status === 401) { + sessionService.handleUnauthorized(); + } else if (!isSilentEndpoint) { // Use ErrorService to display the error errorService.handleHttpError(error); } diff --git a/frontend/ai.client/src/app/auth/first-boot/first-boot.page.css b/frontend/ai.client/src/app/auth/first-boot/first-boot.page.css index 72609321..1659783b 100644 --- a/frontend/ai.client/src/app/auth/first-boot/first-boot.page.css +++ b/frontend/ai.client/src/app/auth/first-boot/first-boot.page.css @@ -1,5 +1,247 @@ -/* First-boot page specific styles */ +/* First-boot page styles — mirrors the login page's lava-lamp parallax + backdrop and frosted-glass card so the two auth screens feel like one + system. Class names are component-scoped (Emulated view encapsulation) + so they don't collide with the login page's identical names. */ @import "tailwindcss"; @custom-variant dark (&:where(.dark, .dark *)); + +/* ---------- Background canvas ---------- */ +.login-shell { + background: + radial-gradient(120% 80% at 0% 0%, color-mix(in oklab, var(--color-primary-50) 70%, white) 0%, transparent 60%), + radial-gradient(120% 80% at 100% 100%, color-mix(in oklab, var(--color-primary-100) 60%, white) 0%, transparent 55%), + var(--color-gray-50); +} + +:host-context(html.dark) .login-shell { + background: + radial-gradient(120% 80% at 0% 0%, color-mix(in oklab, var(--color-primary-900) 50%, black) 0%, transparent 60%), + radial-gradient(120% 80% at 100% 100%, color-mix(in oklab, var(--color-primary-800) 35%, black) 0%, transparent 55%), + var(--color-gray-900); +} + +.login-bg { + position: absolute; + inset: 0; + overflow: hidden; + pointer-events: none; +} + +.login-lava { + position: absolute; + inset: 0; + overflow: hidden; +} + +.login-blob { + position: absolute; + will-change: transform, border-radius; + border-radius: 58% 42% 60% 40% / 50% 55% 45% 50%; +} + +/* ----- Far tier: huge, slow, heavy blur, low opacity ----- */ +.login-blob--a { + width: 70vw; + height: 86vw; + max-width: 880px; + max-height: 1080px; + bottom: -38vw; + left: -18vw; + filter: blur(110px); + opacity: 0.4; + background: radial-gradient(circle at 35% 35%, var(--color-primary-400), var(--color-primary-700) 60%, transparent 78%); + animation: + login-rise-a 52s ease-in-out infinite alternate, + login-morph-a 28s ease-in-out infinite alternate; +} + +.login-blob--b { + width: 62vw; + height: 76vw; + max-width: 800px; + max-height: 960px; + top: -34vw; + right: -20vw; + filter: blur(100px); + opacity: 0.36; + background: radial-gradient(circle at 65% 65%, var(--color-primary-500), var(--color-primary-800) 65%, transparent 82%); + animation: + login-rise-b 60s ease-in-out infinite alternate, + login-morph-b 32s ease-in-out infinite alternate; +} + +/* ----- Mid tier ----- */ +.login-blob--c { + width: 32vw; + height: 40vw; + max-width: 420px; + max-height: 520px; + top: 28%; + left: 42%; + filter: blur(60px); + opacity: 0.5; + background: radial-gradient(circle, color-mix(in oklab, var(--color-primary-300) 75%, white), transparent 72%); + animation: + login-rise-c 30s ease-in-out infinite alternate, + login-morph-c 18s ease-in-out infinite alternate; +} + +.login-blob--d { + width: 28vw; + height: 36vw; + max-width: 360px; + max-height: 460px; + bottom: -12vw; + right: 18vw; + filter: blur(55px); + opacity: 0.55; + background: radial-gradient(circle at 50% 50%, var(--color-primary-300), var(--color-primary-500) 60%, transparent 80%); + animation: + login-rise-d 26s ease-in-out infinite alternate, + login-morph-a 16s ease-in-out infinite alternate -3s; +} + +/* ----- Near tier ----- */ +.login-blob--e { + width: 16vw; + height: 22vw; + max-width: 220px; + max-height: 300px; + top: -6vw; + left: 32vw; + filter: blur(32px); + opacity: 0.65; + background: radial-gradient(circle at 50% 50%, var(--color-primary-400), var(--color-primary-700) 65%, transparent 82%); + animation: + login-rise-e 14s ease-in-out infinite alternate, + login-morph-b 11s ease-in-out infinite alternate -5s; +} + +.login-blob--f { + width: 12vw; + height: 16vw; + max-width: 160px; + max-height: 220px; + bottom: -4vw; + left: 14vw; + filter: blur(26px); + opacity: 0.7; + background: radial-gradient(circle at 45% 45%, var(--color-primary-300), var(--color-primary-600) 65%, transparent 84%); + animation: + login-rise-f 11s ease-in-out infinite alternate, + login-morph-c 9s ease-in-out infinite alternate -2s; +} + +:host-context(html.dark) .login-blob--a { opacity: 0.32; } +:host-context(html.dark) .login-blob--b { opacity: 0.28; } +:host-context(html.dark) .login-blob--c { opacity: 0.38; } +:host-context(html.dark) .login-blob--d { opacity: 0.42; } +:host-context(html.dark) .login-blob--e { opacity: 0.5; } +:host-context(html.dark) .login-blob--f { opacity: 0.55; } + +.login-grid { + position: absolute; + inset: 0; + background-image: + linear-gradient(to right, color-mix(in oklab, var(--color-primary-500) 8%, transparent) 1px, transparent 1px), + linear-gradient(to bottom, color-mix(in oklab, var(--color-primary-500) 8%, transparent) 1px, transparent 1px); + background-size: 64px 64px; + mask-image: radial-gradient(ellipse 70% 60% at 50% 45%, black 30%, transparent 75%); + -webkit-mask-image: radial-gradient(ellipse 70% 60% at 50% 45%, black 30%, transparent 75%); + opacity: 0.6; +} + +:host-context(html.dark) .login-grid { + background-image: + linear-gradient(to right, color-mix(in oklab, var(--color-primary-300) 6%, transparent) 1px, transparent 1px), + linear-gradient(to bottom, color-mix(in oklab, var(--color-primary-300) 6%, transparent) 1px, transparent 1px); + opacity: 0.5; +} + +/* Far: minimal travel, lazy sway */ +@keyframes login-rise-a { + 0% { transform: translate3d(0, 0, 0) scale(1, 1) rotate(0deg); } + 50% { transform: translate3d(2vw, -12vh, 0) scale(1.04, 0.96) rotate(4deg); } + 100% { transform: translate3d(-1vw, -22vh, 0) scale(0.97, 1.05) rotate(-3deg); } +} +@keyframes login-rise-b { + 0% { transform: translate3d(0, 0, 0) scale(1, 1) rotate(0deg); } + 50% { transform: translate3d(-2vw, 10vh, 0) scale(0.96, 1.05) rotate(-4deg); } + 100% { transform: translate3d(1vw, 20vh, 0) scale(1.05, 0.96) rotate(3deg); } +} + +/* Mid: moderate travel */ +@keyframes login-rise-c { + 0% { transform: translate3d(0, 0, 0) scale(1, 1) rotate(0deg); } + 50% { transform: translate3d(-5vw, -25vh, 0) scale(1.1, 0.94) rotate(-10deg); } + 100% { transform: translate3d(4vw, -50vh, 0) scale(0.92, 1.1) rotate(8deg); } +} +@keyframes login-rise-d { + 0% { transform: translate3d(0, 0, 0) scale(1, 1) rotate(0deg); } + 50% { transform: translate3d(6vw, -35vh, 0) scale(1.05, 0.95) rotate(12deg); } + 100% { transform: translate3d(-3vw, -68vh, 0) scale(0.92, 1.08) rotate(-7deg); } +} + +/* Near: dramatic travel */ +@keyframes login-rise-e { + 0% { transform: translate3d(0, 0, 0) scale(1, 1) rotate(0deg); } + 50% { transform: translate3d(8vw, 55vh, 0) scale(0.88, 1.14) rotate(-18deg); } + 100% { transform: translate3d(-6vw, 100vh, 0) scale(1.16, 0.86) rotate(14deg); } +} +@keyframes login-rise-f { + 0% { transform: translate3d(0, 0, 0) scale(1, 1) rotate(0deg); } + 50% { transform: translate3d(-9vw, -55vh, 0) scale(1.18, 0.84) rotate(20deg); } + 100% { transform: translate3d(7vw, -105vh, 0) scale(0.85, 1.18) rotate(-16deg); } +} + +/* Surface morph */ +@keyframes login-morph-a { + 0% { border-radius: 58% 42% 60% 40% / 50% 55% 45% 50%; } + 50% { border-radius: 42% 58% 38% 62% / 60% 40% 60% 40%; } + 100% { border-radius: 50% 50% 65% 35% / 45% 55% 50% 50%; } +} +@keyframes login-morph-b { + 0% { border-radius: 50% 50% 40% 60% / 55% 45% 55% 45%; } + 50% { border-radius: 65% 35% 55% 45% / 40% 60% 40% 60%; } + 100% { border-radius: 38% 62% 50% 50% / 60% 50% 50% 40%; } +} +@keyframes login-morph-c { + 0% { border-radius: 60% 40% 50% 50% / 45% 60% 40% 55%; } + 50% { border-radius: 40% 60% 65% 35% / 55% 40% 60% 45%; } + 100% { border-radius: 55% 45% 38% 62% / 50% 55% 45% 50%; } +} + +@media (prefers-reduced-motion: reduce) { + .login-blob, + .login-blob--a, + .login-blob--b, + .login-blob--c, + .login-blob--d, + .login-blob--e, + .login-blob--f { + animation: none; + } +} + +/* ---------- Frosted glass card ---------- */ +.login-card { + background: color-mix(in oklab, white 65%, transparent); + backdrop-filter: blur(24px) saturate(160%); + -webkit-backdrop-filter: blur(24px) saturate(160%); + border: 1px solid color-mix(in oklab, white 70%, transparent); + box-shadow: + 0 1px 0 0 rgba(255, 255, 255, 0.6) inset, + 0 20px 50px -20px color-mix(in oklab, var(--color-primary-900) 35%, transparent), + 0 8px 24px -12px rgba(0, 0, 0, 0.15); +} + +:host-context(html.dark) .login-card { + background: color-mix(in oklab, var(--color-gray-900) 55%, transparent); + border-color: color-mix(in oklab, white 12%, transparent); + box-shadow: + 0 1px 0 0 rgba(255, 255, 255, 0.06) inset, + 0 20px 50px -20px rgba(0, 0, 0, 0.6), + 0 8px 24px -12px rgba(0, 0, 0, 0.5); +} diff --git a/frontend/ai.client/src/app/auth/first-boot/first-boot.page.ts b/frontend/ai.client/src/app/auth/first-boot/first-boot.page.ts index 4c8f81e7..494ed489 100644 --- a/frontend/ai.client/src/app/auth/first-boot/first-boot.page.ts +++ b/frontend/ai.client/src/app/auth/first-boot/first-boot.page.ts @@ -11,8 +11,23 @@ import { SystemService, FirstBootError } from '../../services/system.service'; styleUrl: './first-boot.page.css', changeDetection: ChangeDetectionStrategy.OnPush, template: ` -
-
+