[ts sdk / exploration] New example apps by ardaerzin · Pull Request #4353 · Agenta-AI/agenta

ardaerzin · 2026-05-18T15:12:44Z

Summary

Testing

Verified locally

Added or updated tests

QA follow-up

Demo

Checklist

I have included a video or screen recording for UI changes, or marked Demo as N/A
Relevant tests pass locally
Relevant linting and formatting pass locally
I have signed the CLA, or I will sign it when the bot prompts me

Contributor Resources

Mirrors @agenta/api-client: type:module, main/types/exports point at ./dist, prepare hook compiles via tsc on install. Required for Node-side consumers (tsx, ts-node, plain node + tsc) — workspace bundler magic doesn't extend to them, so the SDK has to ship a real built artifact. Existing dependents (@agenta/entities, web/_reference/*) still type-check against the new dist-based main. dist/ added to .gitignore (matches the api-client pattern; install regenerates it via the prepare hook).

Adds a top-of-file note that newer spike apps targeting AI SDK v6 (with multiple frameworks and the @vercel/otel variant) live under web/examples/. The v4 example itself stays unchanged — coexistence is intentional: it validates Agenta's adapter against v4 spans while the spike apps probe v6.

…ing design Adds a new examples/* workspace tree under web/ for research apps that inform Agenta's first-party TypeScript SDK with tracing (ts-sdk-tracing). The spike measures the actual friction TS users hit today wiring AI SDK + raw OpenTelemetry against the Agenta backend, so the SDK can be designed around real pain instead of guesses. Phase 1 contents: - @agenta/spike-verify (web/examples/.shared/agenta-verify/) — workspace package wrapping the official @agenta/sdk's traces.querySpans() with a polling/matching layer + 3 typed errors. 13 vitest unit tests cover polling/timeout/retry/error semantics; verifyTrace itself talks to a real Agenta over HTTP at integration time. - @agenta-spike/node-vercel-ai-v6 (web/examples/node-vercel-ai-v6/) — Node + AI SDK v6 + raw OTel spike. Three demos (generateText, streamText, tool-call) and four canonical assertion scripts that run via pnpm test against a live Agenta endpoint: 1. cold-start trace completeness 2. mid-stream client-abort flush 3. metadata round-trip 4. instrumentation runs before first handler All 4 currently green against a local Agenta instance. Wires examples/* and examples/.shared/* into web/pnpm-workspace.yaml. Each spike app is fully self-contained (own package.json, .env.example, tsconfig). Findings feed docs/design/ts-sdk-tracing/ — see that directory for the pain log + executive summary. NOT a stable starter template. Apps are research instruments, expected to be either refactored to use ts-sdk-tracing or retired when the SDK ships (per the post-SDK lifecycle TODO captured separately).

Adds the design space for Agenta's first-party TypeScript SDK with tracing. The spike apps under web/examples/ feed entries into pain-log.md during build; the SDK is then designed against the captured friction rather than guesses. Includes: - README.md — orientation and document index - summary.md — living one-page executive summary, updated as each spike app phase lands. Phase 1 (Node + AI SDK v6 + raw OTel) section reflects 4/4 canonical assertions green and the first 3 pain entries. - pain-log.md — structured friction entries with per-framework prefix numbering scheme (P-NODE-*, P-APP-RAW-*, P-APP-VERCEL-*, etc.). Each entry: framework, 3-axis severity, code-that-exists-today, ideal-API sketch, notes. Three Phase 1 entries so far, two silent-failure-shaped. - status.md — progress tracker, locked design decisions, plus a separate "SDK Requirements" section for items initially logged as pain entries but on review obvious features ts-sdk-tracing must ship (built dist, host normalization, projectId propagation). - scripts/validate-pain-log.mjs — schema validator (Node ESM, no deps) wired into .pre-commit-config.yaml as a pre-commit hook on the pain-log.md path. Validates ID prefix matches a known framework set, 3-axis severity present, code excerpt + ideal sketch both present, no duplicate IDs. Plus TODOS.md at repo root: post-spike lifecycle decision per spike app (refactor to use ts-sdk-tracing as docs companion / starter template, or retire with pain log preserved).

Adds the second spike app for ts-sdk-tracing design: Next.js 15 App Router + AI SDK v6 + raw OpenTelemetry. Probes the four canonical assertions plus three new framework-specific surfaces — Server Action direct invocation, useChat streaming via /api/chat, and an edge-runtime route at /api/edge-chat. Result: 4/4 nodejs-runtime assertions PASS. The edge-runtime route silently emits zero spans despite the documented setup (fetch-based exporter, SimpleSpanProcessor, waitUntil(forceFlush())) — captured as P-APP-RAW-01 in the pain log, the highest-severity Phase 2a finding. Phase 2b's @vercel/otel app will A/B test whether that's a raw-OTel-on- edge issue or something else. Stack notes worth committing context on: - Uses SimpleSpanProcessor (not BatchSpanProcessor) per P-NODE-02. BatchSpanProcessor + AI SDK v6 streamText silently loses spans. - instrumentation.ts dispatches by NEXT_RUNTIME — Node-runtime setup in instrumentation.node.ts, edge-runtime setup module-scoped inside the edge route file (Node-only OTel libs can't load on edge). - type:module in package.json is required for tsx test scripts to resolve the ESM-only @agenta/sdk via @agenta/spike-verify. - Each spike app on its own port (3101 here) so multiple can run in parallel without colliding with the OSS app on 3000. Updates summary.md (now living, covers both Phase 1 + 2a) and status.md to reflect the current 4/4 + edge-pending state.

Adds the @vercel/otel variant of the App Router spike. Same app shape as nextjs-app-router-raw — same routes, same canonical assertions, same chat UI, same Server Action probe. The only delta is the instrumentation wiring: a single registerOTel() call from @vercel/otel replaces the multi-file raw OTel scaffold (instrumentation.ts → instrumentation.node.ts + inline edge-route provider setup). Result: 3/4 nodejs assertions PASS. assertion-2 FAILS — captured as P-APP-VERCEL-01 in the pain log. Same root cause as P-NODE-02: @vercel/otel defaults to BatchSpanProcessor, which loses streamText spans on mid-stream client abort within the 5s assertion window. Edge route A/B verdict: where raw OTel + manual edge setup emits zero spans EVER (P-APP-RAW-01), @vercel/otel emits spans WITH a 10-15s delay (P-APP-VERCEL-02). Not silent loss, but too slow for interactive abort-flush scenarios. The wrapper rescues the case the raw setup totally fails — but doesn't make it production-ready for streaming. Cross-cutting takeaway captured in summary.md: BatchSpanProcessor + AI SDK v6 streamText is the universal flush failure regardless of which OTel wrapper you use. The SDK has to own the processor choice. App runs on its own port (3102) so both 2a (3101) and 2b (3102) can run concurrently for direct A/B testing.

Adds the Pages Router raw-OTel spike to inform ts-sdk-tracing design. Same wiring approach as nextjs-app-router-raw — `instrumentation.ts` register hook (works identically on Pages Router since Next 15) + SimpleSpanProcessor + AI SDK v6 — but adapted to Pages Router's NodeApi handler shape. Pages-vs-App differences worth committing context on: - Streaming: Pages Router API routes return a Node ServerResponse, not a fetch Response. So `result.toUIMessageStreamResponse()` (App Router) doesn't apply — we use `pipeUIMessageStreamToResponse({response: res, stream: result.toUIMessageStream()})` instead. - No Server Action: Pages Router doesn't support React Server Actions, so the Server-Action probe from Phase 2a is omitted. - No edge route: dropped at build time (P-PAGES-RAW-01). Pages Router's edge runtime applies stricter dynamic-code-eval static analysis than App Router's, and rejects `@opentelemetry/exporter-trace-otlp-http`'s imports. The same imports compile fine in Phase 2a's App Router edge route. Pages Router users on raw OTel can't ship edge tracing today. - useChat client side: identical to App Router — same DefaultChatTransport pattern works against both router types. Result: 4/4 nodejs-runtime canonical assertions PASS. The P-PAGES-RAW-01 edge-build failure is the new entry; Phase 3b (vercel-otel) will A/B test whether @vercel/otel's edge bundle passes the strict check. App runs on its own port (3103).

Adds the Pages Router @vercel/otel spike to A/B test against Phase 3a (nextjs-pages-router-raw). Same Pages Router app shape, same Node res + pipeUIMessageStreamToResponse streaming pattern — single-line registerOTel() replaces multi-file raw OTel scaffolding. Two critical Phase 3b verdicts: 1. Edge route BUILDS + RUNS on @vercel/otel where raw OTel failed (P-PAGES-RAW-01). @vercel/otel ships an edge-safe bundle that passes Pages Router's strict dynamic-code-eval static check. Spans arrive on edge with the same ~10-15s BatchSpanProcessor delay as the App Router edge story (P-APP-VERCEL-02 reproduces on Pages-edge). 2. NEW silent failure: ag.metrics.tokens = {} on the parent streamText span, but ONLY in this 4-way combination of Pages Router + @vercel/otel + pipeUIMessageStreamToResponse + AI SDK v6 streamText. Each isolated piece works alone (verified across Node raw OTel, App Router vercel-otel, Pages Router raw OTel). Captured as P-PAGES-VERCEL-01. Token counts are the #1 metric users instrument LLM calls for — cost tracking silently disappears when wiring documented best-practice from both ecosystems. Result: 4/4 nodejs-runtime canonical assertions PASS, with assertion-1 loosened to drop the now-empty token-metrics check (documented in test file comments). Edge route + assertion-4 sentinels validated by manual probe. App runs on its own port (3104). Implications for ts-sdk-tracing's design escalate cross-cutting takeaway #1: the SDK must own streamText's span lifecycle (end + flush + attribute population) — solves P-NODE-02 + P-APP-VERCEL-01 + P-PAGES-VERCEL-01 in one shot.

Closes the 4-framework matrix for ts-sdk-tracing's design input: Node + Next.js App Router (raw + vercel-otel) + Next.js Pages Router (raw + vercel-otel) + TanStack Start. 6 spike apps total covering the modern TS framework surface for AI SDK v6 + raw OTel + Agenta. TanStack Start specifics (vs Next.js): - No `instrumentation.ts` register hook. Instrumentation fires via being the FIRST import in `src/server.ts` — unenforced by the framework. A single auto-formatter import-sort silently disables tracing with no warning. Captured as P-TANSTACK-01. - No per-route edge runtime opt-in (`export const runtime = "edge"`). Runtime is selected at the Nitro preset level for the entire server. Edge probe deferred — captured as P-TANSTACK-02 (coverage gap, not a silent failure). - `createStartHandler()` return shape mismatch: official docs show `export default createStartHandler(...)` but the dev plugin needs `{fetch: ...}`. Lost ~30min debugging; required reading the framework's own default-entry source to find the working shape. Captured as P-TANSTACK-03. Stream sink IS identical to App Router (fetch Response via `result.toUIMessageStreamResponse()`), so the per-call tracing path is identical: 4/4 nodejs canonical assertions PASS unchanged from Phase 2a. Result: 6 spike apps, 11 ecosystem pain entries (6 silent-failure shaped), 5 cross-cutting takeaways. The fifth takeaway is new from Phase 4: the SDK should ship per-framework adapter wrappers (`withAgentaInstrumentation(handler, opts)`) so each framework's instrumentation seam is invariant-by-construction. Three TanStack pain entries collapse to one wrapper. Phase 4 closes spike scope. SDK design phase can now start with confidence — pain log saturated across framework variations, no new pain categories expected from further frameworks (Hono, Remix, SvelteKit, etc. are out of spike scope per Decision 4). App runs on its own port (3105).

…auses Source-dived @vercel/otel@2.1.2 and ai@6.0.177 to pin down both deferred investigations. Updates the pain log Notes sections with empirical mechanisms and the summary's "What didn't, and why" table rows. P-APP-RAW-01 (raw OTel + App Router edge: zero spans ever): @vercel/otel's CompositeSpanProcessor.onStart enrolls forceFlush into globalThis[Symbol.for("@vercel/request-context")].get().waitUntil(...) at root-span open. That's the Vercel-runtime primitive backing unstable_after — it defers isolate freeze until the registered promise resolves. Our raw setup uses `after(() => forceFlush())` which runs the callback but does NOT enroll the promise into the runtime tracker, so the edge isolate freezes the moment Response returns and the OTLP fetch is killed mid-flight. None of the 3 originally posed hypotheses was correct as written — hypothesis 2 (after() runs too late) was closest but mechanism-wrong. keepalive: true is a red herring (vercel/otel doesn't set it either). P-PAGES-VERCEL-01 (Pages + vercel-otel + pipeUIMessageStreamToResponse + streamText: empty ag.metrics.tokens): @vercel/otel's CompositeSpanProcessor.onEnd force-ends every still-open child span when the Next.js root SERVER span ends. AI SDK v6 streamText writes ai.usage.* attrs INSIDE flush() right before rootSpan.end(). pipeUIMessageStreamToResponse returns synchronously while the stream is still draining (writeToServerResponse fires read() without awaiting), so the SERVER span ends BEFORE flush() runs — the force-end then kills ai.streamText, and AI SDK's subsequent setAttributes( {ai.usage.*}) no-ops on the ended span per OTel spec. App Router's toUIMessageStreamResponse keeps the SERVER span alive until the response body drains (Next awaits the stream), so flush() lands before the force-end. Raw OTel has no CompositeSpanProcessor and no force-end logic. The 4-way collision is structural. Both findings stay in observation space. Solution-space updates deferred per current direction. Open questions logged in pain-log.md: - whether requestContext symbol is populated in `next dev` vs only deployed Vercel infra (our 10-15s arrival could be incidental BatchProcessor timing rather than waitUntil-enrolled flush) - mechanism of P-PAGES-VERCEL-01 traced from source, not instrumented at runtime — a 1-line probe patching CompositeSpanProcessor.onEnd would empirically confirm

Researches both vendors across 8 dimensions (package layout, init, tracing API, AI provider integrations, evals/datasets/prompts, export model + edge runtime, type safety, design opinions). Maps findings against the active ts-sdk-tracing spike's 11 pain entries and distills 10 explicit RFC decision points plus 5 ranked differentiation opportunities. Headline findings: - AI SDK v6 streamText + abort flush is an open gap for both competitors (Langfuse issue #12643, Braintrust steers to OTel mode where same problem reappears) — strongest differentiator - Edge runtime tracing: Langfuse unsupported, Braintrust ships per-runtime conditional exports — pattern worth lifting - Both ship eval orchestration in-SDK; Braintrust adds a CLI runner - Langfuse v4/v5 demonstrates OTel-native end-to-end is viable; paid for in three breaking releases (cautionary tale on migration) Inputs to the upcoming agenta ts-sdk RFC.

v1 of the doc was synthesis of web research and contained ~18 incorrect or imprecise claims. v2 is a deep source-code audit of both repos cloned locally with file:line citations on every load-bearing claim. Key corrections (Braintrust): - Span types: 11 not 6 (adds automation/facet/preprocessor/classifier/review) - Wire endpoint: `logs3` not `/logs`; log upload batch is 100 not 1000 - `wrapTraced` generator handling is `function*`/`async function*` only, NOT arbitrary AsyncIterable — does not solve AI SDK v6 case - AI SDK stream handling routes through `diagnostics_channel`, not generators - Zero AbortSignal handling anywhere in `js/src/wrappers/ai-sdk/` - `dotenv` is CLI-only, never auto-loaded at runtime - `filterAISpans` is opt-in, off by default - CLI binary is `braintrust`, not `bt` - `BraintrustExporter` wraps `BraintrustSpanProcessor`; HTTP layer is upstream OTLPTraceExporter - `engines` and `sideEffects` not declared anywhere Key corrections (Langfuse): - `@langfuse/tracing` is Node ≥ 20, not "Universal" - `LANGFUSE_BASEURL` still read as legacy fallback - 10 observation types but only 2 attribute shapes; remaining 8 are type aliases of `LangfuseSpanAttributes` - 16 deprecated method-name aliases preserved in `LangfuseClient` — migration tax softer than commonly framed - Default batching uses upstream OTel defaults (512 / 5000ms) unless env vars set - `LANGFUSE_FLUSH_AT`/`_INTERVAL` control both span processor AND score queue — env var collision Smoking gun: source-level confirmation neither codebase handles AI SDK v6 streaming abort. Langfuse's `wrapAsyncIterable` ends generation only on for-await loop completion; e2e tests pass via manual forceFlush(). Braintrust's wrapAISDK has no abort-aware logic. Confirms strongest differentiation opportunity for agenta. New findings worth lifting into RFC: - Braintrust `Symbol.for("braintrust-state")` globalThis state pattern (kills multi-copy / Next.js dev mode footguns) - Langfuse stale-while-revalidate prompt cache - Langfuse `propagateAttributes` observation-centric attribute propagation - Langfuse `setLangfuseTracerProvider` escape hatch for `@vercel/otel` - Langfuse mirroring of `user.id`/`session.id` to unprefixed OTel-standard keys - Per-attribute async-aware mask function applied before media extraction 17 explicit RFC decision points + 6 ranked differentiation opportunities.

…ion buries AI SDK spans in UI) Discovered during a per-phase re-run with isolated API keys (each phase's traces went to its own Agenta key for clean comparison). All four Next.js spike screenshots showed `POST /api/chat...`, `executing api r...`, `GET /api/sentin...` as the trace-list rows with empty Inputs/Outputs columns — where Phase 1 (Node) and the v4 published example showed `ai.streamText` / `ai.generateText` with the prompt + response fully visible. Verified the cause via direct `POST /api/spans/query` calls for one assertion-3 trace per phase. Confirmed structure: Phase 2a/2b (App Router): 7 spans, ai.streamText at depth 2 under `POST /api/chat/route` → `executing api route (app) /api/chat/route` Phase 3a/3b (Pages Router): 4-5 spans, ai.streamText at depth 2 under `POST /api/chat` → `executing api route (pages) /api/chat` Phase 4 TanStack Start: 2 spans, ai.streamText IS the root (L0) Phase 1 Node: ai.streamText IS the root Identical tree shape in raw OTel AND @vercel/otel — proves the HTTP + handler wrapper spans come from Next.js 15's built-in OTel auto- instrumentation, not from the user's OTel library choice. TanStack Start's Vite/Nitro stack does NOT emit auto-instrumented HTTP spans, so its traces display correctly in Agenta's "Root" view. Inputs/outputs/tokens/metadata are all present in Agenta — on the ai.streamText span at depth 2. The UI's default "Root" filter just doesn't surface them: the HTTP root span carries only ag.type.trace='invocation' + ag.metrics.duration.cumulative, no payload. Why we missed it earlier: our verifyTrace harness queries by attribute (ag.user.id) and matches by span name, finding ai.streamText regardless of hierarchy. Programmatic assertions pass; the UI experience is degraded. Pre-existing pain entries focused on data loss; this is the inverse — data is preserved, but the default lens hides it. Added "common" as a new framework prefix (P-COMMON-NN) for cross- framework / backend-side findings. Updated: - pain-log.md: schema docstring, numbering scheme, P-COMMON-01 entry - validate-pain-log.mjs: accept "common" framework value - summary.md: current status paragraph + What didn't table row + pain entry count (11 → 12) - status.md: Last Updated note

…te Langfuse precedent Pre-existing P-COMMON-01 entry claimed the root cause was Next.js's built-in OTel auto-instrumentation but didn't validate the claim or compare against how other LLM observability platforms handle the same symptom. Filling in both via web research: Not-a-wiring-mistake (confirmed): - Our instrumentation.ts in both raw OTel and @vercel/otel variants matches Vercel's own ai-chatbot template + the Next.js OTel docs "Manual OpenTelemetry configuration" sample line-for-line. No canonical example ships span filtering. - The wrapper span names (BaseServer.handleRequest, AppRouteRouteHandlers .runHandler, NextNodeServer.findPageComponents, .startResponse) come from packages/next/src/server/lib/trace/constants.ts — Next.js itself, not @opentelemetry/instrumentation-http and not @vercel/otel. - Next.js docs explicitly state: "the root server span labeled as [http.method] [next.route]. All other spans from that particular trace will be nested under it." It's by design. - NEXT_OTEL_VERBOSE=0 (default) keeps the 5 wrapper spans we observed; setting =1 adds MORE. NEXT_OTEL_FETCH_DISABLED=1 suppresses only the outbound fetch span. No documented knob removes the root. How others handle the same symptom: - Langfuse @langfuse/otel v5+ ships a LangfuseSpanProcessor that drops every span whose instrumentationScope.name doesn't match a known LLM- library prefix (`ai`, `openinference`, anthropic, openai, langsmith, litellm, etc.) and lacks gen_ai.* attrs. Next.js wrapper spans get silently dropped at the processor BEFORE export. Their FAQ (`/faq/all/unwanted-http-database-spans`) addresses this exact pain and notes pre-v5 SDKs exported them and counted toward billing — i.e. they evolved to filter after running into the same problem we are. - Braintrust uses wrapper-based observability (wrapAISDK, traced()) rather than instrumentation.ts auto-instrumentation. The AI call is wrapped explicitly so the wrapper span IS the trace root by construction. Different paradigm; no filtering needed. - Neither documents a user-side instrumentation.ts modification; the filtering (Langfuse) or wrapping (Braintrust) is SDK-side, applied uniformly across customer apps. Updates: - pain-log.md P-COMMON-01: new "Confirmed not-a-wiring-mistake" block with reference URLs; rewrote the "What would be ideal" sketch to show three observed ecosystem approaches (Langfuse filter, Braintrust wrappers, backend-side UI) — all in observation space, no recommendation - summary.md "What didn't" table P-COMMON-01 row: tightened root-cause column to cite the 5 supporting evidence points, mentions Langfuse precedent

…flag inference Previous commit (626172c) made the claim "Langfuse explicitly addressed it on their SDK side after running into it themselves" without distinguishing what's directly quoted vs inferred. Replacing the loose paraphrase with verifiable references. Verified two sources (web-fetched 2026-05-11): 1. Langfuse FAQ at langfuse.com/faq/all/unwanted-http-database-spans — contains two verbatim quotes now inlined in the pain entry: - Pre-v5 behavior: "no automatic filtering — Langfuse exports all spans it receives, including HTTP requests, database queries, and framework internals." - v5+ behavior: "apply a default span filter that automatically keeps only LLM-related spans and drops HTTP, database, and framework spans — no configuration needed." 2. @langfuse/otel source at unpkg.com/@langfuse/otel/dist/index.mjs — confirms LangfuseSpanProcessor class + isDefaultExportSpan() filter logic (OR of isLangfuseSpan/isGenAISpan/isKnownLLMInstrumentor) + the 10-entry KNOWN_LLM_INSTRUMENTATION_SCOPE_PREFIXES allowlist (`ai`, `openinference`, `langsmith`, etc.). Code excerpts now in the pain entry. What was actually NOT verbatim: the "after running into it themselves" phrasing. The FAQ does NOT contain a direct quote saying Langfuse built the filter because they hit the problem internally. That inference rests on (a) the documented pre-v5 vs v5 behavioral change and (b) the existence of a dedicated FAQ page titled "unwanted-http-database-spans" — strong indirect evidence but not a quotable claim. Pain entry now flags this distinction explicitly with an "Interpretive note (not a verbatim claim)" subsection. Braintrust comparison also flagged as agent-research-summarized rather than independently web-verified.

Adds the Vue ecosystem to the spike: Nuxt 4 + Nitro + AI SDK v6 + raw OTel. Mirrors the Next.js spike pattern (4 canonical assertions) so the trace hierarchy / abort handling / instrumentation seam can be compared apples to apples. Scope decision (single app, not A/B): - Research showed @vercel/otel is Next.js-only — no first-party Nuxt path. An A/B "nuxt-vercel" companion would just be testing community packaging. - Nuxt has no instrumentation.ts hook. Wiring is via Nitro plugin (server/plugins/otel.ts), which runs at Nitro init. What worked: - 4/4 canonical assertions PASS (assertion-2's flush window had to be raised from 5s to 30s — see P-NUXT-01 below). - Trace hierarchy is CLEAN: ai.streamText IS the root, inputs/outputs visible at the top level (2 spans total). P-COMMON-01 (Next.js HTTP auto-instrumentation buries AI SDK spans) does NOT apply to Nuxt — bare Nitro doesn't emit HTTP server spans. Same shape as TanStack Start. What didn't: P-NUXT-01: H3 v2 RC has no working abort-signal propagation in the Nitro Node-runtime preset. Verified empirically: - event.req.signal: types claim it exists; runtime says undefined - event.runtime.node.req: typed; runtime says undefined - event.node.req 'close' event: fires only AFTER stream drains naturally, not when client disconnects mid-stream streamText receives no abortSignal, model keeps generating after client abort, parent span ends ~7-15s late (instead of <1s like other phases). Trace DOES arrive, just outside the interactive window. Production cost-control implication: today's Nuxt users keep paying for tokens generated after the user already closed the tab. Updates: - web/examples/nuxt-raw/: full app (package.json, nuxt.config.ts, server/{plugins/otel.ts, api/chat.post.ts, api/sentinels.get.ts, lib/ai.ts}, app.vue, 4 assertion scripts, tsconfig.json, .env.example, .gitignore) - pain-log.md: new "Nuxt 3/4 (Vue + Nitro)" section + P-NUXT-01 entry with verbatim runtime probe output - summary.md: Phase 5 section, P-NUXT-01 row in "What didn't" table, assertion count 23/24 → 27/28, pain count 12 → 13 - status.md: Phase 5 progress tracker row, Last Updated note - scripts/validate-pain-log.mjs: accept "nuxt" as framework value Port: 3106 (avoids 3101-3105 used by prior spike apps).

… evals, scoring, media, annotations, sessions, functions, CLI, auth, config, cost, query) v2 was tracing-focused. v3 adds peer sections for every non-tracing surface both SDKs ship, with file:line citations from a fresh source audit of both repos. New sections (13 surfaces): - §6 Prompts — Braintrust loadPrompt + xact_id versioning + two-tier cache (memory + gzipped disk); Langfuse PromptManager + SWR cache with concurrent-refresh dedup - §7 Datasets — Braintrust ObjectFetcher + full CRUD + snapshots; Langfuse manager-only-get + CRUD on raw api.datasets.* - §8 Evals — Braintrust ~25 Evaluator fields + rolling concurrency + byte-backpressure; Langfuse batched (allSettled) + silent eval failure drop - §9 Scoring — Braintrust span-attached, no separate queue; Langfuse fire-and-forget queue + 5 ScoreDataTypes + ScoreConfig schemas - §10 Media — Braintrust Attachment refs only; Langfuse auto-extracts base64 from 6 attribute slots + presigned URL + sha256 dedup - §11 Annotations — Braintrust NONE (gap); Langfuse raw API only, annotations = scores with queueId - §12 Sessions/users — Braintrust metadata-only (gap); Langfuse first -class via propagateAttributes → unprefixed OTel user.id/session.id - §13 Functions — Braintrust first-class Project DSL with tools/ prompts/parameters/scorers builders + server-side invoke(); Langfuse NONE - §14 CLI — Braintrust eval/push/pull + --dev mode server; Langfuse no CLI - §15 Auth, multi-project, orgs — Braintrust per-call state for multi-tenant; Langfuse one-client-per-project + SCIM - §16 Configuration, secrets — NEITHER ships managers; agenta Python SDK is the outlier - §17 Cost — both server-side; Braintrust normalizes token metrics, Langfuse untyped USD bag - §18 Query/read-back — Braintrust ObjectFetcher AsyncIterable + BTQL; Langfuse typed api.*.* + Cube-style metric DSL Headline non-tracing findings: - Braintrust has surfaces Langfuse lacks (functions/tools/invoke, CLI, --dev mode, disk-cached prompts) but ALSO surfaces Braintrust lacks that Langfuse has (annotations, sessions/users, manager pattern) - Both punt on secrets/config — agenta Python SDK is the outlier here; RFC recommendation is to defer ConfigManager/SecretsManager port to TS SDK v2 - Both ship eval orchestration in-SDK; Braintrust's Evaluator field set is much richer (~25 fields with rolling concurrency + byte backpressure) than Langfuse's batched allSettled approach - Langfuse silently drops failed evaluator results — agenta should surface as dataType:"ERROR" scores instead RFC decisions expanded from 17 → 75 across all surfaces. 12 ranked differentiation opportunities (up from 6).

… backend-fixable analysis Three artifacts shipped: 1. Broken baseline at examples/node/observability-mastra/ — same wiring shape as the published v4 quickstart, but emits ZERO traces for Mastra agents. README walks through both broken paths (vendored AI SDK noopTracer + non-OTel ObservabilityBus) and points to the fix. 2. Working PoC at web/examples/mastra-node/ — AgentaMastraExporter (~150 lines) subscribes to Mastra's ObservabilityBus and re-emits spans through globally-registered OTel. 4/4 canonical assertions PASS. Clean 4-level Mastra tree lands in Agenta with inputs/outputs/userId/sessionId propagated. 3. Backend-fixable analysis (Option C complete) — full matrix in summary.md "Backend-fixable subset (AI SDK)" section, plus per-entry annotation on all 16 pain entries (3 yes, 1 partial, 12 no). Lets future analysis filter the wedge by axis. Adds: - Pain entries P-MASTRA-01/02/03 + new `mastra` framework prefix - "Strategic alternative: backend-led integration" section in summary (Mahmoud-aligned framing) - "Backend-fixable subset (AI SDK)" section in summary - Cross-cutting takeaway #6 (Mastra requires different integration shape) - Phase 6 status tracker row - Mastra framework added to pain-log schema + validator Recommended implementation order from the analysis: 1. Backend fixes first (P-NODE-01 + P-NODE-03 + P-COMMON-01) — 3 wins, broadest benefit, sharpens JS SDK scope from 13 things to 10 in 2 categories. 2. AI SDK lifecycle wrapper as JS SDK v0 — single mechanism solves the 3 highest-severity JS-side silent failures. 3. Edge runtime helper as JS SDK v0.1 4. Framework adapter shims as JS SDK v0.2+ Outstanding scope decision deferred: remaining 6 Mastra-in-framework spike apps (Mastra inside Next.js, Pages, TanStack, Nuxt) not pursued — strategic answer doesn't depend on framework-specific Mastra data since Mastra's ObservabilityBus operates above the framework HTTP layer.

…K-direct apps Wires Braintrust as a second OTLP destination alongside Agenta so the SAME source span data fans out to both backends. Lets us directly compare how each LLM observability platform displays IDENTICAL trace input — no application code changes required, just an additional SpanProcessor in each instrumentation file. Apps wired (8): examples/node/observability-vercel-ai/ (root v4) web/examples/node-vercel-ai-v6/ (Phase 1) web/examples/nextjs-app-router-raw/ (Phase 2a) web/examples/nextjs-app-router-vercel/ (Phase 2b) web/examples/nextjs-pages-router-raw/ (Phase 3a) web/examples/nextjs-pages-router-vercel/ (Phase 3b) web/examples/react-tanstack-start/ (Phase 4) web/examples/nuxt-raw/ (Phase 5) Pattern: each app's instrumentation file conditionally appends a second SpanProcessor wrapping an OTLPTraceExporter pointed at Braintrust's endpoint (https://api.braintrust.dev/otel/v1/traces) with Bearer auth + x-bt-parent header naming the destination project. Reads BRAINTRUST_API_KEY + BRAINTRUST_OTLP_URL from env. When the key is unset, behaviour matches the original baseline. For @vercel/otel apps (Phase 2b + 3b): wrap both exporters in BatchSpanProcessor (matching @vercel/otel's default traceExporter behaviour) — see the accidental-finding section below. Assertion results after wiring (29/32 PASS): - Phase 1 / 2a / 3a / 3b / 4 / 5: 4/4 PASS each - Phase 2b: 3/4 (a2 expected FAIL — P-APP-VERCEL-01 preserved) - Root v4: 1 trace exported (no canonical assertions) Accidental finding: switching @vercel/otel from `traceExporter: x` (default BatchSpanProcessor) to `spanProcessors: [SimpleSpanProcessor(x)]` sidesteps P-APP-VERCEL-01 entirely — assertion-2 went from FAIL to PASS when I first wired Phase 2b with SimpleSpanProcessor. SimpleSpanProcessor exports synchronously per span, before BatchProcessor's flush window would have closed. This is a legitimate user-visible workaround for @vercel/otel users hitting P-APP-VERCEL-01 today (trade-off: per-span HTTP round-trip latency, ~50-200ms per call). Reverted to BatchSpanProcessor in both vercel-otel variants to preserve the baseline pain reproduction. Documented in the P-APP-VERCEL-01 entry's notes as an interim workaround. Strategic implication: customers on Agenta who want side-by-side comparison with Braintrust can wire it in 8-10 lines of additional instrumentation code with no application-level changes. The same pattern extends to any OTLP-compatible LLM observability backend (Langfuse, Honeycomb, Datadog OTel). Not covered: Mastra (`web/examples/mastra-node/`) — no Braintrust key provided and the integration shape differs (would need a Braintrust-flavoured Mastra BaseExporter, or wrapping Mastra's vendored AI SDK with `wrapAISDK`). Tracked as open work. Adds: - Phase 7 section in summary.md - Interim-workaround note on P-APP-VERCEL-01 (SimpleSpanProcessor sidestep) - Status tracker row for Phase 7 - BRAINTRUST_API_KEY + BRAINTRUST_OTLP_URL placeholders in every .env.example

…anProcessor) + clarify P-APP-VERCEL-01 vs P-PAGES-VERCEL-01 nuance User push-back surfaced two issues with my Phase 7 commit (a1799db): 1. The Agenta docs at docs/docs/integrations/frameworks/vercel-ai-sdk/ observability.mdx already use SimpleSpanProcessor in the canonical example (line 74). I had been WebFetching live agenta.ai docs to answer questions about doc coverage instead of reading the repo source — inexcusable methodology when the source is in-tree. 2. Phase 7's Braintrust wiring for @vercel/otel apps (2b + 3b) used BatchSpanProcessor "to preserve the failing baseline" — but the baseline I should be preserving is what the docs recommend, not what @vercel/otel defaults to. Fixes: - Revert Phase 2b + 3b to SimpleSpanProcessor in the @vercel/otel spanProcessors arrays (matches docs canonical example) - Both apps now 4/4 PASS — P-APP-VERCEL-01 stops reproducing because Simple sidesteps the BatchSpanProcessor flush race - Phase 3b token check stays loosened — verified empirically that P-PAGES-VERCEL-01 reproduces under SimpleSpanProcessor too (the CompositeSpanProcessor.onEnd force-end is independent of processor choice) Two cleanly-separated findings drop out: - **P-APP-VERCEL-01 is processor-dependent**: Batch reproduces it, Simple sidesteps. Primarily a docs gap (Agenta docs don't have a @vercel/otel-specific section explaining the override). - **P-PAGES-VERCEL-01 is processor-independent**: reproduces under both. Genuine JS-side wedge or backend trace-enrichment. Pain log updates: - P-NODE-02: docs-coverage clarification — Agenta docs already use Simple, the pain is for users who reach for "production-grade" Batch without reading Agenta docs. Strategic framing: SDK should ship a streaming-aware Batch processor (perf optimisation on top of the docs workaround), not a bug fix. - P-APP-VERCEL-01: clarified who hits it (users following @vercel/otel defaults without reading Agenta docs) + minimum docs fix listed. - P-PAGES-VERCEL-01: added "processor-choice independence verified" note — empirically reproduces under Simple too. Summary doc: - Phase 7 section rewritten with corrected 32/32 PASS results - Methodology correction explicitly called out - New "Doc-coverage findings to action" subsection lists 4 specific improvements to docs/docs/integrations/frameworks/vercel-ai-sdk/ observability.mdx (line 74 SimpleSpanProcessor not explained, no @vercel/otel section, no pipeUIMessageStreamToResponse warning, no metadata-to-toolCall gap mention) What I should have done from the start: read docs/docs/integrations/ frameworks/vercel-ai-sdk/observability.mdx and the docs/design/ vercel-ai-adapter/ directory BEFORE designing the spike. The spike's findings are still valid — they reveal real bugs the docs don't address — but the framing for each pain entry is sharper when cross-referenced against existing docs.

v3 was source-audit-only. v4 layers empirical findings from the ts-sdk-chore/example-apps branch, where 8 spike apps fan IDENTICAL OTel span data out to Agenta + Braintrust + Langfuse via parallel SimpleSpanProcessors on one NodeTracerProvider. Key changes: New §0.5 "Empirical evidence: tri-export across 8 spike apps": - Wiring cost: ~8 LoC for +Braintrust, ~12 LoC for +Langfuse on top of agenta baseline. Tri-export pattern works mechanically - Verified trace counts via REST API for all 8 apps (2-33 events each on Braintrust, 1-12 traces each on Langfuse) - Side-by-side: same ai.streamText span, three backends. Same data, three rendering choices - P-BRAINTRUST-01: silent data-plane mismatch. US-default OTLP endpoint silently swallows spans for EU-plane orgs; OTel returns HTTP 200 so SimpleSpanProcessor never surfaces failure. Cost ~3 hours of empirical investigation to find - P-LANGFUSE-01: no server-side scope filter. Earlier v3 claim ("Langfuse drops non-LLM scope") was wrong — that filter is @langfuse/otel JS-SDK-only. Raw OTLP to Langfuse stores everything, same as Agenta §5 updates: callout the data-plane gotcha (P-BRAINTRUST-01) inline with Braintrust wire description. Clarify Langfuse's isDefaultExportSpan filter is JS-SDK-only with P-LANGFUSE-01 citation. §17 update: empirical "Agenta returns ag.metrics.costs = {}" gap with real numbers. Token data lands; just needs server-side rollup. §22 (Differentiation opportunities) expanded from 12 → 15 ranked: - Added "Multi-backend delivery health verification" (P-BRAINTRUST-01) - Added "Server-side scope filter" (P-LANGFUSE-01 → portable to agenta backend, no JS SDK required) - Added "Cost computation server-side from gen_ai.usage.*" §21 (RFC decisions) expanded with empirical-driven D76-D80: - D76: surface per-destination delivery health (P-BRAINTRUST-01 lesson generalizes to any future agenta multi-backend pipeline) - D77: document data-plane / multi-region selection loudly - D78: scope-filter logic belongs at ingest/render, NOT as a JS SpanProcessor (cleaner architecture than Langfuse SDK approach) - D79: case for @agenta/sdk-tracing pivots from "wraps OTel ergonomically" (insufficient differentiation) to "hides config gotchas that silently lose data + solves what raw OTLP can't" - D80: cost computation server-side, populated from gen_ai.usage.* Methodology note clarifies the vendor-SDK vs raw-OTLP distinction — many "features" in §§3-13 only fire when the user installs the vendor SDK. The raw OTLP path is what agenta will produce. Appendix A updated with v3→v4 corrections table.

…e-analysis Brings the spike's Phase 5-8 work and design-doc updates into the competitive-analysis branch so the v4 doc cross-references resolve locally on a single branch. Incoming from example-apps: - Phase 5: web/examples/nuxt-raw/ — Nuxt 4 + Nitro plugin spike app - Phase 6: web/examples/mastra-node/ + AgentaMastraExporter PoC at src/agenta-exporter.ts (~150 lines bridging Mastra's ObservabilityBus to OTel) - Phase 7: Braintrust dual-export wiring across the 8 AI-SDK-direct apps (instrumentation.ts files updated) - Phase 8: Langfuse tri-export wiring across the same 8 apps - @vercel/otel apps aligned with Agenta docs (SimpleSpanProcessor) - examples/node/observability-mastra/ — broken-baseline reproducer at the published v4 quickstart layout - docs(ts-sdk-tracing): P-COMMON-01 (Next.js HTTP auto-instrumentation buries AI SDK spans), Langfuse claim sourcing, P-PAGES-VERCEL-01 nuance clarification Not included in this merge (still uncommitted in the example-apps worktree at .claude/worktrees/youthful-lumiere-5a93ed): - docs/design/ts-sdk-tracing/sdk-comparison.md (the canonical empirical comparison doc referenced from v4 of competitive-analysis.md) - Working-tree edits to pain-log.md (adds P-BRAINTRUST-01 and P-LANGFUSE-01) - Working-tree edits to summary.md (Phase 8 empirical correction + P-COMMON-01 inline correction) - Working-tree edits to status.md Once those are committed on example-apps, a follow-up merge or cherry pick will pull them into this branch. Conflicts: none. The competitive-analysis branch only touched docs/design/ts-sdk/ (sibling directory to example-apps' work in docs/design/ts-sdk-tracing/ and web/examples/). docs/design/ts-sdk/competitive-analysis.md (v4, this branch's only contribution) is preserved unchanged through the merge.

…rison + empirical pain entries Phase 8 lands the Langfuse tri-export wiring across all 8 spike apps and surfaces two new pain entries from the empirical verification that motivated this work. Wiring (8 of 8 apps): - Add LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY / LANGFUSE_BASE_URL env vars to every .env.example - Append a conditional third SpanProcessor(OTLPTraceExporter(langfuse)) to each instrumentation file (Node, App Router raw + @vercel/otel, Pages Router raw + @vercel/otel, TanStack Start, Nuxt Nitro plugin). Auth is Basic base64(public:secret); optional x-langfuse-ingestion-version: 4 header - Refactor the if/else single-vs-dual processor blocks in @vercel/otel apps to a single conditionally-appended spanProcessors array — same shape scales linearly with backend count - Root-level examples/node/observability-vercel-ai/ baseline gets the same tri-export wiring so docs can match the spike apps Empirical verification doc: - docs/design/ts-sdk-tracing/sdk-comparison.md — full side-by-side comparison of how Agenta, Braintrust, Langfuse handle IDENTICAL OTel span data. Trace counts verified via REST API for all 8 apps. Project<>app mapping table. Wiring cost table (~8 LoC Braintrust, ~12 LoC Langfuse on top of Agenta baseline). Side-by-side ai.streamText field comparison from real API responses New pain entries (in pain-log.md): - P-BRAINTRUST-01: silent data-plane mismatch. Braintrust's US-default OTLP endpoint auto-creates projects but silently rejects span payloads when the user's org is on EU plane. Cost ~3 hours of empirical investigation. OTel exporters return HTTP 200; no signal to SimpleSpanProcessor. Generalizable lesson: multi-backend OTel pipelines need REST-API delivery verification, not just span-export success - P-LANGFUSE-01: no server-side scope filter. Earlier docs (incl P-COMMON-01 inline citation) claimed Langfuse "drops non-LLM scope spans server-side". Empirically wrong. On raw OTLP, Langfuse stores every span including Next.js HTTP wrappers with null input/output. The isDefaultExportSpan filter lives inside @langfuse/otel JS-SDK, not at ingest. Strengthens the backend-fix case for P-COMMON-01 Summary doc updates: - Phase 8 section gains an empirical-correction callout flagging the pre-correction trace-count claim and pointing at the verified data in sdk-comparison.md - P-COMMON-01 ecosystem claim inline-corrected to reflect P-LANGFUSE-01 findings Status doc: Phase 8 marked complete with the corrected trace-count context. Lockfile update: web/pnpm-lock.yaml regenerated for the workspace package additions across spike apps. Not included in this commit: - web/examples/sdk-native-spike/ (untracked) — Phase 9 in-progress vendor-native SDK comparison (agenta-raw-otel / braintrust-sdk / langfuse-sdk side-by-side scripts). Commit separately once node_modules is removed and the scripts stabilize.

Follow-up to the earlier merge of example-apps. Brings in the just-landed Phase 8 commit (bc3010f) which adds: - Langfuse tri-export wiring across all 8 spike apps (env.example + instrumentation files) - docs/design/ts-sdk-tracing/sdk-comparison.md (canonical empirical comparison referenced by v4 of competitive-analysis.md) - P-BRAINTRUST-01 (silent data-plane mismatch) and P-LANGFUSE-01 (no server-side scope filter) pain entries - Phase 8 empirical correction in summary.md - Root-level examples/node/observability-vercel-ai/ tri-export wiring With this merge, every cross-reference in competitive-analysis.md v4 resolves locally and the empirical evidence is reproducible from a single branch. Conflicts: none expected (competitive-analysis branch only touched docs/design/ts-sdk/ which is a sibling directory).

Catches the branch up to main (220 commits behind at merge time, last main commit: 75dc5e2 release/v0.99.9). Brings in the Fern-generated TS client work + everything else that landed since the ts-sdk-tracing spike branched off in early May. Conflicts resolved: - web/pnpm-workspace.yaml — both sides modified. Ours added the `examples/*` + `examples/.shared/*` workspace globs for the spike apps; main added an `allowBuilds:` config block listing per-package build allowances. Union of both kept. - web/pnpm-lock.yaml — 219 conflict markers, mass conflicts throughout. Took main's version. The spike apps' workspace packages are not reflected in the lockfile; running `pnpm install` from web/ will regenerate the lockfile to include them. The spike apps don't need to run to feed the RFC (the docs + sdk-comparison empirical data carry the findings), so this is an acceptable trade-off for the RFC office-hours session. Branch is now self-contained with: - Latest main (Fern-generated TS SDK client + all recent platform work) - Full ts-sdk-tracing spike (8 apps with tri-export wiring) - 19 pain entries including the v4 empirical additions (P-BRAINTRUST-01, P-LANGFUSE-01) - Canonical sdk-comparison.md (empirical Agenta vs Braintrust vs Langfuse) - v4 competitive-analysis.md (1709 lines, 80+ RFC decisions) If running the spike apps locally is needed, `cd web && pnpm install` to regenerate the lockfile with the spike apps' workspace entries.

The previous commit (bc3010f) introduced 3-line wraps on Buffer.from(...).toString("base64") in 4 instrumentation files. Project prettier config wants 1-line. Pre-commit's prettier --write keeps re-flattening them on every commit attempt, which then blocks subsequent commits via pre-commit's stash/restore conflict path. No behavior change — pure formatting.

…hared/ The verification harness at web/examples/.shared/agenta-verify/ existed only as working-directory state in this worktree since Phase 0 of the ts-sdk-tracing spike. 8 spike-app package.json files declare "@agenta/spike-verify": "workspace:*" and 32 test files import verifyTrace from it, but the package was never on any git ref — git status never surfaced it because the root .gitignore's `.*` rule silently swallowed the dot-prefixed directory. The dot-prefix is intentional (it carves the workspace out of the default `examples/*` glob per the comment in web/pnpm-workspace.yaml), but no exception was ever added to .gitignore to match. This commit: - adds an un-ignore block in .gitignore for web/examples/.shared/ while keeping node_modules/, .turbo/, dist/, .vite/ ignored - commits the harness sources (1,157 LoC across 8 files): package.json, tsconfig.json, vitest.config.ts src/api.ts, src/errors.ts, src/index.ts, src/verify.ts test/verify.test.ts Discovered when a sibling worktree (ts-sdk-chore/rfc, forked off ts-sdk-chore/competitive-analysis) failed pnpm install with the workspace:* reference unresolved — confirming the package never reached any branch.

Three scripts that exercise the same streamText("Reply with: ok.") call against gpt-4o-mini using each backend's recommended SDK shape (not just raw OTLP). Empirical companion to docs/design/ts-sdk-tracing/sdk-comparison.md § "Ergonomic-by-ergonomic, six implementations side-by-side". scripts/agenta-raw-otel.ts — baseline: raw @opentelemetry/* + AI SDK's experimental_telemetry flag. 9 functional setup statements. Hand-rolled trace URL. scripts/langfuse-sdk.ts — langfuse-node SDK: 1-statement Langfuse client, observeOpenAI() wrapping the OpenAI client (Langfuse generation observation auto-emitted, NO ai.streamText interception — that requires @langfuse/vercel), trace.span() / trace.end() imperative custom-span pattern, first-class userId/sessionId/tags/metadata on trace constructor, trace.getTraceUrl(). scripts/braintrust-sdk.ts — braintrust SDK: initLogger() + wrapAISDK(ai) drop-in (no experimental_telemetry flag needed), traced(fn, {name}) functional wrapper, currentSpan().log({metadata, tags}) flat-KV semantic context, currentSpan().link() trace URL. Empirical findings driving doc updates: - Langfuse computes cost server-side at ingest: resolves "gpt-4o-mini" to date-versioned "gpt-4o-mini-2024-07-18", looks up pricing, populates calculatedTotalCost: 3e-06 + per-token inputPrice/outputPrice - Braintrust auto-emits time_to_first_token: 0.057s without being asked - Langfuse observeOpenAI ONLY wraps the OpenAI client — to trace AI SDK's streamText calls in Langfuse, separate @langfuse/vercel package needed - Setup LoC empirically counted: Langfuse 1, Braintrust 2, Agenta-raw 9 .env stays ignored (real OpenAI/Agenta/Langfuse/Braintrust keys for the re-run); .env.example is the redacted placeholder.

Brings in: - @agenta/spike-verify harness at web/examples/.shared/agenta-verify/ (unblocks workspace install for all 8 spike apps) - web/examples/sdk-native-spike — vendor-native SDK comparison scripts - prettier autofix from pre-commit hook

Adds two design docs for the Agenta TypeScript SDK v1: - docs/design/ts-sdk/rfc.md — long-form RFC (audit trail: reasoning chain, rejected alternatives, locked decisions inherited from prior spike work). - docs/design/ts-sdk/proposal.md — short-form proposal for review. Uses descriptive English issue names instead of spike pain-log codes; pieces of it are shareable on their own without cross-referencing the pain log. Re-runs the four Next.js spike apps on Next.js 16.2.6 (Turbopack default builder) and updates artifacts accordingly: - web/examples/nextjs-{app,pages}-router-{raw,vercel}: bump next to 16.2.6, add @opentelemetry/sdk-trace-base as an explicit dep on the two @vercel/otel apps (Turbopack's stricter module resolution exposed the transitive miss), add Next-generated next-env.d.ts, refresh assertion-1 tests against the consumer-facing ag.metrics.tokens.cumulative.* path. - web/examples/nextjs-pages-router-raw/pages/api/edge-chat.ts: new probe to verify P-PAGES-RAW-01 on Next 16 — the build-time barrier is gone with Turbopack; runtime tracing problem (P-APP-RAW-01) persists. Pain log updates (docs/design/ts-sdk-tracing/pain-log.md): - P-PAGES-RAW-01: appended Next.js 16 RESOLVED-at-build-time note; runtime emits 0 spans still. - P-PAGES-VERCEL-01: appended Next.js 16 re-verification — both parent AND child .doStream have empty tokens (originally we believed the child retained them). Fix recommendation flips: backend rollup can't recover what no span carries; promote the JS-side agentaPipeUIMessageStreamToResponse helper from contingency to v1 deliverable. - P-COMMON-02: new entry — Turbopack's stricter module resolution exposes missing transitive @opentelemetry/sdk-trace-base declarations on @vercel/otel apps. v1 setup-doc update. - P-BRAINTRUST-01 / P-LANGFUSE-01: backfill **Framework:** lines, severity axes, and the friction + ideal-sketch code blocks to satisfy the pain-log schema validator (previously failing on main). - validate-pain-log.mjs: teach the validator about `braintrust` and `langfuse` frameworks + their code prefixes. Spike harness touch-ups (web/examples/.shared/agenta-verify/, plus minor per-app fixes in node-vercel-ai-v6, react-tanstack-start, mastra-node) are incidental cleanup picked up during the re-run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

These three drafts (proposal.md, rfc.md, competitive-analysis.md) were review material, not durable artifacts intended to live in the repo. Removing them from the branch tip; they're saved locally outside the repo. Earlier commits on this branch (e.g. ad7064c for proposal/rfc, 911778c and earlier for competitive-analysis) still contain them if the history ever needs to be consulted. The spike artifacts that the proposal was built from (pain-log, status, summary, sdk-comparison, the comparison scripts, and all the spike apps under web/examples/) remain on the branch unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel · 2026-05-18T15:12:49Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agenta-documentation	Ready	Preview, Comment	May 19, 2026 1:04pm

coderabbitai · 2026-05-18T15:12:53Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 9669eb71-f8ae-447e-83b9-6c11b2a8e3ea

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ts-sdk-chore/rfc

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

+        })
+    } catch (err) {
+        return new Response(
+            JSON.stringify({error: err instanceof Error ? err.message : String(err)}),


+        })
+    } catch (err) {
+        return new Response(
+            JSON.stringify({error: err instanceof Error ? err.message : String(err)}),


…acing New spike app under web/examples/nextjs-workflow/ verifying how OpenTelemetry tracing behaves with Vercel Workflow DevKit + AI SDK. Two workflow variants, both empirically verified against Agenta: 1. **Plain pattern**: a 'use workflow' function calling generateText from inside a 'use step'. Produces a 36-span single-trace run. AI SDK semantic spans (ai.generateText, ai.generateText.doGenerate, fetch POST openai.com) land alongside Workflow DevKit's own OTel instrumentation (workflow.start, WORKFLOW, STEP, step.execute, world.events.create *, queue.publish, workflow.replay, etc.). All share one trace_id, including across the /.well-known/workflow/v1/step HTTP boundary that Workflow DevKit uses internally to dispatch step execution. 2. **DurableAgent pattern**: a multi-LLM agent loop with one tool call. Produces a 135-span single-trace run. Same automatic trace-context propagation. Caveat: DurableAgent calls AI SDK at the LanguageModelV3.doStream provider interface directly, bypassing streamText/generateText — so the standard ai.* / gen_ai.* semantic spans DO NOT appear. Users see Workflow's STEP doStreamStep span and the raw HTTP fetch to OpenAI, but not the AI-SDK-shaped trace tree. Flagged as a v2 follow-up if usage grows. Two setup traps logged in the spike code for any future Workflow user: - 'workflow' and '@workflow/ai' must NOT be in Next's serverExternalPackages. Externalizing them hides their 'use step' directives from the Workflow SWC plugin and produces StepNotRegisteredError at runtime (specifically: step//@workflow/ai@.../doStreamStep, step//@workflow/ai/agent@.../closeStream). The comment in next.config.ts documents this. - DurableAgent's model option needs an actual model object from @workflow/ai/<provider> (e.g. openai('gpt-4o-mini') from @workflow/ai/openai), not the Gateway-style 'openai/gpt-4o-mini' string (which requires GATEWAY_API_KEY). And 'instructions:' not 'system:'. Strategic finding: the §2.4 HTTP-wrapper-masquerades-as-the-trace-root problem gets proportionally worse with Workflow DevKit — a single run produces 35-135+ framework spans before the LLM call, so the entry HTTP span (which spans[0] picks) is even less informative as a trace root. This raises the value of the backend extract_root_span upgrade (\§5.1 in the proposal). Also touches: - web/examples/nextjs-pages-router-raw/next-env.d.ts: regenerated by Next.js 16 (path moved from .next/dev/types to .next/types). - web/pnpm-lock.yaml: workflow@4.2.4 + @workflow/ai@4.1.2 + @workflow/next@4.0.5 deps added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… to Agenta New spike app under web/examples/n8n-self-hosted/ verifying how n8n's first-party OpenTelemetry support behaves when pointed at Agenta. n8n 2.20.11 bundles @opentelemetry/sdk-node 0.213 and registers an OtelService on startup when N8N_OTEL_ENABLED=true. What lands in Agenta (verified empirically against a webhook-triggered workflow with two nodes — Webhook + HTTP Request): - workflow.execute span (one per run) with attributes: n8n.workflow.{id, name, node_count, version_id} n8n.execution.{id, mode, status, is_retry, error_type} - node.execute span (one per node, nested under workflow.execute) with: n8n.node.{id, name, type, type_version, items.input, items.output} - service.name configurable via N8N_OTEL_EXPORTER_SERVICE_NAME. What n8n does NOT emit: - No gen_ai.* / ai.* semantic attributes anywhere. - No http.* on HTTP Request nodes despite literal HTTP calls. - No prompt/model/completion/token capture — only items.input/output integer counts. n8n's OtelService only wraps workflow + per-node lifecycle; it doesn't instrument code inside node implementations. So an HTTP Request to OpenAI, or a LangChain-backed AI node, produces one node.execute span with framework metadata and zero LLM-call detail. Same shape as the Mastra finding (\§1.10 of the proposal) — framework emits OTel, LLM-level visibility needs a separate adapter. Setup gotchas documented in EMPIRICAL_FINDINGS.md (for any future user trying the same setup): - n8n reads its OWN N8N_OTEL_* env vars; standard OTEL_* vars are silently ignored. - buildOtlpTracesUrl is a literal endpoint + path concatenation, so Agenta's ?project_id= query parameter must be stuffed into N8N_OTEL_EXPORTER_OTLP_TRACING_PATH: N8N_OTEL_EXPORTER_OTLP_ENDPOINT=http://host.docker.internal N8N_OTEL_EXPORTER_OTLP_TRACING_PATH=/api/otlp/v1/traces?project_id=<uuid> - Auth via comma-separated key=value pairs in N8N_OTEL_EXPORTER_OTLP_HEADERS. - From inside Docker, AGENTA_HOST must resolve to the host's loopback via host.docker.internal (Docker Desktop) or extra_hosts: host-gateway (Linux Docker). Implication for the v1 TypeScript SDK proposal: n8n users are covered 'for free' at the workflow level — no Agenta-specific code, just the env vars above. LLM-call detail is the gap, addressable by a v2+ docs recipe or an OpenInference-style LangChain JS instrumentation deployed as a custom n8n node. Not a v1 SDK deliverable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds source-level evidence to EMPIRICAL_FINDINGS.md that the missing gen_ai.* / ai.* / http.* attributes on n8n's OTel output is n8n's design, not a setup mistake on our side. Four source-level checks against the installed n8n image (v2.20.11): 1. OtelService.init() in dist/modules/otel/otel.service.js instantiates NodeSDK with no 'instrumentations' array — so zero OTel auto-instrumentation is registered. 2. grep for 'registerInstrumentations' / 'instrumentations:' across all of /usr/lib/node_modules/n8n/dist returns zero matches. 3. @opentelemetry/instrumentation-http and related packages ARE bundled (transitive peer deps from LangChain), but n8n never imports them. 4. OtelLifecycleHandler.onNodeEnd exposes a customAttributes hook via ctx.taskData.metadata.tracing — a per-node opt-in surface. But @n8n/n8n-nodes-langchain (the AI nodes package) never writes to metadata.tracing, so even the purpose-built OpenAI Chat Model node produces the same shape as our HTTP Request: n8n.node.type + items.input/output counts, nothing else. Three paths forward, none of them v1 SDK work: - Upstream PR adding metadata.tracing writes inside the LangChain AI nodes - A custom n8n node that wraps AI calls and sets metadata.tracing.gen_ai_* - A process-level OTel auto-instrumentation for LangChain JS layered on top Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ardaerzin and others added 30 commits May 10, 2026 21:01

ardaerzin and others added 2 commits May 18, 2026 16:59

Merge remote-tracking branch 'origin/main' into ts-sdk-chore/rfc

5ebfab9

github-advanced-security AI found potential problems May 18, 2026

View reviewed changes

ardaerzin and others added 2 commits May 19, 2026 12:27

Merge branch 'spike/vercel-workflow-tracing' into ts-sdk-chore/rfc

1a826b9

vercel Bot deployed to Preview May 19, 2026 10:29 View deployment

ardaerzin and others added 2 commits May 19, 2026 14:44

Merge branch 'spike/n8n-tracing' into ts-sdk-chore/rfc

cd7a9c2

vercel Bot deployed to Preview May 19, 2026 12:47 View deployment

vercel Bot deployed to Preview May 19, 2026 13:04 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ts sdk / exploration] New example apps #4353

[ts sdk / exploration] New example apps #4353
ardaerzin wants to merge 37 commits into
mainfrom
ts-sdk-chore/rfc

ardaerzin commented May 18, 2026

Uh oh!

vercel Bot commented May 18, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ardaerzin commented May 18, 2026

Summary

Testing

Verified locally

Added or updated tests

QA follow-up

Demo

Checklist

Contributor Resources

Uh oh!

vercel Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented May 18, 2026 •

edited

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading