Skip to content

[ts sdk / exploration] New example apps #4353

Draft
ardaerzin wants to merge 37 commits into
mainfrom
ts-sdk-chore/rfc
Draft

[ts sdk / exploration] New example apps #4353
ardaerzin wants to merge 37 commits into
mainfrom
ts-sdk-chore/rfc

Conversation

@ardaerzin
Copy link
Copy Markdown
Contributor

Summary

Testing

Verified locally

Added or updated tests

QA follow-up

Demo

Checklist

  • I have included a video or screen recording for UI changes, or marked Demo as N/A
  • Relevant tests pass locally
  • Relevant linting and formatting pass locally
  • I have signed the CLA, or I will sign it when the bot prompts me

Contributor Resources

ardaerzin and others added 30 commits May 10, 2026 21:01
Mirrors @agenta/api-client: type:module, main/types/exports point at
./dist, prepare hook compiles via tsc on install. Required for Node-side
consumers (tsx, ts-node, plain node + tsc) — workspace bundler magic
doesn't extend to them, so the SDK has to ship a real built artifact.

Existing dependents (@agenta/entities, web/_reference/*) still type-check
against the new dist-based main. dist/ added to .gitignore (matches the
api-client pattern; install regenerates it via the prepare hook).
Adds a top-of-file note that newer spike apps targeting AI SDK v6 (with
multiple frameworks and the @vercel/otel variant) live under
web/examples/. The v4 example itself stays unchanged — coexistence is
intentional: it validates Agenta's adapter against v4 spans while the
spike apps probe v6.
…ing design

Adds a new examples/* workspace tree under web/ for research apps that
inform Agenta's first-party TypeScript SDK with tracing
(ts-sdk-tracing). The spike measures the actual friction TS users hit
today wiring AI SDK + raw OpenTelemetry against the Agenta backend, so
the SDK can be designed around real pain instead of guesses.

Phase 1 contents:

- @agenta/spike-verify (web/examples/.shared/agenta-verify/) — workspace
  package wrapping the official @agenta/sdk's traces.querySpans() with a
  polling/matching layer + 3 typed errors. 13 vitest unit tests cover
  polling/timeout/retry/error semantics; verifyTrace itself talks to a
  real Agenta over HTTP at integration time.

- @agenta-spike/node-vercel-ai-v6 (web/examples/node-vercel-ai-v6/) —
  Node + AI SDK v6 + raw OTel spike. Three demos (generateText,
  streamText, tool-call) and four canonical assertion scripts that run
  via pnpm test against a live Agenta endpoint:
    1. cold-start trace completeness
    2. mid-stream client-abort flush
    3. metadata round-trip
    4. instrumentation runs before first handler

  All 4 currently green against a local Agenta instance.

Wires examples/* and examples/.shared/* into web/pnpm-workspace.yaml.
Each spike app is fully self-contained (own package.json, .env.example,
tsconfig). Findings feed docs/design/ts-sdk-tracing/ — see that
directory for the pain log + executive summary.

NOT a stable starter template. Apps are research instruments, expected
to be either refactored to use ts-sdk-tracing or retired when the SDK
ships (per the post-SDK lifecycle TODO captured separately).
Adds the design space for Agenta's first-party TypeScript SDK with
tracing. The spike apps under web/examples/ feed entries into pain-log.md
during build; the SDK is then designed against the captured friction
rather than guesses.

Includes:

- README.md — orientation and document index
- summary.md — living one-page executive summary, updated as each spike
  app phase lands. Phase 1 (Node + AI SDK v6 + raw OTel) section
  reflects 4/4 canonical assertions green and the first 3 pain entries.
- pain-log.md — structured friction entries with per-framework prefix
  numbering scheme (P-NODE-*, P-APP-RAW-*, P-APP-VERCEL-*, etc.). Each
  entry: framework, 3-axis severity, code-that-exists-today, ideal-API
  sketch, notes. Three Phase 1 entries so far, two silent-failure-shaped.
- status.md — progress tracker, locked design decisions, plus a separate
  "SDK Requirements" section for items initially logged as pain entries
  but on review obvious features ts-sdk-tracing must ship (built dist,
  host normalization, projectId propagation).
- scripts/validate-pain-log.mjs — schema validator (Node ESM, no deps)
  wired into .pre-commit-config.yaml as a pre-commit hook on the
  pain-log.md path. Validates ID prefix matches a known framework set,
  3-axis severity present, code excerpt + ideal sketch both present, no
  duplicate IDs.

Plus TODOS.md at repo root: post-spike lifecycle decision per spike app
(refactor to use ts-sdk-tracing as docs companion / starter template, or
retire with pain log preserved).
Adds the second spike app for ts-sdk-tracing design: Next.js 15 App
Router + AI SDK v6 + raw OpenTelemetry. Probes the four canonical
assertions plus three new framework-specific surfaces — Server Action
direct invocation, useChat streaming via /api/chat, and an edge-runtime
route at /api/edge-chat.

Result: 4/4 nodejs-runtime assertions PASS. The edge-runtime route
silently emits zero spans despite the documented setup (fetch-based
exporter, SimpleSpanProcessor, waitUntil(forceFlush())) — captured as
P-APP-RAW-01 in the pain log, the highest-severity Phase 2a finding.
Phase 2b's @vercel/otel app will A/B test whether that's a raw-OTel-on-
edge issue or something else.

Stack notes worth committing context on:

- Uses SimpleSpanProcessor (not BatchSpanProcessor) per P-NODE-02.
  BatchSpanProcessor + AI SDK v6 streamText silently loses spans.
- instrumentation.ts dispatches by NEXT_RUNTIME — Node-runtime setup
  in instrumentation.node.ts, edge-runtime setup module-scoped inside
  the edge route file (Node-only OTel libs can't load on edge).
- type:module in package.json is required for tsx test scripts to
  resolve the ESM-only @agenta/sdk via @agenta/spike-verify.
- Each spike app on its own port (3101 here) so multiple can run in
  parallel without colliding with the OSS app on 3000.

Updates summary.md (now living, covers both Phase 1 + 2a) and status.md
to reflect the current 4/4 + edge-pending state.
Adds the @vercel/otel variant of the App Router spike. Same app shape as
nextjs-app-router-raw — same routes, same canonical assertions, same
chat UI, same Server Action probe. The only delta is the instrumentation
wiring: a single registerOTel() call from @vercel/otel replaces the
multi-file raw OTel scaffold (instrumentation.ts → instrumentation.node.ts
+ inline edge-route provider setup).

Result: 3/4 nodejs assertions PASS. assertion-2 FAILS — captured as
P-APP-VERCEL-01 in the pain log. Same root cause as P-NODE-02:
@vercel/otel defaults to BatchSpanProcessor, which loses streamText
spans on mid-stream client abort within the 5s assertion window.

Edge route A/B verdict: where raw OTel + manual edge setup emits zero
spans EVER (P-APP-RAW-01), @vercel/otel emits spans WITH a 10-15s delay
(P-APP-VERCEL-02). Not silent loss, but too slow for interactive
abort-flush scenarios. The wrapper rescues the case the raw setup
totally fails — but doesn't make it production-ready for streaming.

Cross-cutting takeaway captured in summary.md: BatchSpanProcessor + AI
SDK v6 streamText is the universal flush failure regardless of which
OTel wrapper you use. The SDK has to own the processor choice.

App runs on its own port (3102) so both 2a (3101) and 2b (3102) can run
concurrently for direct A/B testing.
Adds the Pages Router raw-OTel spike to inform ts-sdk-tracing design.
Same wiring approach as nextjs-app-router-raw — `instrumentation.ts`
register hook (works identically on Pages Router since Next 15) +
SimpleSpanProcessor + AI SDK v6 — but adapted to Pages Router's
NodeApi handler shape.

Pages-vs-App differences worth committing context on:

- Streaming: Pages Router API routes return a Node ServerResponse, not
  a fetch Response. So `result.toUIMessageStreamResponse()` (App Router)
  doesn't apply — we use `pipeUIMessageStreamToResponse({response: res,
  stream: result.toUIMessageStream()})` instead.
- No Server Action: Pages Router doesn't support React Server Actions,
  so the Server-Action probe from Phase 2a is omitted.
- No edge route: dropped at build time (P-PAGES-RAW-01). Pages Router's
  edge runtime applies stricter dynamic-code-eval static analysis than
  App Router's, and rejects `@opentelemetry/exporter-trace-otlp-http`'s
  imports. The same imports compile fine in Phase 2a's App Router edge
  route. Pages Router users on raw OTel can't ship edge tracing today.
- useChat client side: identical to App Router — same DefaultChatTransport
  pattern works against both router types.

Result: 4/4 nodejs-runtime canonical assertions PASS. The
P-PAGES-RAW-01 edge-build failure is the new entry; Phase 3b (vercel-otel)
will A/B test whether @vercel/otel's edge bundle passes the strict check.

App runs on its own port (3103).
Adds the Pages Router @vercel/otel spike to A/B test against Phase 3a
(nextjs-pages-router-raw). Same Pages Router app shape, same Node res +
pipeUIMessageStreamToResponse streaming pattern — single-line registerOTel()
replaces multi-file raw OTel scaffolding.

Two critical Phase 3b verdicts:

1. Edge route BUILDS + RUNS on @vercel/otel where raw OTel failed (P-PAGES-RAW-01).
   @vercel/otel ships an edge-safe bundle that passes Pages Router's strict
   dynamic-code-eval static check. Spans arrive on edge with the same
   ~10-15s BatchSpanProcessor delay as the App Router edge story
   (P-APP-VERCEL-02 reproduces on Pages-edge).

2. NEW silent failure: ag.metrics.tokens = {} on the parent streamText span,
   but ONLY in this 4-way combination of Pages Router + @vercel/otel +
   pipeUIMessageStreamToResponse + AI SDK v6 streamText. Each isolated piece
   works alone (verified across Node raw OTel, App Router vercel-otel, Pages
   Router raw OTel). Captured as P-PAGES-VERCEL-01. Token counts are the #1
   metric users instrument LLM calls for — cost tracking silently disappears
   when wiring documented best-practice from both ecosystems.

Result: 4/4 nodejs-runtime canonical assertions PASS, with assertion-1
loosened to drop the now-empty token-metrics check (documented in test
file comments). Edge route + assertion-4 sentinels validated by manual probe.

App runs on its own port (3104). Implications for ts-sdk-tracing's design
escalate cross-cutting takeaway #1: the SDK must own streamText's span
lifecycle (end + flush + attribute population) — solves P-NODE-02 +
P-APP-VERCEL-01 + P-PAGES-VERCEL-01 in one shot.
Closes the 4-framework matrix for ts-sdk-tracing's design input:
Node + Next.js App Router (raw + vercel-otel) + Next.js Pages Router
(raw + vercel-otel) + TanStack Start. 6 spike apps total covering
the modern TS framework surface for AI SDK v6 + raw OTel + Agenta.

TanStack Start specifics (vs Next.js):

- No `instrumentation.ts` register hook. Instrumentation fires via
  being the FIRST import in `src/server.ts` — unenforced by the
  framework. A single auto-formatter import-sort silently disables
  tracing with no warning. Captured as P-TANSTACK-01.
- No per-route edge runtime opt-in (`export const runtime = "edge"`).
  Runtime is selected at the Nitro preset level for the entire server.
  Edge probe deferred — captured as P-TANSTACK-02 (coverage gap, not
  a silent failure).
- `createStartHandler()` return shape mismatch: official docs show
  `export default createStartHandler(...)` but the dev plugin needs
  `{fetch: ...}`. Lost ~30min debugging; required reading the
  framework's own default-entry source to find the working shape.
  Captured as P-TANSTACK-03.

Stream sink IS identical to App Router (fetch Response via
`result.toUIMessageStreamResponse()`), so the per-call tracing path
is identical: 4/4 nodejs canonical assertions PASS unchanged from
Phase 2a.

Result: 6 spike apps, 11 ecosystem pain entries (6 silent-failure
shaped), 5 cross-cutting takeaways. The fifth takeaway is new from
Phase 4: the SDK should ship per-framework adapter wrappers
(`withAgentaInstrumentation(handler, opts)`) so each framework's
instrumentation seam is invariant-by-construction. Three TanStack
pain entries collapse to one wrapper.

Phase 4 closes spike scope. SDK design phase can now start with
confidence — pain log saturated across framework variations, no
new pain categories expected from further frameworks (Hono, Remix,
SvelteKit, etc. are out of spike scope per Decision 4).

App runs on its own port (3105).
…auses

Source-dived @vercel/otel@2.1.2 and ai@6.0.177 to pin down both deferred
investigations. Updates the pain log Notes sections with empirical
mechanisms and the summary's "What didn't, and why" table rows.

P-APP-RAW-01 (raw OTel + App Router edge: zero spans ever):

@vercel/otel's CompositeSpanProcessor.onStart enrolls forceFlush into
globalThis[Symbol.for("@vercel/request-context")].get().waitUntil(...)
at root-span open. That's the Vercel-runtime primitive backing
unstable_after — it defers isolate freeze until the registered promise
resolves. Our raw setup uses `after(() => forceFlush())` which runs
the callback but does NOT enroll the promise into the runtime tracker,
so the edge isolate freezes the moment Response returns and the OTLP
fetch is killed mid-flight. None of the 3 originally posed hypotheses
was correct as written — hypothesis 2 (after() runs too late) was
closest but mechanism-wrong. keepalive: true is a red herring
(vercel/otel doesn't set it either).

P-PAGES-VERCEL-01 (Pages + vercel-otel + pipeUIMessageStreamToResponse
+ streamText: empty ag.metrics.tokens):

@vercel/otel's CompositeSpanProcessor.onEnd force-ends every still-open
child span when the Next.js root SERVER span ends. AI SDK v6 streamText
writes ai.usage.* attrs INSIDE flush() right before rootSpan.end().
pipeUIMessageStreamToResponse returns synchronously while the stream
is still draining (writeToServerResponse fires read() without
awaiting), so the SERVER span ends BEFORE flush() runs — the force-end
then kills ai.streamText, and AI SDK's subsequent setAttributes(
{ai.usage.*}) no-ops on the ended span per OTel spec. App Router's
toUIMessageStreamResponse keeps the SERVER span alive until the
response body drains (Next awaits the stream), so flush() lands
before the force-end. Raw OTel has no CompositeSpanProcessor and no
force-end logic. The 4-way collision is structural.

Both findings stay in observation space. Solution-space updates
deferred per current direction.

Open questions logged in pain-log.md:
- whether requestContext symbol is populated in `next dev` vs only
  deployed Vercel infra (our 10-15s arrival could be incidental
  BatchProcessor timing rather than waitUntil-enrolled flush)
- mechanism of P-PAGES-VERCEL-01 traced from source, not instrumented
  at runtime — a 1-line probe patching CompositeSpanProcessor.onEnd
  would empirically confirm
Researches both vendors across 8 dimensions (package layout, init,
tracing API, AI provider integrations, evals/datasets/prompts, export
model + edge runtime, type safety, design opinions). Maps findings
against the active ts-sdk-tracing spike's 11 pain entries and
distills 10 explicit RFC decision points plus 5 ranked
differentiation opportunities.

Headline findings:
- AI SDK v6 streamText + abort flush is an open gap for both
  competitors (Langfuse issue #12643, Braintrust steers to OTel
  mode where same problem reappears) — strongest differentiator
- Edge runtime tracing: Langfuse unsupported, Braintrust ships
  per-runtime conditional exports — pattern worth lifting
- Both ship eval orchestration in-SDK; Braintrust adds a CLI runner
- Langfuse v4/v5 demonstrates OTel-native end-to-end is viable;
  paid for in three breaking releases (cautionary tale on migration)

Inputs to the upcoming agenta ts-sdk RFC.
v1 of the doc was synthesis of web research and contained ~18 incorrect
or imprecise claims. v2 is a deep source-code audit of both repos cloned
locally with file:line citations on every load-bearing claim.

Key corrections (Braintrust):
- Span types: 11 not 6 (adds automation/facet/preprocessor/classifier/review)
- Wire endpoint: `logs3` not `/logs`; log upload batch is 100 not 1000
- `wrapTraced` generator handling is `function*`/`async function*` only,
  NOT arbitrary AsyncIterable — does not solve AI SDK v6 case
- AI SDK stream handling routes through `diagnostics_channel`, not generators
- Zero AbortSignal handling anywhere in `js/src/wrappers/ai-sdk/`
- `dotenv` is CLI-only, never auto-loaded at runtime
- `filterAISpans` is opt-in, off by default
- CLI binary is `braintrust`, not `bt`
- `BraintrustExporter` wraps `BraintrustSpanProcessor`; HTTP layer is
  upstream OTLPTraceExporter
- `engines` and `sideEffects` not declared anywhere

Key corrections (Langfuse):
- `@langfuse/tracing` is Node ≥ 20, not "Universal"
- `LANGFUSE_BASEURL` still read as legacy fallback
- 10 observation types but only 2 attribute shapes; remaining 8 are
  type aliases of `LangfuseSpanAttributes`
- 16 deprecated method-name aliases preserved in `LangfuseClient` —
  migration tax softer than commonly framed
- Default batching uses upstream OTel defaults (512 / 5000ms) unless
  env vars set
- `LANGFUSE_FLUSH_AT`/`_INTERVAL` control both span processor AND score
  queue — env var collision

Smoking gun: source-level confirmation neither codebase handles AI SDK
v6 streaming abort. Langfuse's `wrapAsyncIterable` ends generation only
on for-await loop completion; e2e tests pass via manual forceFlush().
Braintrust's wrapAISDK has no abort-aware logic. Confirms strongest
differentiation opportunity for agenta.

New findings worth lifting into RFC:
- Braintrust `Symbol.for("braintrust-state")` globalThis state pattern
  (kills multi-copy / Next.js dev mode footguns)
- Langfuse stale-while-revalidate prompt cache
- Langfuse `propagateAttributes` observation-centric attribute propagation
- Langfuse `setLangfuseTracerProvider` escape hatch for `@vercel/otel`
- Langfuse mirroring of `user.id`/`session.id` to unprefixed OTel-standard
  keys
- Per-attribute async-aware mask function applied before media extraction

17 explicit RFC decision points + 6 ranked differentiation opportunities.
…ion buries AI SDK spans in UI)

Discovered during a per-phase re-run with isolated API keys (each phase's
traces went to its own Agenta key for clean comparison). All four Next.js
spike screenshots showed `POST /api/chat...`, `executing api r...`, `GET
/api/sentin...` as the trace-list rows with empty Inputs/Outputs columns —
where Phase 1 (Node) and the v4 published example showed `ai.streamText`
/ `ai.generateText` with the prompt + response fully visible.

Verified the cause via direct `POST /api/spans/query` calls for one
assertion-3 trace per phase. Confirmed structure:

  Phase 2a/2b (App Router): 7 spans, ai.streamText at depth 2 under
    `POST /api/chat/route` → `executing api route (app) /api/chat/route`
  Phase 3a/3b (Pages Router): 4-5 spans, ai.streamText at depth 2 under
    `POST /api/chat` → `executing api route (pages) /api/chat`
  Phase 4 TanStack Start: 2 spans, ai.streamText IS the root (L0)
  Phase 1 Node: ai.streamText IS the root

Identical tree shape in raw OTel AND @vercel/otel — proves the HTTP +
handler wrapper spans come from Next.js 15's built-in OTel auto-
instrumentation, not from the user's OTel library choice. TanStack
Start's Vite/Nitro stack does NOT emit auto-instrumented HTTP spans, so
its traces display correctly in Agenta's "Root" view.

Inputs/outputs/tokens/metadata are all present in Agenta — on the
ai.streamText span at depth 2. The UI's default "Root" filter just
doesn't surface them: the HTTP root span carries only
ag.type.trace='invocation' + ag.metrics.duration.cumulative, no payload.

Why we missed it earlier: our verifyTrace harness queries by attribute
(ag.user.id) and matches by span name, finding ai.streamText regardless
of hierarchy. Programmatic assertions pass; the UI experience is
degraded. Pre-existing pain entries focused on data loss; this is the
inverse — data is preserved, but the default lens hides it.

Added "common" as a new framework prefix (P-COMMON-NN) for cross-
framework / backend-side findings. Updated:
- pain-log.md: schema docstring, numbering scheme, P-COMMON-01 entry
- validate-pain-log.mjs: accept "common" framework value
- summary.md: current status paragraph + What didn't table row +
  pain entry count (11 → 12)
- status.md: Last Updated note
…te Langfuse precedent

Pre-existing P-COMMON-01 entry claimed the root cause was Next.js's
built-in OTel auto-instrumentation but didn't validate the claim or
compare against how other LLM observability platforms handle the same
symptom. Filling in both via web research:

Not-a-wiring-mistake (confirmed):
- Our instrumentation.ts in both raw OTel and @vercel/otel variants
  matches Vercel's own ai-chatbot template + the Next.js OTel docs
  "Manual OpenTelemetry configuration" sample line-for-line. No
  canonical example ships span filtering.
- The wrapper span names (BaseServer.handleRequest, AppRouteRouteHandlers
  .runHandler, NextNodeServer.findPageComponents, .startResponse)
  come from packages/next/src/server/lib/trace/constants.ts — Next.js
  itself, not @opentelemetry/instrumentation-http and not @vercel/otel.
- Next.js docs explicitly state: "the root server span labeled as
  [http.method] [next.route]. All other spans from that particular
  trace will be nested under it." It's by design.
- NEXT_OTEL_VERBOSE=0 (default) keeps the 5 wrapper spans we observed;
  setting =1 adds MORE. NEXT_OTEL_FETCH_DISABLED=1 suppresses only the
  outbound fetch span. No documented knob removes the root.

How others handle the same symptom:
- Langfuse @langfuse/otel v5+ ships a LangfuseSpanProcessor that drops
  every span whose instrumentationScope.name doesn't match a known LLM-
  library prefix (`ai`, `openinference`, anthropic, openai, langsmith,
  litellm, etc.) and lacks gen_ai.* attrs. Next.js wrapper spans get
  silently dropped at the processor BEFORE export. Their FAQ
  (`/faq/all/unwanted-http-database-spans`) addresses this exact pain
  and notes pre-v5 SDKs exported them and counted toward billing — i.e.
  they evolved to filter after running into the same problem we are.
- Braintrust uses wrapper-based observability (wrapAISDK, traced())
  rather than instrumentation.ts auto-instrumentation. The AI call is
  wrapped explicitly so the wrapper span IS the trace root by
  construction. Different paradigm; no filtering needed.
- Neither documents a user-side instrumentation.ts modification; the
  filtering (Langfuse) or wrapping (Braintrust) is SDK-side, applied
  uniformly across customer apps.

Updates:
- pain-log.md P-COMMON-01: new "Confirmed not-a-wiring-mistake" block
  with reference URLs; rewrote the "What would be ideal" sketch to show
  three observed ecosystem approaches (Langfuse filter, Braintrust
  wrappers, backend-side UI) — all in observation space, no
  recommendation
- summary.md "What didn't" table P-COMMON-01 row: tightened root-cause
  column to cite the 5 supporting evidence points, mentions Langfuse
  precedent
…flag inference

Previous commit (626172c) made the claim "Langfuse explicitly
addressed it on their SDK side after running into it themselves"
without distinguishing what's directly quoted vs inferred. Replacing
the loose paraphrase with verifiable references.

Verified two sources (web-fetched 2026-05-11):

1. Langfuse FAQ at langfuse.com/faq/all/unwanted-http-database-spans —
   contains two verbatim quotes now inlined in the pain entry:
   - Pre-v5 behavior: "no automatic filtering — Langfuse exports all
     spans it receives, including HTTP requests, database queries, and
     framework internals."
   - v5+ behavior: "apply a default span filter that automatically
     keeps only LLM-related spans and drops HTTP, database, and
     framework spans — no configuration needed."

2. @langfuse/otel source at unpkg.com/@langfuse/otel/dist/index.mjs —
   confirms LangfuseSpanProcessor class + isDefaultExportSpan() filter
   logic (OR of isLangfuseSpan/isGenAISpan/isKnownLLMInstrumentor) +
   the 10-entry KNOWN_LLM_INSTRUMENTATION_SCOPE_PREFIXES allowlist
   (`ai`, `openinference`, `langsmith`, etc.). Code excerpts now in
   the pain entry.

What was actually NOT verbatim: the "after running into it themselves"
phrasing. The FAQ does NOT contain a direct quote saying Langfuse
built the filter because they hit the problem internally. That
inference rests on (a) the documented pre-v5 vs v5 behavioral change
and (b) the existence of a dedicated FAQ page titled
"unwanted-http-database-spans" — strong indirect evidence but not a
quotable claim. Pain entry now flags this distinction explicitly with
an "Interpretive note (not a verbatim claim)" subsection.

Braintrust comparison also flagged as agent-research-summarized rather
than independently web-verified.
Adds the Vue ecosystem to the spike: Nuxt 4 + Nitro + AI SDK v6 + raw OTel.
Mirrors the Next.js spike pattern (4 canonical assertions) so the trace
hierarchy / abort handling / instrumentation seam can be compared apples
to apples.

Scope decision (single app, not A/B):
- Research showed @vercel/otel is Next.js-only — no first-party Nuxt path.
  An A/B "nuxt-vercel" companion would just be testing community packaging.
- Nuxt has no instrumentation.ts hook. Wiring is via Nitro plugin
  (server/plugins/otel.ts), which runs at Nitro init.

What worked:
- 4/4 canonical assertions PASS (assertion-2's flush window had to be
  raised from 5s to 30s — see P-NUXT-01 below).
- Trace hierarchy is CLEAN: ai.streamText IS the root, inputs/outputs
  visible at the top level (2 spans total). P-COMMON-01 (Next.js HTTP
  auto-instrumentation buries AI SDK spans) does NOT apply to Nuxt —
  bare Nitro doesn't emit HTTP server spans. Same shape as TanStack
  Start.

What didn't:
P-NUXT-01: H3 v2 RC has no working abort-signal propagation in the
Nitro Node-runtime preset. Verified empirically:
- event.req.signal: types claim it exists; runtime says undefined
- event.runtime.node.req: typed; runtime says undefined
- event.node.req 'close' event: fires only AFTER stream drains naturally,
  not when client disconnects mid-stream
streamText receives no abortSignal, model keeps generating after client
abort, parent span ends ~7-15s late (instead of <1s like other phases).
Trace DOES arrive, just outside the interactive window. Production
cost-control implication: today's Nuxt users keep paying for tokens
generated after the user already closed the tab.

Updates:
- web/examples/nuxt-raw/: full app (package.json, nuxt.config.ts,
  server/{plugins/otel.ts, api/chat.post.ts, api/sentinels.get.ts,
  lib/ai.ts}, app.vue, 4 assertion scripts, tsconfig.json, .env.example,
  .gitignore)
- pain-log.md: new "Nuxt 3/4 (Vue + Nitro)" section + P-NUXT-01 entry
  with verbatim runtime probe output
- summary.md: Phase 5 section, P-NUXT-01 row in "What didn't" table,
  assertion count 23/24 → 27/28, pain count 12 → 13
- status.md: Phase 5 progress tracker row, Last Updated note
- scripts/validate-pain-log.mjs: accept "nuxt" as framework value

Port: 3106 (avoids 3101-3105 used by prior spike apps).
… evals, scoring, media, annotations, sessions, functions, CLI, auth, config, cost, query)

v2 was tracing-focused. v3 adds peer sections for every non-tracing
surface both SDKs ship, with file:line citations from a fresh source
audit of both repos.

New sections (13 surfaces):
- §6 Prompts — Braintrust loadPrompt + xact_id versioning + two-tier
  cache (memory + gzipped disk); Langfuse PromptManager + SWR cache
  with concurrent-refresh dedup
- §7 Datasets — Braintrust ObjectFetcher + full CRUD + snapshots;
  Langfuse manager-only-get + CRUD on raw api.datasets.*
- §8 Evals — Braintrust ~25 Evaluator fields + rolling concurrency +
  byte-backpressure; Langfuse batched (allSettled) + silent eval
  failure drop
- §9 Scoring — Braintrust span-attached, no separate queue;
  Langfuse fire-and-forget queue + 5 ScoreDataTypes + ScoreConfig
  schemas
- §10 Media — Braintrust Attachment refs only; Langfuse auto-extracts
  base64 from 6 attribute slots + presigned URL + sha256 dedup
- §11 Annotations — Braintrust NONE (gap); Langfuse raw API only,
  annotations = scores with queueId
- §12 Sessions/users — Braintrust metadata-only (gap); Langfuse first
  -class via propagateAttributes → unprefixed OTel user.id/session.id
- §13 Functions — Braintrust first-class Project DSL with tools/
  prompts/parameters/scorers builders + server-side invoke();
  Langfuse NONE
- §14 CLI — Braintrust eval/push/pull + --dev mode server;
  Langfuse no CLI
- §15 Auth, multi-project, orgs — Braintrust per-call state for
  multi-tenant; Langfuse one-client-per-project + SCIM
- §16 Configuration, secrets — NEITHER ships managers; agenta Python
  SDK is the outlier
- §17 Cost — both server-side; Braintrust normalizes token metrics,
  Langfuse untyped USD bag
- §18 Query/read-back — Braintrust ObjectFetcher AsyncIterable + BTQL;
  Langfuse typed api.*.* + Cube-style metric DSL

Headline non-tracing findings:
- Braintrust has surfaces Langfuse lacks (functions/tools/invoke, CLI,
  --dev mode, disk-cached prompts) but ALSO surfaces Braintrust lacks
  that Langfuse has (annotations, sessions/users, manager pattern)
- Both punt on secrets/config — agenta Python SDK is the outlier here;
  RFC recommendation is to defer ConfigManager/SecretsManager port to
  TS SDK v2
- Both ship eval orchestration in-SDK; Braintrust's Evaluator field
  set is much richer (~25 fields with rolling concurrency + byte
  backpressure) than Langfuse's batched allSettled approach
- Langfuse silently drops failed evaluator results — agenta should
  surface as dataType:"ERROR" scores instead

RFC decisions expanded from 17 → 75 across all surfaces. 12 ranked
differentiation opportunities (up from 6).
… backend-fixable analysis

Three artifacts shipped:

1. Broken baseline at examples/node/observability-mastra/ — same wiring
   shape as the published v4 quickstart, but emits ZERO traces for
   Mastra agents. README walks through both broken paths (vendored
   AI SDK noopTracer + non-OTel ObservabilityBus) and points to the
   fix.

2. Working PoC at web/examples/mastra-node/ — AgentaMastraExporter
   (~150 lines) subscribes to Mastra's ObservabilityBus and re-emits
   spans through globally-registered OTel. 4/4 canonical assertions
   PASS. Clean 4-level Mastra tree lands in Agenta with
   inputs/outputs/userId/sessionId propagated.

3. Backend-fixable analysis (Option C complete) — full matrix in
   summary.md "Backend-fixable subset (AI SDK)" section, plus
   per-entry annotation on all 16 pain entries (3 yes, 1 partial,
   12 no). Lets future analysis filter the wedge by axis.

Adds:

- Pain entries P-MASTRA-01/02/03 + new `mastra` framework prefix
- "Strategic alternative: backend-led integration" section in summary
  (Mahmoud-aligned framing)
- "Backend-fixable subset (AI SDK)" section in summary
- Cross-cutting takeaway #6 (Mastra requires different integration shape)
- Phase 6 status tracker row
- Mastra framework added to pain-log schema + validator

Recommended implementation order from the analysis:

1. Backend fixes first (P-NODE-01 + P-NODE-03 + P-COMMON-01) — 3 wins,
   broadest benefit, sharpens JS SDK scope from 13 things to 10 in 2
   categories.
2. AI SDK lifecycle wrapper as JS SDK v0 — single mechanism solves the
   3 highest-severity JS-side silent failures.
3. Edge runtime helper as JS SDK v0.1
4. Framework adapter shims as JS SDK v0.2+

Outstanding scope decision deferred: remaining 6 Mastra-in-framework
spike apps (Mastra inside Next.js, Pages, TanStack, Nuxt) not pursued
— strategic answer doesn't depend on framework-specific Mastra data
since Mastra's ObservabilityBus operates above the framework HTTP
layer.
…K-direct apps

Wires Braintrust as a second OTLP destination alongside Agenta so the
SAME source span data fans out to both backends. Lets us directly compare
how each LLM observability platform displays IDENTICAL trace input — no
application code changes required, just an additional SpanProcessor in
each instrumentation file.

Apps wired (8):

  examples/node/observability-vercel-ai/   (root v4)
  web/examples/node-vercel-ai-v6/          (Phase 1)
  web/examples/nextjs-app-router-raw/      (Phase 2a)
  web/examples/nextjs-app-router-vercel/   (Phase 2b)
  web/examples/nextjs-pages-router-raw/    (Phase 3a)
  web/examples/nextjs-pages-router-vercel/ (Phase 3b)
  web/examples/react-tanstack-start/       (Phase 4)
  web/examples/nuxt-raw/                   (Phase 5)

Pattern: each app's instrumentation file conditionally appends a second
SpanProcessor wrapping an OTLPTraceExporter pointed at Braintrust's
endpoint (https://api.braintrust.dev/otel/v1/traces) with Bearer auth
+ x-bt-parent header naming the destination project. Reads
BRAINTRUST_API_KEY + BRAINTRUST_OTLP_URL from env. When the key is
unset, behaviour matches the original baseline.

For @vercel/otel apps (Phase 2b + 3b): wrap both exporters in
BatchSpanProcessor (matching @vercel/otel's default traceExporter
behaviour) — see the accidental-finding section below.

Assertion results after wiring (29/32 PASS):
- Phase 1 / 2a / 3a / 3b / 4 / 5: 4/4 PASS each
- Phase 2b: 3/4 (a2 expected FAIL — P-APP-VERCEL-01 preserved)
- Root v4: 1 trace exported (no canonical assertions)

Accidental finding: switching @vercel/otel from `traceExporter: x`
(default BatchSpanProcessor) to `spanProcessors: [SimpleSpanProcessor(x)]`
sidesteps P-APP-VERCEL-01 entirely — assertion-2 went from FAIL to PASS
when I first wired Phase 2b with SimpleSpanProcessor. SimpleSpanProcessor
exports synchronously per span, before BatchProcessor's flush window
would have closed. This is a legitimate user-visible workaround for
@vercel/otel users hitting P-APP-VERCEL-01 today (trade-off: per-span
HTTP round-trip latency, ~50-200ms per call). Reverted to
BatchSpanProcessor in both vercel-otel variants to preserve the baseline
pain reproduction. Documented in the P-APP-VERCEL-01 entry's notes as
an interim workaround.

Strategic implication: customers on Agenta who want side-by-side
comparison with Braintrust can wire it in 8-10 lines of additional
instrumentation code with no application-level changes. The same pattern
extends to any OTLP-compatible LLM observability backend (Langfuse,
Honeycomb, Datadog OTel).

Not covered: Mastra (`web/examples/mastra-node/`) — no Braintrust key
provided and the integration shape differs (would need a
Braintrust-flavoured Mastra BaseExporter, or wrapping Mastra's vendored
AI SDK with `wrapAISDK`). Tracked as open work.

Adds:
- Phase 7 section in summary.md
- Interim-workaround note on P-APP-VERCEL-01 (SimpleSpanProcessor sidestep)
- Status tracker row for Phase 7
- BRAINTRUST_API_KEY + BRAINTRUST_OTLP_URL placeholders in every .env.example
…anProcessor) + clarify P-APP-VERCEL-01 vs P-PAGES-VERCEL-01 nuance

User push-back surfaced two issues with my Phase 7 commit (a1799db):

1. The Agenta docs at docs/docs/integrations/frameworks/vercel-ai-sdk/
   observability.mdx already use SimpleSpanProcessor in the canonical
   example (line 74). I had been WebFetching live agenta.ai docs to
   answer questions about doc coverage instead of reading the repo
   source — inexcusable methodology when the source is in-tree.
2. Phase 7's Braintrust wiring for @vercel/otel apps (2b + 3b) used
   BatchSpanProcessor "to preserve the failing baseline" — but the
   baseline I should be preserving is what the docs recommend, not
   what @vercel/otel defaults to.

Fixes:

- Revert Phase 2b + 3b to SimpleSpanProcessor in the @vercel/otel
  spanProcessors arrays (matches docs canonical example)
- Both apps now 4/4 PASS — P-APP-VERCEL-01 stops reproducing because
  Simple sidesteps the BatchSpanProcessor flush race
- Phase 3b token check stays loosened — verified empirically that
  P-PAGES-VERCEL-01 reproduces under SimpleSpanProcessor too (the
  CompositeSpanProcessor.onEnd force-end is independent of processor
  choice)

Two cleanly-separated findings drop out:

- **P-APP-VERCEL-01 is processor-dependent**: Batch reproduces it,
  Simple sidesteps. Primarily a docs gap (Agenta docs don't have a
  @vercel/otel-specific section explaining the override).
- **P-PAGES-VERCEL-01 is processor-independent**: reproduces under
  both. Genuine JS-side wedge or backend trace-enrichment.

Pain log updates:

- P-NODE-02: docs-coverage clarification — Agenta docs already use
  Simple, the pain is for users who reach for "production-grade" Batch
  without reading Agenta docs. Strategic framing: SDK should ship a
  streaming-aware Batch processor (perf optimisation on top of the
  docs workaround), not a bug fix.
- P-APP-VERCEL-01: clarified who hits it (users following @vercel/otel
  defaults without reading Agenta docs) + minimum docs fix listed.
- P-PAGES-VERCEL-01: added "processor-choice independence verified"
  note — empirically reproduces under Simple too.

Summary doc:

- Phase 7 section rewritten with corrected 32/32 PASS results
- Methodology correction explicitly called out
- New "Doc-coverage findings to action" subsection lists 4 specific
  improvements to docs/docs/integrations/frameworks/vercel-ai-sdk/
  observability.mdx (line 74 SimpleSpanProcessor not explained, no
  @vercel/otel section, no pipeUIMessageStreamToResponse warning, no
  metadata-to-toolCall gap mention)

What I should have done from the start: read docs/docs/integrations/
frameworks/vercel-ai-sdk/observability.mdx and the docs/design/
vercel-ai-adapter/ directory BEFORE designing the spike. The spike's
findings are still valid — they reveal real bugs the docs don't
address — but the framing for each pain entry is sharper when
cross-referenced against existing docs.
v3 was source-audit-only. v4 layers empirical findings from the
ts-sdk-chore/example-apps branch, where 8 spike apps fan IDENTICAL OTel
span data out to Agenta + Braintrust + Langfuse via parallel
SimpleSpanProcessors on one NodeTracerProvider.

Key changes:

New §0.5 "Empirical evidence: tri-export across 8 spike apps":
- Wiring cost: ~8 LoC for +Braintrust, ~12 LoC for +Langfuse on top of
  agenta baseline. Tri-export pattern works mechanically
- Verified trace counts via REST API for all 8 apps (2-33 events each
  on Braintrust, 1-12 traces each on Langfuse)
- Side-by-side: same ai.streamText span, three backends. Same data,
  three rendering choices
- P-BRAINTRUST-01: silent data-plane mismatch. US-default OTLP endpoint
  silently swallows spans for EU-plane orgs; OTel returns HTTP 200 so
  SimpleSpanProcessor never surfaces failure. Cost ~3 hours of
  empirical investigation to find
- P-LANGFUSE-01: no server-side scope filter. Earlier v3 claim
  ("Langfuse drops non-LLM scope") was wrong — that filter is
  @langfuse/otel JS-SDK-only. Raw OTLP to Langfuse stores everything,
  same as Agenta

§5 updates: callout the data-plane gotcha (P-BRAINTRUST-01) inline
with Braintrust wire description. Clarify Langfuse's isDefaultExportSpan
filter is JS-SDK-only with P-LANGFUSE-01 citation.

§17 update: empirical "Agenta returns ag.metrics.costs = {}" gap with
real numbers. Token data lands; just needs server-side rollup.

§22 (Differentiation opportunities) expanded from 12 → 15 ranked:
- Added "Multi-backend delivery health verification" (P-BRAINTRUST-01)
- Added "Server-side scope filter" (P-LANGFUSE-01 → portable to agenta
  backend, no JS SDK required)
- Added "Cost computation server-side from gen_ai.usage.*"

§21 (RFC decisions) expanded with empirical-driven D76-D80:
- D76: surface per-destination delivery health (P-BRAINTRUST-01 lesson
  generalizes to any future agenta multi-backend pipeline)
- D77: document data-plane / multi-region selection loudly
- D78: scope-filter logic belongs at ingest/render, NOT as a JS
  SpanProcessor (cleaner architecture than Langfuse SDK approach)
- D79: case for @agenta/sdk-tracing pivots from "wraps OTel
  ergonomically" (insufficient differentiation) to "hides config
  gotchas that silently lose data + solves what raw OTLP can't"
- D80: cost computation server-side, populated from gen_ai.usage.*

Methodology note clarifies the vendor-SDK vs raw-OTLP distinction —
many "features" in §§3-13 only fire when the user installs the vendor
SDK. The raw OTLP path is what agenta will produce.

Appendix A updated with v3→v4 corrections table.
…e-analysis

Brings the spike's Phase 5-8 work and design-doc updates into the
competitive-analysis branch so the v4 doc cross-references resolve
locally on a single branch.

Incoming from example-apps:
- Phase 5: web/examples/nuxt-raw/ — Nuxt 4 + Nitro plugin spike app
- Phase 6: web/examples/mastra-node/ + AgentaMastraExporter PoC at
  src/agenta-exporter.ts (~150 lines bridging Mastra's ObservabilityBus
  to OTel)
- Phase 7: Braintrust dual-export wiring across the 8 AI-SDK-direct
  apps (instrumentation.ts files updated)
- Phase 8: Langfuse tri-export wiring across the same 8 apps
- @vercel/otel apps aligned with Agenta docs (SimpleSpanProcessor)
- examples/node/observability-mastra/ — broken-baseline reproducer at
  the published v4 quickstart layout
- docs(ts-sdk-tracing): P-COMMON-01 (Next.js HTTP auto-instrumentation
  buries AI SDK spans), Langfuse claim sourcing, P-PAGES-VERCEL-01
  nuance clarification

Not included in this merge (still uncommitted in the example-apps
worktree at .claude/worktrees/youthful-lumiere-5a93ed):
- docs/design/ts-sdk-tracing/sdk-comparison.md (the canonical empirical
  comparison doc referenced from v4 of competitive-analysis.md)
- Working-tree edits to pain-log.md (adds P-BRAINTRUST-01 and
  P-LANGFUSE-01)
- Working-tree edits to summary.md (Phase 8 empirical correction +
  P-COMMON-01 inline correction)
- Working-tree edits to status.md

Once those are committed on example-apps, a follow-up merge or cherry
pick will pull them into this branch.

Conflicts: none. The competitive-analysis branch only touched
docs/design/ts-sdk/ (sibling directory to example-apps' work in
docs/design/ts-sdk-tracing/ and web/examples/).

docs/design/ts-sdk/competitive-analysis.md (v4, this branch's only
contribution) is preserved unchanged through the merge.
…rison + empirical pain entries

Phase 8 lands the Langfuse tri-export wiring across all 8 spike apps
and surfaces two new pain entries from the empirical verification
that motivated this work.

Wiring (8 of 8 apps):
- Add LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY / LANGFUSE_BASE_URL env
  vars to every .env.example
- Append a conditional third SpanProcessor(OTLPTraceExporter(langfuse))
  to each instrumentation file (Node, App Router raw + @vercel/otel,
  Pages Router raw + @vercel/otel, TanStack Start, Nuxt Nitro plugin).
  Auth is Basic base64(public:secret); optional
  x-langfuse-ingestion-version: 4 header
- Refactor the if/else single-vs-dual processor blocks in @vercel/otel
  apps to a single conditionally-appended spanProcessors array — same
  shape scales linearly with backend count
- Root-level examples/node/observability-vercel-ai/ baseline gets the
  same tri-export wiring so docs can match the spike apps

Empirical verification doc:
- docs/design/ts-sdk-tracing/sdk-comparison.md — full side-by-side
  comparison of how Agenta, Braintrust, Langfuse handle IDENTICAL OTel
  span data. Trace counts verified via REST API for all 8 apps.
  Project<>app mapping table. Wiring cost table (~8 LoC Braintrust,
  ~12 LoC Langfuse on top of Agenta baseline). Side-by-side
  ai.streamText field comparison from real API responses

New pain entries (in pain-log.md):
- P-BRAINTRUST-01: silent data-plane mismatch. Braintrust's US-default
  OTLP endpoint auto-creates projects but silently rejects span
  payloads when the user's org is on EU plane. Cost ~3 hours of
  empirical investigation. OTel exporters return HTTP 200; no signal
  to SimpleSpanProcessor. Generalizable lesson: multi-backend OTel
  pipelines need REST-API delivery verification, not just span-export
  success
- P-LANGFUSE-01: no server-side scope filter. Earlier docs (incl
  P-COMMON-01 inline citation) claimed Langfuse "drops non-LLM scope
  spans server-side". Empirically wrong. On raw OTLP, Langfuse stores
  every span including Next.js HTTP wrappers with null input/output.
  The isDefaultExportSpan filter lives inside @langfuse/otel JS-SDK,
  not at ingest. Strengthens the backend-fix case for P-COMMON-01

Summary doc updates:
- Phase 8 section gains an empirical-correction callout flagging the
  pre-correction trace-count claim and pointing at the verified data
  in sdk-comparison.md
- P-COMMON-01 ecosystem claim inline-corrected to reflect P-LANGFUSE-01
  findings

Status doc: Phase 8 marked complete with the corrected trace-count
context.

Lockfile update: web/pnpm-lock.yaml regenerated for the workspace
package additions across spike apps.

Not included in this commit:
- web/examples/sdk-native-spike/ (untracked) — Phase 9 in-progress
  vendor-native SDK comparison (agenta-raw-otel / braintrust-sdk /
  langfuse-sdk side-by-side scripts). Commit separately once node_modules
  is removed and the scripts stabilize.
Follow-up to the earlier merge of example-apps. Brings in the
just-landed Phase 8 commit (bc3010f) which adds:

- Langfuse tri-export wiring across all 8 spike apps (env.example +
  instrumentation files)
- docs/design/ts-sdk-tracing/sdk-comparison.md (canonical empirical
  comparison referenced by v4 of competitive-analysis.md)
- P-BRAINTRUST-01 (silent data-plane mismatch) and P-LANGFUSE-01
  (no server-side scope filter) pain entries
- Phase 8 empirical correction in summary.md
- Root-level examples/node/observability-vercel-ai/ tri-export wiring

With this merge, every cross-reference in competitive-analysis.md v4
resolves locally and the empirical evidence is reproducible from a
single branch.

Conflicts: none expected (competitive-analysis branch only touched
docs/design/ts-sdk/ which is a sibling directory).
Catches the branch up to main (220 commits behind at merge time, last
main commit: 75dc5e2 release/v0.99.9). Brings in the Fern-generated
TS client work + everything else that landed since the ts-sdk-tracing
spike branched off in early May.

Conflicts resolved:

- web/pnpm-workspace.yaml — both sides modified. Ours added the
  `examples/*` + `examples/.shared/*` workspace globs for the spike
  apps; main added an `allowBuilds:` config block listing per-package
  build allowances. Union of both kept.

- web/pnpm-lock.yaml — 219 conflict markers, mass conflicts throughout.
  Took main's version. The spike apps' workspace packages are not
  reflected in the lockfile; running `pnpm install` from web/ will
  regenerate the lockfile to include them. The spike apps don't need
  to run to feed the RFC (the docs + sdk-comparison empirical data
  carry the findings), so this is an acceptable trade-off for the
  RFC office-hours session.

Branch is now self-contained with:
- Latest main (Fern-generated TS SDK client + all recent platform work)
- Full ts-sdk-tracing spike (8 apps with tri-export wiring)
- 19 pain entries including the v4 empirical additions (P-BRAINTRUST-01,
  P-LANGFUSE-01)
- Canonical sdk-comparison.md (empirical Agenta vs Braintrust vs Langfuse)
- v4 competitive-analysis.md (1709 lines, 80+ RFC decisions)

If running the spike apps locally is needed, `cd web && pnpm install`
to regenerate the lockfile with the spike apps' workspace entries.
The previous commit (bc3010f) introduced 3-line wraps on Buffer.from(...).toString("base64")
in 4 instrumentation files. Project prettier config wants 1-line. Pre-commit's
prettier --write keeps re-flattening them on every commit attempt, which then
blocks subsequent commits via pre-commit's stash/restore conflict path.

No behavior change — pure formatting.
…hared/

The verification harness at web/examples/.shared/agenta-verify/ existed only
as working-directory state in this worktree since Phase 0 of the ts-sdk-tracing
spike. 8 spike-app package.json files declare "@agenta/spike-verify": "workspace:*"
and 32 test files import verifyTrace from it, but the package was never on any
git ref — git status never surfaced it because the root .gitignore's `.*` rule
silently swallowed the dot-prefixed directory.

The dot-prefix is intentional (it carves the workspace out of the default
`examples/*` glob per the comment in web/pnpm-workspace.yaml), but no exception
was ever added to .gitignore to match. This commit:

  - adds an un-ignore block in .gitignore for web/examples/.shared/
    while keeping node_modules/, .turbo/, dist/, .vite/ ignored
  - commits the harness sources (1,157 LoC across 8 files):
      package.json, tsconfig.json, vitest.config.ts
      src/api.ts, src/errors.ts, src/index.ts, src/verify.ts
      test/verify.test.ts

Discovered when a sibling worktree (ts-sdk-chore/rfc, forked off
ts-sdk-chore/competitive-analysis) failed pnpm install with the workspace:*
reference unresolved — confirming the package never reached any branch.
Three scripts that exercise the same streamText("Reply with: ok.") call
against gpt-4o-mini using each backend's recommended SDK shape (not just
raw OTLP). Empirical companion to docs/design/ts-sdk-tracing/sdk-comparison.md
§ "Ergonomic-by-ergonomic, six implementations side-by-side".

  scripts/agenta-raw-otel.ts — baseline: raw @opentelemetry/* + AI SDK's
    experimental_telemetry flag. 9 functional setup statements. Hand-rolled
    trace URL.
  scripts/langfuse-sdk.ts    — langfuse-node SDK: 1-statement Langfuse client,
    observeOpenAI() wrapping the OpenAI client (Langfuse generation observation
    auto-emitted, NO ai.streamText interception — that requires @langfuse/vercel),
    trace.span() / trace.end() imperative custom-span pattern, first-class
    userId/sessionId/tags/metadata on trace constructor, trace.getTraceUrl().
  scripts/braintrust-sdk.ts  — braintrust SDK: initLogger() + wrapAISDK(ai)
    drop-in (no experimental_telemetry flag needed), traced(fn, {name})
    functional wrapper, currentSpan().log({metadata, tags}) flat-KV semantic
    context, currentSpan().link() trace URL.

Empirical findings driving doc updates:

  - Langfuse computes cost server-side at ingest: resolves "gpt-4o-mini"
    to date-versioned "gpt-4o-mini-2024-07-18", looks up pricing, populates
    calculatedTotalCost: 3e-06 + per-token inputPrice/outputPrice
  - Braintrust auto-emits time_to_first_token: 0.057s without being asked
  - Langfuse observeOpenAI ONLY wraps the OpenAI client — to trace AI SDK's
    streamText calls in Langfuse, separate @langfuse/vercel package needed
  - Setup LoC empirically counted: Langfuse 1, Braintrust 2, Agenta-raw 9

.env stays ignored (real OpenAI/Agenta/Langfuse/Braintrust keys for the
re-run); .env.example is the redacted placeholder.
Brings in:
- @agenta/spike-verify harness at web/examples/.shared/agenta-verify/
  (unblocks workspace install for all 8 spike apps)
- web/examples/sdk-native-spike — vendor-native SDK comparison scripts
- prettier autofix from pre-commit hook
Adds two design docs for the Agenta TypeScript SDK v1:

- docs/design/ts-sdk/rfc.md — long-form RFC (audit trail: reasoning chain,
  rejected alternatives, locked decisions inherited from prior spike work).
- docs/design/ts-sdk/proposal.md — short-form proposal for review. Uses
  descriptive English issue names instead of spike pain-log codes; pieces
  of it are shareable on their own without cross-referencing the pain log.

Re-runs the four Next.js spike apps on Next.js 16.2.6 (Turbopack default
builder) and updates artifacts accordingly:

- web/examples/nextjs-{app,pages}-router-{raw,vercel}: bump next to 16.2.6,
  add @opentelemetry/sdk-trace-base as an explicit dep on the two @vercel/otel
  apps (Turbopack's stricter module resolution exposed the transitive miss),
  add Next-generated next-env.d.ts, refresh assertion-1 tests against the
  consumer-facing ag.metrics.tokens.cumulative.* path.
- web/examples/nextjs-pages-router-raw/pages/api/edge-chat.ts: new probe to
  verify P-PAGES-RAW-01 on Next 16 — the build-time barrier is gone with
  Turbopack; runtime tracing problem (P-APP-RAW-01) persists.

Pain log updates (docs/design/ts-sdk-tracing/pain-log.md):

- P-PAGES-RAW-01: appended Next.js 16 RESOLVED-at-build-time note; runtime
  emits 0 spans still.
- P-PAGES-VERCEL-01: appended Next.js 16 re-verification — both parent AND
  child .doStream have empty tokens (originally we believed the child
  retained them). Fix recommendation flips: backend rollup can't recover
  what no span carries; promote the JS-side agentaPipeUIMessageStreamToResponse
  helper from contingency to v1 deliverable.
- P-COMMON-02: new entry — Turbopack's stricter module resolution exposes
  missing transitive @opentelemetry/sdk-trace-base declarations on
  @vercel/otel apps. v1 setup-doc update.
- P-BRAINTRUST-01 / P-LANGFUSE-01: backfill **Framework:** lines, severity
  axes, and the friction + ideal-sketch code blocks to satisfy the
  pain-log schema validator (previously failing on main).
- validate-pain-log.mjs: teach the validator about `braintrust` and
  `langfuse` frameworks + their code prefixes.

Spike harness touch-ups (web/examples/.shared/agenta-verify/, plus minor
per-app fixes in node-vercel-ai-v6, react-tanstack-start, mastra-node) are
incidental cleanup picked up during the re-run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ardaerzin and others added 2 commits May 18, 2026 16:59
These three drafts (proposal.md, rfc.md, competitive-analysis.md) were
review material, not durable artifacts intended to live in the repo.
Removing them from the branch tip; they're saved locally outside the
repo. Earlier commits on this branch (e.g. ad7064c for proposal/rfc,
911778c and earlier for competitive-analysis) still contain them if
the history ever needs to be consulted.

The spike artifacts that the proposal was built from (pain-log,
status, summary, sdk-comparison, the comparison scripts, and all the
spike apps under web/examples/) remain on the branch unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 18, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agenta-documentation Ready Ready Preview, Comment May 19, 2026 1:04pm

Request Review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 9669eb71-f8ae-447e-83b9-6c11b2a8e3ea

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ts-sdk-chore/rfc

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

})
} catch (err) {
return new Response(
JSON.stringify({error: err instanceof Error ? err.message : String(err)}),
})
} catch (err) {
return new Response(
JSON.stringify({error: err instanceof Error ? err.message : String(err)}),
ardaerzin and others added 2 commits May 19, 2026 12:27
…acing

New spike app under web/examples/nextjs-workflow/ verifying how
OpenTelemetry tracing behaves with Vercel Workflow DevKit + AI SDK.

Two workflow variants, both empirically verified against Agenta:

1. **Plain pattern**: a 'use workflow' function calling generateText
   from inside a 'use step'. Produces a 36-span single-trace run.
   AI SDK semantic spans (ai.generateText, ai.generateText.doGenerate,
   fetch POST openai.com) land alongside Workflow DevKit's own OTel
   instrumentation (workflow.start, WORKFLOW, STEP, step.execute,
   world.events.create *, queue.publish, workflow.replay, etc.). All
   share one trace_id, including across the /.well-known/workflow/v1/step
   HTTP boundary that Workflow DevKit uses internally to dispatch step
   execution.

2. **DurableAgent pattern**: a multi-LLM agent loop with one tool call.
   Produces a 135-span single-trace run. Same automatic trace-context
   propagation. Caveat: DurableAgent calls AI SDK at the
   LanguageModelV3.doStream provider interface directly, bypassing
   streamText/generateText — so the standard ai.* / gen_ai.* semantic
   spans DO NOT appear. Users see Workflow's STEP doStreamStep span and
   the raw HTTP fetch to OpenAI, but not the AI-SDK-shaped trace tree.
   Flagged as a v2 follow-up if usage grows.

Two setup traps logged in the spike code for any future Workflow user:

- 'workflow' and '@workflow/ai' must NOT be in Next's
  serverExternalPackages. Externalizing them hides their 'use step'
  directives from the Workflow SWC plugin and produces
  StepNotRegisteredError at runtime (specifically:
  step//@workflow/ai@.../doStreamStep, step//@workflow/ai/agent@.../closeStream).
  The comment in next.config.ts documents this.

- DurableAgent's model option needs an actual model object from
  @workflow/ai/<provider> (e.g. openai('gpt-4o-mini') from
  @workflow/ai/openai), not the Gateway-style 'openai/gpt-4o-mini'
  string (which requires GATEWAY_API_KEY). And 'instructions:' not
  'system:'.

Strategic finding: the §2.4 HTTP-wrapper-masquerades-as-the-trace-root
problem gets proportionally worse with Workflow DevKit — a single run
produces 35-135+ framework spans before the LLM call, so the entry
HTTP span (which spans[0] picks) is even less informative as a trace
root. This raises the value of the backend extract_root_span upgrade
(\§5.1 in the proposal).

Also touches:
- web/examples/nextjs-pages-router-raw/next-env.d.ts: regenerated by
  Next.js 16 (path moved from .next/dev/types to .next/types).
- web/pnpm-lock.yaml: workflow@4.2.4 + @workflow/ai@4.1.2 +
  @workflow/next@4.0.5 deps added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ardaerzin and others added 2 commits May 19, 2026 14:44
… to Agenta

New spike app under web/examples/n8n-self-hosted/ verifying how n8n's
first-party OpenTelemetry support behaves when pointed at Agenta. n8n
2.20.11 bundles @opentelemetry/sdk-node 0.213 and registers an
OtelService on startup when N8N_OTEL_ENABLED=true.

What lands in Agenta (verified empirically against a webhook-triggered
workflow with two nodes — Webhook + HTTP Request):

- workflow.execute span (one per run) with attributes:
  n8n.workflow.{id, name, node_count, version_id}
  n8n.execution.{id, mode, status, is_retry, error_type}

- node.execute span (one per node, nested under workflow.execute) with:
  n8n.node.{id, name, type, type_version, items.input, items.output}

- service.name configurable via N8N_OTEL_EXPORTER_SERVICE_NAME.

What n8n does NOT emit:

- No gen_ai.* / ai.* semantic attributes anywhere.
- No http.* on HTTP Request nodes despite literal HTTP calls.
- No prompt/model/completion/token capture — only items.input/output
  integer counts.

n8n's OtelService only wraps workflow + per-node lifecycle; it doesn't
instrument code inside node implementations. So an HTTP Request to
OpenAI, or a LangChain-backed AI node, produces one node.execute span
with framework metadata and zero LLM-call detail. Same shape as the
Mastra finding (\§1.10 of the proposal) — framework emits OTel,
LLM-level visibility needs a separate adapter.

Setup gotchas documented in EMPIRICAL_FINDINGS.md (for any future user
trying the same setup):

- n8n reads its OWN N8N_OTEL_* env vars; standard OTEL_* vars are
  silently ignored.

- buildOtlpTracesUrl is a literal endpoint + path concatenation, so
  Agenta's ?project_id= query parameter must be stuffed into
  N8N_OTEL_EXPORTER_OTLP_TRACING_PATH:
    N8N_OTEL_EXPORTER_OTLP_ENDPOINT=http://host.docker.internal
    N8N_OTEL_EXPORTER_OTLP_TRACING_PATH=/api/otlp/v1/traces?project_id=<uuid>

- Auth via comma-separated key=value pairs in
  N8N_OTEL_EXPORTER_OTLP_HEADERS.

- From inside Docker, AGENTA_HOST must resolve to the host's loopback
  via host.docker.internal (Docker Desktop) or extra_hosts:
  host-gateway (Linux Docker).

Implication for the v1 TypeScript SDK proposal: n8n users are covered
'for free' at the workflow level — no Agenta-specific code, just the
env vars above. LLM-call detail is the gap, addressable by a v2+ docs
recipe or an OpenInference-style LangChain JS instrumentation deployed
as a custom n8n node. Not a v1 SDK deliverable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds source-level evidence to EMPIRICAL_FINDINGS.md that the missing
gen_ai.* / ai.* / http.* attributes on n8n's OTel output is n8n's
design, not a setup mistake on our side.

Four source-level checks against the installed n8n image (v2.20.11):

1. OtelService.init() in dist/modules/otel/otel.service.js instantiates
   NodeSDK with no 'instrumentations' array — so zero OTel
   auto-instrumentation is registered.

2. grep for 'registerInstrumentations' / 'instrumentations:' across
   all of /usr/lib/node_modules/n8n/dist returns zero matches.

3. @opentelemetry/instrumentation-http and related packages ARE
   bundled (transitive peer deps from LangChain), but n8n never
   imports them.

4. OtelLifecycleHandler.onNodeEnd exposes a customAttributes hook via
   ctx.taskData.metadata.tracing — a per-node opt-in surface. But
   @n8n/n8n-nodes-langchain (the AI nodes package) never writes to
   metadata.tracing, so even the purpose-built OpenAI Chat Model node
   produces the same shape as our HTTP Request: n8n.node.type +
   items.input/output counts, nothing else.

Three paths forward, none of them v1 SDK work:
- Upstream PR adding metadata.tracing writes inside the LangChain AI
  nodes
- A custom n8n node that wraps AI calls and sets metadata.tracing.gen_ai_*
- A process-level OTel auto-instrumentation for LangChain JS layered
  on top

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants