Skip to content

feat(boot): self-test OpenRouter at startup — fail fast on env typos#308

Merged
rz1989s merged 1 commit into
mainfrom
fix/issue-293-boot-self-test-openrouter
Jun 2, 2026
Merged

feat(boot): self-test OpenRouter at startup — fail fast on env typos#308
rz1989s merged 1 commit into
mainfrom
fix/issue-293-boot-self-test-openrouter

Conversation

@rz1989s
Copy link
Copy Markdown
Member

@rz1989s rz1989s commented Jun 2, 2026

Summary

Closes #293. Adds a 2-token ping to OpenRouter as the third boot step (after loadNetworkConfig, before getDb). Catches the two silent-outage modes from frontier_sip_17 at boot instead of letting them manifest as empty assistant responses on every chat turn:

  1. SIPHER_MODEL typogetSipherModel() throws synchronously when pi-ai's registry doesn't know the model (e.g. the hyphen-form claude-sonnet-4-6 instead of dot-form).
  2. Bad OPENROUTER_API_KEY — the ping returns 401 / 403 and we surface that here.

Both now produce a Docker restart-loop with a loud error in the container logs instead of accepting traffic and degrading silently.

New module

packages/agent/src/boot/self-test.ts:

selfTestOpenRouter({ timeoutMs }?: SelfTestOptions): Promise<void>
  • Returns early if SIPHER_SKIP_BOOT_SELF_TEST=true.
  • getSipherModel() throws on invalid SIPHER_MODEL before any fetch.
  • Validates OPENROUTER_API_KEY is set.
  • POSTs https://openrouter.ai/api/v1/chat/completions with max_tokens: 2. AbortController caps budget at 5000ms.
  • On non-2xx, throws OpenRouter self-test failed: HTTP <status> - <body slice>. Body slice capped at 300 chars (no secret leakage; OpenRouter doesn't echo keys in error envelopes).

Wiring

packages/agent/src/index.ts adds await selfTestOpenRouter() between loadNetworkConfig() and getDb(). Top-level await is already in use elsewhere in the file (e.g. restorePendingActions).

Skipped automatically in

  • vitest (vitest.config.ts env block) — tests don't currently load index.ts, but defensive against future integration tests.
  • playwright (playwright.config.ts webServer env block) — e2e has no real OPENROUTER_API_KEY and chat is mocked at the network layer.
  • Any environment that explicitly sets SIPHER_SKIP_BOOT_SELF_TEST=true (documented in .env.example).

Test plan

  • pnpm vitest run tests/boot/self-test.test.ts — 8 passed.
  • pnpm test -- --run (root) — 1662 passed (was 1652 baseline; +10 net = 8 new self-test + 2 from the 2 skipped torque integrations being recounted).
  • pnpm typecheck clean across root + sdk + app + agent.
  • Post-merge: VPS boot log should show OpenRouter: self-test pass (~Nms) line ~200-500ms after Network: devnet ... line. If OPENROUTER_API_KEY is wrong, container should restart-loop with the clear OpenRouter self-test failed: HTTP 401 ... error.

New tests

tests/boot/self-test.test.ts (8 cases):

  • returns successfully when OpenRouter responds with 200
  • throws with HTTP status + body when OpenRouter returns 401
  • throws when OpenRouter returns 5xx
  • throws when OPENROUTER_API_KEY is unset (no fetch attempted)
  • propagates getSipherModel error when SIPHER_MODEL is invalid (no fetch attempted)
  • aborts the fetch after the configured timeout
  • skips entirely when SIPHER_SKIP_BOOT_SELF_TEST=true
  • does NOT skip when the value is "false" or other truthy strings

Follow-ups (not in this PR)

  • Could add SIPHER_SELF_TEST_TIMEOUT_MS env var if 5s default ever turns out to be too tight for some operator network. Punt until needed.

Frontier_sip_17 caught two stacked env-var bugs that took 100% of prod chat
traffic offline silently for an unknown window:

  1. SIPHER_MODEL set to hyphen-form (claude-sonnet-4-6) instead of
     dot-form. getSipherModel() throws on bad config, but only fires when
     a chat actually happens — meaning the outage manifested as empty
     assistant responses with no error log, not a boot failure.
  2. OPENROUTER_API_KEY pointed at a stale/revoked key. OpenRouter
     returned 401; pi-ai swallowed it and emitted empty content.

This PR adds a 2-token ping to OpenRouter as the third boot step (after
loadNetworkConfig + before getDb). Both failure modes now throw at boot
and Docker restarts the container with the same loud error until env is
fixed — a recoverable container restart loop instead of a silent
production outage.

New module: packages/agent/src/boot/self-test.ts

  selfTestOpenRouter({ timeoutMs }?):
    - returns early if SIPHER_SKIP_BOOT_SELF_TEST=true
    - calls getSipherModel() which throws synchronously on a bad
      SIPHER_MODEL — the existing dot-vs-hyphen guard. No fetch yet.
    - validates OPENROUTER_API_KEY is set (throws if empty).
    - POST https://openrouter.ai/api/v1/chat/completions with
      model.id + max_tokens:2 + a 1-message body. AbortController
      caps the budget at 5000ms (overridable in tests).
    - on non-2xx, throws with HTTP status + body slice (no secret
      leakage — OpenRouter doesn't echo keys in error bodies, and
      we cap the body slice at 300 chars).

Wiring: packages/agent/src/index.ts adds \`await selfTestOpenRouter()\`
between loadNetworkConfig() and getDb(), with a one-line "self-test
pass (Nms)" success log. Top-level await is already used elsewhere in
the file (e.g. restorePendingActions), so no module-system change.

Skipped automatically in:
  - vitest (vitest.config.ts env block — tests don't load index.ts
    directly today, but defensive against future integration tests).
  - playwright (playwright.config.ts webServer env block — e2e has
    no real OPENROUTER key, chat is mocked at the network layer).
  - any environment that opts in via SIPHER_SKIP_BOOT_SELF_TEST=true.
    .env.example documents this for operators.

Tests: 8 cases in tests/boot/self-test.test.ts — 200 success, 401
HTTP error, 5xx HTTP error, missing API key, invalid SIPHER_MODEL,
abort-on-timeout, skip flag respected, skip flag NOT respected on
non-"true" values. Suite total: 1652 → 1662 (+10) after merge of
prior #295/#294.

Closes #293.
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
sipher Ready Ready Preview, Comment Jun 2, 2026 1:46am

@rz1989s rz1989s merged commit 3726042 into main Jun 2, 2026
8 checks passed
@rz1989s rz1989s deleted the fix/issue-293-boot-self-test-openrouter branch June 2, 2026 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(boot): self-test OpenRouter + model lookup at startup — surface env typos in <1s

1 participant