Skip to content

chore(model): remove inert --name flag from obol model setup custom#509

Open
bussyjd wants to merge 1 commit into
mainfrom
chore/remove-unused-model-name-flag
Open

chore(model): remove inert --name flag from obol model setup custom#509
bussyjd wants to merge 1 commit into
mainfrom
chore/remove-unused-model-name-flag

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 21, 2026

Summary

What changed:

  • Removed --name flag from obol model setup custom.
  • Dropped name parameter from model.AddCustomEndpoint(cfg, u, endpoint, modelName, apiKey) signature.
  • Removed OBOL_LLM_NAME env var from flows/lib.sh::route_llm_via_obol_cli and flows/buy-external.sh.
  • Updated CLAUDE.md, internal/embed/skills/monetize-guide/SKILL.md, and .agents/skills/obol-stack-dev/references/llm-routing.md example commands and env-var lists.

Why it matters:
The --name flag was 100% inert — it never reached ModelEntry, never persisted to the LiteLLM ConfigMap, and never influenced obol model list/prefer/status/sync/remove or any routing. The string was echoed in two log lines and passed as a UI label to RestartLiteLLM on the fallback path. Nothing else. Its help text said "informational only — LiteLLM keys the route by --model, not --name".

The trap: operators run obol model setup custom --name foo --model my/model …, then call the route as foo, and get:

litellm.BadRequestError: You passed in model=foo.
There are no healthy deployments for this model.

This was the same error string in the v0.10.0-rc1 upgrade report attributed to a "cache survives obol stack up" bug. After five fresh-cluster probes on rc3 could not reproduce the cache bug, the reliable reproducer turned out to be calling the route by the user-given --name rather than the --model value LiteLLM actually keys on. Removing the flag eliminates that UX trap.

Risk level: low

Commit under test: b9ff172 (this PR), parent f8df92e (tag v0.10.0-rc3)

Base branch: main

Scope

  • Code
  • Charts / manifests
  • Flows / QA scripts
  • Docs / skills
  • Images / dependencies
  • Other:

Validation

CI checks:

Check Status Link
Unit tests (touched packages) ✅ pass local — go test ./cmd/obol/ ./internal/model/ -count=1
Full unit suite ⚠️ 1 pre-existing fail local — see below
Shell syntax ✅ pass local — bash -n flows/*.sh
Release-smoke (flows 01-12) ⚠️ 5 fails, all env-related local — see report below
Live cluster smoke (obol model setup custom end-to-end) ✅ pass local

Unit tests:

$ go test ./... -count=1
ok      github.com/ObolNetwork/obol-stack/cmd/obol      1.476s
ok      github.com/ObolNetwork/obol-stack/internal/model      0.731s
ok      ... (39 packages total)
FAIL    github.com/ObolNetwork/obol-stack/internal/stack      8.202s
  └── TestWarnIfNoChatModel_EmitsWarnWhenNoModels

PRE-EXISTING: Reproduces on clean `main` HEAD (f8df92e / v0.10.0-rc3 tag)
with this PR's changes stashed. Test asserts the warn is on stderr but
the message arrives on stdout. Unrelated to this PR.

Integration tests:

SKIPPED — internal/openclaw integration tests expect host Ollama on
:11434; QA host runs Unsloth Studio on :8888 instead. Not exercised
by this change either way (--name was never used by Ollama setup path).

Flow tests:

Flow Network QA machine label Worktree Result Artifacts
flow-02-stack-init-up n/a macOS Docker Desktop none (in-tree) PASS .tmp/release-smoke-20260521-134647/flow-02-stack-init-up.log
flow-05-network n/a macOS Docker Desktop none (in-tree) PASS .tmp/release-smoke-20260521-134647/flow-05-network.log
flow-07-sell-verify n/a macOS Docker Desktop none (in-tree) PASS .tmp/release-smoke-20260521-134647/flow-07-sell-verify.log
flow-08-buy base-sepolia macOS Docker Desktop none (in-tree) PASS .tmp/release-smoke-20260521-134647/flow-08-buy.log
flow-09-lifecycle n/a macOS Docker Desktop none (in-tree) PASS .tmp/release-smoke-20260521-134647/flow-09-lifecycle.log
flow-10-anvil-facilitator local macOS Docker Desktop none (in-tree) PASS .tmp/release-smoke-20260521-134647/flow-10-anvil-facilitator.log
flow-01-prerequisites n/a macOS Docker Desktop none (in-tree) FAIL (env) Unsloth /v1/models returns 401 to unauthenticated probe
flow-03-inference n/a macOS Docker Desktop none (in-tree) FAIL (env) endpoint validation failed: ... context deadline exceeded — Unsloth Studio 27B cold-start exceeds ValidateCustomEndpoint's 60s timeout. CLI parsed args correctly — no --name regression.
flow-04-agent n/a macOS Docker Desktop none (in-tree) FAIL (env) cascades from flow-03 (no model registered)
flow-06-sell-setup n/a macOS Docker Desktop none (in-tree) FAIL (env) hard-coded preflight for Ollama at :11434; QA host runs Unsloth
flow-11-dual-stack base-sepolia macOS Docker Desktop none (in-tree) FAIL (env) same Unsloth cold-start timeout during alice's obol model setup custom validation. Preflight (Alice ETH, Bob USDC, facilitator) all PASS.

Release smoke:

$ OBOL_LLM_ENDPOINT=http://host.k3d.internal:8888/v1 \
  OBOL_LLM_MODEL=unsloth/Qwen3.6-27B-MTP-GGUF \
  OBOL_LLM_API_KEY=<unsloth-studio-jwt> \
  bash flows/release-smoke.sh

| Flow                       | Result | FAIL lines | SKIP lines | Exit code |
| -------------------------- | ------ | ---------: | ---------: | --------: |
| flow-01-prerequisites      | FAIL   |          1 |          0 |         1 |
| flow-02-stack-init-up      | PASS   |          0 |          0 |         0 |
| flow-03-inference          | FAIL   |          6 |          0 |         1 |
| flow-04-agent              | FAIL   |          1 |          0 |         1 |
| flow-05-network            | PASS   |          0 |          0 |         0 |
| flow-06-sell-setup         | FAIL   |          1 |          0 |         1 |
| flow-07-sell-verify        | PASS   |          0 |          0 |         0 |
| flow-10-anvil-facilitator  | PASS   |          0 |          0 |         0 |
| flow-08-buy                | PASS   |          0 |          0 |         0 |
| flow-09-lifecycle          | PASS   |          0 |          0 |         0 |
| flow-11-dual-stack         | FAIL   |          0 |          0 |         1 |

Release smoke failed: 5 flow(s)

Failure attribution: Zero failures involve --name parsing. grep "unknown flag\|flag provided but not defined" .tmp/release-smoke-*/*.log returns nothing. Every obol model setup custom invocation parsed arguments and reached endpoint validation correctly. All 5 failures are upstream of the CLI:

  1. Unsloth Studio auth/v1/models requires bearer token; flow-01's simple unauthenticated probe gets 401.
  2. Unsloth Studio cold-start — first inference call on the 27B GGUF triggers model load, exceeding ValidateCustomEndpoint's 60s timeout. Surfaces in flow-03 and flow-11.
  3. flow-06 Ollama hardcodeflow-06-sell-setup.sh preflight checks localhost:11434; QA host runs Unsloth, not Ollama.
  4. flow-04 — cascades from flow-03 leaving LiteLLM without the routed model.

A vLLM/llama.cpp QA host without auth would not hit any of these. Six flows pass cleanly including all on-chain payment flows (flow-08 buy, flow-09 lifecycle).

Live Chain Evidence

Do not include private keys, seed phrases, passwords, hostnames, personal paths, or raw bearer tokens.

Network: base-sepolia (flow-08, flow-11)

RPC/provider: default free-tier fallback (no paid RPC set this run)

Facilitator: https://x402.gcp.obol.tech (reachable, supports Base Sepolia exact)

Contracts and tokens:

Name Address Version / notes
USDC (Base Sepolia) 0x036CbD53842c5426634e7929541eC2318f3dCF7e facilitator-default

Wallet roles:

Role Address Source
Alice / seller / register 0xC0De030F6C37f490594F93fB99e2756703c4297E flow-11 derived from REMOTE_SIGNER_PRIVATE_KEY
Bob / buyer / payer 0x57b0eF875DeB5A37301F1640E469a2129Da9490E flow-11 derived from REMOTE_SIGNER_PRIVATE_KEY (2nd derive)
Facilitator / receiver n/a hosted x402-rs

Balances:

Token Address Before After Expected delta Actual delta
USDC 0x036CbD…CF7e Bob 4.95 USDC Bob 4.95 USDC 0 (no purchase fired — flow-11 failed at LLM setup) 0

Transaction receipts: none on-chain this run (PR is CLI-only, no settlement path touched).

Runtime Evidence

QA environment:

Item Value
OS / arch macOS Darwin 25.5.0 / arm64
Backend Docker Desktop + k3d v5.8.3
Tool versions go1.25.5, k3d 5.8.3, helm 3.x, helmfile 1.4.x, kubectl 1.35.x
QA agent/model Hermes (nousresearch/hermes-agent:v2026.5.7) + Unsloth Studio serving unsloth/Qwen3.6-27B-MTP-GGUF

Images:

Component Image Tag / digest Source
obol-agent (Hermes) nousresearch/hermes-agent v2026.5.7 docker.io
LiteLLM ghcr.io/berriai/litellm:main-stable upstream ghcr
x402-verifier / serviceoffer-controller / x402-buyer / demo-server / public-storefront ghcr.io/obolnetwork/<name> :latest (locally built, OBOL_FORCE_REBUILD_LOCAL_DEV_IMAGES=true) in-tree Dockerfiles

Kubernetes / stack:

Item Value
Stack IDs smart-dinosaur (validation phase), wondrous-crane (release-smoke) — both torn down
Namespaces standard set (llm, x402, hermes-obol-agent, erpc, monitoring, traefik, obol-frontend)
Pod readiness all default infra pods Running 2/2 or 1/1 during validation
Cleanup result release-smoke's cleanup_stacks trap removed test workspaces on exit

Model and routing:

Item Value
Agent/model used Hermes → LiteLLM → claude-sonnet-4-6 (Anthropic via ANTHROPIC_API_KEY) and unsloth/Qwen3.6-27B-MTP-GGUF (Unsloth Studio on host)
LiteLLM route claude-sonnet-4-6 + paid/* + anthropic/* + unsloth/Qwen3.6-27B-MTP-GGUF
Paid endpoint status not exercised this PR
Auth token source LITELLM_MASTER_KEY from kubectl get secret litellm-secrets -n llm; Unsloth JWT from POST /api/auth/login

Artifacts and logs:

Artifact Location / link Notes
Release-smoke run .tmp/release-smoke-20260521-134647/ 11 per-flow logs + RELEASE_REPORT.md
Pre-merge live smoke tmux pane obol-0:qa.0 4 chat probes, all HTTP 200

Demo readiness:

Item Status Notes
Seller visible / registered n/a not in scope of this PR
Buyer discovery works flow-08 buy and flow-09 lifecycle both PASS in release-smoke
Paid route works flow-08 PASS
Settlement visible on-chain n/a no settlement triggered this PR

Review Notes

Known gaps:

  • Unsloth Studio is not natively supported by obol model setup custom — it requires a bearer JWT and has slow cold-start for large GGUFs that exceeds the 60s validation timeout. Adding first-class Unsloth support (alongside Ollama) would let release-smoke run cleanly on hosts without vLLM. Out of scope for this PR.
  • A separate bug surfaced during validation: obol model setup custom hot-add path fails with [Errno 30] Read-only file system: '/etc/litellm/config.yaml' because the ConfigMap is mounted RO. The CLI correctly falls back to a deployment restart, so users aren't blocked, but every custom-endpoint setup pays the full ~90s rollout cost. Worth a follow-up.
  • Pre-existing unit-test failure: TestWarnIfNoChatModel_EmitsWarnWhenNoModels in internal/stack is broken on main HEAD as of this PR (asserts warn-on-stderr but the message lands on stdout). Predates this branch.

Follow-ups:

  • Native Unsloth support in obol model setup (auth handling, longer first-call timeout).
  • Fix hot-add by writing the merged config to an emptyDir or making the mount RW.
  • Fix TestWarnIfNoChatModel_EmitsWarnWhenNoModels (separate issue).
  • Re-validate Issue Add helm to obolup #2 (per-agent Hermes crashloop) on a Linux k3d host — cannot reproduce on macOS Docker Desktop due to virtiofs ownership translation, but source inspection on rc3 confirms agent_render.go::agentPodSpec still lacks an init container.

Reviewer focus:

  • Confirm the signature change AddCustomEndpoint(cfg, u, endpoint, modelName, apiKey) (no name) is acceptable — the function is only called from cmd/obol/model.go and has no test mocks. Repo-wide grep returns zero stale callers.
  • Confirm the RestartLiteLLM(cfg, u, modelName) fallback label change is OK — the third arg is a UI-only string.
  • Confirm OBOL_LLM_NAME removal from flows/buy-external.sh doesn't break any external automation referencing it (none found in this repo or in ~/.config).

The `--name` flag on `obol model setup custom` was documented as
informational only and never participated in any routing or persistence:

- ModelEntry has only `model_name` (route key) + `litellm_params`; the
  CLI `--name` value was never written to either.
- `detectProvider` (used by `obol model list/status`) inspects
  `entry.ModelName` + `entry.LiteLLMParams.Model` prefixes; the `--name`
  string never reached it.
- It was only echoed back in two log lines and passed as a UI label to
  `RestartLiteLLM` on the hot-add fallback path.

This caused confusion in QA: an operator running

    obol model setup custom --name foo --model my/model ...

would later call the route as `foo` and get LiteLLM's

    BadRequestError: ... There are no healthy deployments for this model.

(The same error message the operator at #v0.10.0-rc1-upgrade-report
attributed to a cache-survives-stack-up bug. Five fresh-cluster probes
on rc3 could not reproduce the cache bug — the consistent reproducer
turned out to be calling the route by the user-given `--name` rather
than the actual registered `--model` value.)

Changes:
- cmd/obol/model.go: drop --name flag from modelSetupCustomCommand
- internal/model/model.go: drop name parameter from AddCustomEndpoint;
  fallback RestartLiteLLM label now uses modelName
- flows/lib.sh: route_llm_via_obol_cli no longer reads OBOL_LLM_NAME or
  passes --name
- flows/buy-external.sh: OBOL_LLM_NAME env var removed (orphan)
- CLAUDE.md / monetize-guide SKILL.md / llm-routing.md: example commands
  and env-var lists drop --name / OBOL_LLM_NAME
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant