An OpenAI-compatible HTTP proxy that aggregates multiple LLM providers behind a single endpoint. Clients that speak the OpenAI API (LangChain, LiteLLM, Open WebUI, Cursor, etc.) connect to llmproxy without modification; llmproxy routes each request to the correct upstream based on a provider-prefix embedded in the model name.
llmproxy/
├── run.py ← start the server (no install needed)
├── llmproxy_test_client.py ← live integration test client (talks to a running proxy)
├── test_tui.py ← interactive chat TUI (despite the name — not a test suite)
├── llmproxy/ ← the package
│ ├── __main__.py
│ ├── config.py
│ ├── server.py
│ ├── setup_wizard.py
│ ├── providers.py ← loader for the JSON sidecar
│ └── providers.json ← single source of truth for ALL provider templates
│ (+ believed_free / model_reasoning / model_capabilities / free_limits)
├── scripts/
│ └── update_free_models.py ← scraper that keeps providers.json's free-tier fields current
│ └── sources/ ← per-source plugins (openrouter, community, /models, docs)
├── tests/ ← pytest unit/integration suite
├── requirements.txt
├── requirements-dev.txt ← pytest, ruff, responses (test-only deps)
├── pyproject.toml ← pytest + ruff config
├── setup.py ← only needed for pip install
├── Dockerfile
├── docker-compose.yml
├── config.example.json ← auto-generated from llmproxy/providers.json
└── .github/workflows/
├── ci.yml ← pytest, ruff, config-example-up-to-date guard
└── docker-publish.yml ← GHCR image publish
All models exposed by llmproxy follow this pattern:
<provider_name>/<upstream_model_id>
The upstream_model_id may itself contain slashes. Examples:
| Proxy model string | Provider | Upstream model |
|---|---|---|
openrouter/openrouter/free |
openrouter | openrouter/free |
openrouter/anthropic/claude-3.5-sonnet |
openrouter | anthropic/claude-3.5-sonnet |
openai/gpt-4o |
openai | gpt-4o |
deepseek/deepseek-chat |
deepseek | deepseek-chat |
ollama/llama3 |
ollama | llama3 |
The proxy strips the leading <provider_name>/ before forwarding the request to
the upstream provider's base URL.
While the slash form above is the canonical input form (accepted by every
endpoint), GET /v1/models advertises ids in a different display form:
<provider_name>__<upstream_model_id>
For example, an Ollama model with upstream id qwen2.5vl:3b is listed as
ollama__qwen2.5vl:3b. The double-underscore separator avoids two real-world
client bugs:
- Spaces and parentheses break strict client validators (e.g. Hermes rejects any model name containing whitespace).
- A
/separator causes some clients to silently truncate the id at the first/, hiding the provider suffix in their menus.
Putting the provider on the left mirrors the canonical provider/model slash
form, so the two ids read consistently across logs, configs, and menus.
Clients may submit any of these four forms in "model" on chat/completions
requests:
provider/model— canonical slash formprovider__model— current display formmodel__provider— legacy display form from PR #27model (provider)— pre-PR #27 legacy display form
All four resolve identically.
llmproxy advertises a special synthetic model named llmproxy__free. When a request
arrives with "model": "llmproxy__free" (or the legacy "llmproxy/free"), the proxy:
- Collects every model across all providers whose upstream ID contains the
word
free(case-insensitive) or whose upstream ID (or fullprovider/upstreamID) appears in the top-levelbelieved_freeconfig list — see Configuration. - Picks a random starting position in that list, then tries each candidate in order, wrapping around.
- Returns the first response with an HTTP status below 400. If a candidate is rate-limited, overloaded, or otherwise unhealthy, it is skipped silently and the next one is tried.
This spreads load across free-tier endpoints and provides automatic failover — useful when any individual free model is rate-limited.
# Use the free virtual model
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llmproxy__free", "messages": [{"role": "user", "content": "Hello!"}]}'
# Inspect which backends are currently eligible
curl http://localhost:8080/v1/models/llmproxy__free | jq '._candidates'The llmproxy__free model appears at the top of GET /v1/models whenever at least one
eligible backend is available.
llmproxy also advertises a synthetic model named llmproxy__local. When a request
arrives with "model": "llmproxy__local" (or the legacy "llmproxy/local"), the proxy:
- Collects every model across all providers whose
base_urlhostname matches a loopback address (localhost,127.x.x.x,::1,0.0.0.0), an mDNS name (*.local), or a Docker host-gateway alias (host.docker.internal,gateway.docker.internal). - Picks a random starting position in that list, then tries each candidate in order, wrapping around.
- Returns the first response with an HTTP status below 400.
This is useful for clients that want to use whichever local model (Ollama, LM Studio, llama.cpp, etc.) happens to be running without hard-coding a specific model name.
# Use the local virtual model
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llmproxy__local", "messages": [{"role": "user", "content": "Hello!"}]}'
# Inspect which backends are currently eligible
curl http://localhost:8080/v1/models/llmproxy__local | jq '._candidates'The llmproxy__local model appears in GET /v1/models only when at least one model
from a localhost-backed provider is present in the route cache — meaning the
provider must be reachable and its /models listing must have been fetched
successfully.
Local models are not added to
believed_free. Local-provider models (Ollama, LM Studio, OpenWebUI, etc.) live entirely under the__localfamily —llmproxy__local,llmproxy__standard/local, and so on. When the setup wizard auto-registers a local provider, it tags each discovered model inmodel_reasoningonly;believed_freeis reserved for cloud free-tier offerings. If you want a local model to also appear underllmproxy__free, add it tobelieved_freeby hand.
You can optionally tag individual models in the config with a reasoning
level — exploratory, standard, or deep — to group them by how much
thinking effort they are expected to apply. When at least one model is tagged
with a given level, llmproxy exposes corresponding virtual endpoints:
| Virtual model name | Selects |
|---|---|
llmproxy__exploratory |
All models tagged exploratory |
llmproxy__standard |
All models tagged standard |
llmproxy__deep |
All models tagged deep |
llmproxy__exploratory/free |
Models tagged exploratory and qualifying as free-tier |
llmproxy__exploratory/local |
Models tagged exploratory and served on localhost |
llmproxy__standard/free |
Models tagged standard and qualifying as free-tier |
llmproxy__standard/local |
Models tagged standard and served on localhost |
llmproxy__deep/free |
Models tagged deep and qualifying as free-tier |
llmproxy__deep/local |
Models tagged deep and served on localhost |
Each virtual endpoint uses the same random-start round-robin with automatic
failover as llmproxy__free and llmproxy__local. The legacy llmproxy/...
form (e.g. llmproxy/deep/free) is still accepted on input for backward
compatibility with pinned client configs.
# Use the deep reasoning virtual model
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llmproxy__deep", "messages": [{"role": "user", "content": "Prove P≠NP"}]}'
# Inspect which backends are eligible for llmproxy__standard/free
curl http://localhost:8080/v1/models/llmproxy__standard/free | jq '._candidates'Tags are configured via the model_reasoning field — see
Configuration → model_reasoning below.
Free models vary wildly in what they support: some handle tool/function calls,
some accept images, some emit reasoning, some honor JSON-mode. llmproxy can tag
each model with the capabilities it supports (via model_capabilities) and use
that to route requests on any virtual model:
-
Proactive ordering — when a request needs a capability, candidates that support it are tried first. This is a stable reordering: models with unknown capability are kept as fallbacks, so incomplete metadata never turns a request into a hard failure.
-
Reactive failover — when a capability was mandatory but the upstream returned a 200 that didn't deliver it, llmproxy fails over to the next candidate, exactly like it does on an HTTP error. Today this covers:
- tools —
tool_choiceforced a call ("required"or a specific function) but the response contained notool_calls. - json —
response_formatrequested JSON but the body wasn't valid JSON.
(Reactive 200-body detection runs on non-streaming requests only; streaming responses still benefit from proactive ordering. Capabilities without a reliable 200 signal — vision, reasoning — rely on the upstream returning an HTTP error, which already triggers failover.)
- tools —
The tool_choice: "auto" case is never treated as a failure — a model may
legitimately answer without calling a tool.
Detected capabilities: tools, vision, reasoning, json.
When at least one model is tagged, dedicated capability virtual endpoints appear:
| Virtual model name | Selects |
|---|---|
llmproxy__tools |
All models tagged tools |
llmproxy__tools/free |
Models tagged tools and qualifying as free-tier |
llmproxy__vision |
All models tagged vision |
llmproxy__vision/free |
Models tagged vision and qualifying as free-tier |
# Route a tool-calling request only to tool-capable free models, failing
# over automatically if one returns no tool call:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llmproxy__tools/free",
"tool_choice": "required",
"tools": [{"type": "function", "function": {"name": "get_weather"}}],
"messages": [{"role": "user", "content": "Weather in Paris?"}]}'
# llmproxy__free also benefits — it now orders/fails over by capability when
# the request carries tools or images.Tags are configured via the model_capabilities field, which auto-populates
from the scraper (OpenRouter's supported_parameters / image modality) and the
setup wizard's Manage model tags → Tag model capabilities menu — see
Configuration → model_capabilities below. The legacy
llmproxy/... input form (e.g. llmproxy/tools/free) is also accepted.
Config is stored at ~/.config/llmproxy/config.json (or the path in
$LLMPROXY_CONFIG, or the --config flag).
{
"providers": {
"<name>": {
"base_url": "https://...",
"api_key": "sk-...",
"model_filter": ["model-a", "model-b"]
}
},
"believed_free": [
"openrouter/qwen/qwen3-coder:free",
"gpt-oss-20b",
"nvidia/meta/llama-3.1-70b-instruct"
],
"model_reasoning": {
"anthropic/claude-3.5-haiku": "exploratory",
"anthropic/claude-sonnet-4-5": "standard",
"anthropic/claude-opus-4": "deep",
"openrouter/deepseek/deepseek-r1": "deep",
"nvidia/meta/llama-3.1-70b-instruct": "standard"
},
"model_capabilities": {
"openrouter/qwen/qwen3-coder:free": ["tools", "reasoning"],
"google/gemini-2.5-flash": ["tools", "vision", "json"]
},
"server": {
"host": "0.0.0.0",
"port": 8080,
"log_level": "INFO",
"request_timeout": 120,
"stream_timeout": 300,
"response_cache_ttl": 120
}
}model_filter is an optional list of upstream model IDs to allow (without the
provider prefix). It is not set by default in config.example.json. Set it to
null or omit it to permit all models from that provider. It can be used as a
manual allowlist, or as a fallback model list for providers whose /v1/models
endpoint does not work (e.g. Cloudflare Workers AI, Cloudflare AI Gateway).
believed_free is an optional top-level array of model names that the
free virtual model should include even when their ID doesn't contain the
word free. Omit the field entirely (or set it to []) to keep the
default behaviour — only IDs that literally contain free are pulled in.
Each entry is matched (case-insensitively) against either the upstream
model ID (e.g. gpt-oss-20b) or the full proxy ID (e.g.
openrouter/qwen/qwen3-coder:free). The setup wizard manages this field
via its "Manage model tags" menu and via the per-provider auto-populate step
when you add a templated provider; the merged defaults come from
llmproxy/providers.json.
Free-tier accuracy: The
believed_freeentries inconfig.example.jsonand inllmproxy/providers.jsonare best-effort estimates based on publicly-stated provider free tiers. Provider offerings change without notice — no guarantee is made as to accuracy. Verify directly with each provider before relying on free availability in production. Thescripts/update_free_models.pyscraper exists to keep these entries current.
model_reasoning is an optional top-level object that tags individual
models with a reasoning level. Valid levels are exploratory, standard,
and deep. Each key is matched (case-insensitively) against either the
upstream model ID (e.g. anthropic/claude-opus-4) or the full
provider/upstream_model proxy ID (e.g.
openrouter/anthropic/claude-opus-4). When a level has at least one tagged
model in the route cache, the corresponding virtual endpoint is advertised in
GET /v1/models. Omit the field entirely (or set it to {}) to disable
reasoning-level routing. The setup wizard manages this field via its
"Manage model tags" menu (merged defaults come from llmproxy/providers.json).
model_capabilities is an optional top-level object that tags individual
models with the capabilities they support. Valid values are tools, vision,
reasoning, and json (a list per model). Keys are matched
(case-insensitively) against either the upstream model ID or the full
provider/upstream_model proxy ID, like model_reasoning. It drives
capability-aware routing & failover on
all virtual models and powers the llmproxy__tools / llmproxy__vision
endpoints (advertised when at least one model carries the tag). Omit it (or set
it to {}) to disable capability-aware behavior — the proxy then behaves exactly
as before. The field auto-populates from the scraper (OpenRouter's
supported_parameters and image input modality) and from the setup wizard's
"Manage model tags → Tag model capabilities" menu.
See config.example.json for a complete annotated example.
Provider templates and free-tier metadata both live in
llmproxy/providers.json — the single source of
truth. The setup wizard reads from this file at startup; config.example.json
is regenerated from the same file. To add or update a provider, edit
providers.json directly (or run the scraper — see
Keeping the free-models list current).
The wizard currently offers ready-made templates for these providers:
| Provider | Default key | Base URL |
|---|---|---|
| Nous Research (Hermes) | nous |
https://inference-api.nousresearch.com/v1 |
| Nvidia NIM | nvidia |
https://integrate.api.nvidia.com/v1 |
| Google Gemini (OpenAI-compat) | google |
https://generativelanguage.googleapis.com/v1beta/openai |
| Cerebras | cerebras |
https://api.cerebras.ai/v1 |
| GitHub Models | github |
https://models.github.ai/inference |
| SambaNova Cloud | sambanova |
https://api.sambanova.ai/v1 |
| Mistral AI | mistral |
https://api.mistral.ai/v1 |
| Groq | groq |
https://api.groq.com/openai/v1 |
| Together AI | together |
https://api.together.xyz/v1 |
| Fireworks AI | fireworks |
https://api.fireworks.ai/inference/v1 |
| Cloudflare Workers AI | cloudflare-workers |
https://api.cloudflare.com/client/v4/accounts/.../ai/v1 |
| Zhipu AI (BigModel) | zhipu |
https://open.bigmodel.cn/api/paas/v4 |
| Z.AI | z-ai |
https://api.z.ai/api/paas/v4 |
| Cohere | cohere |
https://api.cohere.com/compatibility/v1 |
| DeepSeek | deepseek |
https://api.deepseek.com/v1 |
| OpenRouter | openrouter |
https://openrouter.ai/api/v1 |
| Ollama Cloud | ollama-cloud |
https://ollama.com/v1 |
| Moonshot AI (Kimi) | moonshot |
https://api.moonshot.ai/v1 |
| MiniMax | minimax |
https://api.minimax.io/v1 |
| Hugging Face Inference | huggingface |
https://router.huggingface.co/v1 |
| xAI (Grok) | xai |
https://api.x.ai/v1 |
| Cloudflare AI Gateway | cloudflare-ai-gateway |
https://gateway.ai.cloudflare.com/v1/{account}/{gw}/workers-ai/v1 |
| Vercel AI Gateway | vercel |
https://ai-gateway.vercel.sh/v1 |
| Venice AI | venice |
https://api.venice.ai/api/v1 |
| OpenCode Zen (free gateway) | opencode-zen |
https://opencode.ai/zen/v1 |
API key required. Every provider in this table requires an API key. The setup wizard displays a hint showing where to obtain each key. For keyless local access (e.g. a local Ollama instance), use the manual "Add / edit a provider" option in the wizard.
Any OpenAI-compatible provider can also be added manually via the "Add / edit a provider (manual)" menu option.
Providers that do not support
GET /v1/models(as of May 2026) Some providers return an error or non-JSON response for the/modelsendpoint. For these, llmproxy synthesizes model entries from the provider'smodel_filterwhen the fetch fails. Setmodel_filtermanually in your config for these providers to enumerate the models you want available.
Provider Reason Cloudflare Workers AI Returns HTTP 405 — method not supported Cloudflare AI Gateway Returns HTTP 401 — no anonymous model enumeration Hugging Face Inference Returns HTML rather than JSON for /v1/models
Provider free tiers change without notice. The free-tier fields in
llmproxy/providers.json hold the
project's best-effort view of which models are currently free and what
their rate limits are — used by the llmproxy__free virtual endpoint and by the
setup wizard's "auto-populate" step.
A scraper at scripts/update_free_models.py polls multiple sources, diffs the
result against the sidecar, and prints proposed adds / removes / limit changes
for human review.
| Source | Confidence | What it does |
|---|---|---|
openrouter |
high | Hits https://openrouter.ai/api/v1/models and flags any model with pricing.prompt == 0 as free. |
docs |
high | Per-provider HTML scrapers for published rate-limit / free-tier pages (Google, Groq, Cerebras, Mistral, Cohere). Add more under scripts/sources/docs/. |
api |
medium | Calls each provider's OpenAI-compatible /v1/models endpoint when <PROVIDER>_API_KEY is set in your environment. Used to detect removals (a believed-free model that's no longer listed). |
community |
low | Pulls the tashfeenahmed/freellmapi community list as a sanity signal. |
# Preview proposed changes (no files written)
python scripts/update_free_models.py --dry-run
# Apply the changes to llmproxy/providers.json and regenerate config.example.json
python scripts/update_free_models.py
# Restrict to one provider
python scripts/update_free_models.py --provider google --dry-run
# Restrict to specific sources
python scripts/update_free_models.py --source openrouter,docs --dry-run
# Just regenerate config.example.json from the current sidecar (no scraping)
python scripts/update_free_models.py --regen-config-only
# Also sync your live config.json's free-tier sections from the sidecar
python scripts/update_free_models.py --config ~/.config/llmproxy/config.json --dry-run
python scripts/update_free_models.py --config ~/.config/llmproxy/config.json
# Sync the config from the current sidecar without scraping
python scripts/update_free_models.py --regen-config-only --config ~/.config/llmproxy/config.jsonThe proxy reads believed_free / model_reasoning / model_capabilities /
free_limits at runtime from your config.json, not from the sidecar. Pass
--config PATH to also reconcile a live config in the same run (honors
--dry-run):
- Scope is limited to providers configured in that file. Entries for custom
providers, or sidecar providers you haven't configured, are left untouched —
as are non-model keys like the
_noteinfree_limits. believed_freeandfree_limitsare synced — newly-free models are added and models that are no longer free are removed.model_reasoningandmodel_capabilitiesare add-only. Existing tags are never pruned or overwritten, so a model keeps its reasoning level / capability tags (including any you set by hand) even after it leaves the free tier.- Your
providers,server, and any other config sections are preserved; only the free-tier sections change.
- A failed source never causes a removal. Sources run independently; any source that errors out (network failure, parse error, 5xx) emits no evidence rather than "every model is absent". The scraper prints which sources succeeded so you can judge how much to trust the diff.
/v1/modelspresence ≠ free. Theapisource only contributes existence evidence; it can flag removals but cannot decide that a model is free.- Reasoning levels are preserved. Existing
model_reasoningentries are never overwritten. New models are tagged viainfer_reasoning_level()(deep keywords → deep; size in B → standard / exploratory) so you can hand-tune later.
When set, each <PROVIDER>_API_KEY enables the api source for that provider:
GROQ_API_KEY=gsk-... GOOGLE_API_KEY=AIza-...
CEREBRAS_API_KEY=csk-... MISTRAL_API_KEY=...
COHERE_API_KEY=... SAMBANOVA_API_KEY=...
(and so on — uppercase the provider key, replace - with _, append _API_KEY).
This is the recommended path for local use. You only need flask and
requests; no pip install . or pip install -e . is required.
pip install flask requestsgunicorn is optional. If installed, the server uses it automatically for
better concurrency; otherwise it falls back to the Flask development server,
which is fine for local use.
pip install gunicorn # optionalRun the interactive setup wizard. It creates ~/.config/llmproxy/config.json
and prompts you for each provider's name, base URL, API key, and optional model
filter.
python run.py --setupYou can re-run --setup at any time to add, edit, or remove providers.
python run.pyThe server binds to 0.0.0.0:8080 by default. Override host or port without
editing the config:
python run.py --port 9000 --log-level DEBUGrun.py resolves its own location via os.path.abspath(__file__), so it works
correctly regardless of which directory you invoke it from:
python /path/to/llmproxy/run.py --setup
python /path/to/llmproxy/run.pypython run.py --setupThe server hot-reloads config on each request (modification-time cache), so
provider changes take effect immediately without a restart. Only host or
port changes require a restart.
The repo has three distinct things named "test"-ish — each does something different:
| File | What it is |
|---|---|
tests/ |
The pytest unit/integration suite (run with pytest). New as of this release. |
llmproxy_test_client.py |
Live integration test client. Talks to a running llmproxy over HTTP. |
test_tui.py |
Interactive chat TUI for hand-driving the proxy (despite the misleading name). |
pip install -r requirements-dev.txt
pytest # run everything
pytest --cov=llmproxy --cov=scripts # with coverage
pytest tests/test_scraper # just the scraper tests
ruff check llmproxy scripts tests # lintCI runs the same checks on every push and pull request — see
.github/workflows/ci.yml. It runs:
pytestacross Python 3.11 and 3.12,rufflint,- a guard that fails the build if
config.example.jsonhas drifted fromllmproxy/providers.json(regenerate locally withpython scripts/update_free_models.py --regen-config-only).
llmproxy_test_client.py is a standalone script with no dependencies beyond
requests. It connects to a running llmproxy instance and exercises all
endpoints, printing a pass/fail/skip report.
# Run all test suites against the default localhost:8080
python llmproxy_test_client.py
# Target a different host or port
python llmproxy_test_client.py --base-url http://localhost:9000/v1
# Force a specific model for chat/embedding/streaming tests
python llmproxy_test_client.py --model openrouter/openrouter/free
# Run only the structural tests (no live LLM calls required)
python llmproxy_test_client.py --suite health --suite errors
# Skip streaming (useful in environments that buffer SSE)
python llmproxy_test_client.py --no-stream
# Include OpenAI SDK compatibility test (requires: pip install openai)
python llmproxy_test_client.py --use-sdk| Suite | What it checks | Needs provider? |
|---|---|---|
health |
GET /health returns 200 and lists active providers |
No |
errors |
Missing model field, bad prefix, unknown provider, non-JSON body | No |
models |
GET /v1/models aggregates all providers; naming convention |
Yes |
free |
Sends several prompts to model="llmproxy__free"; tests cycling + streaming |
Yes (free tier) |
local |
Sends several prompts to model="llmproxy__local"; skipped if none configured |
Yes (localhost) |
chat |
Non-streaming chat completion; checks response content | Yes |
streaming |
Streaming SSE chat; prints tokens live as they arrive | Yes |
embeddings |
Embedding request; accepts graceful 400/404 if unsupported | Yes |
sdk |
Same chat + stream tests via the openai Python package |
Yes |
When no --model flag is given, the client auto-selects a model from the
proxy's /v1/models list, preferring names that suggest a free or small model
(free, mini, flash, haiku, small, 8b, etc.).
llmproxy test client
Target: http://localhost:8080/v1
───────────────────────────────────────────────────────
══ Health Check ══
✓ GET /health returns 200 providers=[]
No providers configured yet. Run: python run.py --setup
══ Error Handling ══
✓ Missing 'model' field → 400
✓ Non-prefixed model string → 400
✓ Unknown provider → 404
✓ Non-JSON body → 400
✓ GET /health JSON schema contains 'status'
───────────────────────────────────────────────────────
Results: 6 passed 0 failed 1 skipped / 7 total
If you prefer a system-wide llmproxy command, install the package:
pip install -e . # editable install (recommended for development)
# or
pip install .After installation, run.py is no longer needed; use the llmproxy command
directly:
llmproxy --setup
llmproxy
llmproxy --port 9000 --log-level DEBUG
llmproxy --list-providers
llmproxy --versiondocker build -t llmproxy .Or pull from GHCR (see GHCR — hosting and pulling):
docker pull ghcr.io/billjr99/llmproxy:latestConfig is bind-mounted from ~/.config/llmproxy on the host. The image runs
as a non-root user by default (no --user required); passing
--user $(id -u):$(id -g) makes files created inside the container owned by
you on the host.
mkdir -p ~/.config/llmproxy
docker run -it --rm \
--user $(id -u):$(id -g) \
-v ~/.config/llmproxy:/config \
-e LLMPROXY_CONFIG=/config/config.json \
llmproxy --setupdocker run -d \
-p 8080:8080 \
--user $(id -u):$(id -g) \
-v ~/.config/llmproxy:/config \
-e LLMPROXY_CONFIG=/config/config.json \
--name llmproxy \
llmproxydocker run -it --rm \
--user $(id -u):$(id -g) \
-v ~/.config/llmproxy:/config \
-e LLMPROXY_CONFIG=/config/config.json \
llmproxy --setup
# Restart only if host or port changed; hot-reload handles everything else
docker restart llmproxyWhen llmproxy runs in Docker but you want it to talk to a local provider like
Ollama running on the host (or in a sibling container), localhost inside the
container points to the container itself — not to your host. You have three
options; pick whichever fits your setup.
Option A — host.docker.internal (recommended for Docker Desktop)
Change the provider's base_url from http://localhost:11434/v1 to
http://host.docker.internal:11434/v1. llmproxy already treats
host.docker.internal and gateway.docker.internal as local for the purposes
of llmproxy__local routing, so the __local virtual model picks it up
automatically.
On plain Linux (no Docker Desktop), host.docker.internal doesn't resolve by
default — add it explicitly:
docker run --add-host=host.docker.internal:host-gateway ... llmproxy…or in docker-compose.yml:
services:
llmproxy:
# ...
extra_hosts:
- "host.docker.internal:host-gateway"Option B — host networking
Start llmproxy with --network=host and keep the original
http://localhost:11434/v1 config. Simplest on Linux; not available on
Docker Desktop.
If you prefer to keep the config entirely inside Docker (useful for CI or
rootless environments where a host-path mount is inconvenient), mount the
named volume over the default config location under the non-root user's home
(/home/llmproxy/.config/llmproxy):
# Setup
docker run -it --rm \
-v llmproxy_config:/home/llmproxy/.config/llmproxy \
llmproxy --setup
# Server
docker run -d \
-p 8080:8080 \
-v llmproxy_config:/home/llmproxy/.config/llmproxy \
--name llmproxy \
llmproxyThe docker-compose.yml uses a bind mount from ~/.config/llmproxy on the
host and runs containers as the current user. Create a .env file first so
Compose picks up your UID/GID:
printf "UID=%s\nGID=%s\n" "$(id -u)" "$(id -g)" > .env
mkdir -p ~/.config/llmproxy# Build and start the server (detached)
docker-compose up -d
# First-time setup or reconfigure (interactive)
docker-compose run --rm setup
# Restart to apply host/port changes
docker-compose restart llmproxy
# View logs
docker-compose logs -f llmproxy
# Stop and remove containers (host config directory is preserved)
docker-compose downThe included GitHub Actions workflow (.github/workflows/docker-publish.yml)
automatically builds and pushes the image to
GitHub Container Registry (GHCR) on every push to main
and on every version tag (v*). It uses GITHUB_TOKEN, so no extra secrets
or personal access tokens are needed.
To enable it, fork or push the repo to GitHub — the workflow runs automatically. Images are published to:
ghcr.io/<your-github-username>/llmproxy
For this repository: ghcr.io/billjr99/llmproxy.
Tags produced:
| Event | Tags |
|---|---|
Push to main |
main, latest |
Push tag v1.2.3 |
1.2.3, 1.2, latest |
docker pull ghcr.io/billjr99/llmproxy:latest
mkdir -p ~/.config/llmproxy
# First-time setup
docker run -it --rm \
--user $(id -u):$(id -g) \
-v ~/.config/llmproxy:/config \
-e LLMPROXY_CONFIG=/config/config.json \
ghcr.io/billjr99/llmproxy:latest --setup
# Start the server
docker run -d \
-p 8080:8080 \
--user $(id -u):$(id -g) \
-v ~/.config/llmproxy:/config \
-e LLMPROXY_CONFIG=/config/config.json \
--name llmproxy \
ghcr.io/billjr99/llmproxy:latestTo use the GHCR image instead of building locally, replace build: . in
docker-compose.yml with:
image: ghcr.io/billjr99/llmproxy:latestAll endpoints mirror the OpenAI API.
| Method | Path | Description |
|---|---|---|
| GET | /health |
Health check; returns provider list |
| GET | /version |
Returns the running llmproxy version |
| GET | /v1/models |
Aggregate model list from all providers |
| GET | /v1/models/<model_id> |
Single model lookup |
| POST | /v1/chat/completions |
Chat completions (streaming supported) |
| POST | /v1/completions |
Legacy text completions |
| POST | /v1/embeddings |
Embeddings |
| * | /v1/<anything> |
Pass-through to upstream (see note below) |
For pass-through endpoints not listed above (e.g., /v1/audio/transcriptions),
the proxy routes based on the model field in the request body. For
GET/DELETE requests without a model field, append ?provider=<name> to the URL.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-used", # llmproxy uses the upstream key from config
)
response = client.chat.completions.create(
model="openrouter/anthropic/claude-3.5-sonnet",
messages=[{"role": "user", "content": "Hello!"}],
)Add the following to ~/.config/opencode/opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"plugin": [
"opencode-lmstudio"
],
"provider": {
"lmstudio": {
"npm": "@ai-sdk/openai-compatible",
"name": "llmproxy",
"options": {
"baseURL": "http://localhost:8080/v1",
"apiKey": "sk-local"
}
}
}
}The opencode-lmstudio plugin provides the @ai-sdk/openai-compatible adapter.
The apiKey value is not used by llmproxy but is required by the adapter; any
non-empty string works.
Install the pi-openai-compat
plugin and point it at http://localhost:8080. No API key is required.
# List all available models
curl http://localhost:8080/v1/models | jq '.data[].id'
# Chat completion
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openrouter/openrouter/free",
"messages": [{"role": "user", "content": "Hello!"}]
}'All flags apply equally to python run.py and the installed llmproxy command.
usage: run.py [--setup] [--config PATH] [--host HOST] [--port PORT]
[--log-level LEVEL] [--list-providers] [--version]
(no flags) Start the proxy server.
--setup Interactive configuration wizard.
--config PATH Override config file location.
--host HOST Bind host (overrides config).
--port PORT Bind port (overrides config).
--log-level LEVEL DEBUG | INFO | WARNING | ERROR.
--list-providers Print configured providers and exit.
--version Print version and exit.
| Variable | Purpose |
|---|---|
LLMPROXY_CONFIG |
Override the default config file path. |
- The server is a thin Flask application backed by gunicorn (gthread workers) when gunicorn is installed, falling back to the Flask development server.
/v1/modelsqueries all providers concurrently viaThreadPoolExecutor. A single unreachable provider is logged as a warning and omitted from the aggregate response rather than causing an overall failure.- Config is hot-reloaded on each request via an mtime cache; provider changes
take effect without a server restart. Only
hostandportchanges require one. - Streaming responses are relayed as raw SSE byte streams via
stream_with_context, preserving upstream chunk boundaries.