zLLM is a lightweight inference server written in Zig for managing and running LLMs. It supports downloading models from Hugging Face, converting them to the GGUF format, and executing inference using llama.cpp.
- Written entirely in Zig
- Integrated model registry (
registry_manifest.json) - Supports model download, GGUF conversion, and inference via CLI and REST API.
- MacOS (Metal) and Intel x86 support via llama.cpp
- Currently supports OpenAI's chat completions, enough to use Python's OpenAI client library.
- Zig 0.14 or newer
- Python 3 virtual environment with dependencies to run
llama.cppconversion scripts- Required only for
convertstep
- Required only for
git clone https://github.com/AbeRodz/zLLM.git
cd zLLM
zig buildThe typical usage flow is:
- Download a model
- Convert it to GGUF
- Run inference
Downloads a model from Hugging Face.
zig build run -- get gemma-3-1b 88 is the number of threads used for downloading.
Various huggingface models required a token and authentication before downloading them e.g. gemma-3. You need to setup the HF_TOKEN env variable.
Converts the downloaded model using a Python virtual environment.
zig build run -- convert gemma-3-1bRuns inference on the model:
zig build run -- run gemma-3-1bRuns inference API:
zig build run -- serveYou only need to run zig build once. You can then use the built binary directly:
./zig-out/bin/zLLM run gemma-3-1b 2>/dev/null2>/dev/null suppresses standard error output from llama.cpp.
Currently only:
- MacOS Apple Silicon.
- Linux x86 (tested on Ubuntu)
- Other distros haven't been tested yet.
Platform support is bound by build.zig and the capabilities of llama.cpp.
The list of supported models is defined in registry_manifest.json. Example entries include:
- gemma-3-1b
- gpt-2
Currently supports the chat completions endpoint example cURL:
curl --location 'http://localhost:8080/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "gemma-3-1b",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant that only knows about math and nothing more, if what you are asked about is not about math respond just with i dont know."
},
{
"role": "user",
"content": "i would like about the history of america"
}
],
"frequency_penalty": 0.5,
"max_tokens": 100,
"presence_penalty": 0.3,
"stop": ["stop"],
"stream": false,
"temperature": 0.8,
"top_p": 1.0
}
'Not all fields from the chat completions are currently supported or do anything per se, but it's enough to use Python's OpenAI client:
import openai
client = openai.OpenAI(
base_url="http://localhost:8080/v1/", # local zLLM server
api_key= "apiKey" # dummy apiKey
)
completion = client.chat.completions.create(
stream=False,
model="gemma-3-1b", # model saved on registry
messages= [
{
'role': 'user',
'content': {
"text": "I'm doing good"
},
}
],
max_tokens=256,
temperature=0.8,
)
print(completion.choices[0].content)
# i dont know -> expected response given the example above.The following is a simple script to test streaming and estimate the token generation rate.
import openai
client = openai.OpenAI(
base_url="http://localhost:8080/v1/", # local zLLM server
api_key= "apiKey" # dummy apiKey
)
start_time = time.time()
first_token_time = None
token_count = 0
response = client.chat.completions.create(
model = "gemma-3-1b",
messages=[{"role": "user", "content": "Tell me a story about a fox"}],
stream=True,
)
try:
for chunk in response:
now = time.time()
if first_token_time is None:
first_token_time = now
print(f"⏱ First token delay: {first_token_time - start_time:.3f}s")
token_count += 1
print(chunk.choices[0].delta.content, end='', flush=True)
except Exception as err:
print(err)
pass
end_time = time.time()
duration = end_time - first_token_time
print(f"\n\n📊 Tokens streamed: {token_count}")
print(f"⚡ Throughput: {token_count / duration:.2f} tokens/sec"),