zLLM

zLLM is a lightweight inference server written in Zig for managing and running LLMs. It supports downloading models from Hugging Face, converting them to the GGUF format, and executing inference using llama.cpp.

✨ Features

Written entirely in Zig
Integrated model registry (registry_manifest.json)
Supports model download, GGUF conversion, and inference via CLI and REST API.
MacOS (Metal) and Intel x86 support via llama.cpp
Currently supports OpenAI's chat completions, enough to use Python's OpenAI client library.

📦 Requirements

Zig 0.14 or newer
Python 3 virtual environment with dependencies to run llama.cpp conversion scripts
- Required only for convert step

🛠️ Build Locally

git clone https://github.com/AbeRodz/zLLM.git
cd zLLM
zig build

Usage

The typical usage flow is:

Download a model
Convert it to GGUF
Run inference

📥 Get (Download a model)

Downloads a model from Hugging Face.

zig build run -- get gemma-3-1b 8

8 is the number of threads used for downloading.

NOTE

Various huggingface models required a token and authentication before downloading them e.g. gemma-3. You need to setup the HF_TOKEN env variable.

🔁 Convert (to GGUF)

Converts the downloaded model using a Python virtual environment.

zig build run -- convert gemma-3-1b

🧠 Run (Inference)

Runs inference on the model:

zig build run -- run gemma-3-1b

🌐 Run (API)

Runs inference API:

zig build run -- serve

⚙️ Direct Binary Execution

You only need to run zig build once. You can then use the built binary directly:

./zig-out/bin/zLLM run gemma-3-1b 2>/dev/null

2>/dev/null suppresses standard error output from llama.cpp.

🖥️ Supported Platforms

Currently only:

MacOS Apple Silicon.
Linux x86 (tested on Ubuntu)
- Other distros haven't been tested yet.

Platform support is bound by build.zig and the capabilities of llama.cpp.

📚 Supported Models

The list of supported models is defined in registry_manifest.json. Example entries include:

gemma-3-1b
gpt-2

OpenAI API Support

Currently supports the chat completions endpoint example cURL:

curl --location 'http://localhost:8080/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "gemma-3-1b",
    "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant that only knows about math and nothing more, if what you are asked about is not about math respond just with i dont know."
            },
            {
                "role": "user",
                "content": "i would like about the history of america"
            }
    ],
    "frequency_penalty": 0.5,
    "max_tokens": 100,
    "presence_penalty": 0.3,
    "stop": ["stop"],
    "stream": false,
    "temperature": 0.8,
    "top_p": 1.0
}
'

Not all fields from the chat completions are currently supported or do anything per se, but it's enough to use Python's OpenAI client:

import openai

client = openai.OpenAI(
  base_url="http://localhost:8080/v1/", # local zLLM server
  api_key= "apiKey" # dummy apiKey
)

completion = client.chat.completions.create(
    stream=False,
    model="gemma-3-1b", # model saved on registry
    messages=  [
        {
            'role': 'user',
            'content': {
                "text": "I'm doing good"
            },
        }
    ],
    max_tokens=256,
    temperature=0.8,
)

print(completion.choices[0].content)
# i dont know -> expected response given the example above.

Stream support via SSE

The following is a simple script to test streaming and estimate the token generation rate.

import openai

client = openai.OpenAI(
  base_url="http://localhost:8080/v1/", # local zLLM server
  api_key= "apiKey" # dummy apiKey
)

start_time = time.time()
first_token_time = None
token_count = 0

response = client.chat.completions.create(
    model = "gemma-3-1b",
    messages=[{"role": "user", "content": "Tell me a story about a fox"}],
    stream=True,
)
try:
    for chunk in response:
        now = time.time()
        if first_token_time is None:
            first_token_time = now
            print(f"⏱ First token delay: {first_token_time - start_time:.3f}s")
        token_count += 1
        print(chunk.choices[0].delta.content, end='', flush=True)
except Exception as err:
    print(err)
    pass
end_time = time.time()
duration = end_time - first_token_time
print(f"\n\n📊 Tokens streamed: {token_count}")
print(f"⚡ Throughput: {token_count / duration:.2f} tokens/sec"),

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github/workflows		.github/workflows
.vscode		.vscode
assets		assets
llama.cpp @ 54a7272		llama.cpp @ 54a7272
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
build.zig		build.zig
build.zig.zon		build.zig.zon
build_context.zig		build_context.zig
build_csources.zig		build_csources.zig
build_metal.zig		build_metal.zig
build_platform.zig		build_platform.zig
converter.py		converter.py
sentence_piece_convert.py		sentence_piece_convert.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

zLLM

✨ Features

📦 Requirements

🛠️ Build Locally

Usage

📥 Get (Download a model)

NOTE

🔁 Convert (to GGUF)

🧠 Run (Inference)

🌐 Run (API)

⚙️ Direct Binary Execution

🖥️ Supported Platforms

📚 Supported Models

OpenAI API Support

Stream support via SSE

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

zLLM

✨ Features

📦 Requirements

🛠️ Build Locally

Usage

📥 Get (Download a model)

NOTE

🔁 Convert (to GGUF)

🧠 Run (Inference)

🌐 Run (API)

⚙️ Direct Binary Execution

🖥️ Supported Platforms

📚 Supported Models

OpenAI API Support

Stream support via SSE

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages