Building blocks for local agents in C++.
Note
This library is designed for running small language models locally using llama.cpp. If you want to call external LLM APIs, this is not the right fit.
-
Context Engineering - Use callbacks to manipulate the context between iterations of the agent loop.
-
Memory - Use tools that allow an agent to store and retrieve relevant information across conversations.
-
Multi-Agent - Build a multi-agent system with weight sharing where a main agent delegates to specialized sub-agents.
-
Shell - Allow an agent to write shell scripts to perform multiple actions at once. Demonstrates human-in-the-loop interactions via callbacks.
-
Tracing - Use callbacks to collect a record of the steps of the agent loop with OpenTelemetry.
You need to download a GGUF model in order to run the examples, the default model configuration is set for granite-4.0-micro:
wget https://huggingface.co/ibm-granite/granite-4.0-micro-GGUF/resolve/main/granite-4.0-micro-Q8_0.ggufImportant
The examples use default ModelConfig values optimized for granite-4.0-micro. If you use a different model, you should adapt these values (context size, temperature, sampling parameters, etc.) to your specific use case.
We define an agent with the following building blocks:
In the current LLM (Large Language Models) world, and agent is usually a simple loop that intersperses Model Calls and Tool Executions, until a stop condition is met.
Important
There are different ways to implement stop conditions. By default, we let the agent decide when to end the loop, by generating an output without tool executions. You can implement additional stop conditions via callbacks.
Callbacks allow you to hook into the agent lifecycle at specific points:
before_agent_loop/after_agent_loop- Run logic at the start/end of the agent loopbefore_llm_call/after_llm_call- Intercept or modify messages before/after model inferencebefore_tool_execution/after_tool_execution- Validate, skip, or handle tool calls and their results
Use callbacks for logging, context manipulation, human-in-the-loop approval, or error recovery.
A system prompt that defines the agent's behavior and capabilities. Passed to the Agent constructor and automatically prepended to conversations.
Encapsulates local LLM initialization and inference using llama.cpp. This is tightly coupled to llama.cpp and requires models in GGUF format.
Handles:
- Loading GGUF model files (quantized models recommended for efficiency)
- Chat template application and tokenization
- Text generation with configurable sampling (temperature, top_p, top_k, etc.)
- KV cache management for efficient prompt caching
Tools extend the agent's capabilities beyond text generation. Each tool defines:
- Name and description - Helps the model understand when to use it
- Parameters schema - JSON Schema defining expected arguments
- Execute function - The actual implementation
When the model decides to use a tool, the agent parses the tool call, executes it, and feeds the result back into the conversation.