docs: add Granite Guardian page, update benchmarks and ztoken, fix broken link

dndungu · dndungu · commit 80faf8f1de23 · 2026-03-26T09:42:32.000-07:00
- Add comprehensive Granite Guardian reference page covering Go API,
  CLI, REST API, guardrails middleware, and all 13 risk categories
- Add SentencePiece unigram support note to ztoken ecosystem page
- Add Mistral output quality note to benchmarks (pending tokenizer fix)
- Fix broken /docs/getting-started/gpu-setup link to /docs/architecture/gpu-setup
diff --git a/content/docs/ecosystem/ztoken.md b/content/docs/ecosystem/ztoken.md
@@ -90,6 +90,10 @@ tok := ztoken.NewBPETokenizer(vocab, merges, special, false)
 tok.SetSentencePiece(true)
 ```
 
+## SentencePiece Unigram Support
+
+As of v0.3.0, ztoken supports SentencePiece unigram tokenization in addition to BPE. Unigram models (used by T5, mBART, and some multilingual models) are detected automatically when loading from HuggingFace JSON or GGUF files with `tokenizer.ggml.model = "llama"` and a unigram vocabulary.
+
 ## Supported Models
 
 ztoken is compatible with tokenizers from:
diff --git a/content/docs/getting-started/quickstart.md b/content/docs/getting-started/quickstart.md
@@ -301,7 +301,7 @@ fmt.Printf("Tokens: %d, Duration: %s\n", result.TokenCount, result.Duration)
 ## Next Steps
 
 - [Installation](/docs/getting-started/installation) -- detailed installation and platform support
-- [GPU Setup](/docs/getting-started/gpu-setup) -- configure CUDA, ROCm, or OpenCL for hardware-accelerated inference
+- [GPU Setup](/docs/architecture/gpu-setup) -- configure CUDA, ROCm, or OpenCL for hardware-accelerated inference
 - [API Server](/docs/deployment) -- serve models behind an OpenAI-compatible HTTP API
 - [API Reference](/docs/api) -- full API documentation
 - [Tutorials](/docs/tutorials) -- step-by-step guides for common tasks
diff --git a/content/docs/reference/benchmarks.md b/content/docs/reference/benchmarks.md
@@ -48,8 +48,14 @@ Ollama v0.17.7.
 | Mistral 7B Q4_K_M | mistral | 7B | **44** | 46.77 | **0.94x** | ~Even |
 
 Zerfoo wins on small models (1B-1.5B). Llama 3.2 3B is at parity. Mistral 7B
-was previously at 11 tok/s due to a performance regression; after the fix it
-runs at 44 tok/s (0.94x Ollama -- near parity).
+was previously at 11 tok/s due to a performance regression; after the shared
+memory fix it runs at 44 tok/s (0.94x Ollama -- near parity).
+
+> **Note on Mistral output quality:** Mistral 7B throughput is correct at 44
+> tok/s, but output quality is pending a tokenizer fix. The Mistral tokenizer
+> requires SentencePiece byte-fallback handling that is not yet fully
+> implemented. Throughput numbers are valid; text coherence will improve once
+> the tokenizer fix lands.
 Additional architectures (Qwen, Phi, Mixtral, Command-R, Falcon, Mamba, RWKV)
 will be added as GGUF files are acquired and parser compatibility is resolved.
 
diff --git a/content/docs/reference/granite-guardian.md b/content/docs/reference/granite-guardian.md