DAT-Bench instruments divergent word choice in LLMs.
It starts with the Divergent Association Task (DAT): generate 10 nouns that are as semantically distant from each other as possible. Responses are scored with GloVe 840B cosine distances and exposed as verifiers-compatible reward signals for evaluation and RL training.
The first track is implemented. Decomposition tracks are planned.
Most model benchmarks reward correctness, instruction following, or reasoning over fixed targets. DAT-Bench asks a narrower complementary question:
Can a rewardable measure of divergent thinking improve how models explore, decompose, and cover open-ended tasks?
DAT is the first instrument. It gives a cheap, repeatable signal for semantic divergence: not whether the model knows an answer, but whether it can move across distant regions of word space while staying within task constraints.
The planned decomposition tracks test whether that diversity objective carries over from word choice to task structure.
Implemented:
- DAT scorer using GloVe 840B cosine distances
- Composite
verifiersrubric for evaluation and RL training - Training environment with XML output and format reward
- Eval environment with Pydantic structured output
- Multi-provider experiment runner: OpenAI, Anthropic, Ollama, OpenRouter
- Prompting strategies for DAT trials
- Statistical visualisations: ridge plots, significance matrices, effect sizes
- Hand-curated fixtures for pipeline validation
Planned:
- QD-DP: query decomposition with an explicit diversity objective
- TD-DP: task decomposition with an explicit diversity objective
- Word-chain trajectory benchmark for graph-navigation strategy analysis
- Python 3.10+
uv- GloVe 840B embeddings
- A word list compatible with the scorer
Set the embedding and word-list paths with environment variables:
export GLOVE_PATH=/path/to/glove.840B.300d.txt
export WORDS_PATH=/path/to/words.txtThe scorer also checks:
data/embeddings/glove.840B.300d.txt
data/words.txt
git clone https://github.com/NasonZ/DAT-Bench.git
cd DAT-Bench
uv venv .venv
source .venv/bin/activate
uv pip install -e .For development:
uv pip install -e ".[dev]"
pytestList available prompting strategies:
uv run divergent-bench strategiesOpenAI:
uv run divergent-bench run \
--provider openai \
--model gpt-5-mini \
--strategy competitive \
--samples 20Ollama:
uv run divergent-bench run \
--provider ollama \
--model llama3.2:3b \
--strategy random \
--samples 15OpenRouter:
export OPENROUTER_API_KEY="***"
uv run divergent-bench run \
--provider openrouter \
--model meta-llama/llama-3.1-8b-instruct \
--samples 10DAT-Bench exposes two environment loaders.
from environments.dat_bench.dat_bench import load_environment, load_eval_environment
# Training: XMLParser + format reward signal
train_env = load_environment(strategy="competitive", num_examples=50)
# Eval: Pydantic structured output, deterministic parsing
eval_env = load_eval_environment(strategy="DAT_instructions", num_examples=30)Training uses XML because format compliance is part of the reward signal. Eval uses structured output so parsing is handled by the provider instead of the model.
The rubric exposes these reward signals:
| Signal | Weight | Purpose |
|---|---|---|
creativity_reward |
1.0 | Normalised DAT score; primary reward |
validity_reward |
0.2 | Fraction of generated words in the GloVe vocabulary |
format_reward |
0.1 | XML format compliance; training only |
raw_dat_score |
0.0 | Raw DAT score for analysis |
valid_word_count |
0.0 | Count of valid words for analysis |
from divergent_bench import DATScorer
scorer = DATScorer()
score = scorer.dat([
"whale", "hammer", "symphony", "cactus", "glacier",
"umbrella", "passport", "volcano", "whistle", "tapestry",
])
print(f"DAT score: {score:.1f}")DAT score is the average cosine distance between all pairs of the first 7 valid unique words, multiplied by 100.
score = mean(pairwise_cosine_distance(first_7_valid_words)) * 100
Higher scores indicate greater semantic distance between the generated words. The reward normalises raw DAT scores to [0, 1] with a linear map over the empirical range [40, 100].
This is not a general creativity score. It is a specific measure of semantic divergence under a constrained word-generation task.
| Strategy | Temperature | Description |
|---|---|---|
none |
0.7 | Minimal instructions; baseline |
competitive |
0.7 | Prize framing with tips for maximising distance |
DAT_instructions |
0.7 | Full task context from the original DAT framing |
random |
1.0 | Explicit randomness instruction |
DAT is the first track because it provides a concrete reward for divergence. The planned decomposition tracks ask whether the same pressure helps models cover more of an open problem.
Given a complex question, generate orthogonal sub-questions, answer each, and synthesise the result.
Candidate metrics:
- orthogonality between sub-questions
- coverage against a topic map
- redundancy penalty
- answer quality after synthesis
Given a complex task, generate distinct top-level approaches, select one, and build a plan tree.
Candidate metrics:
- approach diversity
- actionability
- risk coverage
- plan quality via judge rubric
Both tracks should include standard non-diversity-primed baselines and report effect sizes with multiple-comparison correction.
A word-chain benchmark is planned as a separate trajectory task.
DAT scores a final set of words. Word-chain would score a path through a fully observable word graph: each step changes one word by insertion, deletion, or substitution. This makes model strategy visible, not just final performance.
Candidate signals:
- chain length
- valid transition rate
- bridge pattern frequency
- turning-angle distribution
- word-length oscillation
- local graph density along the path
The design note is in docs/research/word-chain-trajectory-benchmark-design.md.
DAT-Bench/
βββ divergent_bench/
β βββ cli.py # packaged command-line interface
β βββ dat/ # DAT scorer
β βββ rubrics/ # verifiers-compatible reward rubrics
β βββ config/ # prompting strategies and model configs
β βββ data/ # fixtures for validation
β βββ experiments/ # experiment runner
β βββ llm/ # provider clients
β βββ metrics/ # DSI and Lempel-Ziv metrics
β βββ visualization/ # plots, styles, loaders
β βββ decomposition/ # planned QD/TD implementations
β βββ utils/
βββ environments/
β βββ dat_bench/ # verifiers environment definition
βββ configs/
β βββ eval/ # eval configuration
βββ docs/
β βββ research/ # alignment plans and benchmark designs
β βββ api/ # structured-output notes
β βββ development/ # roadmap and technical notes
βββ tests/
β βββ unit/
β βββ integration/
βββ scripts/
βββ pyproject.toml
βββ LICENSE
- Multiple-comparison control: Holm by default, plus Bonferroni and Benjamini-Hochberg
- Effect sizes: Cohen's d with standard thresholds
- Small-sample handling: adaptive rendering, CI capping, low-n warnings
- Colourblind-safe palette validated with Coblis
- The DAT scorer requires local GloVe 840B embeddings and a word list.
- Decomposition tracks are not implemented yet.
- Word-chain is a design document, not an implemented environment.
Run tests:
pytestRun a focused test:
pytest tests/unit/test_dat_scorer.pyRun integration tests:
pytest tests/integrationUseful docs:
docs/research/verifiers-alignment-plan.mddocs/research/word-chain-trajectory-benchmark-design.mddocs/development/ROADMAP.mddocs/api/structured-output.md
- Olson et al. (2021). Naming unrelated words predicts creativity. PNAS.
- Pennington et al. (2014). GloVe: Global Vectors for Word Representation.
- Cohen (1988). Statistical Power Analysis for the Behavioral Sciences.
@software{divergent_bench,
title = {DAT-Bench: Divergent Thinking Benchmarks for LLMs},
author = {Nason Zikayo},
year = {2026},
url = {https://github.com/NasonZ/DAT-Bench}
}MIT. See LICENSE.