VoxelBench

VoxelBench is a prompt-based benchmark for testing how well AI coding models can plan, architect, implement, and honestly evaluate a non-trivial software project.

The benchmark asks models to build a browser-based WebGL voxel sandbox game, then build a marketing/info website for that game. The point is not just whether the model can output code. The point is whether it can make good engineering decisions, produce something runnable, avoid fake compliance, and explain trade-offs clearly.

What this tests

VoxelBench is designed to test model capability across:

TypeScript implementation quality
Project setup and build correctness
WebGL/Three.js usage
Game architecture
SOLID-style system separation
Procedural world generation
Chunk loading and meshing
Player movement, gravity, collision, and flying
UI/menu/settings implementation
Automated screenshot capture
Product/design judgement
Framework/tooling decision-making
Honest self-critique

Benchmark prompts

This repo contains two main prompts.

1. VoxelBench Game Prototype

The first prompt asks the model to build a browser-based voxel sandbox game inside a blank Bun TypeScript project.

The game should include:

Chunk-based voxel terrain
Minetest/Luanti-inspired world generation
Multiple biome-like areas
Plains, hills, mountains, beaches, water, trees, and caves
Gravity-based movement
Creative flying mode
Double-tap Space to toggle flight
Block breaking and placement
Main menu and settings menu
View distance slider
Procedural moving clouds
Screenshot capture script
TypeScript build and typecheck commands

The model must not stop to ask clarification questions. If something is ambiguous, it should make the best engineering decision it can, state the assumption, and continue.

2. VoxelBench Marketing Site

The second prompt asks the model to build a marketing and information website for the game.

Unlike the first prompt, the framework is intentionally not specified. The model must choose the tooling itself and justify the choice.

The site should include:

Hero section
Game pitch
Screenshot/showcase section
Features section
Controls section
Technical highlights
Development/status section
Call to action
Footer
Responsive layout
Good accessibility
Sensible performance choices
Honest handling of missing screenshots

This prompt tests product judgement, visual design, framework selection, and whether the model overengineers or underengineers the solution.

Why this exists

Many coding benchmarks are too narrow. They test isolated algorithmic tasks or small code snippets, but real software work involves more than that.

VoxelBench is meant to test whether a model can:

Understand a broad product request
Research and apply relevant ideas
Make sensible architecture choices
Implement multiple interacting systems
Produce code that actually runs
Avoid pretending incomplete features are done
Explain limitations honestly

The game prompt is intentionally demanding because it touches rendering, world generation, input, physics, UI, build tooling, and testing.

The website prompt is intentionally looser because good models should be able to make appropriate tooling and design decisions without being spoon-fed every choice.

Suggested workflow

Create a fresh folder for each model run.

For the game benchmark:

mkdir voxelbench-game-model-name
cd voxelbench-game-model-name
bun init

Then give the model the full game prompt.

After it finishes, run the commands it provides, usually something like:

bun install
bun run typecheck
bun run build
bun run screenshots
bun run dev

For the marketing site benchmark, either use the game output as context or provide the screenshot files generated by the first benchmark.

Scoring

There is no official numeric scoring system yet. I score runs manually based on the quality of the result.

Important things to check:

Does it run without manual fixes?
Does TypeScript typecheck pass?
Does the production build pass?
Does the screenshot script create real screenshots?
Is the game playable?
Does movement feel correct?
Does gravity/collision work?
Does double-tap Space toggle flight?
Does Escape pause the game and only Escape?
Is the world generation actually interesting?
Are chunks generated/unloaded properly?
Are block breaking and placement implemented?
Is the code clean and modular?
Does the model avoid large god classes?
Does it honestly report incomplete features?
Does the marketing site use sensible tooling?
Does the marketing site look specific to the game?
Is the site responsive and accessible?

What scores well

A strong model run should produce:

A working local setup
Clean TypeScript
Clear architecture
Sensible system boundaries
Playable movement
Working collision and gravity
Working creative flight
Interesting terrain generation
Efficient chunk meshing
Working block interactions
Real screenshot automation
Clear setup instructions
Honest limitations
Good self-critique

For the marketing site, strong runs should also show:

Good framework/tooling judgement
Clear design direction
Good responsive layout
Specific, non-generic copy
Screenshot integration
Accessibility awareness
Lightweight implementation
Good explanation of trade-offs

What scores poorly

Lower-quality runs usually include:

Code that does not compile
Missing setup instructions
Single-file demos
Static cube scenes
Fake features hidden behind comments
Placeholder systems marked as complete
Broken movement
No gravity or collision
No real chunking
Flat boring terrain
Rendering every block as a separate mesh
Overuse of any
Large god classes
Fake screenshot files
Generic SaaS-looking landing pages
Overengineered frameworks without justification
Claims that do not match the implementation

Important benchmark rule

The prompts explicitly tell models not to ask clarification questions.

If something is ambiguous, the model should make the best reasonable decision, state the assumption, and keep going.

This is intentional. VoxelBench tests initiative, judgement, and execution, not just the ability to ask follow-up questions.

Legal/assets note

VoxelBench is inspired by voxel sandbox games and open-source projects such as Minetest/Luanti, but implementations should not copy source code or copyrighted assets.

Generated games and sites should not use Minecraft branding, Minecraft assets, or imply affiliation with Mojang or Microsoft.

Repository structure suggestion

A simple repo layout could be:

.
├── prompts
│   ├── 01-game-prototype.md
│   └── 02-marketing-site.md
├── runs
│   └── README.md
├── screenshots
│   └── README.md
└── README.md

The runs/ folder can be used to store notes, scores, generated screenshots, or links to separate repos for each model attempt.

Status

VoxelBench is experimental.

The prompts are expected to evolve as more model runs expose weaknesses, loopholes, or unfair requirements.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
bun.lock		bun.lock
index.ts		index.ts
mineclone.md		mineclone.md
package.json		package.json
tsconfig.json		tsconfig.json
website.md		website.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoxelBench

What this tests

Benchmark prompts

1. VoxelBench Game Prototype

2. VoxelBench Marketing Site

Why this exists

Suggested workflow

Scoring

What scores well

What scores poorly

Important benchmark rule

Legal/assets note

Repository structure suggestion

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VoxelBench

What this tests

Benchmark prompts

1. VoxelBench Game Prototype

2. VoxelBench Marketing Site

Why this exists

Suggested workflow

Scoring

What scores well

What scores poorly

Important benchmark rule

Legal/assets note

Repository structure suggestion

Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages