VoxelBench is a prompt-based benchmark for testing how well AI coding models can plan, architect, implement, and honestly evaluate a non-trivial software project.
The benchmark asks models to build a browser-based WebGL voxel sandbox game, then build a marketing/info website for that game. The point is not just whether the model can output code. The point is whether it can make good engineering decisions, produce something runnable, avoid fake compliance, and explain trade-offs clearly.
VoxelBench is designed to test model capability across:
- TypeScript implementation quality
- Project setup and build correctness
- WebGL/Three.js usage
- Game architecture
- SOLID-style system separation
- Procedural world generation
- Chunk loading and meshing
- Player movement, gravity, collision, and flying
- UI/menu/settings implementation
- Automated screenshot capture
- Product/design judgement
- Framework/tooling decision-making
- Honest self-critique
This repo contains two main prompts.
The first prompt asks the model to build a browser-based voxel sandbox game inside a blank Bun TypeScript project.
The game should include:
- Chunk-based voxel terrain
- Minetest/Luanti-inspired world generation
- Multiple biome-like areas
- Plains, hills, mountains, beaches, water, trees, and caves
- Gravity-based movement
- Creative flying mode
- Double-tap Space to toggle flight
- Block breaking and placement
- Main menu and settings menu
- View distance slider
- Procedural moving clouds
- Screenshot capture script
- TypeScript build and typecheck commands
The model must not stop to ask clarification questions. If something is ambiguous, it should make the best engineering decision it can, state the assumption, and continue.
The second prompt asks the model to build a marketing and information website for the game.
Unlike the first prompt, the framework is intentionally not specified. The model must choose the tooling itself and justify the choice.
The site should include:
- Hero section
- Game pitch
- Screenshot/showcase section
- Features section
- Controls section
- Technical highlights
- Development/status section
- Call to action
- Footer
- Responsive layout
- Good accessibility
- Sensible performance choices
- Honest handling of missing screenshots
This prompt tests product judgement, visual design, framework selection, and whether the model overengineers or underengineers the solution.
Many coding benchmarks are too narrow. They test isolated algorithmic tasks or small code snippets, but real software work involves more than that.
VoxelBench is meant to test whether a model can:
- Understand a broad product request
- Research and apply relevant ideas
- Make sensible architecture choices
- Implement multiple interacting systems
- Produce code that actually runs
- Avoid pretending incomplete features are done
- Explain limitations honestly
The game prompt is intentionally demanding because it touches rendering, world generation, input, physics, UI, build tooling, and testing.
The website prompt is intentionally looser because good models should be able to make appropriate tooling and design decisions without being spoon-fed every choice.
Create a fresh folder for each model run.
For the game benchmark:
mkdir voxelbench-game-model-name
cd voxelbench-game-model-name
bun initThen give the model the full game prompt.
After it finishes, run the commands it provides, usually something like:
bun install
bun run typecheck
bun run build
bun run screenshots
bun run devFor the marketing site benchmark, either use the game output as context or provide the screenshot files generated by the first benchmark.
There is no official numeric scoring system yet. I score runs manually based on the quality of the result.
Important things to check:
- Does it run without manual fixes?
- Does TypeScript typecheck pass?
- Does the production build pass?
- Does the screenshot script create real screenshots?
- Is the game playable?
- Does movement feel correct?
- Does gravity/collision work?
- Does double-tap Space toggle flight?
- Does Escape pause the game and only Escape?
- Is the world generation actually interesting?
- Are chunks generated/unloaded properly?
- Are block breaking and placement implemented?
- Is the code clean and modular?
- Does the model avoid large god classes?
- Does it honestly report incomplete features?
- Does the marketing site use sensible tooling?
- Does the marketing site look specific to the game?
- Is the site responsive and accessible?
A strong model run should produce:
- A working local setup
- Clean TypeScript
- Clear architecture
- Sensible system boundaries
- Playable movement
- Working collision and gravity
- Working creative flight
- Interesting terrain generation
- Efficient chunk meshing
- Working block interactions
- Real screenshot automation
- Clear setup instructions
- Honest limitations
- Good self-critique
For the marketing site, strong runs should also show:
- Good framework/tooling judgement
- Clear design direction
- Good responsive layout
- Specific, non-generic copy
- Screenshot integration
- Accessibility awareness
- Lightweight implementation
- Good explanation of trade-offs
Lower-quality runs usually include:
- Code that does not compile
- Missing setup instructions
- Single-file demos
- Static cube scenes
- Fake features hidden behind comments
- Placeholder systems marked as complete
- Broken movement
- No gravity or collision
- No real chunking
- Flat boring terrain
- Rendering every block as a separate mesh
- Overuse of
any - Large god classes
- Fake screenshot files
- Generic SaaS-looking landing pages
- Overengineered frameworks without justification
- Claims that do not match the implementation
The prompts explicitly tell models not to ask clarification questions.
If something is ambiguous, the model should make the best reasonable decision, state the assumption, and keep going.
This is intentional. VoxelBench tests initiative, judgement, and execution, not just the ability to ask follow-up questions.
VoxelBench is inspired by voxel sandbox games and open-source projects such as Minetest/Luanti, but implementations should not copy source code or copyrighted assets.
Generated games and sites should not use Minecraft branding, Minecraft assets, or imply affiliation with Mojang or Microsoft.
A simple repo layout could be:
.
├── prompts
│ ├── 01-game-prototype.md
│ └── 02-marketing-site.md
├── runs
│ └── README.md
├── screenshots
│ └── README.md
└── README.md
The runs/ folder can be used to store notes, scores, generated screenshots, or links to separate repos for each model attempt.
VoxelBench is experimental.
The prompts are expected to evolve as more model runs expose weaknesses, loopholes, or unfair requirements.