Skip to content

OpenStaticFish/Minebench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VoxelBench

VoxelBench is a prompt-based benchmark for testing how well AI coding models can plan, architect, implement, and honestly evaluate a non-trivial software project.

The benchmark asks models to build a browser-based WebGL voxel sandbox game, then build a marketing/info website for that game. The point is not just whether the model can output code. The point is whether it can make good engineering decisions, produce something runnable, avoid fake compliance, and explain trade-offs clearly.

What this tests

VoxelBench is designed to test model capability across:

  • TypeScript implementation quality
  • Project setup and build correctness
  • WebGL/Three.js usage
  • Game architecture
  • SOLID-style system separation
  • Procedural world generation
  • Chunk loading and meshing
  • Player movement, gravity, collision, and flying
  • UI/menu/settings implementation
  • Automated screenshot capture
  • Product/design judgement
  • Framework/tooling decision-making
  • Honest self-critique

Benchmark prompts

This repo contains two main prompts.

1. VoxelBench Game Prototype

The first prompt asks the model to build a browser-based voxel sandbox game inside a blank Bun TypeScript project.

The game should include:

  • Chunk-based voxel terrain
  • Minetest/Luanti-inspired world generation
  • Multiple biome-like areas
  • Plains, hills, mountains, beaches, water, trees, and caves
  • Gravity-based movement
  • Creative flying mode
  • Double-tap Space to toggle flight
  • Block breaking and placement
  • Main menu and settings menu
  • View distance slider
  • Procedural moving clouds
  • Screenshot capture script
  • TypeScript build and typecheck commands

The model must not stop to ask clarification questions. If something is ambiguous, it should make the best engineering decision it can, state the assumption, and continue.

2. VoxelBench Marketing Site

The second prompt asks the model to build a marketing and information website for the game.

Unlike the first prompt, the framework is intentionally not specified. The model must choose the tooling itself and justify the choice.

The site should include:

  • Hero section
  • Game pitch
  • Screenshot/showcase section
  • Features section
  • Controls section
  • Technical highlights
  • Development/status section
  • Call to action
  • Footer
  • Responsive layout
  • Good accessibility
  • Sensible performance choices
  • Honest handling of missing screenshots

This prompt tests product judgement, visual design, framework selection, and whether the model overengineers or underengineers the solution.

Why this exists

Many coding benchmarks are too narrow. They test isolated algorithmic tasks or small code snippets, but real software work involves more than that.

VoxelBench is meant to test whether a model can:

  • Understand a broad product request
  • Research and apply relevant ideas
  • Make sensible architecture choices
  • Implement multiple interacting systems
  • Produce code that actually runs
  • Avoid pretending incomplete features are done
  • Explain limitations honestly

The game prompt is intentionally demanding because it touches rendering, world generation, input, physics, UI, build tooling, and testing.

The website prompt is intentionally looser because good models should be able to make appropriate tooling and design decisions without being spoon-fed every choice.

Suggested workflow

Create a fresh folder for each model run.

For the game benchmark:

mkdir voxelbench-game-model-name
cd voxelbench-game-model-name
bun init

Then give the model the full game prompt.

After it finishes, run the commands it provides, usually something like:

bun install
bun run typecheck
bun run build
bun run screenshots
bun run dev

For the marketing site benchmark, either use the game output as context or provide the screenshot files generated by the first benchmark.

Scoring

There is no official numeric scoring system yet. I score runs manually based on the quality of the result.

Important things to check:

  • Does it run without manual fixes?
  • Does TypeScript typecheck pass?
  • Does the production build pass?
  • Does the screenshot script create real screenshots?
  • Is the game playable?
  • Does movement feel correct?
  • Does gravity/collision work?
  • Does double-tap Space toggle flight?
  • Does Escape pause the game and only Escape?
  • Is the world generation actually interesting?
  • Are chunks generated/unloaded properly?
  • Are block breaking and placement implemented?
  • Is the code clean and modular?
  • Does the model avoid large god classes?
  • Does it honestly report incomplete features?
  • Does the marketing site use sensible tooling?
  • Does the marketing site look specific to the game?
  • Is the site responsive and accessible?

What scores well

A strong model run should produce:

  • A working local setup
  • Clean TypeScript
  • Clear architecture
  • Sensible system boundaries
  • Playable movement
  • Working collision and gravity
  • Working creative flight
  • Interesting terrain generation
  • Efficient chunk meshing
  • Working block interactions
  • Real screenshot automation
  • Clear setup instructions
  • Honest limitations
  • Good self-critique

For the marketing site, strong runs should also show:

  • Good framework/tooling judgement
  • Clear design direction
  • Good responsive layout
  • Specific, non-generic copy
  • Screenshot integration
  • Accessibility awareness
  • Lightweight implementation
  • Good explanation of trade-offs

What scores poorly

Lower-quality runs usually include:

  • Code that does not compile
  • Missing setup instructions
  • Single-file demos
  • Static cube scenes
  • Fake features hidden behind comments
  • Placeholder systems marked as complete
  • Broken movement
  • No gravity or collision
  • No real chunking
  • Flat boring terrain
  • Rendering every block as a separate mesh
  • Overuse of any
  • Large god classes
  • Fake screenshot files
  • Generic SaaS-looking landing pages
  • Overengineered frameworks without justification
  • Claims that do not match the implementation

Important benchmark rule

The prompts explicitly tell models not to ask clarification questions.

If something is ambiguous, the model should make the best reasonable decision, state the assumption, and keep going.

This is intentional. VoxelBench tests initiative, judgement, and execution, not just the ability to ask follow-up questions.

Legal/assets note

VoxelBench is inspired by voxel sandbox games and open-source projects such as Minetest/Luanti, but implementations should not copy source code or copyrighted assets.

Generated games and sites should not use Minecraft branding, Minecraft assets, or imply affiliation with Mojang or Microsoft.

Repository structure suggestion

A simple repo layout could be:

.
├── prompts
│   ├── 01-game-prototype.md
│   └── 02-marketing-site.md
├── runs
│   └── README.md
├── screenshots
│   └── README.md
└── README.md

The runs/ folder can be used to store notes, scores, generated screenshots, or links to separate repos for each model attempt.

Status

VoxelBench is experimental.

The prompts are expected to evolve as more model runs expose weaknesses, loopholes, or unfair requirements.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors