Iterative fine-tuning of LLMs on Text-to-SQL using STaR (Self-Taught Reasoner).
This project implements a self-improvement loop for Text-to-SQL generation. The model generates multiple SQL candidates, executes them against SQLite databases, and fine-tunes on correct solutions.
Spider Dataset (7K train, 1K dev)
↓
Ministral-8B generates k=8 SQL candidates
↓
Execute on SQLite → keep correct ones
↓
Fine-tune LoRA on (question, SQL)
↓
Repeat for 3 iterations
↓
Baseline 60.1% → SFT 68.8% → STaR 78.0%
| Model | Accuracy | Method |
|---|---|---|
| Ministral-8B baseline | 60.1% | greedy |
| + SFT on Spider | 68.8% | greedy |
| + STaR (3 iterations) | 78.0% | self-consistency k=16 |
| Iteration | Train Accuracy | Dev Accuracy |
|---|---|---|
| 1 | 82.2% | 71.0% |
| 2 | 88.0% | 71.7% |
| 3 | 88.0% | 72.1% (greedy) / 78.0% (k=16) |
| Component | Technology |
|---|---|
| Base Model | Ministral-8B-Instruct-2410 |
| Fine-tuning | LoRA via LLaMA-Factory |
| Inference | vLLM |
| Dataset | Spider (Yale) - 7K train, 1K dev |
| Method | STaR with k=8 candidates |
| Hardware | NVIDIA H200 |
reasonforge/
├── src/
│ ├── star_train.py # Main STaR training pipeline
│ ├── star_retrain.py # Mini-STaR retrain on errors
│ ├── sql_utils.py # SQL execution and matching
│ ├── prompts.py # Prompts and generation config
│ └── evaluation/
│ └── eval_dev.py # Dev set evaluation
├── configs/
│ ├── config.yaml # STaR configuration
│ └── sft_ministral8b.yaml # SFT configuration
├── data/
│ ├── spider/ # Spider dataset (166 DBs)
│ └── errors/ # Error analysis
├── models/
│ ├── merged_iter_1/ # STaR iteration 1
│ ├── merged_iter_2/ # STaR iteration 2
│ ├── merged_iter_3/ # STaR iteration 3 (best)
│ └── history.json # Training history
├── reports/ # Evaluation results
└── scripts/
└── download_spider.py # Dataset download
# 1. Install dependencies
pip install -r requirements.txt
# 2. Download Spider dataset
python scripts/download_spider.py
# 3. Evaluate baseline model
python src/evaluation/eval_dev.py \
--model_path mistralai/Ministral-8B-Instruct-2410 \
--k 1 \
--temperature 0.0 \
--no_self_consistency
# 4. Run STaR training
python src/star_train.py
# 5. Evaluate fine-tuned model (k=16, self-consistency)
python src/evaluation/eval_dev.py \
--model_path ./models/merged_iter_3 \
--k 16 \
--self_consistencyMain configuration in configs/config.yaml:
star:
k_candidates: 8 # Candidates per question
train_from_base: true # Always start from base model
difficulty_resampling: true # Oversample hard questions
self_improvement:
max_iterations: 5
eval_sample_size: 7000MIT