Skip to content

Commit e440228

Browse files
Merge pull request #22 from ThomasJButler/v1.5
chore: standardise comments to my own style guide
2 parents 3d198a1 + 0901125 commit e440228

18 files changed

Lines changed: 340 additions & 177 deletions

README.md

Lines changed: 147 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,62 +1,162 @@
11
# SQL-Ball
22

3-
SQL-Ball is a football analytics platform that transforms football match data into actionable insights through interactive visualizations and natural language queries. This project was developed as a content project for the **CodeCademy Mastering Generative AI for Developers Bootcamp** held from August to September 2025.
3+
Football analytics platform that converts natural language questions into SQL queries, processes European league match data, and visualises patterns and statistics.
4+
5+
## What It Does
6+
7+
SQL-Ball lets you:
8+
- Ask questions about football matches in plain English ("Show me upsets in the Premier League")
9+
- Get automated SQL query generation and execution
10+
- View interactive dashboards with historical trends, team performance, and statistical anomalies
11+
- Explore data across 22 European leagues with 7,600+ matches
12+
13+
The backend uses LangChain RAG (Retrieval-Augmented Generation) to parse queries and retrieve relevant schema context. The frontend provides real-time visualisations via Chart.js.
14+
15+
## Setup
16+
17+
### Prerequisites
18+
- Node.js 18+
19+
- Python 3.11+
20+
- Supabase project (free tier works)
21+
- OpenAI or Anthropic API key (optional, for AI features)
422

5-
<img width="1261" height="621" alt="image" src="https://github.com/user-attachments/assets/407966f8-e680-458a-aa7a-a3d66b7d1318" />
23+
### Installation
624

7-
## Overview
25+
**Frontend:**
26+
```bash
27+
npm install
28+
npm run dev
29+
```
830

9-
SQL-Ball allows users to:
10-
- View and analyze historical match data.
11-
- Filter data by league division (e.g., Premier League, La Liga, Bundesliga).
12-
- Generate visualizations including trends, distribution charts, and performance radar charts.
13-
- Execute natural language queries converted to SQL for retrieving football statistics.
31+
Runs on `http://localhost:5173` by default.
1432

15-
## Architecture & Use Cases
33+
**Backend:**
34+
```bash
35+
cd backend
36+
python -m venv .venv
37+
source .venv/bin/activate # Windows: .venv\Scripts\activate
38+
pip install -r requirements.txt
39+
python main.py
40+
```
1641

17-
For detailed architectural decisions and design patterns used in this project, please refer to [ARCHITECTURE.md](ARCHITECTURE.md).
42+
Runs on `http://localhost:8000`.
1843

19-
**Key Use Cases:**
20-
- **Data Filtering:** Filter match data based on league divisions using unique codes such as:
21-
- **Premier League (E1)**
22-
- **La Liga (SP1)**
23-
- **Bundesliga (G1)**
24-
- **Interactive Visualizations:** Generate charts for goals trends, result distributions, and team performance.
25-
- **Natural Language Querying:** Allow users to enter plain-English queries that are converted into SQL to retrieve relevant insights.
26-
- **Backend Integration:** Support backend data processing (Python-based API) and a frontend built with Svelte and Vite.
44+
**Environment Variables**
2745

28-
## Project Details
29-
30-
- **Frontend:** Built using Svelte and Vite, with visualizations rendered via Chart.js components.
31-
- **Backend:** Python scripts and APIs serve match data and handle data processing.
32-
- **Deployment:** Designed to be deployed to Vercel with separate configuration for the backend. Sensitive files like `backend/.env` should be added to `.gitignore` to protect credentials.
33-
- **Content Project:** This project was created as part of an intensive bootcamp to master generative AI applications in software development.
34-
35-
## Acknowledgements
36-
37-
Thank you for the incredible datasets, this would not be possible without - https://www.football-data.co.uk/data.php
38-
39-
A special thanks to the CodeCademy Mastering Generative AI for Developers Bootcamp for inspiring this project and providing the learning environment that led to its creation. The course and this project in particular were a big learning curve for me and really enjoyed the experience.
40-
41-
## Getting Started
42-
43-
1. **Installation:**
44-
- Clone the repository.
45-
- Install dependencies using `npm install` for the frontend.
46-
- Set up the backend by following the instructions in `backend/setup.sh`.
47-
- Ensure you have proper environment variables set up (see `.env.example`).
48-
49-
2. **Development:**
50-
- Start the frontend by running `npm run dev`.
51-
- For backend development, run the provided scripts in the `backend` directory.
52-
53-
3. **Deployment:**
54-
- The project is ready for deployment on Vercel. Make sure to configure environment variables on Vercel and exclude sensitive files (like `backend/.env`) from version control.
46+
Create `.env` files in both root and `backend/` directories:
47+
48+
```
49+
# Frontend .env
50+
VITE_SUPABASE_URL=your_supabase_url
51+
VITE_SUPABASE_ANON_KEY=your_supabase_key
52+
VITE_OPENAI_API_KEY=your_openai_key # Optional
53+
```
54+
55+
```
56+
# Backend .env
57+
VITE_SUPABASE_URL=your_supabase_url
58+
VITE_SUPABASE_ANON_KEY=your_supabase_key
59+
VITE_OPENAI_API_KEY=your_openai_key
60+
```
61+
62+
## Architecture
63+
64+
**Frontend Stack**
65+
- Svelte + Vite for rapid development
66+
- Chart.js for visualisations
67+
- TypeScript for type safety
68+
- Tailwind CSS for styling
69+
70+
**Backend Stack**
71+
- FastAPI for HTTP API
72+
- LangChain for RAG and SQL generation
73+
- ChromaDB for vector embeddings
74+
- Supabase PostgreSQL for data
75+
76+
**Data Flow**
77+
1. User types natural language question
78+
2. Frontend sends to backend `/query` endpoint
79+
3. Backend retrieves schema context via embeddings
80+
4. LLM generates SQL with OpenAI/Anthropic
81+
5. SQL validation and repair layer handles edge cases
82+
6. Query executes via Supabase RPC
83+
7. Results return with explanation
84+
85+
## Common Issues
86+
87+
**"No API key configured"**
88+
- Add `VITE_OPENAI_API_KEY` to `.env` for AI features
89+
- Platform still works with template-based query generation
90+
91+
**"Failed to initialise ChromaDB"**
92+
- Ensure `backend/.venv` is activated
93+
- Check `chromadb_data/` directory is writable
94+
- Clear directory and restart if schema changed
95+
96+
**Queries returning empty results**
97+
- Verify league code is correct (E0 = Premier League, SP1 = La Liga, etc.)
98+
- Check match_date format in WHERE clause
99+
- Most data is 2024-2025 season
100+
101+
## Deployment
102+
103+
**Vercel (Frontend)**
104+
```bash
105+
npm run build
106+
# Deploy dist/ folder to Vercel
107+
```
108+
109+
Set environment variables in Vercel project settings.
110+
111+
**Backend Options**
112+
- Railway.app (easiest)
113+
- Heroku (cost-free tier removed)
114+
- Self-hosted VPS
115+
116+
Backend needs environment variables set on deployment platform.
117+
118+
## Tech Details
119+
120+
**Why This Stack?**
121+
122+
- Svelte: Smaller bundle than React, excellent for data visualisations
123+
- FastAPI: Modern Python async framework, auto-generates OpenAPI docs
124+
- LangChain: Handles complex RAG pipelines without reinventing the wheel
125+
- Supabase: PostgreSQL advantage for complex queries, real-time capabilities
126+
- ChromaDB: Lightweight embedding store, no external dependency
127+
128+
**Performance Notes**
129+
130+
- Initial schema embedding takes ~2-3 seconds
131+
- Subsequent queries cache embeddings in memory
132+
- Vector search dramatically improves context retrieval
133+
- Falls back to text search if embeddings unavailable
134+
135+
## Data Source
136+
137+
Match data sourced from [football-data.co.uk](https://www.football-data.co.uk/data.php).
138+
139+
Covers:
140+
- Premier League, Championship, League One, League Two (England)
141+
- La Liga, Segunda División (Spain)
142+
- Bundesliga, 2. Bundesliga (Germany)
143+
- Serie A, Serie B (Italy)
144+
- Ligue 1, Ligue 2 (France)
145+
- Eredivisie (Netherlands)
146+
- Pro League (Belgium)
147+
- Primeira Liga (Portugal)
148+
- Süper Lig (Turkey)
149+
- Super League (Greece)
150+
- Premiership, Championship, League One, League Two (Scotland)
55151

56152
## Contributing
57153

58-
Contributions are welcome! Please check the issues and submit pull requests for enhancements or bug fixes. For major changes, please open an issue first to discuss the proposed changes.
154+
Issues and pull requests welcome. For major changes, open an issue first to discuss.
59155

60156
## License
61157

62-
This project is licensed under the terms described in the [LICENSE](LICENSE) file.
158+
See [LICENSE](LICENSE) file for details.
159+
160+
---
161+
162+
Built as a CodeCademy Mastering Generative AI for Developers bootcamp project (August-September 2025).

backend/api/execute.py

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
"""
2-
SQL Execution API endpoints for SQL-Ball
2+
@author Tom Butler
3+
@date 2025-10-21
4+
@description SQL execution endpoint. Receives SQL queries, applies PostgreSQL compatibility fixes,
5+
validates them, and executes via Supabase RPC. Handles error cases and season validation.
36
"""
47

58
from fastapi import APIRouter, HTTPException
@@ -38,7 +41,7 @@ async def execute_sql_query(request: ExecuteRequest):
3841
"""
3942
try:
4043
# Debug logging
41-
print(f"🔍 DEBUG: Received SQL query: {request.sql}")
44+
print(f"DEBUG: Received SQL query: {request.sql}")
4245

4346
# Clean the SQL - remove trailing semicolon for Supabase RPC
4447
clean_sql = request.sql.strip().rstrip(';')
@@ -56,14 +59,14 @@ async def execute_sql_query(request: ExecuteRequest):
5659
# AGGRESSIVE FIX: Detect and fix conflicting season conditions
5760
season_matches = re.findall(r"season\s*=\s*['\"]([^'\"]+)['\"]", clean_sql, re.IGNORECASE)
5861
if len(season_matches) > 1 and len(set(season_matches)) > 1:
59-
print(f"🚨 EXECUTE: Found conflicting seasons: {season_matches}")
62+
print(f"EXECUTE: Found conflicting seasons: {season_matches}")
6063
# Keep only the first season condition
6164
first_season = season_matches[0]
6265
clean_sql = re.sub(r"season\s*=\s*['\"][^'\"]+['\"]", f"season = '{first_season}'", clean_sql, flags=re.IGNORECASE)
6366
# Clean up multiple ANDs
6467
clean_sql = re.sub(r"\bAND\s+season\s*=\s*['\"][^'\"]+['\"]", "", clean_sql, flags=re.IGNORECASE)
6568
clean_sql = re.sub(r"\bAND\s+AND\b", "AND", clean_sql, flags=re.IGNORECASE)
66-
print(f"🔧 EXECUTE: Fixed to use only '{first_season}': {clean_sql}")
69+
print(f"EXECUTE: Fixed to use only '{first_season}': {clean_sql}")
6770

6871
# CRITICAL POSTGRESQL FIXES:
6972
# Fix boolean comparisons: finished = 1 -> finished = true
@@ -73,9 +76,9 @@ async def execute_sql_query(request: ExecuteRequest):
7376
# Fix column names that don't exist
7477
clean_sql = re.sub(r'\bdate\b', 'kickoff_time', clean_sql, flags=re.IGNORECASE)
7578

76-
print(f"🔧 EXECUTE: Applied PostgreSQL fixes: {clean_sql}")
77-
78-
print(f"🔍 DEBUG: Fixed SQL query: {clean_sql}")
79+
print(f"EXECUTE: Applied PostgreSQL fixes: {clean_sql}")
80+
81+
print(f"DEBUG: Fixed SQL query: {clean_sql}")
7982

8083
# Validate it's a SELECT query for security
8184
if not clean_sql.upper().strip().startswith('SELECT'):
@@ -111,8 +114,8 @@ async def execute_sql_query(request: ExecuteRequest):
111114
raise
112115
except Exception as e:
113116
# Log the full error for debugging
114-
print(f"🚨 ERROR: SQL execution failed: {str(e)}")
115-
print(f"🚨 SQL that failed: {clean_sql}")
117+
print(f"ERROR: SQL execution failed: {str(e)}")
118+
print(f"SQL that failed: {clean_sql}")
116119

117120
# Return more specific error messages
118121
if "does not exist" in str(e).lower():

backend/api/optimize.py

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
"""
2-
Query Optimization API endpoints
2+
@author Tom Butler
3+
@date 2025-10-21
4+
@description Query optimisation API endpoints. Analyses SQL for performance improvements,
5+
discovers patterns in queries, generates suggestions, and explains query execution plans.
36
"""
47

58
from fastapi import APIRouter, HTTPException
@@ -194,25 +197,25 @@ async def explain_query_plan(sql: str):
194197
# Analyze query components
195198
if "SELECT" in sql_upper:
196199
if "*" in sql:
197-
explanation_parts.append("📊 Selecting all columns (consider specifying only needed columns)")
200+
explanation_parts.append("Selecting all columns (consider specifying only needed columns)")
198201
else:
199-
explanation_parts.append("Selecting specific columns (good practice)")
202+
explanation_parts.append("Selecting specific columns (good practice)")
200203

201204
if "JOIN" in sql_upper:
202205
join_count = sql_upper.count("JOIN")
203-
explanation_parts.append(f"🔗 Joining {join_count} table(s)")
206+
explanation_parts.append(f"Joining {join_count} table(s)")
204207

205208
if "WHERE" in sql_upper:
206-
explanation_parts.append("🔍 Filtering results with WHERE clause")
209+
explanation_parts.append("Filtering results with WHERE clause")
207210

208211
if "GROUP BY" in sql_upper:
209-
explanation_parts.append("📈 Grouping results for aggregation")
212+
explanation_parts.append("Grouping results for aggregation")
210213

211214
if "ORDER BY" in sql_upper:
212-
explanation_parts.append("🔤 Sorting results")
215+
explanation_parts.append("Sorting results")
213216

214217
if "LIMIT" in sql_upper:
215-
explanation_parts.append("✂️ Limiting result set size")
218+
explanation_parts.append("Limiting result set size")
216219

217220
# Performance estimate
218221
estimated_speed = "Fast"

backend/api/query.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
"""
2-
Query API endpoints for SQL-Ball
2+
@author Tom Butler
3+
@date 2025-10-21
4+
@description Natural language query API endpoint. Converts plain English questions into SQL queries,
5+
retrieves schema context, validates input, and returns results with explanations.
36
"""
47

58
from fastapi import APIRouter, HTTPException, Depends
@@ -75,8 +78,8 @@ async def process_natural_language_query(request: QueryRequest):
7578

7679
except Exception as e:
7780
# Log the full error for debugging
78-
print(f"🚨 ERROR: Query processing failed: {str(e)}")
79-
print(f"🚨 Question: {request.question}")
81+
print(f"ERROR: Query processing failed: {str(e)}")
82+
print(f"Question: {request.question}")
8083

8184
# Return more specific error messages
8285
error_msg = str(e).lower()

backend/main.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
"""
2-
SQL-Ball FastAPI Backend
3-
RAG-powered football analytics with natural language to SQL conversion
2+
@author Tom Butler
3+
@date 2025-10-21
4+
@description FastAPI backend server for SQL-Ball. Initialises RAG components (schema embeddings,
5+
SQL chain, football terminology mapper) on startup. Configures CORS and exposes
6+
API routers for query processing, optimisation, and execution.
47
"""
58

69
from fastapi import FastAPI, HTTPException
@@ -53,7 +56,7 @@ async def startup_event():
5356
"""Initialize RAG components on startup"""
5457
global schema_embedder, sql_chain, football_mapper
5558

56-
print("🚀 Initializing SQL-Ball RAG system...")
59+
print("Initialising SQL-Ball RAG system...")
5760

5861
# Initialize schema embedder
5962
schema_embedder = SchemaEmbedder()
@@ -69,7 +72,7 @@ async def startup_event():
6972
set_query_deps(sql_chain, schema_embedder)
7073
set_sql_chain(sql_chain)
7174

72-
print("RAG system initialized successfully!")
75+
print("RAG system initialised successfully!")
7376

7477
# Include routers
7578
app.include_router(query_router, prefix="/api")

0 commit comments

Comments
 (0)