Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
161 changes: 99 additions & 62 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,97 +1,134 @@
# Evaluation Function Template Repository
# evaluateProof

This template repository contains the boilerplate code needed in order to create an AWS Lambda function that can be written by any tutor to grade a response area in any way they like.
An AWS Lambda evaluation function for the [LambdaFeedback](https://lambdafeedback.com) platform that uses OpenAI's GPT models to provide automated feedback on undergraduate mathematics submissions.

This version is specifically for python, however the ultimate goal is to make similar boilerplate repositories in any language, allowing tutors the freedom to code in what they feel most comfortable with.
The function implements a two-stage LLM pipeline:
1. **Mark scheme generation** — produces a 0–5 rubric specific to the question.
2. **Feedback generation** — uses that rubric to give the student concise, actionable feedback.

## Table of Contents
- [Evaluation Function Template Repository](#evaluation-function-template-repository)
- [Table of Contents](#table-of-contents)
- [Repository Structure](#repository-structure)
- [Usage](#usage)
- [Getting Started](#getting-started)
- [How it works](#how-it-works)
- [Docker & Amazon Web Services (AWS)](#docker--amazon-web-services-aws)
- [Middleware Functions](#middleware-functions)
- [GitHub Actions](#github-actions)
- [Pre-requisites](#pre-requisites)
- [Contact](#contact)
- [Repository Structure](#repository-structure)
- [How It Works](#how-it-works)
- [Two-Stage Evaluation Pipeline](#two-stage-evaluation-pipeline)
- [Custom Workflows](#custom-workflows)
- [Submission Limits](#submission-limits)
- [Docker & AWS](#docker--aws)
- [GitHub Actions](#github-actions)
- [Getting Started](#getting-started)
- [Configuration](#configuration)
- [Test / Debug Mode](#test--debug-mode)
- [Pre-requisites](#pre-requisites)

## Repository Structure

```bash
```
app/
__init__.py
evaluation.py # Script containing the main evaluation_function
preview.py # Script containing the preview_function
docs.md # Documentation page for this function (required)
evaluation_tests.py # Unittests for the main evaluation_function
preview_tests.py # Unittests for the preview_function
requirements.txt # list of packages needed for algorithm.py
Dockerfile # for building whole image to deploy to AWS
evaluation.py # Entry point — delegates to MathTutor
math_tutor.py # Two-stage LLM evaluation logic (OpenAI)
preview.py # Preview function (returns response as-is)
config_tutor.json # Production LLM directives
config_tutor_test.json # Test LLM directives (loaded first on startup)
evaluation_tests.py # Unit tests for evaluation_function
preview_tests.py # Unit tests for preview_function
requirements.txt # Python dependencies
Dockerfile # Builds the deployable image
docs/
dev.md # Developer-facing documentation
user.md # Student/teacher-facing documentation
test_configs/ # Per-question workflow configs used in tests

.github/
workflows/
test-and-deploy.yml # Testing and deployment pipeline
test-and-deploy.yml # CI: lint → test → deploy staging → deploy production

config.json # Specify the name of the evaluation function in this file
.gitignore
config.json # Declares the function name: "evaluateProof"
```

## Usage
## How It Works

### Getting Started
### Two-Stage Evaluation Pipeline

1. Clone this repository
2. Change the name of the evaluation function in `config.json`
3. The name must be unique. To view existing grading functions, go to:
When a student submits an answer, `evaluation_function` in `evaluation.py` calls `MathTutor.process_input`, which runs two sequential LLM calls:

- [Staging API Gateway Integrations](https://eu-west-2.console.aws.amazon.com/apigateway/main/develop/integrations/attach?api=c1o0u8se7b&region=eu-west-2&routes=0xsoy4q)
- [Production API Gateway Integrations](https://eu-west-2.console.aws.amazon.com/apigateway/main/develop/integrations/attach?api=cttolq2oph&integration=qpbgva8&region=eu-west-2&routes=0xsoy4q)
1. **Markscheme directive** (`config_tutor.json → directives.markscheme`): the model receives the question and generates a 0–5 marking rubric.
2. **Feedback directive** (`config_tutor.json → directives.feedback`): the model receives the question, the student's solution, and the rubric, then returns feedback. The numeric grade is intentionally withheld to avoid misinterpretation.

4. Merge commits into the default branch
- This will trigger the `test-and-deploy.yml` workflow, which will build the docker image, push it to a shared ECR repository, then call the backend `grading-function/ensure` route to build the necessary infrastructure to make the function available from the client app.
The `answer` field passed from the platform is expected to be a JSON string:

5. You are now ready to start developing your function:

- Edit the `app/evaluation.py` file, which ultimately gets called when the function is given the `eval` command
- Edit the `app/preview.py` file, which is called when the function is passed the `preview` command.
- Edit the `app/evaluation_tests.py` and `app/preview_tests.py` files to add tests which get run:
- Every time you commit to this repo, before the image is built and deployed
- Whenever the `healthcheck` command is supplied to the deployed function
- Edit the `app/docs.md` file to reflect your changes. This file is baked into the function's image, and is made available using the `docs` command. This feature is used to display this function's documentation on our [Documentation](https://lambda-feedback.github.io/Documentation/) website once it's been hooked up!

---
```json
{
"question": "Prove that ...",
"answer": "An exemplary solution ...",
"workflow": "/app/app/test_configs/config0.json"
}
```

## How it works
If no exemplary solution is available, the function falls back gracefully.

The function is built on top of a custom base layer, [BaseEvaluationFunctionLayer](https://github.com/lambda-feedback/BaseEvalutionFunctionLayer), which tools, tests and schema checking relevant to all evaluation functions.
### Custom Workflows

### Docker & Amazon Web Services (AWS)
The `answer` field may include a `"workflow"` key pointing to a JSON config file. This lets different questions use different LLM directive chains without redeploying the function. Workflow configs follow the same `{ "directives": { ... } }` schema as `config_tutor.json`.

The grading scripts are hosted AWS Lambda, using containers to run a docker image of the app. Docker is a popular tool in software development that allows programs to be hosted on any machine by bundling all its requirements and dependencies into a single file called an **image**.
### Submission Limits

Images are run within **containers** on AWS, which give us a lot of flexibility over what programming language and packages/libraries can be used. For more information on Docker, read this [introduction to containerisation](https://www.freecodecamp.org/news/a-beginner-friendly-introduction-to-containers-vms-and-docker-79a9e3e119b/). To learn more about AWS Lambda, click [here](https://geekflare.com/aws-lambda-for-beginners/).
The function enforces a cap of **6 submissions per student per response area**. Once reached, further submissions return a message asking the student to contact their instructor. The current count is provided via `params.submission_context.submissions_per_student_per_response_area` and is displayed in the feedback prefix on each submission.

### Middleware Functions
In order to run the algorithm and schema on AWS Lambda, some middleware functions have been provided to handle, validate and return the data so all you need to worry about is the evaluation script and testing.
### Docker & AWS

The code needed to build the image using all the middleware functions are available in the [BaseEvaluationFunctionLayer](https://github.com/lambda-feedback/BaseEvalutionFunctionLayer) repository.
The function is packaged as a Docker image and deployed to AWS Lambda via Amazon ECR. It extends the shared [BaseEvaluationFunctionLayer](https://github.com/lambda-feedback/BaseEvalutionFunctionLayer) image, which provides the request/response middleware and handler wiring.

### GitHub Actions
Whenever a commit is made to the GitHub repository, the new code will go through a pipeline, where it will be tested for syntax errors and code coverage. The pipeline used is called **GitHub Actions** and the scripts for these can be found in `.github/workflows/`.

On top of that, when starting a new evaluation function, you will have to complete a set of unit test scripts, which not only make sure your code is reliable, but also helps you to build a _specification_ for how the code should function before you start programming.
Pushes to `main` trigger `.github/workflows/test-and-deploy.yml`:

Once the code passes all these tests, it will then be uploaded to AWS and will be deployed and ready to go in only a few minutes.
1. **Test** — lint with `flake8`, run `evaluation_tests.py` and `preview_tests.py` (requires `OPENAI_API_KEY` secret).
2. **Deploy Staging** — build and push image to the staging ECR repository, then call the backend `grading-function/ensure` route.
3. **Deploy Production** — same as staging but against the production ECR repository and API.

## Pre-requisites
Although all programming can be done through the GitHub interface, it is recommended you do this locally on your machine. To do this, you must have installed:
## Getting Started

1. Clone this repository.
2. Ensure the following secrets are set in your GitHub repository settings:
- `OPENAI_API_KEY` — used by the evaluation pipeline and tests.
- `LAMBDA_CONTAINER_PIPELINE_AWS_ID` / `LAMBDA_CONTAINER_PIPELINE_AWS_SECRET` — AWS credentials for ECR.
- `FUNCTION_ADMIN_API_KEY` — LambdaFeedback backend API key.
3. Edit `app/evaluation.py` and `app/math_tutor.py` to change evaluation behaviour.
4. Update the LLM prompts in `app/config_tutor.json` to tune the mark scheme and feedback directives.
5. Add or update tests in `app/evaluation_tests.py`.
6. Push to `main` — the CI/CD pipeline handles the rest.

To run tests locally:

- Python 3.8 or higher.
```bash
cd app
pip install -r requirements.txt pytest
OPENAI_API_KEY=sk-... pytest -v evaluation_tests.py preview_tests.py
```

## Configuration

| File | Purpose |
| --- | --- |
| `config.json` | Sets `EvaluationFunctionName` to `"evaluateProof"` — used by CI to tag the Docker image. |
| `app/config_tutor.json` | Production LLM directive chain (markscheme → feedback). |
| `app/config_tutor_test.json` | Loaded first on Lambda cold-start; falls back to `config_tutor.json` if missing. |
| `app/test_configs/` | Per-question workflow overrides referenced from the `answer` JSON. |

- GitHub Desktop or the `git` CLI.
## Test / Debug Mode

- A code editor such as Atom, VS Code, or Sublime.
Prefixing a response with `[[test_mode_temporary]]` activates debug mode, bypassing the normal evaluation flow. Supported commands:

| Command | Behaviour |
| --- | --- |
| `[[test_mode_temporary]] [feedback] <hex>` | Decodes a hex-encoded string and returns it as feedback. |
| `[[test_mode_temporary]] [sleep <n>]` | Sleeps for `n` seconds, then returns a confirmation message. |
| `[[test_mode_temporary]] [tree] [<depth>]` | Returns a directory tree of the app, up to `depth` levels (default 3). |
| `[[test_mode_temporary]] [full trace] <response>` | Runs the normal pipeline and returns the full internal state as JSON. |

## Pre-requisites

Copy this template over by clicking **Use this template** button found in the repository on GitHub. Save it to the `lambda-feedback` Organisation.
- Python 3.8+
- An OpenAI API key with access to GPT-4 (or the model specified in your config).
- Docker (for local image builds).
- `git` or GitHub Desktop.
- A code editor such as VS Code.