Transformers

Large Language Model Evaluation

Project Description

This project aims to evaluate the performance of different large language models through a series of standardized tests. Using a Python program executed in IPython, we assess the models' ability to solve problems from "Problem Set 4," experimenting with different temperature settings to observe their impact on the generated results.

Technologies Used

Language: Python
Environment: IPython
Libraries:
- csv: For handling CSV file operations.
- os: For interacting with the operating system.
- time: For time-related operations.
- subprocess: To run the program under controlled conditions.
- openai: For integration with OpenAI models.
- pathlib: For working with file paths.
- langchain_google_genai: For integration with Google Generative AI models.
- langchain.prompts.ChatPromptTemplate: For creating prompts for the models.
- langchain.output_parsers: For parsing model responses.
- langchain.chat_models.ChatOpenAI: For using OpenAI's chat models.

Repository

The code is available on GitHub here. All necessary documentation is included for easy review.

Models Used

Models:
- Gemini 1.0 Pro
- Gemini 1.5 Pro
- Gemini 1.5 Flash
- GPT 3.5-turbo
- GPT 4.0-turbo
- GPT-4o
- GPT 4o-mini

Execution Configuration

Temperatures tested:
- Temperature = 0
- Temperature = 1
Additional parameters:
- top_p = 1
Prompt: Solve problems from "Problem Set 4" with the model returning "True" for successful tests and "False" otherwise.

Execution Process

Problem Selection: The first 25 problems from "Problem Set 4".
Execution:
- The prompt was sent to the models with the indicated configuration.
- The model's response was extracted from the resulting JSON and stored in a Python file.
- The Python file was executed using subprocess with a 60-second time limit.
Result Analysis:
- Counting "True" and "False" responses.
- Results were stored in a CSV file along with the execution code.

Main Executable Notebook

The primary notebook responsible for executing the program is located at: /Transformers/test_dataset/process_data/Solve_Extract_V2.ipynb To run the program, navigate to the directory and execute the notebook.

Generated Files

Results CSV: Contains the count of "True" and "False" responses along with execution details.
Python Code: Contains the code used for the evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformers

Large Language Model Evaluation

Project Description

Technologies Used

Repository

Models Used

Execution Configuration

Execution Process

Main Executable Notebook

Generated Files

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Transformers

Large Language Model Evaluation

Project Description

Technologies Used

Repository

Models Used

Execution Configuration

Execution Process

Main Executable Notebook

Generated Files