PDF Document Text Extraction using Amazon Bedrock

This project demonstrates how to extract and compare text from PDF documents using two powerful image-capable Large Language Models (LLMs) available through Amazon Bedrock:

Claude 3.7 Sonnet by Anthropic
Amazon Nova Pro by Amazon

Each page of a PDF is converted into an image and sent to both models in parallel to extract textual information. The results are saved as .txt files per page for further analysis or use.

📁 Project Structure

├── documents.ipynb   # Original notebook
├── documents/        # Folder where input PDF files are stored
├── images/           # Folder where images of PDF are stored
├── texts/            # Folder where extracted text results are saved
└── README.md         # This file

🔧 How It Works

Step 1: Convert PDF Pages to Images

Using the fitz library, the notebook converts each page of a PDF into high-quality PNG images.

Step 2: Define LLM Extractor Functions

Two asynchronous Python functions are defined to:

Encode the image in base64
Send it to each model via boto3 and the Amazon Bedrock API
Parse and return the extracted text response

Step 3: Parallel Processing

Using asyncio.gather, the notebook sends the same image to both models concurrently for faster processing.

Step 4: Save Results

The extracted text is saved as .txt files per page with side-by-side comparisons of both models.

📊 LLM Metrics Tracked

For each model, the following metrics are logged and printed:

Claude 3.7 Sonnet:

Input Tokens: 1666
Output Tokens: 1036
Start Time: 1746124263.978323
End Time: 1746124292.999416

Amazon Nova Pro:

Input Tokens: 2223
Output Tokens: 971
Start Time: 1746124263.98382
End Time: 1746124279.478841

📦 Requirements

Python 3.11+
AWS credentials with access to Amazon Bedrock
Dependencies:
```
pip install boto3 pymupdf
```

🚀 How to Run

Upload a PDF to the documents/ folder.
Open and run the documents.ipynb notebook.
Extracted text files will be saved in the texts/ folder.

🤝 Contributing

Feel free to fork this repo and improve model selection, add post-processing like IdP comparison, or integrate evaluation metrics like BLEU or ROUGE scores.

🧠 Author

Gustavo Mainchein — AI Solutions Development Specialist - LinkedIn

📜 License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Document Text Extraction using Amazon Bedrock

📁 Project Structure

🔧 How It Works

Step 1: Convert PDF Pages to Images

Step 2: Define LLM Extractor Functions

Step 3: Parallel Processing

Step 4: Save Results

📊 LLM Metrics Tracked

📦 Requirements

🚀 How to Run

🤝 Contributing

🧠 Author

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
documents		documents
images		images
texts		texts
.gitignore		.gitignore
README.md		README.md
documents.ipynb		documents.ipynb

Folders and files

Latest commit

History

Repository files navigation

PDF Document Text Extraction using Amazon Bedrock

📁 Project Structure

🔧 How It Works

Step 1: Convert PDF Pages to Images

Step 2: Define LLM Extractor Functions

Step 3: Parallel Processing

Step 4: Save Results

📊 LLM Metrics Tracked

📦 Requirements

🚀 How to Run

🤝 Contributing

🧠 Author

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages