Summarizer API

Overview

Summarizer API is a structured, production-ready RESTful API for document summarization. It is designed with reliability, maintainability, and extensibility in mind, even though the project scope is relatively small. The architecture and technology choices reflect my experience building robust Flask APIs and my desire to keep the implementation clean and scalable.

Architectural Choices & Thought Process

Database: PostgreSQL

Chosen for its reliability, clear structure, and ease of replication (no single point of failure with tools like pgpool. This is not part of my project but could easily be configured.).
SQLAlchemy is used as the ORM for flexible and modern database interactions.

API Framework: Flask-smorest

Built by the Marshmallow team, providing seamless integration with Marshmallow models and OpenAPI specifications.
Enables automatic API documentation and validation.
I already know flask-restful and flask-restx and checked if there is anything new. Turns out: both are not maintained anymore—so flask-smorest is the cool new thing. It is actively maintained, modern, and integrates well with Marshmallow and OpenAPI, making it a future-proof choice for new Flask APIs.

Data Modeling: Marshmallow

Preferred over Pydantic for this project due to better Flask integration.
Direct mapping to dataclasses is not required for my use case, so Marshmallow's flexibility is ideal.

Layered Architecture

The codebase is strictly split into API, Business, and Data layers.
This separation keeps implementations simple, clean, and easy to maintain or extend.
While this structure may seem overkill for a small project, I believe in maintaining best practices regardless of project size.

Application Server: uWSGI

Chosen for familiarity and its robust support for managing workers and multiprocessing.
Makes scaling and process management straightforward.

Containerization: Docker

The project is containerized for portability and reproducibility.

Model Serving: Ollama

Initially considered using the Python package gemma, but dependency conflicts (SQLAlchemy >2.0 vs. xmanager needing 1.2.19) led me to use the Ollama image directly as its own instance.

My Thought Process

Summarizing Websites with Gemma3: Chunking Approach

Initially, I attempted to summarize entire website content by placing all the text into a single prompt for the Gemma3 model. However, the results were often unsatisfying: the model struggled with long inputs, leading to incomplete, unfocused, or generic summaries. This limitation is common with many LLMs, which have a context window that restricts the amount of text they can process effectively.

To address this, I explored alternative strategies and discovered the concept of chunking—splitting the website content into smaller, manageable pieces. Each chunk is summarized individually, and then the resulting summaries are combined and condensed into a final, concise summary using an additional prompt. This approach leverages the model's strengths with shorter inputs and ensures that the final output is both comprehensive and focused.

Why Chunking?

Model Limitations: LLMs like Gemma3 perform best with prompts that fit well within their context window. Large blocks of text can overwhelm the model, resulting in poor summarization quality.
Quality and Focus: By chunking, each section of the website receives dedicated attention, leading to more accurate and relevant summaries for each part.
Combining Insights: A final summarization step merges the individual chunk summaries, ensuring that the most important points from the entire website are captured without exceeding character or token limits.

Thought Process

I started with a straightforward approach—feeding the whole website content to the model—but quickly realized its limitations.
Researching best practices for LLM summarization led me to chunking, which is widely recommended for handling long texts.
I implemented chunking and a two-step summarization process: first, summarize each chunk; then, combine those summaries into a single, focused output.
This method produced much better results: concise, relevant, and complete summaries that respect the model's constraints.

Personal Note

Implementing this chunking approach was quite a challenge for me, as I had not used large language models (LLMs) this intensively before. My prior experience was mostly with GPT-5 mini for similar use-cases, which did not require such advanced handling of long-form content. Working with Gemma3 and developing a robust summarization workflow deepened my understanding of LLM limitations and best practices for prompt engineering and data processing.

This chunking strategy is now integrated into the codebase, making the summarizer robust and adaptable to websites of varying lengths and complexities.

Thought Processes: Scalability & Reliability

Distributed Manager-Worker Architecture for Full Scalability

To achieve true scalability and reliability, the Summarizer API implements a distributed manager-worker system. I chose this approach as I already have experience in such worker-systems:

Multiple Managers: You can run multiple manager processes (or containers) in parallel. Each manager is assigned a unique manager-uuid at startup.
Distributed Locking: Each document in the database has a locked_by_manager field. When a manager wants to enqueue a document for processing, it sets this field to its own manager-uuid. This ensures that no two managers can enqueue the same document at the same time, preventing duplicate processing and race conditions.
Worker Pool: Each manager controls a pool of worker processes. Workers only process documents that are locked by their own manager. This design allows you to scale both the number of managers and the number of workers independently, supporting high-throughput and distributed deployments.
Self-Healing & Error Recovery: If a worker crashes or a manager dies, documents stuck in IN_PROGRESS or ENQUEUED state for more than 10 minutes are automatically reset to PENDING and their locked_by_manager field is cleared. This allows any manager to pick up and re-queue the job, ensuring no document is left unprocessed due to failures.

Implementation Notes

The manager uses a unique UUID (manager-uuid) for distributed locking.
The locked_by_manager field in the database is used to claim documents.
Only documents not locked by any manager are eligible for enqueuing.
Workers release the lock after processing, whether successful or failed.
The system is robust to crashes and supports horizontal scaling across multiple containers or hosts.

Distributed Error Recovery & Self-Healing - Details

Problem

If a worker or manager crashes while processing a document, that document may remain stuck in IN_PROGRESS or ENQUEUED state and never get completed.

Solution

Each document has a last_updated timestamp and a locked_by_manager field. Managers periodically check for documents in IN_PROGRESS or ENQUEUED state that have not been updated for more than 5 minutes. If found, their status is reset to PENDING and their locked_by_manager field is cleared, making them available for re-queuing and processing by any manager.

Implementation Notes

Workers set the document status to IN_PROGRESS and update last_updated before starting work.
After successful summarization, status is set to COMPLETED, last_updated is updated, and the lock is released.
On error, status is set to FAILED, last_updated is updated, and the lock is released.
Managers reset stuck jobs to PENDING and clear the lock if no update for 10 minutes.

This approach is robust, self-healing, and fully scalable for distributed or multi-process environments.

Reliability

Reliability was a core consideration throughout the project:

Database Choice: PostgreSQL was selected for its robustness and support for replication, minimizing single points of failure. SQLAlchemy provides reliable ORM interactions and transaction management.
API Validation: Flask-smorest and Marshmallow ensure that all API inputs are validated and documented, reducing the risk of runtime errors and making the API predictable for clients.
Process Isolation: Multiprocessing ensures that long-running or CPU-bound summarization tasks do not block the main API thread, keeping the service responsive even under heavy load.
Containerization: Docker guarantees that the application runs in a controlled environment, reducing dependency conflicts and deployment issues.
Error Handling: The business and API layers are designed to catch and handle exceptions gracefully, returning meaningful error messages and avoiding crashes.

Future Reliability Enhancements

The architecture supports future improvements such as health checks, monitoring, and automated failover (e.g., using pgpool for PostgreSQL).
The codebase is structured to make it easy to add retry logic, circuit breakers, or other reliability patterns as needed.

I wanted to:

Build a clean, maintainable API with clear separation of concerns.
Use technologies that are proven, well-supported, and easy to scale and maintain.
Avoid technical debt and ensure the project could be extended or deployed in production with minimal changes.
Experiment with best practices even in small projects, as habits formed here translate to larger, more complex systems.

Project Structure

API Layer: Handles HTTP requests, validation, and OpenAPI documentation.
Business Layer: Contains core logic for document summarization and workflow management.
Data Layer: Manages database models and persistence.

API Layer Structure & Blueprint Design

Flask-smorest automatically generates OpenAPI documentation, mapping each registered blueprint to a tag in the docs. To keep the architecture future-proof and maintainable, I decided to implement each resource as its own blueprint—even though currently there is only one resource. This strict separation ensures:

Scalability: Adding new resources is straightforward—just create a new blueprint.
Clarity: Each resource is clearly represented in the OpenAPI docs under its own tag.
Maintainability: The codebase remains organized and easy to extend as requirements grow.

This approach may seem over-engineered for a single resource, but it aligns with best practices for building robust, extensible APIs. As the project evolves, new resources can be added with minimal refactoring, keeping the documentation and implementation clean and consistent.

Getting Started

Required Environment Variables

Below is a list of all environment variables needed to run the Summarizer API and its components. Set these in your docker-compose.yml, .env file, or deployment environment as appropriate:

Variable	Description
DB_HOST	Hostname or IP address of the PostgreSQL database server.
DB_PORT	Port number for the PostgreSQL database (default: 5432).
DB_USER	Username for connecting to the PostgreSQL database.
DB_PASSWORD	Password for the PostgreSQL database user.
DB_NAME	Name of the database to use for the Summarizer API.
UWSGI_CHEAPER	Minimum number of uWSGI worker processes to keep alive (used for scaling).
UWSGI_PROCESSES	Total number of uWSGI worker processes to run.
NUM_SUMMARIZATION_WORKERS	Number of worker processes for document summarization (used by manager/worker service).
OLLAMA_HOST	Hostname or IP address, with port, of the Ollama model server (e.g., 'ollama' if using Docker Compose).

Notes:

All database variables (DB_HOST, DB_PORT, DB_USER, DB_PASSWORD, DB_NAME) must be set for both the API and worker services to connect to the database.
OLLAMA_HOST should point to the running Ollama model server. If using Docker Compose, the service name (e.g., 'ollama') is sufficient.
NUM_SUMMARIZATION_WORKERS controls how many parallel summarization jobs each manager can run.
UWSGI_CHEAPER and UWSGI_PROCESSES are used for tuning the API server's process management and scalability.

Quick Start & Running Components

The summarizer_api Docker image provides two main functionalities:

REST API Service: Handles HTTP requests for document summarization (see app service in docker-compose.yml).
CLI Manager/Worker Service: Processes pending documents using a manager and worker pool (see summarizer_worker service in docker-compose.yml).

You can run all components together using Docker Compose, or deploy each part (API, db, workers, Ollama) on separate servers for full scalability.

Option 1: All-in-one with Docker Compose

Clone the repo:

git clone https://gitlab.com/OliverGras/summarizer_api.git
cd summarizer_api

Update environment variables in docker-compose.yml to match your database and Ollama setup.
Start all services:
```
docker compose up -d
```

Option 2: Distributed Deployment

You can run the API server, db, worker and Ollama model server on separate machines or containers.
Details Summarizer Worker:
- To run a manager (with its workers) on any server with DB access:
```
 uv run flask summarize-documents
```
- Each container runs a manager process that handles queue management and launches its own pool of workers.
- The number of workers per container is controlled by the NUM_SUMMARIZATION_WORKERS environment variable.
- You can run multiple containers (managers) for horizontal scaling; each will coordinate with others via distributed locking in the database.
To run Ollama (model server) separately, follow Ollama's documentation and configure the API to point to its endpoint.

Accessing the API and Documentation

The API will be available at: http://localhost:8080/
The OpenAPI/Swagger documentation is available at: http://localhost:8080/apidocs/swagger-ui
The OpenAPI spec (JSON) is at: http://localhost:8080/apidocs/openapi.json

You can use the Swagger UI to explore and test all available endpoints interactively.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
summarizer_api		summarizer_api
tests		tests
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.python-version		.python-version
DockerBuild.sh		DockerBuild.sh
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
entrypoint_ollama.sh		entrypoint_ollama.sh
openapi.json		openapi.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
serve_apidocs_cli.py		serve_apidocs_cli.py
uv.lock		uv.lock
wsgi.py		wsgi.py

Folders and files

Latest commit

History

Repository files navigation

Summarizer API

Overview

Architectural Choices & Thought Process

Database: PostgreSQL

API Framework: Flask-smorest

Data Modeling: Marshmallow

Layered Architecture

Application Server: uWSGI

Containerization: Docker

Model Serving: Ollama

My Thought Process

Summarizing Websites with Gemma3: Chunking Approach

Why Chunking?

Thought Process

Personal Note

Thought Processes: Scalability & Reliability

Distributed Manager-Worker Architecture for Full Scalability

Implementation Notes

Distributed Error Recovery & Self-Healing - Details

Problem

Solution

Implementation Notes

Reliability

Future Reliability Enhancements

Project Structure

API Layer Structure & Blueprint Design

Getting Started

Required Environment Variables

Quick Start & Running Components

Accessing the API and Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages