Summarizer API is a structured, production-ready RESTful API for document summarization. It is designed with reliability, maintainability, and extensibility in mind, even though the project scope is relatively small. The architecture and technology choices reflect my experience building robust Flask APIs and my desire to keep the implementation clean and scalable.
- Chosen for its reliability, clear structure, and ease of replication (no single point of failure with tools like pgpool. This is not part of my project but could easily be configured.).
- SQLAlchemy is used as the ORM for flexible and modern database interactions.
- Built by the Marshmallow team, providing seamless integration with Marshmallow models and OpenAPI specifications.
- Enables automatic API documentation and validation.
- I already know flask-restful and flask-restx and checked if there is anything new. Turns out: both are not maintained anymore—so flask-smorest is the cool new thing. It is actively maintained, modern, and integrates well with Marshmallow and OpenAPI, making it a future-proof choice for new Flask APIs.
- Preferred over Pydantic for this project due to better Flask integration.
- Direct mapping to dataclasses is not required for my use case, so Marshmallow's flexibility is ideal.
- The codebase is strictly split into API, Business, and Data layers.
- This separation keeps implementations simple, clean, and easy to maintain or extend.
- While this structure may seem overkill for a small project, I believe in maintaining best practices regardless of project size.
- Chosen for familiarity and its robust support for managing workers and multiprocessing.
- Makes scaling and process management straightforward.
- The project is containerized for portability and reproducibility.
- Initially considered using the Python package gemma, but dependency conflicts (SQLAlchemy >2.0 vs. xmanager needing 1.2.19) led me to use the Ollama image directly as its own instance.
Initially, I attempted to summarize entire website content by placing all the text into a single prompt for the Gemma3 model. However, the results were often unsatisfying: the model struggled with long inputs, leading to incomplete, unfocused, or generic summaries. This limitation is common with many LLMs, which have a context window that restricts the amount of text they can process effectively.
To address this, I explored alternative strategies and discovered the concept of chunking—splitting the website content into smaller, manageable pieces. Each chunk is summarized individually, and then the resulting summaries are combined and condensed into a final, concise summary using an additional prompt. This approach leverages the model's strengths with shorter inputs and ensures that the final output is both comprehensive and focused.
- Model Limitations: LLMs like Gemma3 perform best with prompts that fit well within their context window. Large blocks of text can overwhelm the model, resulting in poor summarization quality.
- Quality and Focus: By chunking, each section of the website receives dedicated attention, leading to more accurate and relevant summaries for each part.
- Combining Insights: A final summarization step merges the individual chunk summaries, ensuring that the most important points from the entire website are captured without exceeding character or token limits.
- I started with a straightforward approach—feeding the whole website content to the model—but quickly realized its limitations.
- Researching best practices for LLM summarization led me to chunking, which is widely recommended for handling long texts.
- I implemented chunking and a two-step summarization process: first, summarize each chunk; then, combine those summaries into a single, focused output.
- This method produced much better results: concise, relevant, and complete summaries that respect the model's constraints.
Implementing this chunking approach was quite a challenge for me, as I had not used large language models (LLMs) this intensively before. My prior experience was mostly with GPT-5 mini for similar use-cases, which did not require such advanced handling of long-form content. Working with Gemma3 and developing a robust summarization workflow deepened my understanding of LLM limitations and best practices for prompt engineering and data processing.
This chunking strategy is now integrated into the codebase, making the summarizer robust and adaptable to websites of varying lengths and complexities.
To achieve true scalability and reliability, the Summarizer API implements a distributed manager-worker system. I chose this approach as I already have experience in such worker-systems:
- Multiple Managers: You can run multiple manager processes (or containers) in parallel. Each manager is assigned a unique
manager-uuidat startup. - Distributed Locking: Each document in the database has a
locked_by_managerfield. When a manager wants to enqueue a document for processing, it sets this field to its ownmanager-uuid. This ensures that no two managers can enqueue the same document at the same time, preventing duplicate processing and race conditions. - Worker Pool: Each manager controls a pool of worker processes. Workers only process documents that are locked by their own manager. This design allows you to scale both the number of managers and the number of workers independently, supporting high-throughput and distributed deployments.
- Self-Healing & Error Recovery: If a worker crashes or a manager dies, documents stuck in
IN_PROGRESSorENQUEUEDstate for more than 10 minutes are automatically reset toPENDINGand theirlocked_by_managerfield is cleared. This allows any manager to pick up and re-queue the job, ensuring no document is left unprocessed due to failures.
- The manager uses a unique UUID (
manager-uuid) for distributed locking. - The
locked_by_managerfield in the database is used to claim documents. - Only documents not locked by any manager are eligible for enqueuing.
- Workers release the lock after processing, whether successful or failed.
- The system is robust to crashes and supports horizontal scaling across multiple containers or hosts.
If a worker or manager crashes while processing a document, that document may remain stuck in IN_PROGRESS or ENQUEUED state and never get completed.
Each document has a last_updated timestamp and a locked_by_manager field. Managers periodically check for documents in IN_PROGRESS or ENQUEUED state that have not been updated for more than 5 minutes. If found, their status is reset to PENDING and their locked_by_manager field is cleared, making them available for re-queuing and processing by any manager.
- Workers set the document status to
IN_PROGRESSand updatelast_updatedbefore starting work. - After successful summarization, status is set to
COMPLETED,last_updatedis updated, and the lock is released. - On error, status is set to
FAILED,last_updatedis updated, and the lock is released. - Managers reset stuck jobs to
PENDINGand clear the lock if no update for 10 minutes.
This approach is robust, self-healing, and fully scalable for distributed or multi-process environments.
Reliability was a core consideration throughout the project:
- Database Choice: PostgreSQL was selected for its robustness and support for replication, minimizing single points of failure. SQLAlchemy provides reliable ORM interactions and transaction management.
- API Validation: Flask-smorest and Marshmallow ensure that all API inputs are validated and documented, reducing the risk of runtime errors and making the API predictable for clients.
- Process Isolation: Multiprocessing ensures that long-running or CPU-bound summarization tasks do not block the main API thread, keeping the service responsive even under heavy load.
- Containerization: Docker guarantees that the application runs in a controlled environment, reducing dependency conflicts and deployment issues.
- Error Handling: The business and API layers are designed to catch and handle exceptions gracefully, returning meaningful error messages and avoiding crashes.
- The architecture supports future improvements such as health checks, monitoring, and automated failover (e.g., using pgpool for PostgreSQL).
- The codebase is structured to make it easy to add retry logic, circuit breakers, or other reliability patterns as needed.
I wanted to:
- Build a clean, maintainable API with clear separation of concerns.
- Use technologies that are proven, well-supported, and easy to scale and maintain.
- Avoid technical debt and ensure the project could be extended or deployed in production with minimal changes.
- Experiment with best practices even in small projects, as habits formed here translate to larger, more complex systems.
- API Layer: Handles HTTP requests, validation, and OpenAPI documentation.
- Business Layer: Contains core logic for document summarization and workflow management.
- Data Layer: Manages database models and persistence.
Flask-smorest automatically generates OpenAPI documentation, mapping each registered blueprint to a tag in the docs. To keep the architecture future-proof and maintainable, I decided to implement each resource as its own blueprint—even though currently there is only one resource. This strict separation ensures:
- Scalability: Adding new resources is straightforward—just create a new blueprint.
- Clarity: Each resource is clearly represented in the OpenAPI docs under its own tag.
- Maintainability: The codebase remains organized and easy to extend as requirements grow.
This approach may seem over-engineered for a single resource, but it aligns with best practices for building robust, extensible APIs. As the project evolves, new resources can be added with minimal refactoring, keeping the documentation and implementation clean and consistent.
Below is a list of all environment variables needed to run the Summarizer API and its components. Set these in your docker-compose.yml, .env file, or deployment environment as appropriate:
| Variable | Description |
|---|---|
| DB_HOST | Hostname or IP address of the PostgreSQL database server. |
| DB_PORT | Port number for the PostgreSQL database (default: 5432). |
| DB_USER | Username for connecting to the PostgreSQL database. |
| DB_PASSWORD | Password for the PostgreSQL database user. |
| DB_NAME | Name of the database to use for the Summarizer API. |
| UWSGI_CHEAPER | Minimum number of uWSGI worker processes to keep alive (used for scaling). |
| UWSGI_PROCESSES | Total number of uWSGI worker processes to run. |
| NUM_SUMMARIZATION_WORKERS | Number of worker processes for document summarization (used by manager/worker service). |
| OLLAMA_HOST | Hostname or IP address, with port, of the Ollama model server (e.g., 'ollama' if using Docker Compose). |
Notes:
- All database variables (
DB_HOST,DB_PORT,DB_USER,DB_PASSWORD,DB_NAME) must be set for both the API and worker services to connect to the database. OLLAMA_HOSTshould point to the running Ollama model server. If using Docker Compose, the service name (e.g., 'ollama') is sufficient.NUM_SUMMARIZATION_WORKERScontrols how many parallel summarization jobs each manager can run.UWSGI_CHEAPERandUWSGI_PROCESSESare used for tuning the API server's process management and scalability.
The summarizer_api Docker image provides two main functionalities:
- REST API Service: Handles HTTP requests for document summarization (see
appservice indocker-compose.yml). - CLI Manager/Worker Service: Processes pending documents using a manager and worker pool (see
summarizer_workerservice indocker-compose.yml).
You can run all components together using Docker Compose, or deploy each part (API, db, workers, Ollama) on separate servers for full scalability.
Option 1: All-in-one with Docker Compose
- Clone the repo:
git clone https://gitlab.com/OliverGras/summarizer_api.git cd summarizer_api - Update environment variables in
docker-compose.ymlto match your database and Ollama setup. - Start all services:
docker compose up -d
Option 2: Distributed Deployment
- You can run the API server, db, worker and Ollama model server on separate machines or containers.
- Details Summarizer Worker:
- To run a manager (with its workers) on any server with DB access:
uv run flask summarize-documents
- Each container runs a manager process that handles queue management and launches its own pool of workers.
- The number of workers per container is controlled by the
NUM_SUMMARIZATION_WORKERSenvironment variable. - You can run multiple containers (managers) for horizontal scaling; each will coordinate with others via distributed locking in the database.
- To run a manager (with its workers) on any server with DB access:
- To run Ollama (model server) separately, follow Ollama's documentation and configure the API to point to its endpoint.
- The API will be available at:
http://localhost:8080/ - The OpenAPI/Swagger documentation is available at:
http://localhost:8080/apidocs/swagger-ui - The OpenAPI spec (JSON) is at:
http://localhost:8080/apidocs/openapi.json
You can use the Swagger UI to explore and test all available endpoints interactively.