ContextEngine: Distributed RAG Platform on Kubernetes using Ray Serve

ContextEngine is a production-style Retrieval-Augmented Generation (RAG) platform designed for scalable document-aware AI applications. The system leverages Ray Serve for distributed model serving, Kubernetes for orchestration, FastAPI for API management, and vector-based retrieval for efficient semantic search.

The project demonstrates how modern LLM systems can be architected as modular microservices that separate ingestion, retrieval, generation, and orchestration responsibilities.

1. Highlights

Project highlights

Built an end-to-end distributed RAG platform covering document ingestion, vector retrieval, LLM inference, backend APIs, frontend integration, and Kubernetes deployment.
Designed a microservice-based architecture that separates ingestion, retrieval, generation, and API gateway responsibilities.
Leveraged Ray Serve as the distributed serving layer to orchestrate AI services and enable scalable request routing across Kubernetes.
Chose Ray Serve to support independently scalable AI workloads, allowing retrieval and generation services to evolve and scale separately.
Implemented a complete PDF ingestion pipeline with document parsing, chunking, metadata extraction, and embedding generation.
Built a vector-based semantic retrieval system for context-aware search across indexed documents.
Developed dedicated Retrieval and Generation Services to create a modular and maintainable AI application architecture.
Designed the platform to support multiple AI models and future model upgrades without major architectural changes.
Containerized all services using Docker and deployed them as cloud-native workloads on Kubernetes.
Established reusable shared components for configuration management, prompt templates, data models, and common utilities.
Implemented support for metadata-aware retrieval, document filtering, and advanced chunking strategies to improve retrieval quality.
Demonstrated practical application of Distributed Systems, Ray Serve, FastAPI, Kubernetes, Vector Search, and Retrieval-Augmented Generation (RAG) in a production-oriented AI platform.

2. Architecture Overview

3. Technology Stack

Category	Technologies
1. Backend	Python, FastAPI
2. Distributed Serving	Ray Serve
3. Containerization	Docker
4. Orchestration	Kubernetes
5. AI Architecture	RAG
6. Retrieval	Vector Search
7. API Layer	REST APIs
8. Deployment	Cloud Native

4. Project Outcomes

Built a distributed Retrieval-Augmented Generation platform.
Designed a modular microservice architecture for AI workloads.
Implemented API Gateway pattern using FastAPI.
Integrated semantic retrieval workflows.
Implemented scalable LLM orchestration pipelines.
Containerized services using Docker.
Designed Kubernetes-ready deployment architecture.
Utilized Ray Serve for distributed request routing and scaling.
Structured codebase for production-style maintainability.

5. Key Features

5.1 Distributed Serving

Ray Serve based distributed request handling.
Horizontally scalable service architecture.
Separation of inference and orchestration layers.

5.2 Retrieval-Augmented Generation

Semantic document retrieval.
Context-aware response generation.
Modular retrieval pipeline.

5.3 Service-Oriented Design

API Gateway.
Ingestion Service.
Retrieval Service.
Generation Service.
Shared Core Components.

5.4 Kubernetes-Native Architecture

Containerized services.
Cloud deployment ready.
Horizontal scalability support.
Infrastructure abstraction.

6. Repository Structure

ContextEngine/
|
+-- services/
|   +-- api-gateway/
|   +-- ingestion/
|   +-- retrieval/
|   +-- generation/
|
+-- core/
|   +-- configs/
|   +-- models/
|   +-- utilities/
|   +-- shared/
|
+-- infrastructure/
|   +-- kubernetes/
|   +-- docker/
|   +-- deployment/
|
+-- docs/
|
+-- tests/
|
+-- requirements.txt

7. Getting Started

7.1 Clone Repository

git clone https://github.com/hsb943/contextengine-distributed-rag.git

cd contextengine-distributed-rag

7.2 Install Dependencies

pip install -r requirements.txt

7.3 Start API Gateway

cd services/api-gateway

uvicorn main:app --reload

7.4 Health Check

curl http://127.0.0.1:8000/health

Expected Response:

{
  "status": "healthy"
}

8. Design Principles

The project follows several production engineering principles:

Separation of concerns.
Service modularity.
Scalability by design.
Infrastructure abstraction.
Cloud-native deployment patterns.
Reusable core components.
Independent service evolution.

9. Potential Extensions

Distributed vector databases.
Streaming LLM responses.
Authentication and authorization.
Multi-tenant support.
Prometheus monitoring.
Grafana dashboards.
GPU-aware scheduling.
Model versioning.
CI/CD pipelines.

10. Learning Objectives

This project demonstrates practical experience with:

Distributed Systems.
Large Language Model Infrastructure.
Retrieval-Augmented Generation.
Ray Serve.
Kubernetes.
FastAPI.
Microservice Architecture.
Cloud-Native Application Design.

11. License

This repository is intended for educational, research, and portfolio purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
core		core
docs		docs
infrastructure		infrastructure
k8s		k8s
pipelines		pipelines
rag-ui		rag-ui
ray_app		ray_app
scripts		scripts
services		services
tests		tests
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile.backend		Dockerfile.backend
Dockerfile.frontend		Dockerfile.frontend
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ContextEngine: Distributed RAG Platform on Kubernetes using Ray Serve

1. Highlights

2. Architecture Overview

3. Technology Stack

4. Project Outcomes

5. Key Features

5.1 Distributed Serving

5.2 Retrieval-Augmented Generation

5.3 Service-Oriented Design

5.4 Kubernetes-Native Architecture

6. Repository Structure

7. Getting Started

7.1 Clone Repository

7.2 Install Dependencies

7.3 Start API Gateway

7.4 Health Check

8. Design Principles

9. Potential Extensions

10. Learning Objectives

11. License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ContextEngine: Distributed RAG Platform on Kubernetes using Ray Serve

1. Highlights

2. Architecture Overview

3. Technology Stack

4. Project Outcomes

5. Key Features

5.1 Distributed Serving

5.2 Retrieval-Augmented Generation

5.3 Service-Oriented Design

5.4 Kubernetes-Native Architecture

6. Repository Structure

7. Getting Started

7.1 Clone Repository

7.2 Install Dependencies

7.3 Start API Gateway

7.4 Health Check

8. Design Principles

9. Potential Extensions

10. Learning Objectives

11. License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages