A serverless ETL (Extract, Transform, Load) pipeline deployed on AWS using Terraform and Harness CI/CD. Optimized for AWS Free Tier.
This project implements a production-ready ETL pipeline that:
- Extracts data from S3 (CSV, JSON, Parquet)
- Transforms data using pandas (cleaning, validation, enrichment)
- Loads processed data to S3 in Parquet format
- Tracks job metadata in DynamoDB
- Sends notifications via SNS
ββββββββββββββββ βββββββββββββββ ββββββββββββββββ
β S3 Raw βββββ>β Lambda βββββ>β S3 Processed β
β (incoming) β β (ETL Logic) β β (parquet) β
ββββββββββββββββ ββββββββ¬βββββββ ββββββββββββββββ
β
ββββββββ΄βββββββ
β β
ββββββΌβββββ βββββββΌββββββ
βDynamoDB β β SNS β
β(metadata)β β(alerts) β
βββββββββββ βββββββββββββ
The easiest way to deploy - uses Harness for automated CI/CD:
-
Fork/Clone the repository
git clone https://github.com/tmmsunny012/harness-aws-etl-pipeline.git cd harness-aws-etl-pipeline -
Set up Harness - Follow docs/harness_setup.md
- Create Harness account (free tier)
- Configure AWS connector with secrets
- Import pipeline from
.harness/pipeline.yaml
-
Run the pipeline
- Go to Harness > Pipelines > ETL Pipeline Deployment
- Click Run and select environment (dev/staging/prod)
- Pipeline will automatically:
- Run tests
- Build Lambda package
- Create/update all AWS resources
- Verify deployment
For local development or manual deployment:
-
Prerequisites
# Install required tools - Python 3.9+ - Terraform >= 1.0 (or OpenTofu) - AWS CLI v2
-
Configure AWS credentials
aws configure # AWS Access Key ID: <your-access-key> # AWS Secret Access Key: <your-secret-key> # Default region name: us-east-1
-
Build Lambda package
python scripts/build_lambda.py
-
Bootstrap Terraform state backend
cd infrastructure/terraform-state terraform init terraform apply -var="aws_region=us-east-1" # Note the output values for next step
-
Deploy infrastructure
cd ../terraform # Initialize with S3 backend terraform init \ -backend-config="bucket=etl-pipeline-terraform-state-<ACCOUNT_ID>" \ -backend-config="dynamodb_table=etl-pipeline-terraform-locks" # Deploy terraform apply -var="environment=dev" -var="aws_region=us-east-1"
Test the ETL pipeline locally without AWS costs:
# Start LocalStack
docker-compose up -d localstack
# Run ETL locally
python scripts/run_local.py --full-test
# Run tests
pytest tests/ -vharness-aws-etl-pipeline/
βββ .harness/
β βββ pipeline.yaml # Harness CI/CD pipeline definition
β
βββ etl/ # ETL application code
β βββ lambda_handler.py # Lambda entry point
β βββ requirements-lambda.txt # Lambda dependencies (lightweight)
β βββ README.md # ETL technical documentation
β βββ src/
β βββ extract/ # Data extraction from S3
β βββ transform/ # Data transformation logic
β βββ load/ # Data loading to S3
β βββ utils/ # AWS clients, config, metadata
β
βββ infrastructure/
β βββ terraform/ # Main infrastructure
β β βββ main.tf # AWS resources (Lambda, S3, DynamoDB, etc.)
β β βββ variables.tf # Configuration variables
β β βββ outputs.tf # Output values
β β βββ README.md # Terraform documentation
β β
β βββ terraform-state/ # State backend bootstrap
β βββ main.tf # S3 bucket + DynamoDB for state
β
βββ tests/
β βββ unit/ # Unit tests
β βββ integration/ # Integration tests (LocalStack)
β
βββ scripts/
β βββ build_lambda.py # Build Lambda deployment package
β βββ run_local.py # Local development runner
β
βββ docs/
βββ SETUP.md # Detailed setup guide
βββ harness_setup.md # Harness CI/CD configuration
| Document | Description |
|---|---|
| docs/harness_setup.md | Harness CI/CD setup and pipeline execution details |
| docs/SETUP.md | Complete setup guide with AWS IAM permissions |
| infrastructure/terraform/README.md | Terraform configuration and state management |
| etl/README.md | ETL technical documentation, AWS services deep dive |
All resources are optimized for AWS Free Tier:
| Resource | Service | Free Tier Limit |
|---|---|---|
| Raw Data Bucket | S3 | 5GB storage |
| Processed Data Bucket | S3 | 5GB storage |
| ETL Processor | Lambda | 1M requests/month |
| Metadata Table | DynamoDB | 25 RCU, 25 WCU |
| Notifications | SNS | 1M publishes/month |
| Logs | CloudWatch | 5GB ingestion |
The CI/CD pipeline works whether resources exist or not:
- Imports existing resources into Terraform state
- Creates missing resources
- Handles recovery from partial failures
Heavy dependencies (pandas, numpy, pyarrow) are provided by AWS Lambda Layer:
layers = [
"arn:aws:lambda:${region}:336392948345:layer:AWSSDKPandas-Python39:28"
]This keeps deployment package under 10MB (vs 300MB+ without layer).
Terraform state is stored in S3 with DynamoDB locking:
- State persists across CI/CD runs
- Prevents concurrent modifications
- Enables state recovery via S3 versioning
Automatic (S3 Event): Upload a file to the raw data bucket:
aws s3 cp data/sample.csv s3://etl-pipeline-dev-raw-data-<ACCOUNT_ID>/incoming/Manual Invocation:
aws lambda invoke \
--function-name etl-pipeline-dev-processor \
--payload '{"source_bucket": "etl-pipeline-dev-raw-data-<ACCOUNT_ID>", "source_key": "incoming/sample.csv"}' \
response.json# Check processed data
aws s3 ls s3://etl-pipeline-dev-processed-data-<ACCOUNT_ID>/processed/ --recursive
# Check job metadata
aws dynamodb scan --table-name etl-pipeline-dev-metadata
# View Lambda logs
aws logs tail /aws/lambda/etl-pipeline-dev-processor --since 30m| Variable | Description | Default |
|---|---|---|
environment |
Target environment (dev, staging, prod) | dev |
aws_region |
AWS region for deployment | us-east-1 |
| Issue | Solution |
|---|---|
| "BucketAlreadyExists" | Pipeline imports existing resources - this is handled automatically |
| Lambda package too large | Dependencies provided by Lambda Layer, not bundled |
| Terraform state lost | State stored in S3, recoverable via imports |
| AWS provider crash | Using OpenTofu instead of Terraform to avoid credential parsing bugs |
# Check Lambda function
aws lambda get-function --function-name etl-pipeline-dev-processor
# View recent errors
aws logs filter-log-events \
--log-group-name /aws/lambda/etl-pipeline-dev-processor \
--filter-pattern "ERROR"
# Check Terraform state
cd infrastructure/terraform
terraform state list- Fork the repository
- Create a feature branch
- Make changes and add tests
- Submit a pull request
MIT License
- GitHub Issues: Report bugs and feature requests
- Documentation: Check the
/docsfolder