AWS Deployment Journey: From Zero to Running Instance¶
Complete record of every decision and step taken to move the 2Sigma platform from local development (SageMaker) to AWS. Written so anyone can understand the reasoning and reproduce it.
Context¶
- Project: 2Sigma — FastAPI backend + Next.js frontend + PostgreSQL
- Development environment: AWS SageMaker (where code is written)
- Staging environment: AWS EC2 (where manager reviews the app)
- Two separate git repos:
ai-tutor-backend(GitHub) andai-tutor-ui(GitHub)
Step 1: Architecture Decisions¶
Should Postgres and Backend share an instance?¶
Decision: Yes, for now. Separate later.
Reasoning: Budget is tight, max ~25 concurrent users, and the heavy I/O (LLM calls) happens externally on Anthropic/Bedrock — not on our server. A single instance handles this load comfortably.
Should we use RDS for Postgres?¶
Decision: No, run Postgres in Docker on the same EC2 instance.
Reasoning: RDS adds ~$15-30/mo. For a staging environment with low traffic, a containerized Postgres with daily S3 backups is sufficient. The setup is designed so switching to RDS later requires changing one environment variable and removing one service from docker-compose.
Should UI and Backend be on the same instance?¶
Decision: Yes. Nginx reverse proxy routes traffic:
- /api/* → FastAPI backend (:9898)
- /* → Next.js frontend (:3000)
What environment is this?¶
Decision: Staging (not dev, not test). - Dev = SageMaker (where we write code) - Staging = EC2 (production-like, manager reviews here) - Test = CI/CD automated tests (not applicable yet)
Step 2: Instance Sizing¶
Instance type: t3.small¶
| Component | RAM Usage |
|---|---|
| Next.js SSR | ~200-300MB |
| FastAPI (uvicorn, 2 workers) | ~100-150MB |
| PostgreSQL | ~200-300MB |
| Nginx | ~10MB |
| OS overhead | ~200MB |
| Total | ~800MB (leaves headroom on 2GB) |
Storage: 30 GiB gp3¶
Default 8GB fills up fast with Docker images + Postgres data. 30GB costs only ~$2.40/mo.
Credit specification: Standard (not Unlimited)¶
Unlimited can incur surprise charges if CPU spikes. Standard just throttles — safer for a budget setup.
Monthly cost: ~$18¶
| Item | Cost |
|---|---|
| EC2 t3.small on-demand | ~$15 |
| 30 GiB gp3 EBS | ~$2.40 |
| Elastic IP | $0 (free while attached to running instance) |
| S3 backups | ~$0.50 |
| Total | ~$18/mo |
Step 3: Containerization Setup¶
Files created¶
All deployment files live in ai-tutor-backend/ since the root edu/ directory is not a git repo.
| File | Purpose |
|---|---|
ai-tutor-backend/Dockerfile |
FastAPI container — Python 3.11, uvicorn with 2 workers, non-root user, health check |
ai-tutor-ui/Dockerfile |
Next.js container — multi-stage build (builder → runner), standalone output, non-root user |
ai-tutor-ui/next.config.js |
Added output: 'standalone' (required for Docker deployment) |
ai-tutor-backend/.dockerignore |
Excludes venv, tests, logs, .env from Docker image |
ai-tutor-ui/.dockerignore |
Excludes node_modules, .next, tests from Docker image |
deploy/docker-compose.yml |
4 services: postgres, backend, frontend, nginx. RDS migration comments inline. |
deploy/nginx/nginx.conf |
Reverse proxy with rate limiting, security headers, gzip, LLM-friendly timeouts (120s), SSL template |
deploy/scripts/backup-postgres.sh |
Daily pg_dump → gzip → S3 with configurable retention |
deploy/.env.production.example |
All config vars with RDS section ready to uncomment |
Key design decisions in Docker setup¶
- Nginx timeouts set to 120s — LLM API calls (Anthropic/Bedrock) can take a long time; default 60s would cause timeouts
- Postgres port bound to 127.0.0.1 only — not exposed to the internet, only accessible within the Docker network and from the host
- Non-root users in all containers — security best practice
- Health checks on every service — Docker Compose waits for dependencies to be healthy before starting dependent services
- docker-compose.yml has inline comments marking exactly what to change when migrating to RDS
Step 4: Infrastructure as Code (Terraform)¶
Why Terraform?¶
Manager requested it. While overkill for a single EC2 instance, it provides: - Reproducible infrastructure (destroy and recreate identically) - Version-controlled infra changes - Easier to extend when adding RDS, ALB, etc. later
Terraform files (in ai-tutor-backend/terraform/)¶
| File | Purpose |
|---|---|
provider.tf |
AWS provider, region us-east-2 |
variables.tf |
Configurable inputs: region, instance type, volume size, SSH CIDRs |
main.tf |
EC2 instance, security group, ED25519 key pair (auto-generated), Elastic IP |
outputs.tf |
Public IP, SSH command, app URL |
terraform.tfvars.example |
Example config to copy and customize |
.gitignore |
Excludes .tfstate, .pem files, terraform.tfvars (secrets) |
What Terraform creates (5 resources)¶
- TLS Private Key — ED25519 key pair generated locally, saved as
ai-tutor-staging.pem - AWS Key Pair — public key uploaded to AWS for SSH access
- Security Group — SSH (22), HTTP (80), HTTPS (443) inbound; all outbound
- EC2 Instance — t3.small, Amazon Linux 2023, 30GB gp3, termination protection enabled, IMDSv2, credit standard
- Elastic IP — stable public IP that doesn't change on stop/start
AWS profile setup¶
We use a named AWS CLI profile since the office AWS account is the default:
aws configure --profile 2sigma
# Access Key ID: (from IAM → Security credentials → Create access key → CLI use case)
# Secret Access Key: (shown once during creation)
# Region: us-east-2
# Output: json
All Terraform commands use this profile:
First-time EC2 creation (manual, then replaced)¶
- Initially created an EC2 instance manually through the AWS Console
- Manager requested Terraform, so:
- Terminated the manual instance
- Deleted the leftover key pair (
ai-tutor-staging) that conflicted - Ran
terraform applyto create identical infrastructure as code
Terraform execution¶
cd ai-tutor-backend/terraform
cp terraform.tfvars.example terraform.tfvars
AWS_PROFILE=2sigma terraform init # Downloaded AWS provider
AWS_PROFILE=2sigma terraform plan # Previewed 6 resources
AWS_PROFILE=2sigma terraform apply # Created everything in ~20 seconds
Current instance details¶
| Property | Value |
|---|---|
| Instance ID | i-09556cfac92c274e1 |
| Public IP (Elastic) | 3.151.25.120 |
| Region | us-east-2 (Ohio) |
| AMI | Amazon Linux 2023 (ami-0afa4cfe74f8b2d38) |
| SSH command | ssh -i terraform/ai-tutor-staging.pem ec2-user@3.151.25.120 |
Step 5: EC2 Setup and Docker Installation¶
After Terraform created the instance, we connected via SSH and installed the runtime dependencies:
ssh -i terraform/ai-tutor-staging.pem ec2-user@3.151.25.120
# Install Docker and Docker Compose
sudo dnf update -y
sudo dnf install -y docker
sudo systemctl start docker && sudo systemctl enable docker
sudo usermod -aG docker ec2-user
# Install Docker Compose plugin
sudo mkdir -p /usr/local/lib/docker/cli-plugins
sudo curl -SL https://github.com/docker/compose/releases/latest/download/docker-compose-linux-x86_64 \
-o /usr/local/lib/docker/cli-plugins/docker-compose
sudo chmod +x /usr/local/lib/docker/cli-plugins/docker-compose
# Install Docker Buildx (required for multi-stage builds)
sudo curl -SL https://github.com/docker/buildx/releases/latest/download/buildx-v0.20.1.linux-amd64 \
-o /usr/local/lib/docker/cli-plugins/docker-buildx
sudo chmod +x /usr/local/lib/docker/cli-plugins/docker-buildx
# Install Git and clone both repos
sudo dnf install -y git
cd ~ && git clone https://github.com/AI-Teacher-POC/ai-tutor-backend.git
cd ~ && git clone https://github.com/AI-Teacher-POC/ai-tutor-ui.git
Step 6: Configuration and First Deploy¶
Environment file¶
Created ~/ai-tutor-backend/deploy/.env.production from the example template, with generated secrets:
cd ~/ai-tutor-backend/deploy
cp .env.production.example .env.production
# Edited with real values: POSTGRES_PASSWORD, SECRET_KEY, CORS_ORIGINS, Bedrock config
Build and start¶
Important: The --env-file .env.production flag is required on every docker compose command because the env file is not the default .env.
Issues encountered and fixed during deployment¶
-
Missing
remark-breaksdependency —StandardChatBubble.tsximportedremark-breaksbut it was never added topackage.json. Fixed by adding it to dependencies. -
Missing
publicdirectory — The Dockerfile'sCOPY --from=builder /app/public ./publicfailed because the project has nopublic/directory. Fixed by addingmkdir -p publicbeforenpm run buildin the Dockerfile. -
Next.js standalone server binding to wrong host — The standalone
server.jswas binding to the container's hostname instead of0.0.0.0, causing the Docker health check to fail onlocalhost:3000. Fixed by addingHOSTNAME=0.0.0.0andPORT=3000environment variables in the Dockerfile.
Database migrations¶
All 22 migrations ran successfully, creating the full database schema.
Verification¶
| Endpoint | Result |
|---|---|
http://3.151.25.120/health |
{"status":"healthy"} |
http://3.151.25.120/ |
200 OK (frontend loads) |
http://3.151.25.120/api/v1/courses/ |
[] (API working, no courses seeded) |
All 4 containers running and healthy:
| Container | Status |
|---|---|
| postgres | Healthy |
| backend | Healthy |
| frontend | Healthy |
| nginx | Running |
Step 7: CI/CD with GitHub Actions¶
Why automate?¶
The manual deploy flow (SSH → pull → rebuild) takes 2-3 minutes of hands-on work per deploy. GitHub Actions removes all of it — just merge into staging and the EC2 environment updates automatically.
Branch strategy¶
main: Day-to-day development. Push freely — no deploy triggered.staging: Protected deployment branch. When code is pushed or merged here, GitHub Actions auto-deploys to EC2.
How to deploy¶
Option A — Merge main into staging (recommended):
Option B — Push directly to staging (for quick fixes):
Both options trigger the same GitHub Actions workflow.
What the workflows do¶
Each repo has a .github/workflows/deploy.yml that triggers on push to staging:
Backend workflow (~17s):
1. SSHs into EC2
2. Checks out and pulls staging branch
3. Rebuilds the backend container
4. Runs Alembic database migrations
5. Verifies the health endpoint responds
Frontend workflow (~2-3 min):
1. SSHs into EC2
2. Checks out and pulls staging branch
3. Rebuilds the frontend container (Next.js build is the slow part)
4. Restarts nginx to pick up changes
5. Verifies the frontend loads
Both workflows skip doc-only changes (*.md, docs/**, tests/**) to avoid unnecessary deploys.
Smart features¶
- Concurrency control: If you push twice quickly, the first deploy gets cancelled — only the latest code deploys.
- Health checks: Deploy fails visibly in GitHub if the app doesn't respond after rebuild.
- Auto-migrations: Backend workflow runs
alembic upgrade headon every deploy. - Path filtering: Changes to docs, tests, or markdown files don't trigger a deploy.
How SSH authentication works¶
The GitHub Actions runner needs to SSH into EC2 to deploy. Here's how the key chain works:
- Terraform generated an ED25519 key pair during
terraform apply - Private key saved locally:
terraform/ai-tutor-staging.pem(gitignored, never committed) -
Public key uploaded to EC2: stored in
~/.ssh/authorized_keyson the instance -
We copied the private key into GitHub Secrets so the workflow runner can use it:
gh secret set EC2_SSH_KEY --repo AI-Teacher-POC/ai-tutor-backend < terraform/ai-tutor-staging.pem gh secret set EC2_HOST --repo AI-Teacher-POC/ai-tutor-backend <<< "3.151.25.120" gh secret set EC2_SSH_KEY --repo AI-Teacher-POC/ai-tutor-ui < terraform/ai-tutor-staging.pem gh secret set EC2_HOST --repo AI-Teacher-POC/ai-tutor-ui <<< "3.151.25.120" -
During a deploy, the workflow runner writes the secret to a temp file, SSHs into EC2, and the instance verifies the private key matches its public key. The runner is destroyed after the job finishes.
The key lives in 3 places:
| Location | What's stored | How it got there |
|---|---|---|
terraform/ai-tutor-staging.pem (local, gitignored) |
Private key | Terraform generated it |
EC2 ~/.ssh/authorized_keys |
Public key | Terraform uploaded via aws_key_pair |
GitHub Secrets (EC2_SSH_KEY) on both repos |
Private key (encrypted) | We ran gh secret set |
GitHub secrets (configured on both repos)¶
| Secret | Purpose |
|---|---|
EC2_SSH_KEY |
Contents of ai-tutor-staging.pem (SSH private key) |
EC2_HOST |
3.151.25.120 (EC2 Elastic IP) |
These are set in GitHub → repo → Settings → Secrets and variables → Actions. You can also set them via CLI with gh secret set.
EC2 branch tracking¶
The repos on EC2 are checked out to staging (not main). This is intentional — the workflows pull the staging branch on deploy.
Development → Staging Workflow (Manual Fallback)¶
If GitHub Actions is unavailable or you need to deploy manually:
ssh -i terraform/ai-tutor-staging.pem ec2-user@3.151.25.120
cd ~/ai-tutor-backend && git checkout staging && git pull
cd ~/ai-tutor-ui && git checkout staging && git pull
cd ~/ai-tutor-backend/deploy
docker compose --env-file .env.production up -d --build
docker compose --env-file .env.production exec -T backend alembic upgrade head
File Map¶
All deployment-related files and where they live:
ai-tutor-backend/
Dockerfile # Backend container definition
.dockerignore # Files excluded from Docker build
.github/workflows/deploy.yml # GitHub Actions: auto-deploy backend on push
deploy/
docker-compose.yml # All 4 services (postgres, backend, frontend, nginx)
.env.production.example # Environment variable template
nginx/
nginx.conf # Reverse proxy configuration
scripts/
backup-postgres.sh # Postgres → S3 backup script
terraform/
provider.tf # AWS provider config
variables.tf # Input variables
main.tf # Infrastructure resources
outputs.tf # Output values (IP, SSH command)
terraform.tfvars.example # Config template
.gitignore # Excludes state, keys, secrets
ai-tutor-staging.pem # SSH key (gitignored, local only)
terraform.tfstate # State file (gitignored, local only)
docs/
aws-journey.md # This document
aws-deployment-guide.md # Operational guide (deploy, backup, restore)
aws-ec2-setup.md # Terraform reference and EC2 specs
ai-tutor-ui/
Dockerfile # Frontend container definition
.dockerignore # Files excluded from Docker build
.github/workflows/deploy.yml # GitHub Actions: auto-deploy frontend on push
next.config.js # Added output: 'standalone' for Docker
Future Expansion Path¶
| Phase | Trigger | Change | Cost Impact |
|---|---|---|---|
| 1 | Deploy slowing down the server | Move Docker builds to GitHub Actions + GHCR | +$0 |
| 2 | 50+ users or want managed backups | Move Postgres to RDS | +$15-30/mo |
| 3 | Going to production | Add SSL + custom domain | +$0 (Let's Encrypt is free) |
| 4 | Need CDN / faster page loads | Move frontend to Vercel | +$0 (free tier) |
| 5 | 100+ concurrent users | Move to ECS Fargate with ALB | +$60-130/mo |
| 6 | DB queries become bottleneck | Add ElastiCache (Redis) | +$15-25/mo |
Each phase is independent — do them in any order based on what you need first.
Current vs Future CI/CD approach¶
Current (SSH Deploy — what we have now):
Simple and works, but the Docker build runs on EC2 (2GB RAM). During a deploy, the server is under heavy load — especially the ~2 min Next.js frontend build.
Future (Registry Deploy — industry standard):
Push to staging → GitHub Actions builds image (on GitHub's 7GB runner) → pushes to GHCR → SSHs into EC2 → docker pull → restart
EC2 never builds anything — it only pulls a pre-built image and runs it. Deploys go from 2-3 min to ~10 seconds with zero impact on the running app. See aws-deployment-guide.md Phase 1 for implementation steps.