AWS Deployment Journey: From Zero to Running Instance¶

Complete record of every decision and step taken to move the 2Sigma platform from local development (SageMaker) to AWS. Written so anyone can understand the reasoning and reproduce it.

Context¶

Project: 2Sigma — FastAPI backend + Next.js frontend + PostgreSQL
Development environment: AWS SageMaker (where code is written)
Staging environment: AWS EC2 (where manager reviews the app)
Two separate git repos: ai-tutor-backend (GitHub) and ai-tutor-ui (GitHub)

Step 1: Architecture Decisions¶

Decision: Yes, for now. Separate later.

Reasoning: Budget is tight, max ~25 concurrent users, and the heavy I/O (LLM calls) happens externally on Anthropic/Bedrock — not on our server. A single instance handles this load comfortably.

Should we use RDS for Postgres?¶

Decision: No, run Postgres in Docker on the same EC2 instance.

Reasoning: RDS adds ~$15-30/mo. For a staging environment with low traffic, a containerized Postgres with daily S3 backups is sufficient. The setup is designed so switching to RDS later requires changing one environment variable and removing one service from docker-compose.

Should UI and Backend be on the same instance?¶

Decision: Yes. Nginx reverse proxy routes traffic: - /api/* → FastAPI backend (:9898) - /* → Next.js frontend (:3000)

What environment is this?¶

Decision: Staging (not dev, not test). - Dev = SageMaker (where we write code) - Staging = EC2 (production-like, manager reviews here) - Test = CI/CD automated tests (not applicable yet)

Step 2: Instance Sizing¶

Instance type: `t3.small`¶

Component	RAM Usage
Next.js SSR	~200-300MB
FastAPI (uvicorn, 2 workers)	~100-150MB
PostgreSQL	~200-300MB
Nginx	~10MB
OS overhead	~200MB
Total	~800MB (leaves headroom on 2GB)

Storage: 30 GiB gp3¶

Default 8GB fills up fast with Docker images + Postgres data. 30GB costs only ~$2.40/mo.

Credit specification: Standard (not Unlimited)¶

Unlimited can incur surprise charges if CPU spikes. Standard just throttles — safer for a budget setup.

Monthly cost: ~$18¶

Item	Cost
EC2 t3.small on-demand	~$15
30 GiB gp3 EBS	~$2.40
Elastic IP	$0 (free while attached to running instance)
S3 backups	~$0.50
Total	~$18/mo

Step 3: Containerization Setup¶

Files created¶

All deployment files live in ai-tutor-backend/ since the root edu/ directory is not a git repo.

File	Purpose
`ai-tutor-backend/Dockerfile`	FastAPI container — Python 3.11, uvicorn with 2 workers, non-root user, health check
`ai-tutor-ui/Dockerfile`	Next.js container — multi-stage build (builder → runner), standalone output, non-root user
`ai-tutor-ui/next.config.js`	Added `output: 'standalone'` (required for Docker deployment)
`ai-tutor-backend/.dockerignore`	Excludes venv, tests, logs, .env from Docker image
`ai-tutor-ui/.dockerignore`	Excludes node_modules, .next, tests from Docker image
`deploy/docker-compose.yml`	4 services: postgres, backend, frontend, nginx. RDS migration comments inline.
`deploy/nginx/nginx.conf`	Reverse proxy with rate limiting, security headers, gzip, LLM-friendly timeouts (120s), SSL template
`deploy/scripts/backup-postgres.sh`	Daily pg_dump → gzip → S3 with configurable retention
`deploy/.env.production.example`	All config vars with RDS section ready to uncomment

Key design decisions in Docker setup¶

Nginx timeouts set to 120s — LLM API calls (Anthropic/Bedrock) can take a long time; default 60s would cause timeouts
Postgres port bound to 127.0.0.1 only — not exposed to the internet, only accessible within the Docker network and from the host
Non-root users in all containers — security best practice
Health checks on every service — Docker Compose waits for dependencies to be healthy before starting dependent services
docker-compose.yml has inline comments marking exactly what to change when migrating to RDS

Step 4: Infrastructure as Code (Terraform)¶

Why Terraform?¶

Manager requested it. While overkill for a single EC2 instance, it provides: - Reproducible infrastructure (destroy and recreate identically) - Version-controlled infra changes - Easier to extend when adding RDS, ALB, etc. later

Terraform files (in `ai-tutor-backend/terraform/`)¶

File	Purpose
`provider.tf`	AWS provider, region us-east-2
`variables.tf`	Configurable inputs: region, instance type, volume size, SSH CIDRs
`main.tf`	EC2 instance, security group, ED25519 key pair (auto-generated), Elastic IP
`outputs.tf`	Public IP, SSH command, app URL
`terraform.tfvars.example`	Example config to copy and customize
`.gitignore`	Excludes .tfstate, .pem files, terraform.tfvars (secrets)

What Terraform creates (5 resources)¶

TLS Private Key — ED25519 key pair generated locally, saved as ai-tutor-staging.pem
AWS Key Pair — public key uploaded to AWS for SSH access
Security Group — SSH (22), HTTP (80), HTTPS (443) inbound; all outbound
EC2 Instance — t3.small, Amazon Linux 2023, 30GB gp3, termination protection enabled, IMDSv2, credit standard
Elastic IP — stable public IP that doesn't change on stop/start

AWS profile setup¶

We use a named AWS CLI profile since the office AWS account is the default:

aws configure --profile 2sigma
# Access Key ID: (from IAM → Security credentials → Create access key → CLI use case)
# Secret Access Key: (shown once during creation)
# Region: us-east-2
# Output: json

All Terraform commands use this profile:

AWS_PROFILE=2sigma terraform plan
AWS_PROFILE=2sigma terraform apply

First-time EC2 creation (manual, then replaced)¶

Initially created an EC2 instance manually through the AWS Console
Manager requested Terraform, so:
Terminated the manual instance
Deleted the leftover key pair (ai-tutor-staging) that conflicted
Ran terraform apply to create identical infrastructure as code

Terraform execution¶

cd ai-tutor-backend/terraform
cp terraform.tfvars.example terraform.tfvars
AWS_PROFILE=2sigma terraform init     # Downloaded AWS provider
AWS_PROFILE=2sigma terraform plan     # Previewed 6 resources
AWS_PROFILE=2sigma terraform apply    # Created everything in ~20 seconds

Current instance details¶

Property	Value
Instance ID	i-09556cfac92c274e1
Public IP (Elastic)	3.151.25.120
Region	us-east-2 (Ohio)
AMI	Amazon Linux 2023 (ami-0afa4cfe74f8b2d38)
SSH command	`ssh -i terraform/ai-tutor-staging.pem ec2-user@3.151.25.120`

Step 5: EC2 Setup and Docker Installation¶

After Terraform created the instance, we connected via SSH and installed the runtime dependencies:

ssh -i terraform/ai-tutor-staging.pem ec2-user@3.151.25.120

# Install Docker and Docker Compose
sudo dnf update -y
sudo dnf install -y docker
sudo systemctl start docker && sudo systemctl enable docker
sudo usermod -aG docker ec2-user

# Install Docker Compose plugin
sudo mkdir -p /usr/local/lib/docker/cli-plugins
sudo curl -SL https://github.com/docker/compose/releases/latest/download/docker-compose-linux-x86_64 \
  -o /usr/local/lib/docker/cli-plugins/docker-compose
sudo chmod +x /usr/local/lib/docker/cli-plugins/docker-compose

# Install Docker Buildx (required for multi-stage builds)
sudo curl -SL https://github.com/docker/buildx/releases/latest/download/buildx-v0.20.1.linux-amd64 \
  -o /usr/local/lib/docker/cli-plugins/docker-buildx
sudo chmod +x /usr/local/lib/docker/cli-plugins/docker-buildx

# Install Git and clone both repos
sudo dnf install -y git
cd ~ && git clone https://github.com/AI-Teacher-POC/ai-tutor-backend.git
cd ~ && git clone https://github.com/AI-Teacher-POC/ai-tutor-ui.git

Step 6: Configuration and First Deploy¶

Environment file¶

Created ~/ai-tutor-backend/deploy/.env.production from the example template, with generated secrets:

cd ~/ai-tutor-backend/deploy
cp .env.production.example .env.production
# Edited with real values: POSTGRES_PASSWORD, SECRET_KEY, CORS_ORIGINS, Bedrock config

Build and start¶

cd ~/ai-tutor-backend/deploy
docker compose --env-file .env.production up -d --build

Important: The --env-file .env.production flag is required on every docker compose command because the env file is not the default .env.

Issues encountered and fixed during deployment¶

Missing remark-breaks dependency — StandardChatBubble.tsx imported remark-breaks but it was never added to package.json. Fixed by adding it to dependencies.
Missing public directory — The Dockerfile's COPY --from=builder /app/public ./public failed because the project has no public/ directory. Fixed by adding mkdir -p public before npm run build in the Dockerfile.
Next.js standalone server binding to wrong host — The standalone server.js was binding to the container's hostname instead of 0.0.0.0, causing the Docker health check to fail on localhost:3000. Fixed by adding HOSTNAME=0.0.0.0 and PORT=3000 environment variables in the Dockerfile.

Database migrations¶

docker compose --env-file .env.production exec backend alembic upgrade head

All 22 migrations ran successfully, creating the full database schema.

Verification¶

Endpoint	Result
`http://3.151.25.120/health`	`{"status":"healthy"}`
`http://3.151.25.120/`	200 OK (frontend loads)
`http://3.151.25.120/api/v1/courses/`	`[]` (API working, no courses seeded)

All 4 containers running and healthy:

Container	Status
postgres	Healthy
backend	Healthy
frontend	Healthy
nginx	Running

Step 7: CI/CD with GitHub Actions¶

Why automate?¶

The manual deploy flow (SSH → pull → rebuild) takes 2-3 minutes of hands-on work per deploy. GitHub Actions removes all of it — just merge into staging and the EC2 environment updates automatically.

Branch strategy¶

main       →  development (where you write code)
staging    →  auto-deploys to EC2 (what the manager sees)

main: Day-to-day development. Push freely — no deploy triggered.
staging: Protected deployment branch. When code is pushed or merged here, GitHub Actions auto-deploys to EC2.

How to deploy¶

Option A — Merge main into staging (recommended):

git checkout staging
git merge main
git push
# GitHub Actions auto-deploys to EC2
git checkout main

Option B — Push directly to staging (for quick fixes):

git push origin main:staging

Both options trigger the same GitHub Actions workflow.

What the workflows do¶

Each repo has a .github/workflows/deploy.yml that triggers on push to staging:

Backend workflow (~17s): 1. SSHs into EC2 2. Checks out and pulls staging branch 3. Rebuilds the backend container 4. Runs Alembic database migrations 5. Verifies the health endpoint responds

Frontend workflow (~2-3 min): 1. SSHs into EC2 2. Checks out and pulls staging branch 3. Rebuilds the frontend container (Next.js build is the slow part) 4. Restarts nginx to pick up changes 5. Verifies the frontend loads

Both workflows skip doc-only changes (*.md, docs/**, tests/**) to avoid unnecessary deploys.

Smart features¶

Concurrency control: If you push twice quickly, the first deploy gets cancelled — only the latest code deploys.
Health checks: Deploy fails visibly in GitHub if the app doesn't respond after rebuild.
Auto-migrations: Backend workflow runs alembic upgrade head on every deploy.
Path filtering: Changes to docs, tests, or markdown files don't trigger a deploy.

How SSH authentication works¶

The GitHub Actions runner needs to SSH into EC2 to deploy. Here's how the key chain works:

Terraform generated an ED25519 key pair during terraform apply
Private key saved locally: terraform/ai-tutor-staging.pem (gitignored, never committed)
Public key uploaded to EC2: stored in ~/.ssh/authorized_keys on the instance

We copied the private key into GitHub Secrets so the workflow runner can use it:

gh secret set EC2_SSH_KEY --repo AI-Teacher-POC/ai-tutor-backend < terraform/ai-tutor-staging.pem
gh secret set EC2_HOST --repo AI-Teacher-POC/ai-tutor-backend <<< "3.151.25.120"

gh secret set EC2_SSH_KEY --repo AI-Teacher-POC/ai-tutor-ui < terraform/ai-tutor-staging.pem
gh secret set EC2_HOST --repo AI-Teacher-POC/ai-tutor-ui <<< "3.151.25.120"

During a deploy, the workflow runner writes the secret to a temp file, SSHs into EC2, and the instance verifies the private key matches its public key. The runner is destroyed after the job finishes.

The key lives in 3 places:

Location	What's stored	How it got there
`terraform/ai-tutor-staging.pem` (local, gitignored)	Private key	Terraform generated it
EC2 `~/.ssh/authorized_keys`	Public key	Terraform uploaded via `aws_key_pair`
GitHub Secrets (`EC2_SSH_KEY`) on both repos	Private key (encrypted)	We ran `gh secret set`

GitHub secrets (configured on both repos)¶

Secret	Purpose
`EC2_SSH_KEY`	Contents of `ai-tutor-staging.pem` (SSH private key)
`EC2_HOST`	`3.151.25.120` (EC2 Elastic IP)

These are set in GitHub → repo → Settings → Secrets and variables → Actions. You can also set them via CLI with gh secret set.

EC2 branch tracking¶

The repos on EC2 are checked out to staging (not main). This is intentional — the workflows pull the staging branch on deploy.

Development → Staging Workflow (Manual Fallback)¶

If GitHub Actions is unavailable or you need to deploy manually:

ssh -i terraform/ai-tutor-staging.pem ec2-user@3.151.25.120

cd ~/ai-tutor-backend && git checkout staging && git pull
cd ~/ai-tutor-ui && git checkout staging && git pull
cd ~/ai-tutor-backend/deploy
docker compose --env-file .env.production up -d --build
docker compose --env-file .env.production exec -T backend alembic upgrade head

File Map¶

All deployment-related files and where they live:

ai-tutor-backend/
  Dockerfile                          # Backend container definition
  .dockerignore                       # Files excluded from Docker build
  .github/workflows/deploy.yml       # GitHub Actions: auto-deploy backend on push
  deploy/
    docker-compose.yml                # All 4 services (postgres, backend, frontend, nginx)
    .env.production.example           # Environment variable template
    nginx/
      nginx.conf                      # Reverse proxy configuration
    scripts/
      backup-postgres.sh              # Postgres → S3 backup script
  terraform/
    provider.tf                       # AWS provider config
    variables.tf                      # Input variables
    main.tf                           # Infrastructure resources
    outputs.tf                        # Output values (IP, SSH command)
    terraform.tfvars.example          # Config template
    .gitignore                        # Excludes state, keys, secrets
    ai-tutor-staging.pem              # SSH key (gitignored, local only)
    terraform.tfstate                 # State file (gitignored, local only)
  docs/
    aws-journey.md                    # This document
    aws-deployment-guide.md           # Operational guide (deploy, backup, restore)
    aws-ec2-setup.md                  # Terraform reference and EC2 specs

ai-tutor-ui/
  Dockerfile                          # Frontend container definition
  .dockerignore                       # Files excluded from Docker build
  .github/workflows/deploy.yml       # GitHub Actions: auto-deploy frontend on push
  next.config.js                      # Added output: 'standalone' for Docker

Future Expansion Path¶

Phase	Trigger	Change	Cost Impact
1	Deploy slowing down the server	Move Docker builds to GitHub Actions + GHCR	+$0
2	50+ users or want managed backups	Move Postgres to RDS	+$15-30/mo
3	Going to production	Add SSL + custom domain	+$0 (Let's Encrypt is free)
4	Need CDN / faster page loads	Move frontend to Vercel	+$0 (free tier)
5	100+ concurrent users	Move to ECS Fargate with ALB	+$60-130/mo
6	DB queries become bottleneck	Add ElastiCache (Redis)	+$15-25/mo

Each phase is independent — do them in any order based on what you need first.

Current vs Future CI/CD approach¶

Current (SSH Deploy — what we have now):

Push to staging → GitHub Actions SSHs into EC2 → git pull → docker build ON EC2 → restart

Simple and works, but the Docker build runs on EC2 (2GB RAM). During a deploy, the server is under heavy load — especially the ~2 min Next.js frontend build.

Future (Registry Deploy — industry standard):

Push to staging → GitHub Actions builds image (on GitHub's 7GB runner) → pushes to GHCR → SSHs into EC2 → docker pull → restart

EC2 never builds anything — it only pulls a pre-built image and runs it. Deploys go from 2-3 min to ~10 seconds with zero impact on the running app. See aws-deployment-guide.md Phase 1 for implementation steps.