Skip to content

Observability Guide

The application uses four complementary tools for observability: Sentry for error tracking and performance, CloudWatch Logs for container log aggregation, CloudWatch Alarms for infrastructure metrics, and UptimeRobot for external availability checks. Together they cover the most common failure modes without requiring any paid services.

Architecture Overview

                    ┌─────────────────────────────────────────┐
                    │              EC2 Instance                │
                    │                                          │
                    │  ┌──────────┐  ┌──────────┐  ┌───────┐ │
                    │  │ FastAPI  │  │ Next.js  │  │ Nginx │ │
                    │  └────┬─────┘  └────┬─────┘  └───┬───┘ │
                    │       │             │             │      │
                    │       └─────────────┴─────────────┘      │
                    │                    │                      │
                    │         awslogs Docker driver             │
                    └────────────────────┼─────────────────────┘
              ┌──────────────────────────┼──────────────────────────┐
              │                          │                           │
              ▼                          ▼                           ▼
       Sentry Cloud              CloudWatch Logs              CloudWatch Alarms
   (app exceptions,           (/ai-tutor/backend,           (CPU > 80%,
    slow endpoints,            /ai-tutor/frontend,           status check
    React crashes)             /ai-tutor/nginx)              failures)

       UptimeRobot
   (external ping every
    5 min from outside
    your infrastructure)

What's Covered vs Not Covered

Scenario Covered By Status
Backend exceptions Sentry Active
Frontend React crashes Sentry Active
Slow endpoints Sentry Performance Active
Site completely down UptimeRobot Active
Container logs CloudWatch Logs Active
CPU spikes CloudWatch Alarms Active
EC2 status check failure CloudWatch Alarms Active
Nginx errors CloudWatch Logs (nginx container) Active
Database errors Sentry (captures SQLAlchemy errors) Active
Container OOM/crash Docker healthchecks + UptimeRobot Partial
Memory usage alerts Not configured Not yet

Sentry (Error Tracking)

What It Is

Sentry is an error tracking service that captures exceptions, records request context, and measures performance. The free tier is sufficient for a small app.

Free Tier Limits

  • 5,000 errors/month
  • 7-day retention
  • 1 user
  • Email alerts only (no Slack on free)

Backend Integration (FastAPI)

Sentry is initialized in app/main.py during the lifespan startup, but only when SENTRY_DSN is set. This means you can run locally without Sentry by simply omitting the variable.

What gets captured automatically:

  • Unhandled exceptions in any route handler
  • Request context (URL, method, headers, user ID if authenticated)
  • SQLAlchemy query errors
  • Performance traces for a configurable percentage of requests

Files involved:

File Role
app/main.py Sentry initialization in lifespan
app/config.py Settings: SENTRY_DSN, SENTRY_TRACES_SAMPLE_RATE, SENTRY_ENVIRONMENT
requirements.txt sentry-sdk[fastapi]>=1.40.0

Frontend Integration (Next.js 14)

The frontend uses @sentry/nextjs, which handles three separate runtimes: server-side rendering, edge functions, and the browser. Source maps are uploaded during build so stack traces in Sentry show your original TypeScript rather than minified output.

Error boundaries (error.tsx and global-error.tsx) call Sentry.captureException so React rendering errors are reported even when they don't propagate to the server.

Files involved:

File Role
sentry.server.config.ts Server-side Sentry init
sentry.edge.config.ts Edge runtime Sentry init
instrumentation.ts Next.js instrumentation hook (loads correct config per runtime)
instrumentation-client.ts Client-side Sentry init with browser tracing
next.config.js Wrapped with withSentryConfig()
app/error.tsx Calls Sentry.captureException
app/global-error.tsx Calls Sentry.captureException

Environment Variables

Variable Where Purpose
SENTRY_DSN Backend .env Backend Sentry project DSN
SENTRY_TRACES_SAMPLE_RATE Backend .env Fraction of transactions to trace (0.0-1.0, default 0.1)
SENTRY_ENVIRONMENT Backend .env Override environment name (optional, falls back to ENVIRONMENT)
NEXT_PUBLIC_SENTRY_DSN Frontend .env.local Frontend Sentry project DSN
NEXT_PUBLIC_SENTRY_TRACES_SAMPLE_RATE Frontend .env.local Fraction of transactions to trace (default 0.1)

Sentry Dashboard

  • URL: https://sentry.io (org: 2sigma)
  • Projects: python-fastapi (backend), javascript-nextjs (frontend)
  • Issues page: errors grouped by type and stack trace
  • Performance page: slow endpoints and page loads, sorted by p95 latency

Disabling Sentry

Set SENTRY_DSN to an empty string or remove it entirely. Sentry is only initialized when the DSN is present, so there's no code to change.

CloudWatch Logs

What It Is

AWS CloudWatch Logs collects and stores container output. The free tier covers 5 GB/month, which is more than enough for a small app.

How It Works

Docker Compose uses the awslogs logging driver instead of the default json-file driver. Each container ships its stdout and stderr directly to a CloudWatch Log Group in us-east-2. There's no log agent to manage.

Log Groups

Log Group Container What's In It
/ai-tutor/backend FastAPI API request logs, app errors, DB query logs
/ai-tutor/frontend Next.js SSR logs, build output, server-side errors
/ai-tutor/nginx Nginx Access logs, proxy errors, 502/504 errors

Postgres is intentionally excluded from CloudWatch to avoid shipping potentially sensitive query data off the instance. Postgres logs stay in the container and are accessible via docker compose logs postgres.

IAM Permissions

The EC2 instance has an IAM role (ai-tutor-staging-ec2-role) with a policy granting write access to /ai-tutor/* log groups. This is managed in terraform/main.tf and attached to the instance via an instance profile.

Viewing Logs

AWS Console:

  1. Go to CloudWatch → Log Groups
  2. Select /ai-tutor/backend (or frontend/nginx)
  3. Click a log stream to view recent output

AWS CLI:

# Stream logs in real time
aws logs tail /ai-tutor/backend --follow --profile 2sigma

# Filter for errors
aws logs filter-log-events \
  --log-group-name /ai-tutor/backend \
  --filter-pattern "ERROR" \
  --profile 2sigma

Searching Logs (CloudWatch Logs Insights)

Go to CloudWatch → Logs Insights, select a log group, and run queries like:

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50
fields @timestamp, @message
| filter @message like /500/
| stats count() by bin(5m)

CloudWatch Alarms

What's Configured

Alarm Metric Threshold What It Means
ai-tutor-staging-cpu-high CPUUtilization > 80% for 10 min Server under heavy load
ai-tutor-staging-status-check StatusCheckFailed Any failure EC2 instance unreachable or unhealthy

Alert Actions

Both alarms currently have no notification actions (alarm_actions = []). They'll show as ALARM in the AWS Console but won't send emails or pages.

To add email alerts:

  1. Create an SNS topic in AWS Console → SNS → Topics → Create topic
  2. Subscribe your email to the topic (confirm the subscription email)
  3. Copy the topic ARN
  4. Add it to alarm_actions in terraform/main.tf for both alarms
  5. Run terraform apply --var-file=terraform.tfvars from ai-tutor-backend/terraform/

Viewing Alarms

AWS Console → CloudWatch → Alarms → All alarms

New alarms show "Insufficient data" for the first 10 minutes. That's normal. EC2 basic metrics report every 5 minutes.

UptimeRobot (External Monitoring)

What It Is

UptimeRobot is a free external monitoring service that pings your URLs every 5 minutes from servers outside your infrastructure. If a check fails, it sends an email alert.

Monitors Configured

Monitor URL What It Checks
2Sigma - Health Check https://learn.2sigma.io/health Backend health endpoint responds
2Sigma - Backend API https://learn.2sigma.io/api/v1/docs API documentation accessible
2Sigma - Landing Page https://2sigma.io Landing page loads
2Sigma - App https://learn.2sigma.io App frontend loads

Why It's Needed

Sentry only captures errors when the application is running. If the EC2 instance crashes, nginx dies, or the server becomes unreachable, Sentry can't report anything because there's nothing left to report. UptimeRobot checks from the outside, completely independent of your infrastructure.

Dashboard

https://uptimerobot.com shows uptime percentage, response time trends, and incident history for each monitor.

Terraform Resources

All observability infrastructure is managed in ai-tutor-backend/terraform/main.tf:

Resource Purpose
aws_iam_role.ec2 EC2 role for CloudWatch Logs access
aws_iam_role_policy.cloudwatch_logs Policy granting log write permissions to /ai-tutor/*
aws_iam_instance_profile.ec2 Links IAM role to EC2 instance
aws_cloudwatch_metric_alarm.cpu_high CPU > 80% alarm
aws_cloudwatch_metric_alarm.status_check_failed Instance status check alarm

The EC2 instance has lifecycle { ignore_changes = [ami] } to prevent Terraform from replacing the instance when AWS releases a newer AMI.

Prometheus Metrics (Available but Not Scraped)

The backend exposes Prometheus metrics at /metrics covering HTTP request counts, request durations, DB query metrics, and LLM cache performance.

LLM Cache Metrics

Metric Type Labels Description
llm_cache_hits_total Counter content_item_id Total cache hits (responses served from database)
llm_cache_misses_total Counter content_item_id Total cache misses (LLM called)

LLM Cache Analytics (SQL Queries)

Until Prometheus + Grafana is set up, use these SQL queries directly against PostgreSQL to monitor LLM cache performance.

Overall Cache Hit Rate

SELECT
    meta->>'source' AS source,
    COUNT(*) AS count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct
FROM chat_messages
WHERE role = 'assistant' AND meta ? 'source'
GROUP BY meta->>'source';

Cache Hit Rate by Topic

SELECT
    cs.content_item_id,
    ci.title,
    COUNT(*) FILTER (WHERE cm.meta->>'source' = 'cache') AS cache_hits,
    COUNT(*) FILTER (WHERE cm.meta->>'source' = 'llm') AS llm_calls,
    ROUND(
        COUNT(*) FILTER (WHERE cm.meta->>'source' = 'cache') * 100.0 / NULLIF(COUNT(*), 0), 1
    ) AS hit_rate_pct
FROM chat_messages cm
JOIN chat_sessions cs ON cm.session_id = cs.id
JOIN content_items ci ON cs.content_item_id = ci.id
WHERE cm.role = 'assistant' AND cm.meta ? 'source'
GROUP BY cs.content_item_id, ci.title
ORDER BY hit_rate_pct DESC;

Shows which MCQ answer choices are most commonly taken by users:

SELECT
    content_item_id,
    user_message,
    hit_count,
    LEFT(response_content, 100) AS response_preview
FROM llm_response_cache
ORDER BY hit_count DESC
LIMIT 20;

Cache Tree Size per Topic

SELECT
    content_item_id,
    COUNT(*) AS total_nodes,
    SUM(hit_count) AS total_hits,
    COUNT(*) FILTER (WHERE parent_id IS NULL) AS root_nodes
FROM llm_response_cache
GROUP BY content_item_id
ORDER BY total_hits DESC;

Estimated Cost Savings

Assuming ~$0.01 per LLM call (approximate for Claude Haiku-class models):

SELECT
    COUNT(*) FILTER (WHERE meta->>'source' = 'cache') AS cache_hits,
    COUNT(*) FILTER (WHERE meta->>'source' = 'llm') AS llm_calls,
    ROUND(COUNT(*) FILTER (WHERE meta->>'source' = 'cache') * 0.01, 2) AS estimated_savings_usd
FROM chat_messages
WHERE role = 'assistant' AND meta ? 'source';

Langfuse Impact

Scenario Langfuse Trace chat_messages.meta
Cache hit No trace (LLM not called) source: "cache"
Cache miss Full trace (normal) source: "llm"
Diverged session Full trace (normal) source: "llm"

Langfuse continues to track all actual LLM calls accurately. Cache hits are invisible to Langfuse by design — they involve no LLM call, so there's nothing to trace. Use the SQL queries above for cache analytics. These counters are defined in app/core/metrics.py. To calculate cache hit rate:

hit_rate = llm_cache_hits_total / (llm_cache_hits_total + llm_cache_misses_total)

Activating Prometheus + Grafana

Nothing currently scrapes these metrics. To activate full Prometheus + Grafana:

  1. Add Prometheus and Grafana containers to docker-compose.yml
  2. Configure Prometheus to scrape backend:9898/metrics
  3. Configure Grafana to use Prometheus as a data source

The catch: Prometheus + Grafana adds roughly 500 MB of RAM usage. The t3.small has 2 GB total, and the current stack already uses most of it. Defer this until upgrading to a larger instance.

Troubleshooting Quick Reference

For quick diagnostic checks, see below. For a full log of specific errors encountered in production (with root causes and fixes), see Troubleshooting & Known Issues.

Sentry not receiving events

  1. Check the DSN is set: echo $SENTRY_DSN on the EC2 instance (or echo $NEXT_PUBLIC_SENTRY_DSN for frontend)
  2. Check Sentry initialized: look for "Sentry initialized" in backend startup logs (docker compose --env-file .env.production logs backend)
  3. Check free tier quota: Sentry dashboard → Settings → Subscription
  4. Test manually: trigger a deliberate error and check the Sentry Issues page within 1 minute

CloudWatch Logs not appearing

  1. Check the IAM role is attached: AWS Console → EC2 → your instance → Security tab → IAM Role
  2. Check the container is running: docker compose --env-file .env.production ps
  3. Check the Docker log driver: docker inspect <container_name> | grep -A5 LogConfig (should show awslogs)
  4. Check the region matches: awslogs-region in docker-compose.yml should be us-east-2

CloudWatch Alarms stuck in "Insufficient Data"

Normal for the first 10 minutes after creation. EC2 basic monitoring reports metrics every 5 minutes. If it persists beyond 15 minutes, check that the EC2 instance ID in the alarm matches the actual instance.

Cost

Everything runs within free tiers:

Service Free Tier Typical Usage
Sentry 5,000 errors/month Well under for a small app
CloudWatch Logs 5 GB/month ingestion 1-3 GB typical
CloudWatch Alarms 10 alarms free Using 2
UptimeRobot 50 monitors, 5-min interval Using 3
Total $0/month

Future Improvements

Improvement When to Do It Cost
Add SNS email alerts to CloudWatch alarms When you want proactive CPU/status alerts $0
Upgrade Sentry to Team plan When you hit 5K errors/month or need Slack alerts $26/mo
Add Prometheus + Grafana When you upgrade to a larger instance $0 (self-hosted)
Add OpenTelemetry SDK When you add multiple microservices $0
Add Grafana Cloud When you want managed metrics without self-hosting $0-50/mo