Observability Guide¶

The application uses four complementary tools for observability: Sentry for error tracking and performance, CloudWatch Logs for container log aggregation, CloudWatch Alarms for infrastructure metrics, and UptimeRobot for external availability checks. Together they cover the most common failure modes without requiring any paid services.

Architecture Overview¶

                    ┌─────────────────────────────────────────┐
                    │              EC2 Instance                │
                    │                                          │
                    │  ┌──────────┐  ┌──────────┐  ┌───────┐ │
                    │  │ FastAPI  │  │ Next.js  │  │ Nginx │ │
                    │  └────┬─────┘  └────┬─────┘  └───┬───┘ │
                    │       │             │             │      │
                    │       └─────────────┴─────────────┘      │
                    │                    │                      │
                    │         awslogs Docker driver             │
                    └────────────────────┼─────────────────────┘
                                         │
              ┌──────────────────────────┼──────────────────────────┐
              │                          │                           │
              ▼                          ▼                           ▼
       Sentry Cloud              CloudWatch Logs              CloudWatch Alarms
   (app exceptions,           (/ai-tutor/backend,           (CPU > 80%,
    slow endpoints,            /ai-tutor/frontend,           status check
    React crashes)             /ai-tutor/nginx)              failures)

              ▲
              │
       UptimeRobot
   (external ping every
    5 min from outside
    your infrastructure)

What's Covered vs Not Covered¶

Scenario	Covered By	Status
Backend exceptions	Sentry	Active
Frontend React crashes	Sentry	Active
Slow endpoints	Sentry Performance	Active
Site completely down	UptimeRobot	Active
Container logs	CloudWatch Logs	Active
CPU spikes	CloudWatch Alarms	Active
EC2 status check failure	CloudWatch Alarms	Active
Nginx errors	CloudWatch Logs (nginx container)	Active
Database errors	Sentry (captures SQLAlchemy errors)	Active
Container OOM/crash	Docker healthchecks + UptimeRobot	Partial
Memory usage alerts	Not configured	Not yet

Sentry (Error Tracking)¶

What It Is¶

Sentry is an error tracking service that captures exceptions, records request context, and measures performance. The free tier is sufficient for a small app.

Free Tier Limits¶

5,000 errors/month
7-day retention
1 user
Email alerts only (no Slack on free)

Backend Integration (FastAPI)¶

Sentry is initialized in app/main.py during the lifespan startup, but only when SENTRY_DSN is set. This means you can run locally without Sentry by simply omitting the variable.

What gets captured automatically:

Unhandled exceptions in any route handler
Request context (URL, method, headers, user ID if authenticated)
SQLAlchemy query errors
Performance traces for a configurable percentage of requests

Files involved:

File	Role
`app/main.py`	Sentry initialization in lifespan
`app/config.py`	Settings: SENTRY_DSN, SENTRY_TRACES_SAMPLE_RATE, SENTRY_ENVIRONMENT
`requirements.txt`	`sentry-sdk[fastapi]>=1.40.0`

Frontend Integration (Next.js 14)¶

The frontend uses @sentry/nextjs, which handles three separate runtimes: server-side rendering, edge functions, and the browser. Source maps are uploaded during build so stack traces in Sentry show your original TypeScript rather than minified output.

Error boundaries (error.tsx and global-error.tsx) call Sentry.captureException so React rendering errors are reported even when they don't propagate to the server.

Files involved:

File	Role
`sentry.server.config.ts`	Server-side Sentry init
`sentry.edge.config.ts`	Edge runtime Sentry init
`instrumentation.ts`	Next.js instrumentation hook (loads correct config per runtime)
`instrumentation-client.ts`	Client-side Sentry init with browser tracing
`next.config.js`	Wrapped with `withSentryConfig()`
`app/error.tsx`	Calls `Sentry.captureException`
`app/global-error.tsx`	Calls `Sentry.captureException`

Environment Variables¶

Variable	Where	Purpose
`SENTRY_DSN`	Backend `.env`	Backend Sentry project DSN
`SENTRY_TRACES_SAMPLE_RATE`	Backend `.env`	Fraction of transactions to trace (0.0-1.0, default 0.1)
`SENTRY_ENVIRONMENT`	Backend `.env`	Override environment name (optional, falls back to `ENVIRONMENT`)
`NEXT_PUBLIC_SENTRY_DSN`	Frontend `.env.local`	Frontend Sentry project DSN
`NEXT_PUBLIC_SENTRY_TRACES_SAMPLE_RATE`	Frontend `.env.local`	Fraction of transactions to trace (default 0.1)

Sentry Dashboard¶

URL: https://sentry.io (org: 2sigma)
Projects: python-fastapi (backend), javascript-nextjs (frontend)
Issues page: errors grouped by type and stack trace
Performance page: slow endpoints and page loads, sorted by p95 latency

Disabling Sentry¶

Set SENTRY_DSN to an empty string or remove it entirely. Sentry is only initialized when the DSN is present, so there's no code to change.

CloudWatch Logs¶

What It Is¶

AWS CloudWatch Logs collects and stores container output. The free tier covers 5 GB/month, which is more than enough for a small app.

How It Works¶

Docker Compose uses the awslogs logging driver instead of the default json-file driver. Each container ships its stdout and stderr directly to a CloudWatch Log Group in us-east-2. There's no log agent to manage.

Log Groups¶

Log Group	Container	What's In It
`/ai-tutor/backend`	FastAPI	API request logs, app errors, DB query logs
`/ai-tutor/frontend`	Next.js	SSR logs, build output, server-side errors
`/ai-tutor/nginx`	Nginx	Access logs, proxy errors, 502/504 errors

Postgres is intentionally excluded from CloudWatch to avoid shipping potentially sensitive query data off the instance. Postgres logs stay in the container and are accessible via docker compose logs postgres.

IAM Permissions¶

The EC2 instance has an IAM role (ai-tutor-staging-ec2-role) with a policy granting write access to /ai-tutor/* log groups. This is managed in terraform/main.tf and attached to the instance via an instance profile.

Viewing Logs¶

AWS Console:

Go to CloudWatch → Log Groups
Select /ai-tutor/backend (or frontend/nginx)
Click a log stream to view recent output

AWS CLI:

# Stream logs in real time
aws logs tail /ai-tutor/backend --follow --profile 2sigma

# Filter for errors
aws logs filter-log-events \
  --log-group-name /ai-tutor/backend \
  --filter-pattern "ERROR" \
  --profile 2sigma

Searching Logs (CloudWatch Logs Insights)¶

Go to CloudWatch → Logs Insights, select a log group, and run queries like:

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50

fields @timestamp, @message
| filter @message like /500/
| stats count() by bin(5m)

CloudWatch Alarms¶

What's Configured¶

Alarm	Metric	Threshold	What It Means
`ai-tutor-staging-cpu-high`	CPUUtilization	> 80% for 10 min	Server under heavy load
`ai-tutor-staging-status-check`	StatusCheckFailed	Any failure	EC2 instance unreachable or unhealthy

Alert Actions¶

Both alarms currently have no notification actions (alarm_actions = []). They'll show as ALARM in the AWS Console but won't send emails or pages.

To add email alerts:

Create an SNS topic in AWS Console → SNS → Topics → Create topic
Subscribe your email to the topic (confirm the subscription email)
Copy the topic ARN
Add it to alarm_actions in terraform/main.tf for both alarms
Run terraform apply --var-file=terraform.tfvars from ai-tutor-backend/terraform/

Viewing Alarms¶

AWS Console → CloudWatch → Alarms → All alarms

New alarms show "Insufficient data" for the first 10 minutes. That's normal. EC2 basic metrics report every 5 minutes.

UptimeRobot (External Monitoring)¶

What It Is¶

UptimeRobot is a free external monitoring service that pings your URLs every 5 minutes from servers outside your infrastructure. If a check fails, it sends an email alert.

Monitors Configured¶

Monitor	URL	What It Checks
2Sigma - Health Check	https://learn.2sigma.io/health	Backend health endpoint responds
2Sigma - Backend API	https://learn.2sigma.io/api/v1/docs	API documentation accessible
2Sigma - Landing Page	https://2sigma.io	Landing page loads
2Sigma - App	https://learn.2sigma.io	App frontend loads

Why It's Needed¶

Sentry only captures errors when the application is running. If the EC2 instance crashes, nginx dies, or the server becomes unreachable, Sentry can't report anything because there's nothing left to report. UptimeRobot checks from the outside, completely independent of your infrastructure.

Dashboard¶

https://uptimerobot.com shows uptime percentage, response time trends, and incident history for each monitor.

Terraform Resources¶

All observability infrastructure is managed in ai-tutor-backend/terraform/main.tf:

Resource	Purpose
`aws_iam_role.ec2`	EC2 role for CloudWatch Logs access
`aws_iam_role_policy.cloudwatch_logs`	Policy granting log write permissions to `/ai-tutor/*`
`aws_iam_instance_profile.ec2`	Links IAM role to EC2 instance
`aws_cloudwatch_metric_alarm.cpu_high`	CPU > 80% alarm
`aws_cloudwatch_metric_alarm.status_check_failed`	Instance status check alarm

The EC2 instance has lifecycle { ignore_changes = [ami] } to prevent Terraform from replacing the instance when AWS releases a newer AMI.

Prometheus Metrics (Available but Not Scraped)¶

The backend exposes Prometheus metrics at /metrics covering HTTP request counts, request durations, DB query metrics, and LLM cache performance.

LLM Cache Metrics¶

Metric	Type	Labels	Description
`llm_cache_hits_total`	Counter	`content_item_id`	Total cache hits (responses served from database)
`llm_cache_misses_total`	Counter	`content_item_id`	Total cache misses (LLM called)

LLM Cache Analytics (SQL Queries)¶

Until Prometheus + Grafana is set up, use these SQL queries directly against PostgreSQL to monitor LLM cache performance.

Overall Cache Hit Rate¶

SELECT
    meta->>'source' AS source,
    COUNT(*) AS count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct
FROM chat_messages
WHERE role = 'assistant' AND meta ? 'source'
GROUP BY meta->>'source';

Cache Hit Rate by Topic¶

SELECT
    cs.content_item_id,
    ci.title,
    COUNT(*) FILTER (WHERE cm.meta->>'source' = 'cache') AS cache_hits,
    COUNT(*) FILTER (WHERE cm.meta->>'source' = 'llm') AS llm_calls,
    ROUND(
        COUNT(*) FILTER (WHERE cm.meta->>'source' = 'cache') * 100.0 / NULLIF(COUNT(*), 0), 1
    ) AS hit_rate_pct
FROM chat_messages cm
JOIN chat_sessions cs ON cm.session_id = cs.id
JOIN content_items ci ON cs.content_item_id = ci.id
WHERE cm.role = 'assistant' AND cm.meta ? 'source'
GROUP BY cs.content_item_id, ci.title
ORDER BY hit_rate_pct DESC;

Most Popular Cache Paths¶

Shows which MCQ answer choices are most commonly taken by users:

SELECT
    content_item_id,
    user_message,
    hit_count,
    LEFT(response_content, 100) AS response_preview
FROM llm_response_cache
ORDER BY hit_count DESC
LIMIT 20;

Cache Tree Size per Topic¶

SELECT
    content_item_id,
    COUNT(*) AS total_nodes,
    SUM(hit_count) AS total_hits,
    COUNT(*) FILTER (WHERE parent_id IS NULL) AS root_nodes
FROM llm_response_cache
GROUP BY content_item_id
ORDER BY total_hits DESC;

Estimated Cost Savings¶

Assuming ~$0.01 per LLM call (approximate for Claude Haiku-class models):

SELECT
    COUNT(*) FILTER (WHERE meta->>'source' = 'cache') AS cache_hits,
    COUNT(*) FILTER (WHERE meta->>'source' = 'llm') AS llm_calls,
    ROUND(COUNT(*) FILTER (WHERE meta->>'source' = 'cache') * 0.01, 2) AS estimated_savings_usd
FROM chat_messages
WHERE role = 'assistant' AND meta ? 'source';

Langfuse Impact¶

Scenario	Langfuse Trace	chat_messages.meta
Cache hit	No trace (LLM not called)	`source: "cache"`
Cache miss	Full trace (normal)	`source: "llm"`
Diverged session	Full trace (normal)	`source: "llm"`

Langfuse continues to track all actual LLM calls accurately. Cache hits are invisible to Langfuse by design — they involve no LLM call, so there's nothing to trace. Use the SQL queries above for cache analytics. These counters are defined in app/core/metrics.py. To calculate cache hit rate:

hit_rate = llm_cache_hits_total / (llm_cache_hits_total + llm_cache_misses_total)

Activating Prometheus + Grafana¶

Nothing currently scrapes these metrics. To activate full Prometheus + Grafana:

Add Prometheus and Grafana containers to docker-compose.yml
Configure Prometheus to scrape backend:9898/metrics
Configure Grafana to use Prometheus as a data source

The catch: Prometheus + Grafana adds roughly 500 MB of RAM usage. The t3.small has 2 GB total, and the current stack already uses most of it. Defer this until upgrading to a larger instance.

Troubleshooting Quick Reference¶

For quick diagnostic checks, see below. For a full log of specific errors encountered in production (with root causes and fixes), see Troubleshooting & Known Issues.

Sentry not receiving events¶

Check the DSN is set: echo $SENTRY_DSN on the EC2 instance (or echo $NEXT_PUBLIC_SENTRY_DSN for frontend)
Check Sentry initialized: look for "Sentry initialized" in backend startup logs (docker compose --env-file .env.production logs backend)
Check free tier quota: Sentry dashboard → Settings → Subscription
Test manually: trigger a deliberate error and check the Sentry Issues page within 1 minute

CloudWatch Logs not appearing¶

Check the IAM role is attached: AWS Console → EC2 → your instance → Security tab → IAM Role
Check the container is running: docker compose --env-file .env.production ps
Check the Docker log driver: docker inspect <container_name> | grep -A5 LogConfig (should show awslogs)
Check the region matches: awslogs-region in docker-compose.yml should be us-east-2

CloudWatch Alarms stuck in "Insufficient Data"¶

Normal for the first 10 minutes after creation. EC2 basic monitoring reports metrics every 5 minutes. If it persists beyond 15 minutes, check that the EC2 instance ID in the alarm matches the actual instance.

Cost¶

Everything runs within free tiers:

Service	Free Tier	Typical Usage
Sentry	5,000 errors/month	Well under for a small app
CloudWatch Logs	5 GB/month ingestion	1-3 GB typical
CloudWatch Alarms	10 alarms free	Using 2
UptimeRobot	50 monitors, 5-min interval	Using 3
Total		$0/month

Future Improvements¶

Improvement	When to Do It	Cost
Add SNS email alerts to CloudWatch alarms	When you want proactive CPU/status alerts	$0
Upgrade Sentry to Team plan	When you hit 5K errors/month or need Slack alerts	$26/mo
Add Prometheus + Grafana	When you upgrade to a larger instance	$0 (self-hosted)
Add OpenTelemetry SDK	When you add multiple microservices	$0
Add Grafana Cloud	When you want managed metrics without self-hosting	$0-50/mo