Observability Guide¶
The application uses four complementary tools for observability: Sentry for error tracking and performance, CloudWatch Logs for container log aggregation, CloudWatch Alarms for infrastructure metrics, and UptimeRobot for external availability checks. Together they cover the most common failure modes without requiring any paid services.
Architecture Overview¶
┌─────────────────────────────────────────┐
│ EC2 Instance │
│ │
│ ┌──────────┐ ┌──────────┐ ┌───────┐ │
│ │ FastAPI │ │ Next.js │ │ Nginx │ │
│ └────┬─────┘ └────┬─────┘ └───┬───┘ │
│ │ │ │ │
│ └─────────────┴─────────────┘ │
│ │ │
│ awslogs Docker driver │
└────────────────────┼─────────────────────┘
│
┌──────────────────────────┼──────────────────────────┐
│ │ │
▼ ▼ ▼
Sentry Cloud CloudWatch Logs CloudWatch Alarms
(app exceptions, (/ai-tutor/backend, (CPU > 80%,
slow endpoints, /ai-tutor/frontend, status check
React crashes) /ai-tutor/nginx) failures)
▲
│
UptimeRobot
(external ping every
5 min from outside
your infrastructure)
What's Covered vs Not Covered¶
| Scenario | Covered By | Status |
|---|---|---|
| Backend exceptions | Sentry | Active |
| Frontend React crashes | Sentry | Active |
| Slow endpoints | Sentry Performance | Active |
| Site completely down | UptimeRobot | Active |
| Container logs | CloudWatch Logs | Active |
| CPU spikes | CloudWatch Alarms | Active |
| EC2 status check failure | CloudWatch Alarms | Active |
| Nginx errors | CloudWatch Logs (nginx container) | Active |
| Database errors | Sentry (captures SQLAlchemy errors) | Active |
| Container OOM/crash | Docker healthchecks + UptimeRobot | Partial |
| Memory usage alerts | Not configured | Not yet |
Sentry (Error Tracking)¶
What It Is¶
Sentry is an error tracking service that captures exceptions, records request context, and measures performance. The free tier is sufficient for a small app.
Free Tier Limits¶
- 5,000 errors/month
- 7-day retention
- 1 user
- Email alerts only (no Slack on free)
Backend Integration (FastAPI)¶
Sentry is initialized in app/main.py during the lifespan startup, but only when SENTRY_DSN is set. This means you can run locally without Sentry by simply omitting the variable.
What gets captured automatically:
- Unhandled exceptions in any route handler
- Request context (URL, method, headers, user ID if authenticated)
- SQLAlchemy query errors
- Performance traces for a configurable percentage of requests
Files involved:
| File | Role |
|---|---|
app/main.py |
Sentry initialization in lifespan |
app/config.py |
Settings: SENTRY_DSN, SENTRY_TRACES_SAMPLE_RATE, SENTRY_ENVIRONMENT |
requirements.txt |
sentry-sdk[fastapi]>=1.40.0 |
Frontend Integration (Next.js 14)¶
The frontend uses @sentry/nextjs, which handles three separate runtimes: server-side rendering, edge functions, and the browser. Source maps are uploaded during build so stack traces in Sentry show your original TypeScript rather than minified output.
Error boundaries (error.tsx and global-error.tsx) call Sentry.captureException so React rendering errors are reported even when they don't propagate to the server.
Files involved:
| File | Role |
|---|---|
sentry.server.config.ts |
Server-side Sentry init |
sentry.edge.config.ts |
Edge runtime Sentry init |
instrumentation.ts |
Next.js instrumentation hook (loads correct config per runtime) |
instrumentation-client.ts |
Client-side Sentry init with browser tracing |
next.config.js |
Wrapped with withSentryConfig() |
app/error.tsx |
Calls Sentry.captureException |
app/global-error.tsx |
Calls Sentry.captureException |
Environment Variables¶
| Variable | Where | Purpose |
|---|---|---|
SENTRY_DSN |
Backend .env |
Backend Sentry project DSN |
SENTRY_TRACES_SAMPLE_RATE |
Backend .env |
Fraction of transactions to trace (0.0-1.0, default 0.1) |
SENTRY_ENVIRONMENT |
Backend .env |
Override environment name (optional, falls back to ENVIRONMENT) |
NEXT_PUBLIC_SENTRY_DSN |
Frontend .env.local |
Frontend Sentry project DSN |
NEXT_PUBLIC_SENTRY_TRACES_SAMPLE_RATE |
Frontend .env.local |
Fraction of transactions to trace (default 0.1) |
Sentry Dashboard¶
- URL: https://sentry.io (org: 2sigma)
- Projects:
python-fastapi(backend),javascript-nextjs(frontend) - Issues page: errors grouped by type and stack trace
- Performance page: slow endpoints and page loads, sorted by p95 latency
Disabling Sentry¶
Set SENTRY_DSN to an empty string or remove it entirely. Sentry is only initialized when the DSN is present, so there's no code to change.
CloudWatch Logs¶
What It Is¶
AWS CloudWatch Logs collects and stores container output. The free tier covers 5 GB/month, which is more than enough for a small app.
How It Works¶
Docker Compose uses the awslogs logging driver instead of the default json-file driver. Each container ships its stdout and stderr directly to a CloudWatch Log Group in us-east-2. There's no log agent to manage.
Log Groups¶
| Log Group | Container | What's In It |
|---|---|---|
/ai-tutor/backend |
FastAPI | API request logs, app errors, DB query logs |
/ai-tutor/frontend |
Next.js | SSR logs, build output, server-side errors |
/ai-tutor/nginx |
Nginx | Access logs, proxy errors, 502/504 errors |
Postgres is intentionally excluded from CloudWatch to avoid shipping potentially sensitive query data off the instance. Postgres logs stay in the container and are accessible via docker compose logs postgres.
IAM Permissions¶
The EC2 instance has an IAM role (ai-tutor-staging-ec2-role) with a policy granting write access to /ai-tutor/* log groups. This is managed in terraform/main.tf and attached to the instance via an instance profile.
Viewing Logs¶
AWS Console:
- Go to CloudWatch → Log Groups
- Select
/ai-tutor/backend(or frontend/nginx) - Click a log stream to view recent output
AWS CLI:
# Stream logs in real time
aws logs tail /ai-tutor/backend --follow --profile 2sigma
# Filter for errors
aws logs filter-log-events \
--log-group-name /ai-tutor/backend \
--filter-pattern "ERROR" \
--profile 2sigma
Searching Logs (CloudWatch Logs Insights)¶
Go to CloudWatch → Logs Insights, select a log group, and run queries like:
CloudWatch Alarms¶
What's Configured¶
| Alarm | Metric | Threshold | What It Means |
|---|---|---|---|
ai-tutor-staging-cpu-high |
CPUUtilization | > 80% for 10 min | Server under heavy load |
ai-tutor-staging-status-check |
StatusCheckFailed | Any failure | EC2 instance unreachable or unhealthy |
Alert Actions¶
Both alarms currently have no notification actions (alarm_actions = []). They'll show as ALARM in the AWS Console but won't send emails or pages.
To add email alerts:
- Create an SNS topic in AWS Console → SNS → Topics → Create topic
- Subscribe your email to the topic (confirm the subscription email)
- Copy the topic ARN
- Add it to
alarm_actionsinterraform/main.tffor both alarms - Run
terraform apply --var-file=terraform.tfvarsfromai-tutor-backend/terraform/
Viewing Alarms¶
AWS Console → CloudWatch → Alarms → All alarms
New alarms show "Insufficient data" for the first 10 minutes. That's normal. EC2 basic metrics report every 5 minutes.
UptimeRobot (External Monitoring)¶
What It Is¶
UptimeRobot is a free external monitoring service that pings your URLs every 5 minutes from servers outside your infrastructure. If a check fails, it sends an email alert.
Monitors Configured¶
| Monitor | URL | What It Checks |
|---|---|---|
| 2Sigma - Health Check | https://learn.2sigma.io/health | Backend health endpoint responds |
| 2Sigma - Backend API | https://learn.2sigma.io/api/v1/docs | API documentation accessible |
| 2Sigma - Landing Page | https://2sigma.io | Landing page loads |
| 2Sigma - App | https://learn.2sigma.io | App frontend loads |
Why It's Needed¶
Sentry only captures errors when the application is running. If the EC2 instance crashes, nginx dies, or the server becomes unreachable, Sentry can't report anything because there's nothing left to report. UptimeRobot checks from the outside, completely independent of your infrastructure.
Dashboard¶
https://uptimerobot.com shows uptime percentage, response time trends, and incident history for each monitor.
Terraform Resources¶
All observability infrastructure is managed in ai-tutor-backend/terraform/main.tf:
| Resource | Purpose |
|---|---|
aws_iam_role.ec2 |
EC2 role for CloudWatch Logs access |
aws_iam_role_policy.cloudwatch_logs |
Policy granting log write permissions to /ai-tutor/* |
aws_iam_instance_profile.ec2 |
Links IAM role to EC2 instance |
aws_cloudwatch_metric_alarm.cpu_high |
CPU > 80% alarm |
aws_cloudwatch_metric_alarm.status_check_failed |
Instance status check alarm |
The EC2 instance has lifecycle { ignore_changes = [ami] } to prevent Terraform from replacing the instance when AWS releases a newer AMI.
Prometheus Metrics (Available but Not Scraped)¶
The backend exposes Prometheus metrics at /metrics covering HTTP request counts, request durations, DB query metrics, and LLM cache performance.
LLM Cache Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
llm_cache_hits_total |
Counter | content_item_id |
Total cache hits (responses served from database) |
llm_cache_misses_total |
Counter | content_item_id |
Total cache misses (LLM called) |
LLM Cache Analytics (SQL Queries)¶
Until Prometheus + Grafana is set up, use these SQL queries directly against PostgreSQL to monitor LLM cache performance.
Overall Cache Hit Rate¶
SELECT
meta->>'source' AS source,
COUNT(*) AS count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct
FROM chat_messages
WHERE role = 'assistant' AND meta ? 'source'
GROUP BY meta->>'source';
Cache Hit Rate by Topic¶
SELECT
cs.content_item_id,
ci.title,
COUNT(*) FILTER (WHERE cm.meta->>'source' = 'cache') AS cache_hits,
COUNT(*) FILTER (WHERE cm.meta->>'source' = 'llm') AS llm_calls,
ROUND(
COUNT(*) FILTER (WHERE cm.meta->>'source' = 'cache') * 100.0 / NULLIF(COUNT(*), 0), 1
) AS hit_rate_pct
FROM chat_messages cm
JOIN chat_sessions cs ON cm.session_id = cs.id
JOIN content_items ci ON cs.content_item_id = ci.id
WHERE cm.role = 'assistant' AND cm.meta ? 'source'
GROUP BY cs.content_item_id, ci.title
ORDER BY hit_rate_pct DESC;
Most Popular Cache Paths¶
Shows which MCQ answer choices are most commonly taken by users:
SELECT
content_item_id,
user_message,
hit_count,
LEFT(response_content, 100) AS response_preview
FROM llm_response_cache
ORDER BY hit_count DESC
LIMIT 20;
Cache Tree Size per Topic¶
SELECT
content_item_id,
COUNT(*) AS total_nodes,
SUM(hit_count) AS total_hits,
COUNT(*) FILTER (WHERE parent_id IS NULL) AS root_nodes
FROM llm_response_cache
GROUP BY content_item_id
ORDER BY total_hits DESC;
Estimated Cost Savings¶
Assuming ~$0.01 per LLM call (approximate for Claude Haiku-class models):
SELECT
COUNT(*) FILTER (WHERE meta->>'source' = 'cache') AS cache_hits,
COUNT(*) FILTER (WHERE meta->>'source' = 'llm') AS llm_calls,
ROUND(COUNT(*) FILTER (WHERE meta->>'source' = 'cache') * 0.01, 2) AS estimated_savings_usd
FROM chat_messages
WHERE role = 'assistant' AND meta ? 'source';
Langfuse Impact¶
| Scenario | Langfuse Trace | chat_messages.meta |
|---|---|---|
| Cache hit | No trace (LLM not called) | source: "cache" |
| Cache miss | Full trace (normal) | source: "llm" |
| Diverged session | Full trace (normal) | source: "llm" |
Langfuse continues to track all actual LLM calls accurately. Cache hits are invisible to Langfuse by design — they involve no LLM call, so there's nothing to trace. Use the SQL queries above for cache analytics.
These counters are defined in app/core/metrics.py. To calculate cache hit rate:
Activating Prometheus + Grafana¶
Nothing currently scrapes these metrics. To activate full Prometheus + Grafana:
- Add Prometheus and Grafana containers to
docker-compose.yml - Configure Prometheus to scrape
backend:9898/metrics - Configure Grafana to use Prometheus as a data source
The catch: Prometheus + Grafana adds roughly 500 MB of RAM usage. The t3.small has 2 GB total, and the current stack already uses most of it. Defer this until upgrading to a larger instance.
Troubleshooting Quick Reference¶
For quick diagnostic checks, see below. For a full log of specific errors encountered in production (with root causes and fixes), see Troubleshooting & Known Issues.
Sentry not receiving events¶
- Check the DSN is set:
echo $SENTRY_DSNon the EC2 instance (orecho $NEXT_PUBLIC_SENTRY_DSNfor frontend) - Check Sentry initialized: look for "Sentry initialized" in backend startup logs (
docker compose --env-file .env.production logs backend) - Check free tier quota: Sentry dashboard → Settings → Subscription
- Test manually: trigger a deliberate error and check the Sentry Issues page within 1 minute
CloudWatch Logs not appearing¶
- Check the IAM role is attached: AWS Console → EC2 → your instance → Security tab → IAM Role
- Check the container is running:
docker compose --env-file .env.production ps - Check the Docker log driver:
docker inspect <container_name> | grep -A5 LogConfig(should showawslogs) - Check the region matches:
awslogs-regionindocker-compose.ymlshould beus-east-2
CloudWatch Alarms stuck in "Insufficient Data"¶
Normal for the first 10 minutes after creation. EC2 basic monitoring reports metrics every 5 minutes. If it persists beyond 15 minutes, check that the EC2 instance ID in the alarm matches the actual instance.
Cost¶
Everything runs within free tiers:
| Service | Free Tier | Typical Usage |
|---|---|---|
| Sentry | 5,000 errors/month | Well under for a small app |
| CloudWatch Logs | 5 GB/month ingestion | 1-3 GB typical |
| CloudWatch Alarms | 10 alarms free | Using 2 |
| UptimeRobot | 50 monitors, 5-min interval | Using 3 |
| Total | $0/month |
Future Improvements¶
| Improvement | When to Do It | Cost |
|---|---|---|
| Add SNS email alerts to CloudWatch alarms | When you want proactive CPU/status alerts | $0 |
| Upgrade Sentry to Team plan | When you hit 5K errors/month or need Slack alerts | $26/mo |
| Add Prometheus + Grafana | When you upgrade to a larger instance | $0 (self-hosted) |
| Add OpenTelemetry SDK | When you add multiple microservices | $0 |
| Add Grafana Cloud | When you want managed metrics without self-hosting | $0-50/mo |