Troubleshooting & Known Issues¶

A running log of errors encountered in this project, with root causes and fixes. Add new entries as they come up so the team has a searchable reference.

How to Use This Document¶

Search by error message or symptom to find relevant entries
Each entry includes: symptoms, root cause, fix, and prevention
Entries are grouped by category and ordered newest-first within each group

Frontend Performance¶

Admin pages slow to load despite fast backend responses¶

Date: 2026-04-01 Severity: Medium — poor UX on all admin pages, student pages unaffected

Symptoms: - Admin pages (/admin/analytics, /admin/courses, /admin/prompts) take 2-5 seconds to show content - Student dashboard loads instantly - Network tab shows backend APIs respond in <200ms, but a full-screen "Loading..." overlay blocks the page - Server-side pre-fetched data is in the HTML but hidden behind the overlay

Root cause: Three layered issues, listed in order of user-visible impact:

LoadingLink component in AdminSidebar — Every admin sidebar link used <LoadingLink> instead of <Link>. On click, it called e.preventDefault(), showed a full-screen BookLoader portal (z-[9999]), then called router.push(). The overlay stayed until the old component unmounted, hiding the server-rendered content that arrived in ~200ms.
Split-brain auth (JS cookie vs localStorage) — Auth tokens were stored in localStorage (for client-side API calls) and synced to a browser cookie via document.cookie (for server components). This JS-set cookie was unreliable — not updated on token refresh/rotation — causing server-side serverAuthFetch to get 401s on staging. Server components fell back to empty data, forcing client-side re-fetching across the WAN.
force-dynamic on all admin pages — Admin pages used export const dynamic = 'force-dynamic' which disabled all Next.js caching. Every navigation triggered a full server render. Student dashboard used revalidate: 3600 (ISR) and served from cache.

Fix:

Replaced LoadingLink with standard <Link> in AdminSidebar.tsx and AdminBreadcrumbs.tsx. Next.js handles route transitions natively without blocking overlays.
Backend now sets httpOnly cookies on login/refresh via Set-Cookie header — single source of truth, always in sync. Auth dependency (get_current_user) accepts Bearer token OR cookie (backward compatible). Removed JS document.cookie writes from authClient.ts. Added credentials: "include" to all client-side fetch calls.
Replaced force-dynamic with per-fetch caching — admin pages use { revalidate: 30, tags: ["admin:courses"] } for 30-second ISR caching. serverAuthFetch no longer defaults to cache: "no-store" when revalidate/tags are set.

Additional fixes applied: - Converted all 9 admin pages to hybrid Server + Client Component pattern (server pre-fetches data, passes to client component as props) - Fixed N+1 query in _per_course_breakdown() — 4 batch queries instead of 4×N - Added 5-min TTL in-memory cache on analytics endpoints - Removed <RoleGuard> from admin client components (layout already checks auth server-side) - Fixed Docker .next/cache permissions (EACCES: permission denied, mkdir '/app/.next/cache') - Fixed slowapi crash on auth endpoints (missing response: Response parameter)

Prevention: - Never use LoadingLink / full-screen overlay loaders for navigation in layouts where server-rendered content should be immediately visible. Use standard <Link> and let Next.js handle transitions. - Use backend-set httpOnly cookies for auth — never rely on document.cookie for security-critical tokens. - Prefer per-fetch caching with tags (revalidate: 30, tags: [...]) over blanket force-dynamic on admin pages. Invalidate with revalidateTag() on mutations. - Always test page performance from the user's geographic location (or simulate with Network throttling) — not just local dev.

Infrastructure & EC2¶

EC2 instance unresponsive (SSH and HTTP timeout) during Docker builds¶

Date: 2026-02-24 Severity: High — full outage during deploys

Symptoms: - SSH: Connection timed out during banner exchange - HTTP: requests hang indefinitely, curl times out - AWS Console shows instance as "running" with status checks "ok" - CloudWatch shows sustained 50-95% CPU for 30+ minutes

Root cause: The t3.small (2 vCPU, 2GB RAM) had no swap configured. When Docker builds the Next.js frontend (especially with Sentry source maps), it consumes nearly all available memory. The OOM pressure starves the SSH daemon and nginx of CPU/memory, making the instance completely unreachable even though the kernel is still alive.

Fix:

# Add 2GB swap (one-time setup)
sudo dd if=/dev/zero of=/swapfile bs=128M count=16
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Make persistent across reboots
echo '/swapfile swap swap defaults 0 0' | sudo tee -a /etc/fstab

Prevention: Always configure swap on memory-constrained instances. The long-term fix is moving Docker builds to GitHub Actions (see Deployment Guide - Phase 1).

AWS_REGION mismatch — containers fail to start with CloudWatch Logs¶

Date: 2026-02-24 Severity: High — backend, frontend, and nginx all fail to start

Symptoms: - docker compose up -d shows containers as "Created" but not "Up" - Error in Docker output:

failed to create task for container: failed to initialize logging driver:
failed to create Cloudwatch log stream: ... AccessDeniedException:
User: arn:aws:sts::...:assumed-role/ai-tutor-staging-ec2-role/...
is not authorized to perform: logs:CreateLogStream on resource:
arn:aws:logs:eu-west-2:...:log-group:/ai-tutor/frontend:log-stream:...

- Note the region in the ARN (eu-west-2) doesn't match where the infrastructure lives (us-east-2)

Root cause: .env.production had AWS_REGION=eu-west-2 but all infrastructure (EC2 instance, IAM role, CloudWatch Log Groups) was created in us-east-2. The Docker awslogs driver tried to create log streams in eu-west-2, where the IAM policy doesn't grant permissions.

The docker-compose.yml uses awslogs-region: ${AWS_REGION:-us-east-2}, so the env var overrode the default.

Fix:

# On EC2, edit the production env file
sed -i 's/AWS_REGION=eu-west-2/AWS_REGION=us-east-2/' ~/ai-tutor-backend/deploy/.env.production

# Restart containers
cd ~/ai-tutor-backend/deploy
docker compose --env-file .env.production up -d

Prevention: When setting up a new environment, verify AWS_REGION in .env.production matches the region where Terraform provisioned the infrastructure. The docker-compose default (us-east-2) is correct for our setup.

Docker `awslogs-stream-prefix` not supported¶

Date: 2026-02-19 Severity: High — containers fail to start

Symptoms: - docker compose up -d fails - Error mentions awslogs-stream-prefix as an unsupported option

Root cause: The Docker version on the EC2 instance (installed via Amazon Linux 2023's default package) doesn't support the awslogs-stream-prefix option in the logging configuration.

Fix: Remove awslogs-stream-prefix from all logging blocks in deploy/docker-compose.yml. Log streams will use container IDs instead of friendly prefixes.

Prevention: Test Docker Compose config on the target Docker version before deploying. Don't assume all awslogs driver options are available on all Docker versions.

Terraform wants to destroy and recreate EC2 instance (AMI drift)¶

Date: 2026-02-19 Severity: Critical — would destroy the production instance and all data

Symptoms: - terraform plan shows the EC2 instance will be destroyed and recreated - The change is on the ami attribute: a newer Amazon Linux AMI was found

Root cause: The Terraform config uses data.aws_ami.amazon_linux with most_recent = true. When AWS publishes a new AMI, Terraform sees the AMI ID has changed and plans to replace the instance.

Fix: Added lifecycle rule to aws_instance.app in terraform/main.tf:

lifecycle {
  ignore_changes = [ami]
}

Prevention: Always use ignore_changes = [ami] on long-lived EC2 instances that use most_recent AMI lookups. Alternatively, pin the AMI ID directly.

ECS agent container auto-starting and consuming resources¶

Date: 2026-02-24 Severity: Low — wastes memory but doesn't break anything

Symptoms: - docker ps -a shows an ecs-agent container (image: amazon/amazon-ecs-agent:latest) - Container keeps restarting on every instance boot

Root cause: The Amazon Linux 2023 AMI comes pre-configured with the ECS agent for use with Amazon ECS. Since we're using Docker Compose (not ECS), this agent is unnecessary but starts automatically.

Fix:

sudo systemctl stop ecs 2>/dev/null
sudo systemctl disable ecs 2>/dev/null
docker stop ecs-agent && docker rm ecs-agent
docker rmi amazon/amazon-ecs-agent:latest

Prevention: After launching a new EC2 instance from the Amazon Linux AMI, disable the ECS agent if you're not using ECS.

LLM Response Cache¶

Cache not hitting — every request calls the LLM¶

Date: 2026-02-25 Severity: Low — functionality correct, but no cost savings

Symptoms: - chat_messages.meta->>'source' is always "llm", never "cache" - llm_response_cache table is empty or not growing - LLM costs not decreasing after caching was deployed

Possible causes and fixes:

Session is DIVERGED: If any user message in the session was free-form text or a follow-up, cache_state flips to "DIVERGED" permanently. Check: SELECT cache_state FROM chat_sessions WHERE id = <session_id>;
Prompt version or model changed: Cache keys include prompt_version and llm_model_id. After an admin updates a prompt, the old cache tree is missed (by design). New tree grows from the first user on the new version.
Content item has no prompt_name: If content_items.prompt_name is NULL, prompt resolution returns version 0 and the cache lookup key won't match future requests that resolve a real prompt. Fix: Ensure content items have prompt_name set.
User messages not normalized correctly: MCQ answers must be JSON with questionId and answer fields. If the frontend sends a different format, normalize_for_cache() returns None (uncacheable). Check the raw user_message column in llm_response_cache to see what's being cached.

Diagnostic query:

-- Check cache state of recent sessions
SELECT id, cache_state, cache_node_id, created_at
FROM chat_sessions
ORDER BY created_at DESC
LIMIT 20;

-- Check if cache tree exists for a topic
SELECT content_item_id, COUNT(*) AS nodes, SUM(hit_count) AS total_hits
FROM llm_response_cache
GROUP BY content_item_id;

Cache serving stale/wrong responses after prompt update¶

Date: 2026-02-25 Severity: N/A — this cannot happen by design

Explanation: The cache key includes prompt_version and llm_model_id. When a prompt is updated to a new version, the resolved version number changes, so all cache lookups naturally miss. A new tree grows for the new version. Old entries sit unused until cleanup_stale_cache.py removes them.

If you suspect stale responses, verify:

-- Check which prompt version the cache was built with
SELECT DISTINCT prompt_version, llm_model_id FROM llm_response_cache WHERE content_item_id = <id>;

-- Compare with current production prompt version
SELECT version FROM prompts WHERE name = '<prompt_name>' AND 'production' = ANY(labels);

`llm_response_cache` table growing too large¶

Date: 2026-02-25 Severity: Low — doesn't affect correctness, may affect DB performance over time

Symptoms: - Table has many rows with hit_count = 0 - Multiple prompt versions cached for the same topic

Fix:

# Run cleanup to remove stale cache entries
cd ai-tutor-backend
python scripts/cleanup_stale_cache.py --dry-run  # preview first
python scripts/cleanup_stale_cache.py              # then delete

Prevention: Set up a weekly cron job to run cleanup_stale_cache.py. Consider adding a TTL-based cleanup (e.g., delete nodes older than 90 days with 0 hits) if the table grows beyond expectations.

Nginx crash loop — `unknown directive "gzip_level"`¶

Date: 2026-04-01 Severity: High — site completely unreachable, nginx restarts every 60s

Symptoms: - docker compose ps shows nginx as Restarting (1) Less than a second ago - Site returns HTTP 000 (connection refused) - Backend and frontend containers are healthy - Nginx logs show:

[emerg] 1#1: unknown directive "gzip_level" in /etc/nginx/conf.d/default.conf:49

Root cause: gzip_level 6; was added to deploy/nginx/nginx.conf but the correct nginx directive is gzip_comp_level. The typo was introduced in commit fc886f1 on main but didn't reach staging until later, so the bug was dormant until the next staging merge.

Fix:

# In deploy/nginx/nginx.conf, change:
gzip_level 6;
# To:
gzip_comp_level 6;

Prevention: Test nginx config changes with nginx -t before committing. Consider adding a CI step that validates the nginx config syntax (e.g., docker run --rm -v ./nginx.conf:/etc/nginx/conf.d/default.conf:ro nginx:alpine nginx -t).

Application¶

Frontend double API prefix: `/api/v1/api/v1/...`¶

Date: 2026-02-24 (observed, pre-existing) Severity: Medium — API calls fail with ERR_INVALID_URL

Symptoms: - Frontend logs show: Error fetching courses: TypeError: Failed to parse URL from /api/v1/api/v1/courses/?is_active=true - The path /api/v1/ appears twice

Root cause: The API base URL configuration in the frontend already includes /api/v1/, and individual fetch calls also prepend /api/v1/. The prefix gets duplicated.

Fix: Pending — needs investigation in the frontend lib/ directory to find where the base URL is configured and where individual API calls add the prefix. One of the two should be removed.

Prevention: Define the API base URL in one place and use it consistently. Avoid concatenating path prefixes in multiple layers.

Deployment¶

SSH timeouts during `docker compose up --build`¶

Date: 2026-02-19, 2026-02-24 Severity: Medium — can't monitor deploy progress, but deploy continues

Symptoms: - SSH session drops with Connection timed out during banner exchange during a Docker build - The build continues on the EC2 instance even after SSH disconnects (Docker daemon runs independently)

Root cause: The Next.js Docker build (especially with Sentry source map generation) consumes nearly all CPU and memory on the t3.small. The SSH daemon can't complete the handshake within the timeout.

Workarounds: 1. Use nohup to detach the build from the SSH session:

cd ~/ai-tutor-backend/deploy
nohup docker compose --env-file .env.production up -d --build > /tmp/deploy.log 2>&1 &

2. Monitor via CloudWatch instead of SSH — watch CPU utilization. When it drops below 10%, the build is done. 3. Add swap (see the swap entry above) to reduce OOM pressure and keep SSH responsive.

Long-term fix: Move Docker builds to GitHub Actions so EC2 only pulls pre-built images.

Instance force-stop takes 3-5 minutes when under heavy load¶

Date: 2026-02-24 Severity: Low — just slow, not dangerous

Symptoms: - aws ec2 stop-instances returns immediately but instance stays in "stopping" state for 3-5 minutes - Even --force flag doesn't speed it up significantly

Root cause: The instance is under heavy CPU/memory load (e.g., during a Docker build). The OS takes time to gracefully shut down processes. AWS eventually force-kills after a timeout.

Workaround: Just wait. Check status with:

aws ec2 describe-instances --instance-ids <id> --query 'Reservations[0].Instances[0].State.Name' --output text --region us-east-2

Template for New Entries¶

Copy this template when adding a new entry:

### Short description of the error

**Date**: YYYY-MM-DD
**Severity**: Critical / High / Medium / Low

**Symptoms**:
- What you see (error messages, behavior)

**Root cause**: Why it happens.

**Fix**:
How to resolve it (commands, code changes).

**Prevention**: How to avoid it in the future.

Troubleshooting & Known Issues¶

How to Use This Document¶

Frontend Performance¶

Admin pages slow to load despite fast backend responses¶

Infrastructure & EC2¶

EC2 instance unresponsive (SSH and HTTP timeout) during Docker builds¶

AWS_REGION mismatch — containers fail to start with CloudWatch Logs¶

Docker awslogs-stream-prefix not supported¶

Terraform wants to destroy and recreate EC2 instance (AMI drift)¶

ECS agent container auto-starting and consuming resources¶

LLM Response Cache¶

Cache not hitting — every request calls the LLM¶

Cache serving stale/wrong responses after prompt update¶

llm_response_cache table growing too large¶

Nginx crash loop — unknown directive "gzip_level"¶

Application¶

Frontend double API prefix: /api/v1/api/v1/...¶

Deployment¶

SSH timeouts during docker compose up --build¶

Instance force-stop takes 3-5 minutes when under heavy load¶

Template for New Entries¶

Docker `awslogs-stream-prefix` not supported¶

`llm_response_cache` table growing too large¶

Nginx crash loop — `unknown directive "gzip_level"`¶

Frontend double API prefix: `/api/v1/api/v1/...`¶

SSH timeouts during `docker compose up --build`¶