Troubleshooting & Known Issues¶
A running log of errors encountered in this project, with root causes and fixes. Add new entries as they come up so the team has a searchable reference.
How to Use This Document¶
- Search by error message or symptom to find relevant entries
- Each entry includes: symptoms, root cause, fix, and prevention
- Entries are grouped by category and ordered newest-first within each group
Frontend Performance¶
Admin pages slow to load despite fast backend responses¶
Date: 2026-04-01 Severity: Medium — poor UX on all admin pages, student pages unaffected
Symptoms:
- Admin pages (/admin/analytics, /admin/courses, /admin/prompts) take 2-5 seconds to show content
- Student dashboard loads instantly
- Network tab shows backend APIs respond in <200ms, but a full-screen "Loading..." overlay blocks the page
- Server-side pre-fetched data is in the HTML but hidden behind the overlay
Root cause: Three layered issues, listed in order of user-visible impact:
-
LoadingLinkcomponent in AdminSidebar — Every admin sidebar link used<LoadingLink>instead of<Link>. On click, it callede.preventDefault(), showed a full-screenBookLoaderportal (z-[9999]), then calledrouter.push(). The overlay stayed until the old component unmounted, hiding the server-rendered content that arrived in ~200ms. -
Split-brain auth (JS cookie vs localStorage) — Auth tokens were stored in
localStorage(for client-side API calls) and synced to a browser cookie viadocument.cookie(for server components). This JS-set cookie was unreliable — not updated on token refresh/rotation — causing server-sideserverAuthFetchto get 401s on staging. Server components fell back to empty data, forcing client-side re-fetching across the WAN. -
force-dynamicon all admin pages — Admin pages usedexport const dynamic = 'force-dynamic'which disabled all Next.js caching. Every navigation triggered a full server render. Student dashboard usedrevalidate: 3600(ISR) and served from cache.
Fix:
-
Replaced
LoadingLinkwith standard<Link>inAdminSidebar.tsxandAdminBreadcrumbs.tsx. Next.js handles route transitions natively without blocking overlays. -
Backend now sets httpOnly cookies on login/refresh via
Set-Cookieheader — single source of truth, always in sync. Auth dependency (get_current_user) accepts Bearer token OR cookie (backward compatible). Removed JSdocument.cookiewrites fromauthClient.ts. Addedcredentials: "include"to all client-side fetch calls. -
Replaced
force-dynamicwith per-fetch caching — admin pages use{ revalidate: 30, tags: ["admin:courses"] }for 30-second ISR caching.serverAuthFetchno longer defaults tocache: "no-store"when revalidate/tags are set.
Additional fixes applied:
- Converted all 9 admin pages to hybrid Server + Client Component pattern (server pre-fetches data, passes to client component as props)
- Fixed N+1 query in _per_course_breakdown() — 4 batch queries instead of 4×N
- Added 5-min TTL in-memory cache on analytics endpoints
- Removed <RoleGuard> from admin client components (layout already checks auth server-side)
- Fixed Docker .next/cache permissions (EACCES: permission denied, mkdir '/app/.next/cache')
- Fixed slowapi crash on auth endpoints (missing response: Response parameter)
Prevention:
- Never use LoadingLink / full-screen overlay loaders for navigation in layouts where server-rendered content should be immediately visible. Use standard <Link> and let Next.js handle transitions.
- Use backend-set httpOnly cookies for auth — never rely on document.cookie for security-critical tokens.
- Prefer per-fetch caching with tags (revalidate: 30, tags: [...]) over blanket force-dynamic on admin pages. Invalidate with revalidateTag() on mutations.
- Always test page performance from the user's geographic location (or simulate with Network throttling) — not just local dev.
Infrastructure & EC2¶
EC2 instance unresponsive (SSH and HTTP timeout) during Docker builds¶
Date: 2026-02-24 Severity: High — full outage during deploys
Symptoms:
- SSH: Connection timed out during banner exchange
- HTTP: requests hang indefinitely, curl times out
- AWS Console shows instance as "running" with status checks "ok"
- CloudWatch shows sustained 50-95% CPU for 30+ minutes
Root cause: The t3.small (2 vCPU, 2GB RAM) had no swap configured. When Docker builds the Next.js frontend (especially with Sentry source maps), it consumes nearly all available memory. The OOM pressure starves the SSH daemon and nginx of CPU/memory, making the instance completely unreachable even though the kernel is still alive.
Fix:
# Add 2GB swap (one-time setup)
sudo dd if=/dev/zero of=/swapfile bs=128M count=16
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Make persistent across reboots
echo '/swapfile swap swap defaults 0 0' | sudo tee -a /etc/fstab
Prevention: Always configure swap on memory-constrained instances. The long-term fix is moving Docker builds to GitHub Actions (see Deployment Guide - Phase 1).
AWS_REGION mismatch — containers fail to start with CloudWatch Logs¶
Date: 2026-02-24 Severity: High — backend, frontend, and nginx all fail to start
Symptoms:
- docker compose up -d shows containers as "Created" but not "Up"
- Error in Docker output:
failed to create task for container: failed to initialize logging driver:
failed to create Cloudwatch log stream: ... AccessDeniedException:
User: arn:aws:sts::...:assumed-role/ai-tutor-staging-ec2-role/...
is not authorized to perform: logs:CreateLogStream on resource:
arn:aws:logs:eu-west-2:...:log-group:/ai-tutor/frontend:log-stream:...
eu-west-2) doesn't match where the infrastructure lives (us-east-2)
Root cause: .env.production had AWS_REGION=eu-west-2 but all infrastructure (EC2 instance, IAM role, CloudWatch Log Groups) was created in us-east-2. The Docker awslogs driver tried to create log streams in eu-west-2, where the IAM policy doesn't grant permissions.
The docker-compose.yml uses awslogs-region: ${AWS_REGION:-us-east-2}, so the env var overrode the default.
Fix:
# On EC2, edit the production env file
sed -i 's/AWS_REGION=eu-west-2/AWS_REGION=us-east-2/' ~/ai-tutor-backend/deploy/.env.production
# Restart containers
cd ~/ai-tutor-backend/deploy
docker compose --env-file .env.production up -d
Prevention: When setting up a new environment, verify AWS_REGION in .env.production matches the region where Terraform provisioned the infrastructure. The docker-compose default (us-east-2) is correct for our setup.
Docker awslogs-stream-prefix not supported¶
Date: 2026-02-19 Severity: High — containers fail to start
Symptoms:
- docker compose up -d fails
- Error mentions awslogs-stream-prefix as an unsupported option
Root cause: The Docker version on the EC2 instance (installed via Amazon Linux 2023's default package) doesn't support the awslogs-stream-prefix option in the logging configuration.
Fix: Remove awslogs-stream-prefix from all logging blocks in deploy/docker-compose.yml. Log streams will use container IDs instead of friendly prefixes.
Prevention: Test Docker Compose config on the target Docker version before deploying. Don't assume all awslogs driver options are available on all Docker versions.
Terraform wants to destroy and recreate EC2 instance (AMI drift)¶
Date: 2026-02-19 Severity: Critical — would destroy the production instance and all data
Symptoms:
- terraform plan shows the EC2 instance will be destroyed and recreated
- The change is on the ami attribute: a newer Amazon Linux AMI was found
Root cause: The Terraform config uses data.aws_ami.amazon_linux with most_recent = true. When AWS publishes a new AMI, Terraform sees the AMI ID has changed and plans to replace the instance.
Fix: Added lifecycle rule to aws_instance.app in terraform/main.tf:
Prevention: Always use ignore_changes = [ami] on long-lived EC2 instances that use most_recent AMI lookups. Alternatively, pin the AMI ID directly.
ECS agent container auto-starting and consuming resources¶
Date: 2026-02-24 Severity: Low — wastes memory but doesn't break anything
Symptoms:
- docker ps -a shows an ecs-agent container (image: amazon/amazon-ecs-agent:latest)
- Container keeps restarting on every instance boot
Root cause: The Amazon Linux 2023 AMI comes pre-configured with the ECS agent for use with Amazon ECS. Since we're using Docker Compose (not ECS), this agent is unnecessary but starts automatically.
Fix:
sudo systemctl stop ecs 2>/dev/null
sudo systemctl disable ecs 2>/dev/null
docker stop ecs-agent && docker rm ecs-agent
docker rmi amazon/amazon-ecs-agent:latest
Prevention: After launching a new EC2 instance from the Amazon Linux AMI, disable the ECS agent if you're not using ECS.
LLM Response Cache¶
Cache not hitting — every request calls the LLM¶
Date: 2026-02-25 Severity: Low — functionality correct, but no cost savings
Symptoms:
- chat_messages.meta->>'source' is always "llm", never "cache"
- llm_response_cache table is empty or not growing
- LLM costs not decreasing after caching was deployed
Possible causes and fixes:
-
Session is DIVERGED: If any user message in the session was free-form text or a follow-up,
cache_stateflips to"DIVERGED"permanently. Check:SELECT cache_state FROM chat_sessions WHERE id = <session_id>; -
Prompt version or model changed: Cache keys include
prompt_versionandllm_model_id. After an admin updates a prompt, the old cache tree is missed (by design). New tree grows from the first user on the new version. -
Content item has no
prompt_name: Ifcontent_items.prompt_nameis NULL, prompt resolution returns version 0 and the cache lookup key won't match future requests that resolve a real prompt. Fix: Ensure content items haveprompt_nameset. -
User messages not normalized correctly: MCQ answers must be JSON with
questionIdandanswerfields. If the frontend sends a different format,normalize_for_cache()returns None (uncacheable). Check the rawuser_messagecolumn inllm_response_cacheto see what's being cached.
Diagnostic query:
-- Check cache state of recent sessions
SELECT id, cache_state, cache_node_id, created_at
FROM chat_sessions
ORDER BY created_at DESC
LIMIT 20;
-- Check if cache tree exists for a topic
SELECT content_item_id, COUNT(*) AS nodes, SUM(hit_count) AS total_hits
FROM llm_response_cache
GROUP BY content_item_id;
Cache serving stale/wrong responses after prompt update¶
Date: 2026-02-25 Severity: N/A — this cannot happen by design
Explanation: The cache key includes prompt_version and llm_model_id. When a prompt is updated to a new version, the resolved version number changes, so all cache lookups naturally miss. A new tree grows for the new version. Old entries sit unused until cleanup_stale_cache.py removes them.
If you suspect stale responses, verify:
-- Check which prompt version the cache was built with
SELECT DISTINCT prompt_version, llm_model_id FROM llm_response_cache WHERE content_item_id = <id>;
-- Compare with current production prompt version
SELECT version FROM prompts WHERE name = '<prompt_name>' AND 'production' = ANY(labels);
llm_response_cache table growing too large¶
Date: 2026-02-25 Severity: Low — doesn't affect correctness, may affect DB performance over time
Symptoms:
- Table has many rows with hit_count = 0
- Multiple prompt versions cached for the same topic
Fix:
# Run cleanup to remove stale cache entries
cd ai-tutor-backend
python scripts/cleanup_stale_cache.py --dry-run # preview first
python scripts/cleanup_stale_cache.py # then delete
Prevention: Set up a weekly cron job to run cleanup_stale_cache.py. Consider adding a TTL-based cleanup (e.g., delete nodes older than 90 days with 0 hits) if the table grows beyond expectations.
Nginx crash loop — unknown directive "gzip_level"¶
Date: 2026-04-01 Severity: High — site completely unreachable, nginx restarts every 60s
Symptoms:
- docker compose ps shows nginx as Restarting (1) Less than a second ago
- Site returns HTTP 000 (connection refused)
- Backend and frontend containers are healthy
- Nginx logs show:
Root cause: gzip_level 6; was added to deploy/nginx/nginx.conf but the correct nginx directive is gzip_comp_level. The typo was introduced in commit fc886f1 on main but didn't reach staging until later, so the bug was dormant until the next staging merge.
Fix:
Prevention: Test nginx config changes with nginx -t before committing. Consider adding a CI step that validates the nginx config syntax (e.g., docker run --rm -v ./nginx.conf:/etc/nginx/conf.d/default.conf:ro nginx:alpine nginx -t).
Application¶
Frontend double API prefix: /api/v1/api/v1/...¶
Date: 2026-02-24 (observed, pre-existing)
Severity: Medium — API calls fail with ERR_INVALID_URL
Symptoms:
- Frontend logs show: Error fetching courses: TypeError: Failed to parse URL from /api/v1/api/v1/courses/?is_active=true
- The path /api/v1/ appears twice
Root cause: The API base URL configuration in the frontend already includes /api/v1/, and individual fetch calls also prepend /api/v1/. The prefix gets duplicated.
Fix: Pending — needs investigation in the frontend lib/ directory to find where the base URL is configured and where individual API calls add the prefix. One of the two should be removed.
Prevention: Define the API base URL in one place and use it consistently. Avoid concatenating path prefixes in multiple layers.
Deployment¶
SSH timeouts during docker compose up --build¶
Date: 2026-02-19, 2026-02-24 Severity: Medium — can't monitor deploy progress, but deploy continues
Symptoms:
- SSH session drops with Connection timed out during banner exchange during a Docker build
- The build continues on the EC2 instance even after SSH disconnects (Docker daemon runs independently)
Root cause: The Next.js Docker build (especially with Sentry source map generation) consumes nearly all CPU and memory on the t3.small. The SSH daemon can't complete the handshake within the timeout.
Workarounds:
1. Use nohup to detach the build from the SSH session:
cd ~/ai-tutor-backend/deploy
nohup docker compose --env-file .env.production up -d --build > /tmp/deploy.log 2>&1 &
Long-term fix: Move Docker builds to GitHub Actions so EC2 only pulls pre-built images.
Instance force-stop takes 3-5 minutes when under heavy load¶
Date: 2026-02-24 Severity: Low — just slow, not dangerous
Symptoms:
- aws ec2 stop-instances returns immediately but instance stays in "stopping" state for 3-5 minutes
- Even --force flag doesn't speed it up significantly
Root cause: The instance is under heavy CPU/memory load (e.g., during a Docker build). The OS takes time to gracefully shut down processes. AWS eventually force-kills after a timeout.
Workaround: Just wait. Check status with:
aws ec2 describe-instances --instance-ids <id> --query 'Reservations[0].Instances[0].State.Name' --output text --region us-east-2
Template for New Entries¶
Copy this template when adding a new entry: