Monitoring & Alerting
The monitoring stack
| Tool | Purpose | Who uses it |
|---|---|---|
| AWS CloudWatch | Structured JSON logs from all EB instances | Developers, on-call |
| AWS X-Ray | Distributed request tracing | Developers |
| PostHog | Product analytics / feature usage | Product team |
Error models (Errors, ErrAPI, Err400) | Per-agency error deduplication table | Developers, support |
| Cron metrics | Job success/failure tracking | Developers |
| External API metrics | Third-party API latency and errors | Developers |
CloudWatch — structured logs
All logs from flaskapp.py and cronapp.py are shipped to CloudWatch as structured JSON via watchtower.
Finding logs for a specific request
Every request has a request_id (UUID4). To find all logs for a failing request:
- Get the
X-Request-IDfrom the failing HTTP response (browser DevTools → Network → response headers) - Open CloudWatch → Log Groups → find the EB environment’s log group
- Filter events by the
request_idvalue
Key log fields
{ "level": "ERROR", "request_id": "550e8400-...", "endpoint": "admin_views.case_detail", "status": 500, "duration_ms": 234, "query_count": 12, "query_total_ms": 89, "agency_db": "agency_production_db", "user_type": "admin"}CloudWatch circuit breaker
If logs stop appearing in CloudWatch, the circuit breaker may have tripped. After 3 consecutive CloudWatch failures, the logger suspends CloudWatch shipping for 60 seconds. Logs fall back to console (EB instance system logs) during this time.
Check the EB instance’s system logs directly if CloudWatch logs are missing.
AWS X-Ray — distributed tracing
X-Ray provides latency breakdowns for each request — how much time was spent in Python vs. database vs. external APIs.
To use X-Ray:
- Open the AWS X-Ray console
- Find the service map (shows Orchid → MySQL → external services)
- Filter by time range or trace ID
- Drill into a specific trace to see the waterfall of operations
X-Ray trace IDs can be correlated with CloudWatch logs by searching for the trace ID string.
Error models — per-agency error tracking
The Errors, ErrAPI, and Err400 models in each agency’s database store deduplicated error records. These are queryable directly:
-- Recent errors in an agency's DBSELECT message, count, last_seen_at, pathFROM err_apiORDER BY last_seen_at DESCLIMIT 20;Content-based deduplication means each unique error message appears only once (with an incrementing count and updated last_seen_at). High-count errors are your highest-priority bugs.
Watching for slow pages
The query_count and query_total_ms fields in every request log tell you immediately whether a slow page is a Python problem or a database problem:
- High
duration_ms, lowquery_total_ms→ Python is slow (computation, external API call) - High
query_count(40+) → N+1 query problem — add eager loading - High
query_total_mswith normalquery_count→ individual queries are slow — check MySQL indexes
Checking email sync health
For email sync issues:
-- Check sync status for all admins in an agencySELECT es.admin_id, es.sync_type, es.status, es.last_synced_at, dr.reason as disconnect_reasonFROM email_sync esLEFT JOIN email_sync_disconnect_reason dr ON es.disconnect_reason_id = dr.idORDER BY es.last_synced_at DESC;status = 'error' or status = 'disconnected' with a disconnect_reason tells you exactly why sync stopped.
Checking cron job health
-- Recent cron job outcomes (in cron metrics table)SELECT job_name, db_name, status, duration_ms, error_message, run_atFROM cron_metricsORDER BY run_at DESCLIMIT 50;Alerting
Currently there are no automated alerts configured. When adding high-value monitoring:
- CloudWatch Alarms can trigger on metric thresholds (error rate, latency p95)
- Consider adding alarms for: 5xx error rate > 1%, request latency p95 > 5s, cron job failure rate > 10%
Emergency access to logs
If the CloudWatch console is unavailable, logs are also written to the EB instance’s system log:
# SSH into the EB instance (requires AWS key and correct security group)eb ssh [environment-name]
# View recent app logscat /var/log/app-1.log | tail -100