Monitoring & Alerting

The monitoring stack

Tool	Purpose	Who uses it
AWS CloudWatch	Structured JSON logs from all EB instances	Developers, on-call
AWS X-Ray	Distributed request tracing	Developers
PostHog	Product analytics / feature usage	Product team
Error models (`Errors`, `ErrAPI`, `Err400`)	Per-agency error deduplication table	Developers, support
Cron metrics	Job success/failure tracking	Developers
External API metrics	Third-party API latency and errors	Developers

CloudWatch — structured logs

All logs from flaskapp.py and cronapp.py are shipped to CloudWatch as structured JSON via watchtower.

Finding logs for a specific request

Every request has a request_id (UUID4). To find all logs for a failing request:

Get the X-Request-ID from the failing HTTP response (browser DevTools → Network → response headers)
Open CloudWatch → Log Groups → find the EB environment’s log group
Filter events by the request_id value

Key log fields

{
  "level": "ERROR",
  "request_id": "550e8400-...",
  "endpoint": "admin_views.case_detail",
  "status": 500,
  "duration_ms": 234,
  "query_count": 12,
  "query_total_ms": 89,
  "agency_db": "agency_production_db",
  "user_type": "admin"
}

CloudWatch circuit breaker

If logs stop appearing in CloudWatch, the circuit breaker may have tripped. After 3 consecutive CloudWatch failures, the logger suspends CloudWatch shipping for 60 seconds. Logs fall back to console (EB instance system logs) during this time.

Check the EB instance’s system logs directly if CloudWatch logs are missing.

AWS X-Ray — distributed tracing

X-Ray provides latency breakdowns for each request — how much time was spent in Python vs. database vs. external APIs.

To use X-Ray:

Open the AWS X-Ray console
Find the service map (shows Orchid → MySQL → external services)
Filter by time range or trace ID
Drill into a specific trace to see the waterfall of operations

X-Ray trace IDs can be correlated with CloudWatch logs by searching for the trace ID string.

Error models — per-agency error tracking

The Errors, ErrAPI, and Err400 models in each agency’s database store deduplicated error records. These are queryable directly:

-- Recent errors in an agency's DB
SELECT message, count, last_seen_at, path
FROM err_api
ORDER BY last_seen_at DESC
LIMIT 20;

Content-based deduplication means each unique error message appears only once (with an incrementing count and updated last_seen_at). High-count errors are your highest-priority bugs.

Watching for slow pages

The query_count and query_total_ms fields in every request log tell you immediately whether a slow page is a Python problem or a database problem:

High duration_ms, low query_total_ms → Python is slow (computation, external API call)
High query_count (40+) → N+1 query problem — add eager loading
High query_total_ms with normal query_count → individual queries are slow — check MySQL indexes

Checking email sync health

For email sync issues:

-- Check sync status for all admins in an agency
SELECT es.admin_id, es.sync_type, es.status, es.last_synced_at,
       dr.reason as disconnect_reason
FROM email_sync es
LEFT JOIN email_sync_disconnect_reason dr ON es.disconnect_reason_id = dr.id
ORDER BY es.last_synced_at DESC;

status = 'error' or status = 'disconnected' with a disconnect_reason tells you exactly why sync stopped.

Checking cron job health

-- Recent cron job outcomes (in cron metrics table)
SELECT job_name, db_name, status, duration_ms, error_message, run_at
FROM cron_metrics
ORDER BY run_at DESC
LIMIT 50;

Alerting

Currently there are no automated alerts configured. When adding high-value monitoring:

CloudWatch Alarms can trigger on metric thresholds (error rate, latency p95)
Consider adding alarms for: 5xx error rate > 1%, request latency p95 > 5s, cron job failure rate > 10%

Emergency access to logs

If the CloudWatch console is unavailable, logs are also written to the EB instance’s system log:

# SSH into the EB instance (requires AWS key and correct security group)
eb ssh [environment-name]

# View recent app logs
cat /var/log/app-1.log | tail -100