Observability Guide
LuckyPlans includes a full observability stack covering metrics, logs, and traces. This guide covers how to use it day-to-day for development and debugging.
Quick Reference
| Tool | Local URL | K8s Access | Purpose |
|---|---|---|---|
| Grafana | http://localhost:3002 | kubectl -n monitoring port-forward svc/grafana 3002:3000 | Dashboards, log/trace exploration |
| Prometheus | http://localhost:9090 | kubectl -n monitoring port-forward svc/prometheus 9090:9090 | Metrics queries, target health |
| Loki | http://localhost:3100 | (via Grafana) | Log aggregation |
| Tempo | http://localhost:3200 | (via Grafana) | Distributed traces |
Production host routing:
- Grafana is exposed at
https://admin.luckyplans.xyz/grafana - App/API traffic is split across
app.luckyplans.xyzandapi.luckyplans.xyz - Legacy API remains available at
https://v0.api.luckyplans.xyzthrough k3d ingress passthrough
Getting Started
Local development
The observability stack starts automatically with docker compose up -d:
docker compose up -d
pnpm dev
Open http://localhost:3002 — Grafana is pre-configured with anonymous admin access, datasources, and dashboards. No login required.
K8s (local k3d or prod)
The observability stack is deployed via Helm:
# Included automatically in full deploy:
pnpm deploy:local
# Or deploy observability only:
helm upgrade --install luckyplans-observability infrastructure/helm/observability \
--namespace monitoring --create-namespace \
-f infrastructure/helm/observability/values.yaml
# Access Grafana:
kubectl -n monitoring port-forward svc/grafana 3002:3000
Grafana Dashboards
RED Metrics Dashboard
Open Grafana → LuckyPlans folder → RED Metrics.
This dashboard shows the RED method for each service:
| Panel | What it shows | What to look for |
|---|---|---|
| Request Rate by Service | Requests per second per service | Sudden drops = service may be down |
| Error Rate by Service (5xx) | Percentage of 5xx responses | Yellow > 1%, Red > 5% — investigate immediately |
| Request Duration (p50/p95/p99) | Latency percentiles | p95 > 2s triggers an alert |
| Request Rate by Route (Top 10) | Busiest endpoints | Identify hot paths for optimization |
| Slowest Endpoints (p95) | Table of slowest routes | Find performance bottlenecks |
Use the service dropdown at the top to filter by api-gateway or service-core.
Infrastructure Dashboard
Open Grafana → LuckyPlans folder → Infrastructure.
| Panel | What it shows |
|---|---|
| Service Targets Up/Down | Whether Prometheus can reach each scrape target |
| OTel Collector — Exported Spans/Metrics/Logs | Throughput of the telemetry pipeline |
| OTel Collector — Dropped Telemetry | If the collector is dropping data (memory pressure) |
Viewing Traces
Traces show the full journey of a request across services.
Find traces for a request
- Open Grafana → Explore (compass icon in sidebar)
- Select Tempo datasource (top-left dropdown)
- Use Search tab:
- Service Name:
api-gatewayorservice-core - Span Name: e.g.,
GET /graphql,HTTP POST - Min Duration: e.g.,
100msto find slow requests
- Service Name:
- Click Run query
- Click any trace row to see the full span waterfall
Reading a trace
A typical GraphQL request trace looks like:
HTTP POST /graphql [api-gateway, 45ms]
├── graphql.resolve getItems [api-gateway, 38ms]
│ ├── ioredis: PUBLISH [api-gateway, 2ms]
│ └── microservice.getItems [service-core, 30ms]
│ └── ioredis: GET/SET [service-core, 1ms]
└── graphql.serialize [api-gateway, 1ms]
- Parent span: The HTTP request hitting the gateway
- GraphQL resolver span: Auto-instrumented by OTel
- Redis spans: Auto-instrumented by ioredis instrumentation
- Microservice span: Created by
TraceContextExtractorfrom the propagated W3C trace context
Trace-to-log correlation
Click the Logs for this span button on any span to jump to Loki with the trace ID pre-filtered. This shows all log lines emitted during that span’s execution.
Querying Logs
Basic log queries
- Open Grafana → Explore
- Select Loki datasource
- Use LogQL queries:
# All logs from api-gateway
{service_name="api-gateway"}
# Error-level logs only
{service_name="api-gateway"} | json | level="error"
# Logs containing a specific trace ID
{service_name=~"api-gateway|service-core"} |= "abc123def456"
# Logs from a specific route
{service_name="api-gateway"} | json | req_url=~"/graphql.*"
# Count errors per minute
count_over_time({service_name="api-gateway"} | json | level="error" [1m])
Log format
Every log line is structured JSON with these fields:
| Field | Description |
|---|---|
level | Log level (debug, info, warn, error) |
msg | Log message |
traceId | OpenTelemetry trace ID (for correlation) |
spanId | OpenTelemetry span ID |
req.method | HTTP method (api-gateway only) |
req.url | Request URL (api-gateway only) |
res.statusCode | Response status code (api-gateway only) |
responseTime | Request duration in ms (api-gateway only) |
context | NestJS context (class name) |
Log-to-trace correlation
When viewing a log line that contains a traceId, click the TraceID link to jump to the full trace in Tempo.
Observing Edge Connectivity and Upgrades
Edge fleet operations are visible through worker state and upgrade status transitions.
What to watch
lastSeenAtfreshness for each worker (connectivity heartbeat health)versionvstargetVersiondrift (pending upgrade intent)upgradeStatustransitions:UPGRADE_PENDINGDOWNLOADINGVERIFYINGRESTARTINGSUCCEEDED/FAILED
Useful log queries for edge flows
# Registration and connectivity endpoint logs
{service_name="api-gateway"} |= "/internal/edges/register"
{service_name="api-gateway"} |= "/internal/edges/connectivity"
# Upgrade lifecycle status logs
{service_name="api-gateway"} |= "upgradeStatus"
# Credential audit events
{service_name="api-gateway"} |= "audit credential."
Common edge operational signals
- Stale
lastSeenAt: worker likely offline, blocked, or unable to reach gateway. targetVersionset but status never leavesUPGRADE_PENDING: worker is likely busy (idle-only upgrade policy) or connectivity is unstable.- Frequent
FAILEDupgrades: verify release URLs, checksum/signature metadata, and edge host permissions.
Querying Metrics
Using Prometheus directly
Open http://localhost:9090 (or port-forward in K8s) for the Prometheus UI.
Useful PromQL queries
# Request rate per service (last 5 minutes)
sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name)
# Error rate (5xx) as percentage
sum(rate(http_server_request_duration_seconds_count{http_status_code=~"5.."}[5m])) by (service_name)
/
sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name)
# p95 latency per service
histogram_quantile(0.95,
sum(rate(http_server_request_duration_seconds_bucket[5m])) by (le, service_name)
)
# p99 latency for a specific route
histogram_quantile(0.99,
sum(rate(http_server_request_duration_seconds_bucket{http_route="/graphql"}[5m])) by (le)
)
# Redis connected clients
redis_connected_clients
# Redis memory usage
redis_memory_used_bytes
# OTel Collector: exported spans per second
rate(otelcol_exporter_sent_spans_total[5m])
Checking scrape targets
Open http://localhost:9090/targets to verify all targets are UP:
| Target | Expected |
|---|---|
otel-collector | UP — app metrics from NestJS services |
prometheus | UP — self-monitoring |
redis | UP — Redis metrics (K8s only, via redis-exporter) |
keycloak | UP — Keycloak Micrometer metrics (K8s only) |
Alerts
Prometheus evaluates these alert rules (defined in the Prometheus configmap):
| Alert | Condition | Severity | What to do |
|---|---|---|---|
HighErrorRate | >5% of requests return 5xx for 5 min | warning | Check api-gateway logs for errors, look at traces for failing requests |
HighLatency | p95 latency >2s for 5 min | warning | Find slow traces in Tempo, check for Redis connection issues |
RedisDown | Redis exporter unreachable for 1 min | critical | Check Redis pod status, restart if needed |
ServiceTargetDown | Any scrape target down for 2 min | critical | Check if the OTel Collector or exporters are running |
View active alerts: http://localhost:9090/alerts
Note: Alert notifications (email, Slack, PagerDuty) are not yet configured. Alerts currently only show in the Prometheus UI.
Debugging Common Issues
”No data” in Grafana dashboards
- Check OTel Collector:
docker compose logs otel-collector(local) orkubectl -n monitoring logs deployment/otel-collector(K8s) - Check Prometheus targets: http://localhost:9090/targets — all should be UP
- Check NestJS apps are sending telemetry: Look for
OTEL_EXPORTER_OTLP_ENDPOINTin the app’s env. Default ishttp://localhost:4317(local dev) - Make some requests: The dashboards need traffic. Hit http://localhost:3000/graphql with a query
Traces missing the service-core span
The Redis trace propagation requires injectTraceContext() in the gateway resolver and TraceContextExtractor in service-core. Check:
- Gateway resolver uses
injectTraceContext()when callingClientProxy.send() - service-core
AppModulehas{ provide: APP_INTERCEPTOR, useClass: TraceContextExtractor }
Logs not showing traceId
The Pino mixin() function reads the active OTel span. If traceId is missing:
- Verify
instrument.tsis imported as the first line ofmain.ts - Check that
OTEL_EXPORTER_OTLP_ENDPOINTis set (the SDK starts even without a collector, but auto-instrumentation needs the SDK initialized)
OTel Collector dropping data
Check the Infrastructure dashboard — “Dropped Telemetry” panel. If data is being dropped:
- Increase
memory_limiter.limit_mibin the collector config - Increase the collector’s memory limits in
values.yaml
High memory usage
The observability stack uses ~900Mi total. If the k3s node is constrained:
- Use
--no-observabilityflag:./deploy-local.sh --no-observability - Or reduce retention: edit
values.yaml→prometheus.retention,loki.retention,tempo.retention
Architecture Reference
┌─────────────────────────────────────────────────────┐
│ luckyplans namespace │
│ │
│ api-gateway ──OTLP──┐ service-core ──OTLP──┐ │
│ │ │ │
└───────────────────────┼────────────────────────┼────┘
│ │
┌───────────────────────┼────────────────────────┼────┐
│ monitoring namespace │
│ ▼ ▼ │
│ ┌──────────────┐ │
│ │ OTel Collector│ │
│ └──┬───┬───┬───┘ │
│ │ │ │ │
│ metrics │ │ │ traces │
│ ▼ │ ▼ │
│ Prometheus│ Tempo │
│ │ │ │ │
│ │ │logs│ │
│ │ ▼ │ │
│ │ Loki │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ │
│ │ Grafana │ ← dashboards │
│ └──────────────┘ │
│ │
│ Promtail (DaemonSet) ────────▶ Loki (pod logs) │
│ Redis Exporter ──────────────▶ Prometheus │
│ │
└─────────────────────────────────────────────────────┘
See ADR: Full-Stack Observability for the architectural decision and alternatives considered.