Operations Manual
This manual is the day-to-day reference for the team operating the Transform Platform. It covers every observability tool in the local stack β what it does, when to reach for it, and exactly what to type when something goes wrong.
The Observability Stack at a Glanceβ
| Tool | URL | Credentials | Purpose |
|---|---|---|---|
| Prometheus | localhost:9090 | None | Metrics storage and alerting |
| Grafana | localhost:3001 | admin / admin | Metrics dashboards |
| Kibana | localhost:5601 | None | Structured log search |
| Jaeger | localhost:16686 | None | Distributed tracing |
| Kafka UI | localhost:8090 | None | Kafka topic browser |
| App Metrics | localhost:8080/actuator/prometheus | None | Raw Prometheus scrape endpoint |
| App Health | localhost:8080/actuator/health | None | Liveness / readiness probe |
When to Use Whatβ
Something is wrong β where do I look?
β
ββ App not responding / health check failing
β ββ Start with: localhost:8080/actuator/health
β
ββ I need to see numbers (latency, error rate, memory)
β ββ Grafana dashboard β prometheus queries for detail
β
ββ I need to read log lines / find an error message
β ββ Kibana β search by level, traceId, or keyword
β
ββ I need to trace one slow or failing request end-to-end
β ββ Jaeger β find by traceId (copy it from the log)
β
ββ Kafka messages not flowing
β ββ Kafka UI β inspect topic, consumer group lag
β
ββ Alert fired in Prometheus
ββ Prometheus /alerts β Grafana panel for context β Kibana for logs
The Three Signals and How They Connectβ
Every request handled by the platform emits all three signals, all linked by the same traceId:
HTTP Request arrives
β
βΌ
CorrelationIdFilter β writes correlationId to MDC + response header
β
βΌ
TracingMdcFilter β writes traceId + spanId to MDC (from OTel span)
β
βββ LOG LINE emitted β Kibana (contains traceId, correlationId, level, message)
β
βββ SPAN emitted β Jaeger (contains traceId, HTTP method, URI, duration)
β
βββ METRIC incrementedβ Prometheus / Grafana (http_server_requests_seconds)
Typical debug workflow:
- Grafana shows a spike in HTTP 5xx errors β note the time window.
- Kibana: filter
level: "ERROR"in that time window β copy atraceId. - Jaeger: paste the
traceIdβ see every span that made up that request.
Starting the Stackβ
# From the project root
docker compose -f .docker/docker-compose.yml up -d
# Check all containers are healthy
docker compose -f .docker/docker-compose.yml ps
# Tail logs from all containers
docker compose -f .docker/docker-compose.yml logs -f
# Reload Prometheus config without restart (after changing rules)
curl -X POST http://localhost:9090/-/reload
Log Correlation Fieldsβ
Every structured log line written by the app contains these fields:
| Field | Example | Description |
|---|---|---|
traceId | 4bf92f3577b34da6a3ce929d0e0e4736 | OTel trace ID β use in Jaeger |
spanId | 00f067aa0ba902b7 | OTel span ID |
correlationId | 1de41fa4-3d2c-48e7-acc4-297f0800bc5b | Per-request UUID in response header X-Correlation-ID |
level | ERROR | Log level: TRACE / DEBUG / INFO / WARN / ERROR |
logger | c.t.api.service.TransformService | Logger class |
message | Transform failed for specId=csv-to-json | Log message |
@timestamp | 2026-03-09T10:45:32.123Z | UTC timestamp |