Skip to main content

Operations Manual

This manual is the day-to-day reference for the team operating the Transform Platform. It covers every observability tool in the local stack β€” what it does, when to reach for it, and exactly what to type when something goes wrong.


The Observability Stack at a Glance​

ToolURLCredentialsPurpose
Prometheuslocalhost:9090NoneMetrics storage and alerting
Grafanalocalhost:3001admin / adminMetrics dashboards
Kibanalocalhost:5601NoneStructured log search
Jaegerlocalhost:16686NoneDistributed tracing
Kafka UIlocalhost:8090NoneKafka topic browser
App Metricslocalhost:8080/actuator/prometheusNoneRaw Prometheus scrape endpoint
App Healthlocalhost:8080/actuator/healthNoneLiveness / readiness probe

When to Use What​

Something is wrong β€” where do I look?
β”‚
β”œβ”€ App not responding / health check failing
β”‚ └─ Start with: localhost:8080/actuator/health
β”‚
β”œβ”€ I need to see numbers (latency, error rate, memory)
β”‚ └─ Grafana dashboard β†’ prometheus queries for detail
β”‚
β”œβ”€ I need to read log lines / find an error message
β”‚ └─ Kibana β†’ search by level, traceId, or keyword
β”‚
β”œβ”€ I need to trace one slow or failing request end-to-end
β”‚ └─ Jaeger β†’ find by traceId (copy it from the log)
β”‚
β”œβ”€ Kafka messages not flowing
β”‚ └─ Kafka UI β†’ inspect topic, consumer group lag
β”‚
└─ Alert fired in Prometheus
└─ Prometheus /alerts β†’ Grafana panel for context β†’ Kibana for logs

The Three Signals and How They Connect​

Every request handled by the platform emits all three signals, all linked by the same traceId:

HTTP Request arrives
β”‚
β–Ό
CorrelationIdFilter β€” writes correlationId to MDC + response header
β”‚
β–Ό
TracingMdcFilter β€” writes traceId + spanId to MDC (from OTel span)
β”‚
β”œβ”€β”€ LOG LINE emitted β†’ Kibana (contains traceId, correlationId, level, message)
β”‚
β”œβ”€β”€ SPAN emitted β†’ Jaeger (contains traceId, HTTP method, URI, duration)
β”‚
└── METRIC incrementedβ†’ Prometheus / Grafana (http_server_requests_seconds)

Typical debug workflow:

  1. Grafana shows a spike in HTTP 5xx errors β†’ note the time window.
  2. Kibana: filter level: "ERROR" in that time window β†’ copy a traceId.
  3. Jaeger: paste the traceId β†’ see every span that made up that request.

Starting the Stack​

# From the project root
docker compose -f .docker/docker-compose.yml up -d

# Check all containers are healthy
docker compose -f .docker/docker-compose.yml ps

# Tail logs from all containers
docker compose -f .docker/docker-compose.yml logs -f

# Reload Prometheus config without restart (after changing rules)
curl -X POST http://localhost:9090/-/reload

Log Correlation Fields​

Every structured log line written by the app contains these fields:

FieldExampleDescription
traceId4bf92f3577b34da6a3ce929d0e0e4736OTel trace ID β€” use in Jaeger
spanId00f067aa0ba902b7OTel span ID
correlationId1de41fa4-3d2c-48e7-acc4-297f0800bc5bPer-request UUID in response header X-Correlation-ID
levelERRORLog level: TRACE / DEBUG / INFO / WARN / ERROR
loggerc.t.api.service.TransformServiceLogger class
messageTransform failed for specId=csv-to-jsonLog message
@timestamp2026-03-09T10:45:32.123ZUTC timestamp