Grafana

Grafana is the primary metrics dashboard for the Transform Platform. The pre-provisioned "Transform Platform" dashboard gives a real-time visual overview of every layer of the system — from JVM internals to business pipeline throughput.

URL: http://localhost:3001 Login: admin / admin

Navigating to the Dashboard

Open http://localhost:3001 and log in.
Click the hamburger menu (top-left) → Dashboards.
Select Transform Platform from the list.

Or use the direct link: localhost:3001/d/transform-platform-main

Dashboard Sections

The dashboard is divided into five collapsible rows. Click the row header to expand or collapse it.

⚙️ JVM Health

The top row — six stat cards give an instant health snapshot:

Panel	What to watch
Heap Used %	Yellow > 70%, Red > 85%. Sustained red = OOM risk
Non-Heap Used	Metaspace creep = classloader leak
Process CPU %	Red > 85%. Sustained high = runaway loop or GC thrash
Open File Descriptors	Red > 900. If hitting the OS limit → `Too many open files` errors
Process Uptime	Unexpected reset = crash-loop
Live Threads	Yellow > 200, Red > 400. Growing thread count = thread leak

Below the stat cards, three time-series charts:

JVM Memory — Heap used / committed / max + non-heap. A used line approaching the max line means GC is struggling to free memory.
JVM Thread States — Stacked view of runnable / waiting / timed-waiting / blocked. A spike in blocked threads means lock contention. Growing waiting threads often means slow downstream I/O.
GC Pause Duration — Rate of wall time spent in GC. If this approaches 1.0 s/s the JVM is spending almost all its time collecting garbage instead of executing your code.

🌐 HTTP Traffic

Three charts covering all API endpoints:

HTTP Requests / sec — Per endpoint and status. A sudden drop to zero means the app stopped receiving traffic (or stopped running).
HTTP Latency p50 / p95 / p99 — p50 is typical, p99 shows your worst 1%. If p99 is high but p50 is fine, you have occasional slow outliers (GC pause, slow DB query, etc.).
HTTP Error Rate % (4xx + 5xx) — 4xx are client errors (bad requests, not-found). 5xx are your bugs. A rising 5xx line means something broke.

🗄️ Database / HikariCP

Six stat cards + two latency charts:

Stat card	Healthy value	Concern
Active Connections	Low, variable	Consistently at max = pool exhausted
Idle Connections	Should be > 0	Zero idle + pending > 0 = under-provisioned
Pending Threads	0	Any non-zero sustained value = serious problem
Pool Utilization %	< 70%	> 90% for > 5m = alert fires
Connection Timeouts	0 total	Any value = requests failed to get a DB connection
Max Pool Size	Config value	Set via `spring.datasource.hikari.maximum-pool-size`

Connection Acquire Time p50/p95/p99 — How long threads wait to borrow a connection. p99 > 100ms means the pool is under pressure.
Connection Usage Duration p50/p95/p99 — How long connections are held by application code. High p99 here often means a slow or unindexed SQL query.

🔄 Pipeline Metrics

Business-level throughput. Empty until Counter and Timer beans are registered in the app code.

Records Processed / sec — Per spec ID and status.
Records Failed / sec — Per spec ID and severity. Any sustained value here warrants investigation.
File Transform Duration p95 / p99 — End-to-end time to transform a file. Spikes here correlate with slow DB queries or large payloads.

🪟 Window & Action Metrics

Coming Soon — these panels will populate once the Window and Action pipeline is instrumented.

Common Workflows

Investigating a latency spike

Look at HTTP Latency p95/p99 — which URI is slow?
Check GC Pause Duration — is a full GC happening at the same time?
Check Connection Acquire Time p99 — is the DB pool backed up?
If neither, jump to Jaeger and find a trace from that time window.

Investigating an OOM / memory alert

Watch JVM Memory — is heap used flat (leak) or spiky (normal GC cycle)?
Look at GC Pause Duration — is GC running frequently but failing to reclaim?
Check JVM Thread States — too many threads each holding references?
Check Non-Heap Used — growing non-heap = Metaspace leak (class loading).

Investigating a DB slowdown

Pending Threads stat > 0 → pool exhausted.
Connection Acquire Time p99 > 200ms → threads waiting for a connection.
Connection Usage Duration p99 high → a query is running slowly.
Jump to Kibana and search for slow query log lines.

Changing the Time Range

Use the time picker at the top right (default: last 1 hour).
Common presets: Last 15 minutes, Last 1 hour, Last 6 hours, Last 24 hours.
Click and drag on any graph to zoom into a specific time window.
Press Ctrl+Z (or Cmd+Z on Mac) to undo a zoom.
The refresh interval is set to 30 seconds by default. Change it in the time picker.

Adding a Custom Panel

Click Edit (pencil icon, top right) to enter edit mode.
Click Add panel → Add new visualization.
Pick Prometheus as the data source.
Paste any PromQL query from the Prometheus guide.
Choose a visualization type (Time series, Stat, Gauge, Bar chart, Table, etc.).
Click Apply then Save dashboard.

Study Material

Grafana Getting Started
Grafana Panel Types
Grafana Alerting
Dashboard Best Practices
Grafana + Prometheus Tutorial (official)
USE Method for metrics — a framework for thinking about utilisation, saturation, and errors
RED Method for services — Rate, Errors, Duration

Navigating to the Dashboard​

Dashboard Sections​

⚙️ JVM Health​

🌐 HTTP Traffic​

🗄️ Database / HikariCP​

🔄 Pipeline Metrics​

🪟 Window & Action Metrics​

Common Workflows​

Investigating a latency spike​

Investigating an OOM / memory alert​

Investigating a DB slowdown​

Changing the Time Range​

Adding a Custom Panel​

Study Material​