Observability is more than just monitoring—it’s about understanding the internal state of your systems through external outputs. Prometheus and Grafana form the industry-standard stack for collecting, visualizing, and alerting on metrics. In this comprehensive guide, we’ll build a complete observability solution.
Why Prometheus & Grafana?
This powerful combination offers:
- Open Source: No vendor lock-in, fully customizable
- Powerful Query Language: PromQL for complex metric analysis
- Multi-dimensional Data: Labels for flexible aggregation
- Beautiful Dashboards: Grafana’s stunning visualizations
- Active Alerting: Proactive problem detection
- Kubernetes Native: First-class support for cloud-native environments
- Large Ecosystem: Hundreds of exporters for different systems
Architecture Overview
┌─────────────────┐
│ Application │──► metrics endpoint (:9090/metrics)
└────────┬────────┘
│
┌────▼─────┐
│ Exporter │──► exposes metrics
└────┬─────┘
│
┌────▼─────────┐
│ Prometheus │──► scrapes & stores metrics
│ (TSDB) │
└────┬─────────┘
│
┌────▼─────────┐
│ Grafana │──► visualizes metrics
│ (Dashboard) │
└──────────────┘
│
┌────▼─────────┐
│ Alertmanager│──► sends alerts
└──────────────┘
Installation: Kubernetes Deployment
Option 1: Using Helm (Recommended)
# Add Prometheus community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Create monitoring namespace
kubectl create namespace monitoring
# Install kube-prometheus-stack (includes Prometheus, Grafana, Alertmanager)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi \
--set grafana.adminPassword='YourSecurePassword' \
--set alertmanager.enabled=true
# Verify installation
kubectl get pods -n monitoring
# Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Visit http://localhost:3000 (admin / YourSecurePassword)
Option 2: Manual Kubernetes Manifests
# prometheus-deployment.yaml
---
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-east-1'
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Kubernetes API server
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Kubernetes nodes
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# Kubernetes pods
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:v2.45.0
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
ports:
- containerPort: 9090
name: web
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus
- name: prometheus-storage
mountPath: /prometheus
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
- name: prometheus-storage
persistentVolumeClaim:
claimName: prometheus-pvc
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
spec:
type: ClusterIP
selector:
app: prometheus
ports:
- port: 9090
targetPort: 9090
name: web
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
Apply the configuration:
kubectl apply -f prometheus-deployment.yaml
Instrumenting Applications
Node.js Application
// app.js
const express = require('express');
const promClient = require('prom-client');
const app = express();
const register = new promClient.Registry();
// Add default metrics
promClient.collectDefaultMetrics({ register });
// Custom metrics
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.5, 1, 2, 5]
});
const httpRequestTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
const activeConnections = new promClient.Gauge({
name: 'active_connections',
help: 'Number of active connections'
});
register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);
register.registerMetric(activeConnections);
// Middleware to track metrics
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestDuration.labels(req.method, req.route?.path || req.path, res.statusCode).observe(duration);
httpRequestTotal.labels(req.method, req.route?.path || req.path, res.statusCode).inc();
});
next();
});
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
// Application routes
app.get('/api/health', (req, res) => {
res.json({ status: 'healthy' });
});
app.get('/api/users', (req, res) => {
// Simulate processing time
setTimeout(() => {
res.json({ users: [] });
}, Math.random() * 100);
});
app.listen(3000, () => {
console.log('Server running on port 3000');
});
Kubernetes deployment with Prometheus annotations:
# app-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "3000"
prometheus.io/path: "/metrics"
spec:
containers:
- name: myapp
image: myapp:latest
ports:
- containerPort: 3000
name: http
Python Application (Flask)
# app.py
from flask import Flask, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest, REGISTRY
import time
app = Flask(__name__)
# Define metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_DURATION = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)
ACTIVE_REQUESTS = Gauge(
'http_requests_active',
'Active HTTP requests'
)
@app.before_request
def before_request():
request.start_time = time.time()
ACTIVE_REQUESTS.inc()
@app.after_request
def after_request(response):
request_duration = time.time() - request.start_time
REQUEST_DURATION.labels(
method=request.method,
endpoint=request.endpoint or 'unknown'
).observe(request_duration)
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.endpoint or 'unknown',
status=response.status_code
).inc()
ACTIVE_REQUESTS.dec()
return response
@app.route('/metrics')
def metrics():
return generate_latest(REGISTRY)
@app.route('/api/health')
def health():
return {'status': 'healthy'}
@app.route('/api/data')
def data():
# Simulate processing
time.sleep(0.1)
return {'data': []}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
PromQL: Query Language Basics
Essential Queries
# CPU usage per pod
rate(container_cpu_usage_seconds_total[5m])
# Memory usage
container_memory_usage_bytes / 1024 / 1024
# Request rate
rate(http_requests_total[5m])
# Average request duration
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
# Top 5 CPU consuming pods
topk(5, rate(container_cpu_usage_seconds_total[5m]))
# Pods using more than 80% memory
container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.8
Advanced Queries
# Predict disk full in 4 hours
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0
# Alert if request rate drops by 50%
rate(http_requests_total[5m]) < 0.5 * rate(http_requests_total[5m] offset 1h)
# Network traffic aggregated by namespace
sum by (namespace) (rate(container_network_receive_bytes_total[5m]))
# Pod restart count in last hour
changes(kube_pod_container_status_restarts_total[1h]) > 0
Alert Rules Configuration
# prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: monitoring
data:
alert-rules.yml: |
groups:
- name: infrastructure
interval: 30s
rules:
- alert: HighCPUUsage
expr: avg(rate(container_cpu_usage_seconds_total[5m])) by (pod, namespace) > 0.8
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "High CPU usage detected on {{ $labels.pod }}"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has CPU usage above 80% for 5 minutes"
- alert: HighMemoryUsage
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "High memory usage on {{ $labels.pod }}"
description: "Pod {{ $labels.pod }} is using {{ $value | humanizePercentage }} of memory limit"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.pod }} has restarted {{ $value }} times in the last 15 minutes"
- alert: NodeDiskPressure
expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
for: 10m
labels:
severity: warning
team: infrastructure
annotations:
summary: "Node {{ $labels.node }} has disk pressure"
description: "Node {{ $labels.node }} is experiencing disk pressure"
- name: application
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "High latency on {{ $labels.job }}"
description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}"
- alert: LowRequestRate
expr: rate(http_requests_total[5m]) < 0.5 * rate(http_requests_total[5m] offset 1h)
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "Low request rate on {{ $labels.job }}"
description: "Request rate has dropped by 50% compared to 1 hour ago"
Alertmanager Configuration
# alertmanager-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
- match:
severity: warning
receiver: 'slack-warnings'
- match:
team: platform
receiver: 'slack-platform'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
description: '{{ .CommonAnnotations.summary }}'
- name: 'slack-warnings'
slack_configs:
- channel: '#warnings'
color: 'warning'
title: 'Warning: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'slack-platform'
slack_configs:
- channel: '#platform-alerts'
title: 'Platform Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
Grafana Dashboard as Code
{
"dashboard": {
"title": "Kubernetes Cluster Overview",
"tags": ["kubernetes", "cluster"],
"timezone": "browser",
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)",
"legendFormat": ""
}
],
"yaxes": [
{"format": "percentunit"}
]
},
{
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "sum(container_memory_usage_bytes) by (namespace) / 1024 / 1024 / 1024",
"legendFormat": ""
}
]
}
]
}
}
Best Practices
✅ Cardinality Management: Avoid high-cardinality labels (e.g., user IDs, timestamps)
✅ Retention Policy: Balance storage costs with data needs (15-30 days typical)
✅ Recording Rules: Pre-compute expensive queries
✅ Alert Fatigue: Set appropriate thresholds, use inhibition rules
✅ Service Discovery: Use Kubernetes SD instead of static configs
✅ Dashboard Organization: Use folders, tags, and consistent naming
✅ SLO-based Alerting: Alert on user-facing issues, not system metrics
✅ Federation: Use Prometheus federation for multi-cluster setups
Conclusion
Prometheus and Grafana provide a powerful, flexible observability stack. By instrumenting your applications properly, writing effective PromQL queries, and setting up actionable alerts, you’ll gain deep insights into your systems and catch issues before they impact users.
Resources
Questions about observability? Let’s discuss in the comments!