Observability with Prometheus & Grafana: Metrics That Matter

How to set up end-to-end observability for your infrastructure using Prometheus and Grafana, with actionable dashboards and alerting.

HA
Hari Prasad
June 13, 2024
5 min read ...
Financial Planning Tool

PPF Calculator

Calculate your Public Provident Fund returns with detailed projections and tax benefits. Plan your financial future with precision.

Try Calculator
Free Forever Secure
10K+
Users
4.9★
Rating
Career Tool

Resume Builder

Create professional DevOps resumes with modern templates. Showcase your skills, experience, and certifications effectively.

Build Resume
No Login Export PDF
15+
Templates
5K+
Created
Kubernetes Tool

EKS Pod Cost Calculator

Calculate Kubernetes pod costs on AWS EKS. Optimize resource allocation and reduce your cloud infrastructure expenses.

Calculate Costs
Accurate Real-time
AWS
EKS Support
$$$
Save Money
AWS Cloud Tool

AWS VPC Designer Pro

Design and visualize AWS VPC architectures with ease. Create production-ready network diagrams with subnets, route tables, and security groups in minutes.

Design VPC
Visual Editor Export IaC
Multi-AZ
HA Design
Pro
Features
Subnets Security Routing
Explore More

Discover My DevOps Journey

Explore my portfolio, read insightful blogs, learn from comprehensive courses, and leverage powerful DevOps tools—all in one place.

50+
Projects
100+
Blog Posts
10+
Courses
20+
Tools

Observability is more than just monitoring—it’s about understanding the internal state of your systems through external outputs. Prometheus and Grafana form the industry-standard stack for collecting, visualizing, and alerting on metrics. In this comprehensive guide, we’ll build a complete observability solution.

Why Prometheus & Grafana?

This powerful combination offers:

  • Open Source: No vendor lock-in, fully customizable
  • Powerful Query Language: PromQL for complex metric analysis
  • Multi-dimensional Data: Labels for flexible aggregation
  • Beautiful Dashboards: Grafana’s stunning visualizations
  • Active Alerting: Proactive problem detection
  • Kubernetes Native: First-class support for cloud-native environments
  • Large Ecosystem: Hundreds of exporters for different systems

Architecture Overview

┌─────────────────┐
│   Application   │──► metrics endpoint (:9090/metrics)
└────────┬────────┘
         │
    ┌────▼─────┐
    │ Exporter │──► exposes metrics
    └────┬─────┘
         │
    ┌────▼─────────┐
    │  Prometheus  │──► scrapes & stores metrics
    │   (TSDB)     │
    └────┬─────────┘
         │
    ┌────▼─────────┐
    │   Grafana    │──► visualizes metrics
    │  (Dashboard) │
    └──────────────┘
         │
    ┌────▼─────────┐
    │  Alertmanager│──► sends alerts
    └──────────────┘

Installation: Kubernetes Deployment

# Add Prometheus community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Create monitoring namespace
kubectl create namespace monitoring

# Install kube-prometheus-stack (includes Prometheus, Grafana, Alertmanager)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi \
  --set grafana.adminPassword='YourSecurePassword' \
  --set alertmanager.enabled=true

# Verify installation
kubectl get pods -n monitoring

# Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Visit http://localhost:3000 (admin / YourSecurePassword)

Option 2: Manual Kubernetes Manifests

# prometheus-deployment.yaml
---
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        cluster: 'production'
        region: 'us-east-1'
    
    alerting:
      alertmanagers:
      - static_configs:
        - targets:
          - alertmanager:9093
    
    rule_files:
      - /etc/prometheus/rules/*.yml
    
    scrape_configs:
      # Prometheus itself
      - job_name: 'prometheus'
        static_configs:
        - targets: ['localhost:9090']
      
      # Kubernetes API server
      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https
      
      # Kubernetes nodes
      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
        - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
      
      # Kubernetes pods
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
          target_label: __address__
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod_name

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      serviceAccountName: prometheus
      containers:
      - name: prometheus
        image: prom/prometheus:v2.45.0
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--storage.tsdb.retention.time=30d'
          - '--web.console.libraries=/etc/prometheus/console_libraries'
          - '--web.console.templates=/etc/prometheus/consoles'
          - '--web.enable-lifecycle'
        ports:
        - containerPort: 9090
          name: web
        volumeMounts:
        - name: prometheus-config
          mountPath: /etc/prometheus
        - name: prometheus-storage
          mountPath: /prometheus
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
      volumes:
      - name: prometheus-config
        configMap:
          name: prometheus-config
      - name: prometheus-storage
        persistentVolumeClaim:
          claimName: prometheus-pvc

---
apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitoring
spec:
  type: ClusterIP
  selector:
    app: prometheus
  ports:
  - port: 9090
    targetPort: 9090
    name: web

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring

Apply the configuration:

kubectl apply -f prometheus-deployment.yaml

Instrumenting Applications

Node.js Application

// app.js
const express = require('express');
const promClient = require('prom-client');

const app = express();
const register = new promClient.Registry();

// Add default metrics
promClient.collectDefaultMetrics({ register });

// Custom metrics
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.5, 1, 2, 5]
});

const httpRequestTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const activeConnections = new promClient.Gauge({
  name: 'active_connections',
  help: 'Number of active connections'
});

register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);
register.registerMetric(activeConnections);

// Middleware to track metrics
app.use((req, res, next) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestDuration.labels(req.method, req.route?.path || req.path, res.statusCode).observe(duration);
    httpRequestTotal.labels(req.method, req.route?.path || req.path, res.statusCode).inc();
  });
  
  next();
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

// Application routes
app.get('/api/health', (req, res) => {
  res.json({ status: 'healthy' });
});

app.get('/api/users', (req, res) => {
  // Simulate processing time
  setTimeout(() => {
    res.json({ users: [] });
  }, Math.random() * 100);
});

app.listen(3000, () => {
  console.log('Server running on port 3000');
});

Kubernetes deployment with Prometheus annotations:

# app-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "3000"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: myapp
        image: myapp:latest
        ports:
        - containerPort: 3000
          name: http

Python Application (Flask)

# app.py
from flask import Flask, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest, REGISTRY
import time

app = Flask(__name__)

# Define metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint']
)

ACTIVE_REQUESTS = Gauge(
    'http_requests_active',
    'Active HTTP requests'
)

@app.before_request
def before_request():
    request.start_time = time.time()
    ACTIVE_REQUESTS.inc()

@app.after_request
def after_request(response):
    request_duration = time.time() - request.start_time
    
    REQUEST_DURATION.labels(
        method=request.method,
        endpoint=request.endpoint or 'unknown'
    ).observe(request_duration)
    
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.endpoint or 'unknown',
        status=response.status_code
    ).inc()
    
    ACTIVE_REQUESTS.dec()
    return response

@app.route('/metrics')
def metrics():
    return generate_latest(REGISTRY)

@app.route('/api/health')
def health():
    return {'status': 'healthy'}

@app.route('/api/data')
def data():
    # Simulate processing
    time.sleep(0.1)
    return {'data': []}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

PromQL: Query Language Basics

Essential Queries

# CPU usage per pod
rate(container_cpu_usage_seconds_total[5m])

# Memory usage
container_memory_usage_bytes / 1024 / 1024

# Request rate
rate(http_requests_total[5m])

# Average request duration
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

# Top 5 CPU consuming pods
topk(5, rate(container_cpu_usage_seconds_total[5m]))

# Pods using more than 80% memory
container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.8

Advanced Queries

# Predict disk full in 4 hours
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0

# Alert if request rate drops by 50%
rate(http_requests_total[5m]) < 0.5 * rate(http_requests_total[5m] offset 1h)

# Network traffic aggregated by namespace
sum by (namespace) (rate(container_network_receive_bytes_total[5m]))

# Pod restart count in last hour
changes(kube_pod_container_status_restarts_total[1h]) > 0

Alert Rules Configuration

# prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: monitoring
data:
  alert-rules.yml: |
    groups:
    - name: infrastructure
      interval: 30s
      rules:
      - alert: HighCPUUsage
        expr: avg(rate(container_cpu_usage_seconds_total[5m])) by (pod, namespace) > 0.8
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High CPU usage detected on {{ $labels.pod }}"
          description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has CPU usage above 80% for 5 minutes"
      
      - alert: HighMemoryUsage
        expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "High memory usage on {{ $labels.pod }}"
          description: "Pod {{ $labels.pod }} is using {{ $value | humanizePercentage }} of memory limit"
      
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"
          description: "Pod {{ $labels.pod }} has restarted {{ $value }} times in the last 15 minutes"
      
      - alert: NodeDiskPressure
        expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
        for: 10m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "Node {{ $labels.node }} has disk pressure"
          description: "Node {{ $labels.node }} is experiencing disk pressure"
    
    - name: application
      interval: 30s
      rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
      
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 10m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "High latency on {{ $labels.job }}"
          description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}"
      
      - alert: LowRequestRate
        expr: rate(http_requests_total[5m]) < 0.5 * rate(http_requests_total[5m] offset 1h)
        for: 10m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "Low request rate on {{ $labels.job }}"
          description: "Request rate has dropped by 50% compared to 1 hour ago"

Alertmanager Configuration

# alertmanager-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 5m
      slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
    
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      receiver: 'default'
      routes:
      - match:
          severity: critical
        receiver: 'pagerduty-critical'
        continue: true
      - match:
          severity: warning
        receiver: 'slack-warnings'
      - match:
          team: platform
        receiver: 'slack-platform'
    
    receivers:
    - name: 'default'
      slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    
    - name: 'pagerduty-critical'
      pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'
        description: '{{ .CommonAnnotations.summary }}'
    
    - name: 'slack-warnings'
      slack_configs:
      - channel: '#warnings'
        color: 'warning'
        title: 'Warning: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    
    - name: 'slack-platform'
      slack_configs:
      - channel: '#platform-alerts'
        title: 'Platform Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    
    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      equal: ['alertname', 'cluster', 'service']

Grafana Dashboard as Code

{
  "dashboard": {
    "title": "Kubernetes Cluster Overview",
    "tags": ["kubernetes", "cluster"],
    "timezone": "browser",
    "panels": [
      {
        "title": "CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)",
            "legendFormat": ""
          }
        ],
        "yaxes": [
          {"format": "percentunit"}
        ]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(container_memory_usage_bytes) by (namespace) / 1024 / 1024 / 1024",
            "legendFormat": ""
          }
        ]
      }
    ]
  }
}

Best Practices

Cardinality Management: Avoid high-cardinality labels (e.g., user IDs, timestamps)
Retention Policy: Balance storage costs with data needs (15-30 days typical)
Recording Rules: Pre-compute expensive queries
Alert Fatigue: Set appropriate thresholds, use inhibition rules
Service Discovery: Use Kubernetes SD instead of static configs
Dashboard Organization: Use folders, tags, and consistent naming
SLO-based Alerting: Alert on user-facing issues, not system metrics
Federation: Use Prometheus federation for multi-cluster setups

Conclusion

Prometheus and Grafana provide a powerful, flexible observability stack. By instrumenting your applications properly, writing effective PromQL queries, and setting up actionable alerts, you’ll gain deep insights into your systems and catch issues before they impact users.

Resources


Questions about observability? Let’s discuss in the comments!

HA
Author

Hari Prasad

Seasoned DevOps Lead with 11+ years of expertise in cloud infrastructure, CI/CD automation, and infrastructure as code. Proven track record in designing scalable, secure systems on AWS using Terraform, Kubernetes, Jenkins, and Ansible. Strong leadership in mentoring teams and implementing cost-effective cloud solutions.

Continue Reading

DevOps Tools & Calculators Free Tools

Power up your DevOps workflow with these handy tools

Enjoyed this article?

Explore more DevOps insights, tutorials, and best practices

View All Articles