Observability with Prometheus & Grafana: Metrics That Matter

Observability is more than just monitoring—it’s about understanding the internal state of your systems through external outputs. Prometheus and Grafana form the industry-standard stack for collecting, visualizing, and alerting on metrics. In this comprehensive guide, we’ll build a complete observability solution.

Why Prometheus & Grafana?

This powerful combination offers:

Open Source: No vendor lock-in, fully customizable
Powerful Query Language: PromQL for complex metric analysis
Multi-dimensional Data: Labels for flexible aggregation
Beautiful Dashboards: Grafana’s stunning visualizations
Active Alerting: Proactive problem detection
Kubernetes Native: First-class support for cloud-native environments
Large Ecosystem: Hundreds of exporters for different systems

Architecture Overview

┌─────────────────┐
│   Application   │──► metrics endpoint (:9090/metrics)
└────────┬────────┘
         │
    ┌────▼─────┐
    │ Exporter │──► exposes metrics
    └────┬─────┘
         │
    ┌────▼─────────┐
    │  Prometheus  │──► scrapes & stores metrics
    │   (TSDB)     │
    └────┬─────────┘
         │
    ┌────▼─────────┐
    │   Grafana    │──► visualizes metrics
    │  (Dashboard) │
    └──────────────┘
         │
    ┌────▼─────────┐
    │  Alertmanager│──► sends alerts
    └──────────────┘

Installation: Kubernetes Deployment

Option 1: Using Helm (Recommended)

# Add Prometheus community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Create monitoring namespace
kubectl create namespace monitoring

# Install kube-prometheus-stack (includes Prometheus, Grafana, Alertmanager)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi \
  --set grafana.adminPassword='YourSecurePassword' \
  --set alertmanager.enabled=true

# Verify installation
kubectl get pods -n monitoring

# Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Visit http://localhost:3000 (admin / YourSecurePassword)

Option 2: Manual Kubernetes Manifests

# prometheus-deployment.yaml
---
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        cluster: 'production'
        region: 'us-east-1'
    
    alerting:
      alertmanagers:
      - static_configs:
        - targets:
          - alertmanager:9093
    
    rule_files:
      - /etc/prometheus/rules/*.yml
    
    scrape_configs:
      # Prometheus itself
      - job_name: 'prometheus'
        static_configs:
        - targets: ['localhost:9090']
      
      # Kubernetes API server
      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https
      
      # Kubernetes nodes
      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
        - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
      
      # Kubernetes pods
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
          target_label: __address__
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod_name

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      serviceAccountName: prometheus
      containers:
      - name: prometheus
        image: prom/prometheus:v2.45.0
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--storage.tsdb.retention.time=30d'
          - '--web.console.libraries=/etc/prometheus/console_libraries'
          - '--web.console.templates=/etc/prometheus/consoles'
          - '--web.enable-lifecycle'
        ports:
        - containerPort: 9090
          name: web
        volumeMounts:
        - name: prometheus-config
          mountPath: /etc/prometheus
        - name: prometheus-storage
          mountPath: /prometheus
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
      volumes:
      - name: prometheus-config
        configMap:
          name: prometheus-config
      - name: prometheus-storage
        persistentVolumeClaim:
          claimName: prometheus-pvc

---
apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitoring
spec:
  type: ClusterIP
  selector:
    app: prometheus
  ports:
  - port: 9090
    targetPort: 9090
    name: web

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring

Apply the configuration:

kubectl apply -f prometheus-deployment.yaml

Instrumenting Applications

Node.js Application

// app.js
const express = require('express');
const promClient = require('prom-client');

const app = express();
const register = new promClient.Registry();

// Add default metrics
promClient.collectDefaultMetrics({ register });

// Custom metrics
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.5, 1, 2, 5]
});

const httpRequestTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const activeConnections = new promClient.Gauge({
  name: 'active_connections',
  help: 'Number of active connections'
});

register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);
register.registerMetric(activeConnections);

// Middleware to track metrics
app.use((req, res, next) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestDuration.labels(req.method, req.route?.path || req.path, res.statusCode).observe(duration);
    httpRequestTotal.labels(req.method, req.route?.path || req.path, res.statusCode).inc();
  });
  
  next();
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

// Application routes
app.get('/api/health', (req, res) => {
  res.json({ status: 'healthy' });
});

app.get('/api/users', (req, res) => {
  // Simulate processing time
  setTimeout(() => {
    res.json({ users: [] });
  }, Math.random() * 100);
});

app.listen(3000, () => {
  console.log('Server running on port 3000');
});

Kubernetes deployment with Prometheus annotations:

# app-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "3000"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: myapp
        image: myapp:latest
        ports:
        - containerPort: 3000
          name: http

Python Application (Flask)

# app.py
from flask import Flask, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest, REGISTRY
import time

app = Flask(__name__)

# Define metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint']
)

ACTIVE_REQUESTS = Gauge(
    'http_requests_active',
    'Active HTTP requests'
)

@app.before_request
def before_request():
    request.start_time = time.time()
    ACTIVE_REQUESTS.inc()

@app.after_request
def after_request(response):
    request_duration = time.time() - request.start_time
    
    REQUEST_DURATION.labels(
        method=request.method,
        endpoint=request.endpoint or 'unknown'
    ).observe(request_duration)
    
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.endpoint or 'unknown',
        status=response.status_code
    ).inc()
    
    ACTIVE_REQUESTS.dec()
    return response

@app.route('/metrics')
def metrics():
    return generate_latest(REGISTRY)

@app.route('/api/health')
def health():
    return {'status': 'healthy'}

@app.route('/api/data')
def data():
    # Simulate processing
    time.sleep(0.1)
    return {'data': []}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

PromQL: Query Language Basics

Essential Queries

# CPU usage per pod
rate(container_cpu_usage_seconds_total[5m])

# Memory usage
container_memory_usage_bytes / 1024 / 1024

# Request rate
rate(http_requests_total[5m])

# Average request duration
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

# Top 5 CPU consuming pods
topk(5, rate(container_cpu_usage_seconds_total[5m]))

# Pods using more than 80% memory
container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.8

Advanced Queries

# Predict disk full in 4 hours
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0

# Alert if request rate drops by 50%
rate(http_requests_total[5m]) < 0.5 * rate(http_requests_total[5m] offset 1h)

# Network traffic aggregated by namespace
sum by (namespace) (rate(container_network_receive_bytes_total[5m]))

# Pod restart count in last hour
changes(kube_pod_container_status_restarts_total[1h]) > 0

Alert Rules Configuration

# prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: monitoring
data:
  alert-rules.yml: |
    groups:
    - name: infrastructure
      interval: 30s
      rules:
      - alert: HighCPUUsage
        expr: avg(rate(container_cpu_usage_seconds_total[5m])) by (pod, namespace) > 0.8
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High CPU usage detected on {{ $labels.pod }}"
          description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has CPU usage above 80% for 5 minutes"
      
      - alert: HighMemoryUsage
        expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "High memory usage on {{ $labels.pod }}"
          description: "Pod {{ $labels.pod }} is using {{ $value | humanizePercentage }} of memory limit"
      
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"
          description: "Pod {{ $labels.pod }} has restarted {{ $value }} times in the last 15 minutes"
      
      - alert: NodeDiskPressure
        expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
        for: 10m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "Node {{ $labels.node }} has disk pressure"
          description: "Node {{ $labels.node }} is experiencing disk pressure"
    
    - name: application
      interval: 30s
      rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
      
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 10m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "High latency on {{ $labels.job }}"
          description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}"
      
      - alert: LowRequestRate
        expr: rate(http_requests_total[5m]) < 0.5 * rate(http_requests_total[5m] offset 1h)
        for: 10m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "Low request rate on {{ $labels.job }}"
          description: "Request rate has dropped by 50% compared to 1 hour ago"

Alertmanager Configuration

# alertmanager-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 5m
      slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
    
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      receiver: 'default'
      routes:
      - match:
          severity: critical
        receiver: 'pagerduty-critical'
        continue: true
      - match:
          severity: warning
        receiver: 'slack-warnings'
      - match:
          team: platform
        receiver: 'slack-platform'
    
    receivers:
    - name: 'default'
      slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    
    - name: 'pagerduty-critical'
      pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'
        description: '{{ .CommonAnnotations.summary }}'
    
    - name: 'slack-warnings'
      slack_configs:
      - channel: '#warnings'
        color: 'warning'
        title: 'Warning: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    
    - name: 'slack-platform'
      slack_configs:
      - channel: '#platform-alerts'
        title: 'Platform Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    
    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      equal: ['alertname', 'cluster', 'service']

Grafana Dashboard as Code

{
  "dashboard": {
    "title": "Kubernetes Cluster Overview",
    "tags": ["kubernetes", "cluster"],
    "timezone": "browser",
    "panels": [
      {
        "title": "CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)",
            "legendFormat": ""
          }
        ],
        "yaxes": [
          {"format": "percentunit"}
        ]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(container_memory_usage_bytes) by (namespace) / 1024 / 1024 / 1024",
            "legendFormat": ""
          }
        ]
      }
    ]
  }
}

Best Practices

✅ Cardinality Management: Avoid high-cardinality labels (e.g., user IDs, timestamps)
✅ Retention Policy: Balance storage costs with data needs (15-30 days typical)
✅ Recording Rules: Pre-compute expensive queries
✅ Alert Fatigue: Set appropriate thresholds, use inhibition rules
✅ Service Discovery: Use Kubernetes SD instead of static configs
✅ Dashboard Organization: Use folders, tags, and consistent naming
✅ SLO-based Alerting: Alert on user-facing issues, not system metrics
✅ Federation: Use Prometheus federation for multi-cluster setups

Conclusion

Prometheus and Grafana provide a powerful, flexible observability stack. By instrumenting your applications properly, writing effective PromQL queries, and setting up actionable alerts, you’ll gain deep insights into your systems and catch issues before they impact users.

Resources

Questions about observability? Let’s discuss in the comments!

Tags: #prometheus #grafana #monitoring #observability #devops

Author

Hari Prasad

Seasoned DevOps Lead with 11+ years of expertise in cloud infrastructure, CI/CD automation, and infrastructure as code. Proven track record in designing scalable, secure systems on AWS using Terraform, Kubernetes, Jenkins, and Ansible. Strong leadership in mentoring teams and implementing cost-effective cloud solutions.

Continue Reading

Oct 06, 2024

Docker Containerization: From Basics to Production-Ready Images

Complete guide to Docker containerization with best practices for building optimized, secure images and running conta...

Read Article

Oct 05, 2024

Terraform for Infrastructure as Code: Complete Guide with Best Practices

Master Terraform for cloud infrastructure provisioning with real-world examples, state management, modules, and CI/CD...

Read Article

Oct 14, 2024

AIOps: AI-Powered DevOps Automation and Intelligent Operations

Comprehensive guide to implementing AIOps - using AI and machine learning to transform DevOps practices with intellig...

Read Article

Oct 12, 2024

Site Reliability Engineering: Practical Guide to SLOs, SLIs, and Error Budgets

Comprehensive guide to implementing Site Reliability Engineering practices including SLOs, SLIs, error budgets, and a...

Read Article

Observability with Prometheus & Grafana: Metrics That Matter

PPF Calculator

Resume Builder

EKS Pod Cost Calculator

AWS VPC Designer Pro

Discover My DevOps Journey

Portfolio

Blog

Courses

Tools

Why Prometheus & Grafana?

Architecture Overview

Installation: Kubernetes Deployment

Option 1: Using Helm (Recommended)

Option 2: Manual Kubernetes Manifests

Instrumenting Applications

Node.js Application

Python Application (Flask)

PromQL: Query Language Basics

Essential Queries

Advanced Queries

Alert Rules Configuration

Alertmanager Configuration

Grafana Dashboard as Code

Best Practices

Conclusion

Resources

Hari Prasad

Continue Reading

Docker Containerization: From Basics to Production-Ready Images

Terraform for Infrastructure as Code: Complete Guide with Best Practices

AIOps: AI-Powered DevOps Automation and Intelligent Operations

Site Reliability Engineering: Practical Guide to SLOs, SLIs, and Error Budgets

DevOps Tools & Calculators Free Tools

Enjoyed this article?