Featured Article SRE DevOps

Site Reliability Engineering: Practical Guide to SLOs, SLIs, and Error Budgets

Comprehensive guide to implementing Site Reliability Engineering practices including SLOs, SLIs, error budgets, and automation strategies.

HA
Hari Prasad
October 12, 2024
5 min read ...
Financial Planning Tool

PPF Calculator

Calculate your Public Provident Fund returns with detailed projections and tax benefits. Plan your financial future with precision.

Try Calculator
Free Forever Secure
10K+
Users
4.9★
Rating
Career Tool

Resume Builder

Create professional DevOps resumes with modern templates. Showcase your skills, experience, and certifications effectively.

Build Resume
No Login Export PDF
15+
Templates
5K+
Created
Kubernetes Tool

EKS Pod Cost Calculator

Calculate Kubernetes pod costs on AWS EKS. Optimize resource allocation and reduce your cloud infrastructure expenses.

Calculate Costs
Accurate Real-time
AWS
EKS Support
$$$
Save Money
AWS Cloud Tool

AWS VPC Designer Pro

Design and visualize AWS VPC architectures with ease. Create production-ready network diagrams with subnets, route tables, and security groups in minutes.

Design VPC
Visual Editor Export IaC
Multi-AZ
HA Design
Pro
Features
Subnets Security Routing
Explore More

Discover My DevOps Journey

Explore my portfolio, read insightful blogs, learn from comprehensive courses, and leverage powerful DevOps tools—all in one place.

50+
Projects
100+
Blog Posts
10+
Courses
20+
Tools

Site Reliability Engineering (SRE) bridges the gap between development and operations by applying software engineering principles to infrastructure and operations problems. This comprehensive guide covers implementing SRE practices in your organization.

What is SRE?

SRE is Google’s approach to DevOps that emphasizes:

  • Reliability as a feature: Treating reliability with the same importance as features
  • Engineering solutions: Automating operations work
  • Measurable objectives: Using SLOs and SLIs to measure reliability
  • Error budgets: Balancing innovation with reliability
  • Blameless postmortems: Learning from failures

The SRE Principles

1. Embrace Risk

Accept that 100% reliability is impossible and unnecessary.

Availability Target: 99.9% (Three Nines)
Allowed downtime: 43.8 minutes/month
Error budget: 0.1%

Availability Target: 99.99% (Four Nines)
Allowed downtime: 4.38 minutes/month
Error budget: 0.01%

2. Service Level Objectives (SLOs)

SLOs define target levels of reliability:

# Example SLO Definition
slo:
  name: "API Availability"
  description: "Percentage of successful API requests"
  target: 99.9
  window: 30d
  
  sli:
    type: availability
    numerator: "sum(rate(http_requests_total{status!~'5..'}[5m]))"
    denominator: "sum(rate(http_requests_total[5m]))"

3. Service Level Indicators (SLIs)

Quantitative measures of service level:

Common SLIs:

  • Availability: Fraction of time service is usable
  • Latency: Time to complete a request
  • Throughput: Requests per second
  • Error Rate: Fraction of failed requests
  • Correctness: Fraction of correct responses

Implementing SLOs Step-by-Step

Step 1: Identify Critical User Journeys

# user-journeys.yaml
user_journeys:
  - name: "User Login"
    steps:
      - "Load login page"
      - "Submit credentials"
      - "Receive auth token"
      - "Redirect to dashboard"
    criticality: high
    
  - name: "Product Search"
    steps:
      - "Enter search query"
      - "Display results"
      - "Filter results"
    criticality: high
    
  - name: "Checkout Process"
    steps:
      - "Add to cart"
      - "Review cart"
      - "Enter payment"
      - "Complete order"
    criticality: critical

Step 2: Define SLIs for Each Journey

# slis.yaml
slis:
  api_availability:
    description: "Percentage of successful API requests"
    query: |
      sum(rate(http_requests_total{job="api",code!~"5.."}[5m]))
      /
      sum(rate(http_requests_total{job="api"}[5m]))
    unit: "percent"
    
  api_latency_p95:
    description: "95th percentile API response time"
    query: |
      histogram_quantile(0.95,
        rate(http_request_duration_seconds_bucket{job="api"}[5m])
      )
    unit: "seconds"
    threshold: 0.5
    
  api_error_rate:
    description: "Percentage of failed API requests"
    query: |
      sum(rate(http_requests_total{job="api",code=~"5.."}[5m]))
      /
      sum(rate(http_requests_total{job="api"}[5m]))
    unit: "percent"
    threshold: 1.0

Step 3: Set SLO Targets

# slos.yaml
slos:
  - name: "API Availability SLO"
    sli: api_availability
    target: 99.9
    window: 30d
    alerting:
      burn_rate_1h: 14.4
      burn_rate_6h: 6.0
    
  - name: "API Latency SLO"
    sli: api_latency_p95
    target_less_than: 500  # milliseconds
    percentile: 95
    window: 30d
    
  - name: "Search Availability"
    sli: search_availability
    target: 99.5
    window: 30d
    dependencies:
      - elasticsearch
      - cache

Step 4: Implement SLO Monitoring

# slo_calculator.py
from prometheus_api_client import PrometheusConnect
from datetime import datetime, timedelta

class SLOCalculator:
    def __init__(self, prometheus_url):
        self.prom = PrometheusConnect(url=prometheus_url, disable_ssl=True)
    
    def calculate_availability_slo(self, service, window_days=30):
        """Calculate availability SLO for a service"""
        end_time = datetime.now()
        start_time = end_time - timedelta(days=window_days)
        
        # Query for successful requests
        success_query = f'''
            sum(rate(http_requests_total{{
                job="{service}",
                code!~"5.."
            }}[5m]))
        '''
        
        # Query for total requests
        total_query = f'''
            sum(rate(http_requests_total{{
                job="{service}"
            }}[5m]))
        '''
        
        success_data = self.prom.custom_query_range(
            success_query,
            start_time=start_time,
            end_time=end_time,
            step='5m'
        )
        
        total_data = self.prom.custom_query_range(
            total_query,
            start_time=start_time,
            end_time=end_time,
            step='5m'
        )
        
        # Calculate SLI
        total_successful = sum([float(d[1]) for d in success_data[0]['values']])
        total_requests = sum([float(d[1]) for d in total_data[0]['values']])
        
        sli = (total_successful / total_requests) * 100 if total_requests > 0 else 0
        
        return {
            'service': service,
            'window_days': window_days,
            'sli': round(sli, 4),
            'total_requests': int(total_requests),
            'successful_requests': int(total_successful),
            'failed_requests': int(total_requests - total_successful)
        }
    
    def calculate_error_budget(self, service, slo_target=99.9, window_days=30):
        """Calculate remaining error budget"""
        result = self.calculate_availability_slo(service, window_days)
        current_sli = result['sli']
        
        # Calculate error budget
        allowed_failure_rate = 100 - slo_target
        actual_failure_rate = 100 - current_sli
        
        error_budget_consumed = (actual_failure_rate / allowed_failure_rate) * 100
        error_budget_remaining = 100 - error_budget_consumed
        
        return {
            **result,
            'slo_target': slo_target,
            'allowed_failure_rate': allowed_failure_rate,
            'actual_failure_rate': round(actual_failure_rate, 4),
            'error_budget_consumed_percent': round(error_budget_consumed, 2),
            'error_budget_remaining_percent': round(error_budget_remaining, 2),
            'status': 'HEALTHY' if error_budget_remaining > 10 else 'WARNING' if error_budget_remaining > 0 else 'CRITICAL'
        }

# Usage
calculator = SLOCalculator('http://prometheus:9090')
budget = calculator.calculate_error_budget('api-service', slo_target=99.9)

print(f"Service: {budget['service']}")
print(f"Current SLI: {budget['sli']}%")
print(f"SLO Target: {budget['slo_target']}%")
print(f"Error Budget Remaining: {budget['error_budget_remaining_percent']}%")
print(f"Status: {budget['status']}")

Error Budgets

Understanding Error Budgets

SLO: 99.9% availability
Error Budget: 100% - 99.9% = 0.1%

For 30 days:
Total time: 30 days × 24 hours × 60 minutes = 43,200 minutes
Allowed downtime: 43,200 × 0.001 = 43.2 minutes

Error budget remaining determines:
- Can we deploy new features?
- Should we focus on reliability?
- What's our risk appetite?

Error Budget Policy

# error-budget-policy.yaml
error_budget_policy:
  service: api-service
  slo_target: 99.9
  window: 30d
  
  thresholds:
    green:
      min: 50
      actions:
        - "Normal feature development"
        - "2 deployments per day allowed"
        - "Experimental features permitted"
    
    yellow:
      min: 10
      max: 50
      actions:
        - "Increased caution on deployments"
        - "1 deployment per day allowed"
        - "Require staging validation"
        - "No experimental features"
    
    red:
      max: 10
      actions:
        - "Feature freeze"
        - "Emergency fixes only"
        - "Focus on reliability improvements"
        - "Incident review required"
        - "Rollback recent changes"
  
  alerts:
    - threshold: 25
      severity: warning
      notification: slack
    - threshold: 10
      severity: critical
      notification: pagerduty

Implementing Error Budget Enforcement

# error_budget_enforcer.py
class ErrorBudgetEnforcer:
    def __init__(self, calculator, policy):
        self.calculator = calculator
        self.policy = policy
    
    def check_deployment_allowed(self, service):
        """Check if deployment is allowed based on error budget"""
        budget = self.calculator.calculate_error_budget(
            service,
            slo_target=self.policy['slo_target']
        )
        
        remaining = budget['error_budget_remaining_percent']
        
        if remaining > 50:
            return {
                'allowed': True,
                'risk_level': 'LOW',
                'max_deployments_per_day': 2,
                'message': 'Normal operations - proceed with deployment'
            }
        elif remaining > 10:
            return {
                'allowed': True,
                'risk_level': 'MEDIUM',
                'max_deployments_per_day': 1,
                'message': 'Increased caution - require additional validation'
            }
        else:
            return {
                'allowed': False,
                'risk_level': 'HIGH',
                'max_deployments_per_day': 0,
                'message': 'Feature freeze - focus on reliability improvements'
            }
    
    def generate_report(self, service):
        """Generate error budget report"""
        budget = self.calculator.calculate_error_budget(service)
        deployment_status = self.check_deployment_allowed(service)
        
        report = f"""
        ═══════════════════════════════════════════
        ERROR BUDGET REPORT - {service.upper()}
        ═══════════════════════════════════════════
        
        Current SLI:          {budget['sli']}%
        SLO Target:           {budget['slo_target']}%
        
        Error Budget:
        - Consumed:           {budget['error_budget_consumed_percent']}%
        - Remaining:          {budget['error_budget_remaining_percent']}%
        - Status:             {budget['status']}
        
        Deployment Status:
        - Allowed:            {'✓ YES' if deployment_status['allowed'] else '✗ NO'}
        - Risk Level:         {deployment_status['risk_level']}
        - Max Deployments:    {deployment_status['max_deployments_per_day']}/day
        
        Recommendation:
        {deployment_status['message']}
        
        ═══════════════════════════════════════════
        """
        
        return report

# Usage
enforcer = ErrorBudgetEnforcer(calculator, error_budget_policy)
print(enforcer.generate_report('api-service'))

Alerting on SLO Burn Rate

Multi-Window Multi-Burn-Rate Alerts

# prometheus-slo-alerts.yaml
groups:
- name: slo_alerts
  interval: 30s
  rules:
  # Fast burn rate (1 hour window) - page immediately
  - alert: SLOBurnRateFast
    expr: |
      (
        sum(rate(http_requests_total{job="api",code!~"5.."}[1h]))
        /
        sum(rate(http_requests_total{job="api"}[1h]))
      ) < 0.856  # 14.4x burn rate
    for: 2m
    labels:
      severity: critical
      slo: api_availability
    annotations:
      summary: "Fast SLO burn detected - {{ $value | humanizePercentage }}"
      description: "API availability is burning through error budget 14.4x faster than acceptable"
      
  # Slow burn rate (6 hour window) - ticket
  - alert: SLOBurnRateSlow
    expr: |
      (
        sum(rate(http_requests_total{job="api",code!~"5.."}[6h]))
        /
        sum(rate(http_requests_total{job="api"}[6h]))
      ) < 0.94  # 6x burn rate
    for: 15m
    labels:
      severity: warning
      slo: api_availability
    annotations:
      summary: "Slow SLO burn detected - {{ $value | humanizePercentage }}"
      description: "API availability is burning through error budget faster than target"
  
  # Latency SLO
  - alert: LatencySLOViolation
    expr: |
      histogram_quantile(0.95,
        rate(http_request_duration_seconds_bucket{job="api"}[5m])
      ) > 0.5
    for: 5m
    labels:
      severity: warning
      slo: api_latency
    annotations:
      summary: "P95 latency exceeds 500ms - {{ $value }}s"
      description: "API latency SLO is being violated"

Toil Reduction

Identifying Toil

Toil is manual, repetitive, automatable work that:

  • Is manual
  • Is repetitive
  • Can be automated
  • Is tactical not strategic
  • Grows with service size
  • Lacks enduring value

Measuring Toil

# toil_tracker.py
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import List

@dataclass
class ToilTask:
    name: str
    time_minutes: int
    frequency: str  # daily, weekly, monthly
    category: str
    automatable: bool
    
class ToilCalculator:
    def __init__(self):
        self.tasks: List[ToilTask] = []
    
    def add_task(self, task: ToilTask):
        self.tasks.append(task)
    
    def calculate_monthly_toil(self):
        """Calculate total monthly toil hours"""
        total_minutes = 0
        
        for task in self.tasks:
            if task.frequency == 'daily':
                total_minutes += task.time_minutes * 30
            elif task.frequency == 'weekly':
                total_minutes += task.time_minutes * 4
            elif task.frequency == 'monthly':
                total_minutes += task.time_minutes
        
        return total_minutes / 60  # Convert to hours
    
    def toil_by_category(self):
        """Group toil by category"""
        categories = {}
        
        for task in self.tasks:
            freq_multiplier = {'daily': 30, 'weekly': 4, 'monthly': 1}
            minutes = task.time_minutes * freq_multiplier[task.frequency]
            
            if task.category not in categories:
                categories[task.category] = 0
            categories[task.category] += minutes / 60
        
        return categories
    
    def automation_opportunities(self):
        """Identify high-value automation opportunities"""
        opportunities = []
        
        for task in self.tasks:
            if task.automatable:
                freq_multiplier = {'daily': 30, 'weekly': 4, 'monthly': 1}
                monthly_hours = (task.time_minutes * freq_multiplier[task.frequency]) / 60
                
                opportunities.append({
                    'task': task.name,
                    'monthly_hours_saved': monthly_hours,
                    'category': task.category
                })
        
        return sorted(opportunities, key=lambda x: x['monthly_hours_saved'], reverse=True)

# Example usage
tracker = ToilCalculator()

# Add toil tasks
tracker.add_task(ToilTask("Manual deployment verification", 30, "daily", "Deployments", True))
tracker.add_task(ToilTask("Certificate renewal", 45, "monthly", "Security", True))
tracker.add_task(ToilTask("Log analysis for errors", 60, "daily", "Monitoring", True))
tracker.add_task(ToilTask("Database backup verification", 15, "daily", "Backups", True))
tracker.add_task(ToilTask("Capacity planning review", 120, "weekly", "Planning", False))

print(f"Total monthly toil: {tracker.calculate_monthly_toil():.1f} hours")
print(f"\nToil by category: {tracker.toil_by_category()}")
print(f"\nTop automation opportunities:")
for opp in tracker.automation_opportunities()[:3]:
    print(f"  - {opp['task']}: {opp['monthly_hours_saved']:.1f} hours/month")

Automation Script Example

#!/bin/bash
# automate-deployment-verification.sh

set -euo pipefail

SERVICE_NAME="$1"
DEPLOYMENT_ID="$2"
ENVIRONMENT="$3"

echo "🔍 Automated Deployment Verification"
echo "Service: $SERVICE_NAME"
echo "Deployment: $DEPLOYMENT_ID"
echo "Environment: $ENVIRONMENT"

# 1. Check deployment status
echo "✓ Checking deployment status..."
kubectl rollout status deployment/$SERVICE_NAME -n $ENVIRONMENT --timeout=5m

# 2. Verify pod health
echo "✓ Verifying pod health..."
DESIRED_REPLICAS=$(kubectl get deployment $SERVICE_NAME -n $ENVIRONMENT -o jsonpath='{.spec.replicas}')
READY_REPLICAS=$(kubectl get deployment $SERVICE_NAME -n $ENVIRONMENT -o jsonpath='{.status.readyReplicas}')

if [ "$DESIRED_REPLICAS" != "$READY_REPLICAS" ]; then
    echo "❌ Not all replicas are ready: $READY_REPLICAS/$DESIRED_REPLICAS"
    exit 1
fi

# 3. Run smoke tests
echo "✓ Running smoke tests..."
ENDPOINT="https://$SERVICE_NAME.$ENVIRONMENT.example.com"

for endpoint in "/health" "/ready" "/metrics"; do
    STATUS=$(curl -s -o /dev/null -w "%{http_code}" $ENDPOINT$endpoint)
    if [ "$STATUS" != "200" ]; then
        echo "❌ Smoke test failed for $endpoint: HTTP $STATUS"
        exit 1
    fi
    echo "  ✓ $endpoint: OK"
done

# 4. Check error rate
echo "✓ Checking error rate..."
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query" \
    --data-urlencode "query=sum(rate(http_requests_total{service=\"$SERVICE_NAME\",code=~\"5..\"}[5m]))/sum(rate(http_requests_total{service=\"$SERVICE_NAME\"}[5m]))" \
    | jq -r '.data.result[0].value[1]')

if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
    echo "❌ Error rate too high: $ERROR_RATE"
    exit 1
fi

# 5. Record deployment
echo "✓ Recording deployment..."
curl -X POST http://deployment-tracker/api/deployments \
    -H "Content-Type: application/json" \
    -d "{
        \"service\": \"$SERVICE_NAME\",
        \"deployment_id\": \"$DEPLOYMENT_ID\",
        \"environment\": \"$ENVIRONMENT\",
        \"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",
        \"status\": \"success\"
    }"

echo "✅ Deployment verification complete!"

# Time saved: 30 minutes per day → 15 hours per month

Incident Management

Incident Severity Levels

# incident-severity.yaml
severity_levels:
  SEV1:
    name: "Critical"
    description: "Service is down or severely degraded"
    examples:
      - "Complete service outage"
      - "Data loss or corruption"
      - "Security breach"
    response_time: "15 minutes"
    response_team:
      - on_call_sre
      - engineering_manager
      - cto
    
  SEV2:
    name: "High"
    description: "Significant feature unavailable"
    examples:
      - "Major feature broken"
      - "Performance severely degraded"
      - "Affecting >10% of users"
    response_time: "1 hour"
    response_team:
      - on_call_sre
      - product_owner
    
  SEV3:
    name: "Medium"
    description: "Minor feature degraded"
    examples:
      - "Non-critical feature broken"
      - "Minor performance degradation"
      - "Affecting <10% of users"
    response_time: "4 hours"
    response_team:
      - on_call_sre
    
  SEV4:
    name: "Low"
    description: "Cosmetic issues"
    examples:
      - "UI issues"
      - "Documentation errors"
    response_time: "next business day"
    response_team:
      - assigned_engineer

Incident Response Runbook

# Incident Response Runbook

## Phase 1: Detection & Triage (0-5 minutes)

1. **Acknowledge Alert**
   ```bash
   # Acknowledge in PagerDuty
   pd incident ack <incident-id>
  1. Assess Severity
    • Is the service down?
    • How many users affected?
    • Is data at risk?
  2. Declare Incident
    # Create incident channel
    /incident create --severity SEV2 --title "API Latency Spike"
    

Phase 2: Investigation (5-30 minutes)

  1. Gather Information
    # Check service health
    kubectl get pods -n production
       
    # View recent logs
    kubectl logs -n production deployment/api --tail=100
       
    # Check metrics
    curl "http://prometheus:9090/api/v1/query?query=up{job='api'}"
    
  2. Form Hypothesis
    • Recent deployments?
    • Infrastructure changes?
    • External dependencies?
  3. Test Hypothesis
    # Check recent deployments
    kubectl rollout history deployment/api -n production
       
    # Compare with previous version
    kubectl diff -f deployment.yaml
    

Phase 3: Mitigation (30-60 minutes)

  1. Implement Fix
    # Option 1: Rollback
    kubectl rollout undo deployment/api -n production
       
    # Option 2: Scale up
    kubectl scale deployment/api --replicas=10 -n production
       
    # Option 3: Emergency patch
    kubectl set image deployment/api api=api:hotfix-123
    
  2. Verify Fix
    # Check metrics improved
    # Verify error rate decreased
    # Confirm latency normalized
    

Phase 4: Recovery (60+ minutes)

  1. Monitor Stability
    • Watch metrics for 30 minutes
    • Ensure no regression
  2. Close Incident
    /incident close --resolution "Rolled back to v1.2.3"
    

Phase 5: Post-Incident

  1. Schedule Postmortem
    • Within 48 hours
    • Blameless culture
    • Focus on system improvements ```

Blameless Postmortem Template

# Postmortem: API Latency Spike

**Date**: 2024-10-12
**Duration**: 2 hours 15 minutes
**Severity**: SEV2
**Impact**: 35% of API requests experienced >5s latency

## Summary

Between 14:00 and 16:15 UTC, our API experienced significant latency spikes affecting approximately 35% of requests. The issue was caused by a database connection pool exhaustion following a deployment that increased default query timeout.

## Timeline

| Time  | Event |
|-------|-------|
| 14:00 | Deployment of v2.5.0 completed |
| 14:05 | First latency alerts triggered |
| 14:10 | Incident declared (SEV2) |
| 14:15 | Investigation began |
| 14:30 | Hypothesis: Database connection issue |
| 14:45 | Confirmed: Connection pool exhausted |
| 15:00 | Rollback initiated |
| 15:15 | Rollback completed |
| 15:30 | Metrics returned to normal |
| 16:15 | Incident closed |

## Root Cause

The new code version increased the default database query timeout from 5s to 30s. During peak traffic, this caused connections to be held longer, eventually exhausting the connection pool (max 100 connections).

## Impact

- **Users Affected**: ~50,000 users
- **Requests Impacted**: 2.1M requests with >5s latency
- **Revenue Impact**: Estimated $12,000 in lost transactions
- **Error Budget**: Consumed 15% of monthly budget

## What Went Well

✅ Monitoring detected the issue within 5 minutes
✅ Team responded quickly and followed runbook
✅ Rollback was smooth and effective
✅ Communication was clear and timely

## What Went Wrong

❌ Deployment didn't include load testing with new timeout
❌ Connection pool monitoring was not in place
❌ No automated rollback on SLO violation

## Action Items

| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| Add connection pool metrics to Grafana | Alice | 2024-10-19 | P0 |
| Implement load testing in CI/CD | Bob | 2024-10-26 | P0 |
| Create automated rollback on SLO violation | Charlie | 2024-11-02 | P1 |
| Document database timeout best practices | David | 2024-10-26 | P2 |
| Review all timeout configurations | Team | 2024-11-09 | P2 |

## Lessons Learned

1. **Always load test timeout changes** - Seemingly small configuration changes can have major impact
2. **Monitor resource exhaustion** - Connection pools, file descriptors, memory
3. **Implement progressive rollouts** - Canary deployments would have caught this
4. **Trust but verify** - Staging didn't replicate production load

## Prevention

Going forward:
- Mandatory load testing for any timeout/connection configuration changes
- Connection pool utilization must be <70% in production
- Automated rollback if error rate exceeds 5% for 5 minutes
- Monthly review of all timeout configurations

On-Call Best Practices

On-Call Rotation

# oncall-schedule.yaml
schedule:
  rotation_length: 1 week
  handoff_time: "09:00 local"
  
  primary_rotation:
    - alice
    - bob
    - charlie
    - david
  
  secondary_rotation:
    - eve
    - frank
    - grace
  
  coverage:
    weekdays: "24/7"
    weekends: "on-call-only"
  
  escalation_policy:
    - level: 1
      delay: 5 minutes
      notify: primary
    - level: 2
      delay: 15 minutes
      notify: secondary
    - level: 3
      delay: 30 minutes
      notify: engineering_manager

On-Call Checklist

# On-Call Engineer Checklist

## Before Your Shift

- [ ] Review open incidents from previous shift
- [ ] Check current system health dashboard
- [ ] Review error budget status
- [ ] Test alert notifications (SMS, email, app)
- [ ] Ensure VPN access working
- [ ] Have laptop charged and available
- [ ] Review recent deployments
- [ ] Check calendar for scheduled maintenance

## During Your Shift

- [ ] Respond to alerts within 15 minutes
- [ ] Update incident channels regularly
- [ ] Document all actions taken
- [ ] Escalate if unable to resolve in 1 hour
- [ ] Monitor error budget consumption
- [ ] Keep stakeholders informed

## After Your Shift

- [ ] Complete handoff document
- [ ] Brief next on-call engineer
- [ ] Close resolved incidents
- [ ] File any necessary follow-up tickets
- [ ] Update runbooks if needed

SRE Tools and Automation

Chaos Engineering

# chaos_monkey.py
import random
import time
from kubernetes import client, config

class ChaosMonkey:
    def __init__(self):
        config.load_kube_config()
        self.api = client.CoreV1Api()
    
    def kill_random_pod(self, namespace, label_selector):
        """Kill a random pod matching the selector"""
        pods = self.api.list_namespaced_pod(
            namespace=namespace,
            label_selector=label_selector
        )
        
        if not pods.items:
            print("No pods found")
            return
        
        target_pod = random.choice(pods.items)
        print(f"Terminating pod: {target_pod.metadata.name}")
        
        self.api.delete_namespaced_pod(
            name=target_pod.metadata.name,
            namespace=namespace
        )
    
    def introduce_latency(self, namespace, deployment, delay_ms=1000):
        """Add network latency to a deployment"""
        # Using toxiproxy or similar tool
        pass
    
    def run_experiment(self, namespace, label_selector, duration_minutes=5):
        """Run chaos experiment"""
        print(f"Starting chaos experiment for {duration_minutes} minutes")
        end_time = time.time() + (duration_minutes * 60)
        
        while time.time() < end_time:
            self.kill_random_pod(namespace, label_selector)
            time.sleep(random.randint(30, 120))  # Wait 30-120 seconds
        
        print("Chaos experiment complete")

# Usage
chaos = ChaosMonkey()
chaos.run_experiment(
    namespace="production",
    label_selector="app=api,tier=backend",
    duration_minutes=5
)

Key Metrics for SRE

# golden-signals.yaml
golden_signals:
  latency:
    description: "Time to service a request"
    metrics:
      - p50_latency
      - p95_latency
      - p99_latency
    target: "p95 < 500ms"
  
  traffic:
    description: "Demand on the system"
    metrics:
      - requests_per_second
      - concurrent_connections
    target: "Handle 10,000 RPS"
  
  errors:
    description: "Rate of failed requests"
    metrics:
      - error_rate
      - 5xx_rate
    target: "< 0.1% error rate"
  
  saturation:
    description: "How full the service is"
    metrics:
      - cpu_utilization
      - memory_utilization
      - disk_utilization
      - connection_pool_utilization
    target: "< 70% utilization"

Conclusion

Site Reliability Engineering provides a framework for building and maintaining reliable systems at scale. By implementing SLOs, error budgets, and automation, you can balance innovation with reliability while maintaining high service quality.

Key Takeaways

Define clear SLOs based on user experience ✅ Use error budgets to balance reliability and innovation
Automate toil to free up engineering time
Practice blameless postmortems to learn from failures
Monitor the right metrics - SLIs that matter to users
Build for failure - expect and plan for incidents
Continuous improvement - iterate on processes

Resources


How do you implement SRE in your organization? Share your experiences!

HA
Author

Hari Prasad

Seasoned DevOps Lead with 11+ years of expertise in cloud infrastructure, CI/CD automation, and infrastructure as code. Proven track record in designing scalable, secure systems on AWS using Terraform, Kubernetes, Jenkins, and Ansible. Strong leadership in mentoring teams and implementing cost-effective cloud solutions.

Continue Reading

DevOps Tools & Calculators Free Tools

Power up your DevOps workflow with these handy tools

Enjoyed this article?

Explore more DevOps insights, tutorials, and best practices

View All Articles