Site Reliability Engineering: Practical Guide to SLOs, SLIs, and Error Budgets

Site Reliability Engineering (SRE) bridges the gap between development and operations by applying software engineering principles to infrastructure and operations problems. This comprehensive guide covers implementing SRE practices in your organization.

What is SRE?

SRE is Google’s approach to DevOps that emphasizes:

Reliability as a feature: Treating reliability with the same importance as features
Engineering solutions: Automating operations work
Measurable objectives: Using SLOs and SLIs to measure reliability
Error budgets: Balancing innovation with reliability
Blameless postmortems: Learning from failures

The SRE Principles

1. Embrace Risk

Accept that 100% reliability is impossible and unnecessary.

Availability Target: 99.9% (Three Nines)
Allowed downtime: 43.8 minutes/month
Error budget: 0.1%

Availability Target: 99.99% (Four Nines)
Allowed downtime: 4.38 minutes/month
Error budget: 0.01%

2. Service Level Objectives (SLOs)

SLOs define target levels of reliability:

# Example SLO Definition
slo:
  name: "API Availability"
  description: "Percentage of successful API requests"
  target: 99.9
  window: 30d
  
  sli:
    type: availability
    numerator: "sum(rate(http_requests_total{status!~'5..'}[5m]))"
    denominator: "sum(rate(http_requests_total[5m]))"

3. Service Level Indicators (SLIs)

Quantitative measures of service level:

Common SLIs:

Availability: Fraction of time service is usable
Latency: Time to complete a request
Throughput: Requests per second
Error Rate: Fraction of failed requests
Correctness: Fraction of correct responses

Implementing SLOs Step-by-Step

Step 1: Identify Critical User Journeys

# user-journeys.yaml
user_journeys:
  - name: "User Login"
    steps:
      - "Load login page"
      - "Submit credentials"
      - "Receive auth token"
      - "Redirect to dashboard"
    criticality: high
    
  - name: "Product Search"
    steps:
      - "Enter search query"
      - "Display results"
      - "Filter results"
    criticality: high
    
  - name: "Checkout Process"
    steps:
      - "Add to cart"
      - "Review cart"
      - "Enter payment"
      - "Complete order"
    criticality: critical

Step 2: Define SLIs for Each Journey

# slis.yaml
slis:
  api_availability:
    description: "Percentage of successful API requests"
    query: |
      sum(rate(http_requests_total{job="api",code!~"5.."}[5m]))
      /
      sum(rate(http_requests_total{job="api"}[5m]))
    unit: "percent"
    
  api_latency_p95:
    description: "95th percentile API response time"
    query: |
      histogram_quantile(0.95,
        rate(http_request_duration_seconds_bucket{job="api"}[5m])
      )
    unit: "seconds"
    threshold: 0.5
    
  api_error_rate:
    description: "Percentage of failed API requests"
    query: |
      sum(rate(http_requests_total{job="api",code=~"5.."}[5m]))
      /
      sum(rate(http_requests_total{job="api"}[5m]))
    unit: "percent"
    threshold: 1.0

Step 3: Set SLO Targets

# slos.yaml
slos:
  - name: "API Availability SLO"
    sli: api_availability
    target: 99.9
    window: 30d
    alerting:
      burn_rate_1h: 14.4
      burn_rate_6h: 6.0
    
  - name: "API Latency SLO"
    sli: api_latency_p95
    target_less_than: 500  # milliseconds
    percentile: 95
    window: 30d
    
  - name: "Search Availability"
    sli: search_availability
    target: 99.5
    window: 30d
    dependencies:
      - elasticsearch
      - cache

Step 4: Implement SLO Monitoring

# slo_calculator.py
from prometheus_api_client import PrometheusConnect
from datetime import datetime, timedelta

class SLOCalculator:
    def __init__(self, prometheus_url):
        self.prom = PrometheusConnect(url=prometheus_url, disable_ssl=True)
    
    def calculate_availability_slo(self, service, window_days=30):
        """Calculate availability SLO for a service"""
        end_time = datetime.now()
        start_time = end_time - timedelta(days=window_days)
        
        # Query for successful requests
        success_query = f'''
            sum(rate(http_requests_total{{
                job="{service}",
                code!~"5.."
            }}[5m]))
        '''
        
        # Query for total requests
        total_query = f'''
            sum(rate(http_requests_total{{
                job="{service}"
            }}[5m]))
        '''
        
        success_data = self.prom.custom_query_range(
            success_query,
            start_time=start_time,
            end_time=end_time,
            step='5m'
        )
        
        total_data = self.prom.custom_query_range(
            total_query,
            start_time=start_time,
            end_time=end_time,
            step='5m'
        )
        
        # Calculate SLI
        total_successful = sum([float(d[1]) for d in success_data[0]['values']])
        total_requests = sum([float(d[1]) for d in total_data[0]['values']])
        
        sli = (total_successful / total_requests) * 100 if total_requests > 0 else 0
        
        return {
            'service': service,
            'window_days': window_days,
            'sli': round(sli, 4),
            'total_requests': int(total_requests),
            'successful_requests': int(total_successful),
            'failed_requests': int(total_requests - total_successful)
        }
    
    def calculate_error_budget(self, service, slo_target=99.9, window_days=30):
        """Calculate remaining error budget"""
        result = self.calculate_availability_slo(service, window_days)
        current_sli = result['sli']
        
        # Calculate error budget
        allowed_failure_rate = 100 - slo_target
        actual_failure_rate = 100 - current_sli
        
        error_budget_consumed = (actual_failure_rate / allowed_failure_rate) * 100
        error_budget_remaining = 100 - error_budget_consumed
        
        return {
            **result,
            'slo_target': slo_target,
            'allowed_failure_rate': allowed_failure_rate,
            'actual_failure_rate': round(actual_failure_rate, 4),
            'error_budget_consumed_percent': round(error_budget_consumed, 2),
            'error_budget_remaining_percent': round(error_budget_remaining, 2),
            'status': 'HEALTHY' if error_budget_remaining > 10 else 'WARNING' if error_budget_remaining > 0 else 'CRITICAL'
        }

# Usage
calculator = SLOCalculator('http://prometheus:9090')
budget = calculator.calculate_error_budget('api-service', slo_target=99.9)

print(f"Service: {budget['service']}")
print(f"Current SLI: {budget['sli']}%")
print(f"SLO Target: {budget['slo_target']}%")
print(f"Error Budget Remaining: {budget['error_budget_remaining_percent']}%")
print(f"Status: {budget['status']}")

Error Budgets

Understanding Error Budgets

SLO: 99.9% availability
Error Budget: 100% - 99.9% = 0.1%

For 30 days:
Total time: 30 days × 24 hours × 60 minutes = 43,200 minutes
Allowed downtime: 43,200 × 0.001 = 43.2 minutes

Error budget remaining determines:
- Can we deploy new features?
- Should we focus on reliability?
- What's our risk appetite?

Error Budget Policy

# error-budget-policy.yaml
error_budget_policy:
  service: api-service
  slo_target: 99.9
  window: 30d
  
  thresholds:
    green:
      min: 50
      actions:
        - "Normal feature development"
        - "2 deployments per day allowed"
        - "Experimental features permitted"
    
    yellow:
      min: 10
      max: 50
      actions:
        - "Increased caution on deployments"
        - "1 deployment per day allowed"
        - "Require staging validation"
        - "No experimental features"
    
    red:
      max: 10
      actions:
        - "Feature freeze"
        - "Emergency fixes only"
        - "Focus on reliability improvements"
        - "Incident review required"
        - "Rollback recent changes"
  
  alerts:
    - threshold: 25
      severity: warning
      notification: slack
    - threshold: 10
      severity: critical
      notification: pagerduty

Implementing Error Budget Enforcement

# error_budget_enforcer.py
class ErrorBudgetEnforcer:
    def __init__(self, calculator, policy):
        self.calculator = calculator
        self.policy = policy
    
    def check_deployment_allowed(self, service):
        """Check if deployment is allowed based on error budget"""
        budget = self.calculator.calculate_error_budget(
            service,
            slo_target=self.policy['slo_target']
        )
        
        remaining = budget['error_budget_remaining_percent']
        
        if remaining > 50:
            return {
                'allowed': True,
                'risk_level': 'LOW',
                'max_deployments_per_day': 2,
                'message': 'Normal operations - proceed with deployment'
            }
        elif remaining > 10:
            return {
                'allowed': True,
                'risk_level': 'MEDIUM',
                'max_deployments_per_day': 1,
                'message': 'Increased caution - require additional validation'
            }
        else:
            return {
                'allowed': False,
                'risk_level': 'HIGH',
                'max_deployments_per_day': 0,
                'message': 'Feature freeze - focus on reliability improvements'
            }
    
    def generate_report(self, service):
        """Generate error budget report"""
        budget = self.calculator.calculate_error_budget(service)
        deployment_status = self.check_deployment_allowed(service)
        
        report = f"""
        ═══════════════════════════════════════════
        ERROR BUDGET REPORT - {service.upper()}
        ═══════════════════════════════════════════
        
        Current SLI:          {budget['sli']}%
        SLO Target:           {budget['slo_target']}%
        
        Error Budget:
        - Consumed:           {budget['error_budget_consumed_percent']}%
        - Remaining:          {budget['error_budget_remaining_percent']}%
        - Status:             {budget['status']}
        
        Deployment Status:
        - Allowed:            {'✓ YES' if deployment_status['allowed'] else '✗ NO'}
        - Risk Level:         {deployment_status['risk_level']}
        - Max Deployments:    {deployment_status['max_deployments_per_day']}/day
        
        Recommendation:
        {deployment_status['message']}
        
        ═══════════════════════════════════════════
        """
        
        return report

# Usage
enforcer = ErrorBudgetEnforcer(calculator, error_budget_policy)
print(enforcer.generate_report('api-service'))

Alerting on SLO Burn Rate

Multi-Window Multi-Burn-Rate Alerts

# prometheus-slo-alerts.yaml
groups:
- name: slo_alerts
  interval: 30s
  rules:
  # Fast burn rate (1 hour window) - page immediately
  - alert: SLOBurnRateFast
    expr: |
      (
        sum(rate(http_requests_total{job="api",code!~"5.."}[1h]))
        /
        sum(rate(http_requests_total{job="api"}[1h]))
      ) < 0.856  # 14.4x burn rate
    for: 2m
    labels:
      severity: critical
      slo: api_availability
    annotations:
      summary: "Fast SLO burn detected - {{ $value | humanizePercentage }}"
      description: "API availability is burning through error budget 14.4x faster than acceptable"
      
  # Slow burn rate (6 hour window) - ticket
  - alert: SLOBurnRateSlow
    expr: |
      (
        sum(rate(http_requests_total{job="api",code!~"5.."}[6h]))
        /
        sum(rate(http_requests_total{job="api"}[6h]))
      ) < 0.94  # 6x burn rate
    for: 15m
    labels:
      severity: warning
      slo: api_availability
    annotations:
      summary: "Slow SLO burn detected - {{ $value | humanizePercentage }}"
      description: "API availability is burning through error budget faster than target"
  
  # Latency SLO
  - alert: LatencySLOViolation
    expr: |
      histogram_quantile(0.95,
        rate(http_request_duration_seconds_bucket{job="api"}[5m])
      ) > 0.5
    for: 5m
    labels:
      severity: warning
      slo: api_latency
    annotations:
      summary: "P95 latency exceeds 500ms - {{ $value }}s"
      description: "API latency SLO is being violated"

Toil Reduction

Identifying Toil

Toil is manual, repetitive, automatable work that:

Is manual
Is repetitive
Can be automated
Is tactical not strategic
Grows with service size
Lacks enduring value

Measuring Toil

# toil_tracker.py
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import List

@dataclass
class ToilTask:
    name: str
    time_minutes: int
    frequency: str  # daily, weekly, monthly
    category: str
    automatable: bool
    
class ToilCalculator:
    def __init__(self):
        self.tasks: List[ToilTask] = []
    
    def add_task(self, task: ToilTask):
        self.tasks.append(task)
    
    def calculate_monthly_toil(self):
        """Calculate total monthly toil hours"""
        total_minutes = 0
        
        for task in self.tasks:
            if task.frequency == 'daily':
                total_minutes += task.time_minutes * 30
            elif task.frequency == 'weekly':
                total_minutes += task.time_minutes * 4
            elif task.frequency == 'monthly':
                total_minutes += task.time_minutes
        
        return total_minutes / 60  # Convert to hours
    
    def toil_by_category(self):
        """Group toil by category"""
        categories = {}
        
        for task in self.tasks:
            freq_multiplier = {'daily': 30, 'weekly': 4, 'monthly': 1}
            minutes = task.time_minutes * freq_multiplier[task.frequency]
            
            if task.category not in categories:
                categories[task.category] = 0
            categories[task.category] += minutes / 60
        
        return categories
    
    def automation_opportunities(self):
        """Identify high-value automation opportunities"""
        opportunities = []
        
        for task in self.tasks:
            if task.automatable:
                freq_multiplier = {'daily': 30, 'weekly': 4, 'monthly': 1}
                monthly_hours = (task.time_minutes * freq_multiplier[task.frequency]) / 60
                
                opportunities.append({
                    'task': task.name,
                    'monthly_hours_saved': monthly_hours,
                    'category': task.category
                })
        
        return sorted(opportunities, key=lambda x: x['monthly_hours_saved'], reverse=True)

# Example usage
tracker = ToilCalculator()

# Add toil tasks
tracker.add_task(ToilTask("Manual deployment verification", 30, "daily", "Deployments", True))
tracker.add_task(ToilTask("Certificate renewal", 45, "monthly", "Security", True))
tracker.add_task(ToilTask("Log analysis for errors", 60, "daily", "Monitoring", True))
tracker.add_task(ToilTask("Database backup verification", 15, "daily", "Backups", True))
tracker.add_task(ToilTask("Capacity planning review", 120, "weekly", "Planning", False))

print(f"Total monthly toil: {tracker.calculate_monthly_toil():.1f} hours")
print(f"\nToil by category: {tracker.toil_by_category()}")
print(f"\nTop automation opportunities:")
for opp in tracker.automation_opportunities()[:3]:
    print(f"  - {opp['task']}: {opp['monthly_hours_saved']:.1f} hours/month")

Automation Script Example

#!/bin/bash
# automate-deployment-verification.sh

set -euo pipefail

SERVICE_NAME="$1"
DEPLOYMENT_ID="$2"
ENVIRONMENT="$3"

echo "🔍 Automated Deployment Verification"
echo "Service: $SERVICE_NAME"
echo "Deployment: $DEPLOYMENT_ID"
echo "Environment: $ENVIRONMENT"

# 1. Check deployment status
echo "✓ Checking deployment status..."
kubectl rollout status deployment/$SERVICE_NAME -n $ENVIRONMENT --timeout=5m

# 2. Verify pod health
echo "✓ Verifying pod health..."
DESIRED_REPLICAS=$(kubectl get deployment $SERVICE_NAME -n $ENVIRONMENT -o jsonpath='{.spec.replicas}')
READY_REPLICAS=$(kubectl get deployment $SERVICE_NAME -n $ENVIRONMENT -o jsonpath='{.status.readyReplicas}')

if [ "$DESIRED_REPLICAS" != "$READY_REPLICAS" ]; then
    echo "❌ Not all replicas are ready: $READY_REPLICAS/$DESIRED_REPLICAS"
    exit 1
fi

# 3. Run smoke tests
echo "✓ Running smoke tests..."
ENDPOINT="https://$SERVICE_NAME.$ENVIRONMENT.example.com"

for endpoint in "/health" "/ready" "/metrics"; do
    STATUS=$(curl -s -o /dev/null -w "%{http_code}" $ENDPOINT$endpoint)
    if [ "$STATUS" != "200" ]; then
        echo "❌ Smoke test failed for $endpoint: HTTP $STATUS"
        exit 1
    fi
    echo "  ✓ $endpoint: OK"
done

# 4. Check error rate
echo "✓ Checking error rate..."
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query" \
    --data-urlencode "query=sum(rate(http_requests_total{service=\"$SERVICE_NAME\",code=~\"5..\"}[5m]))/sum(rate(http_requests_total{service=\"$SERVICE_NAME\"}[5m]))" \
    | jq -r '.data.result[0].value[1]')

if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
    echo "❌ Error rate too high: $ERROR_RATE"
    exit 1
fi

# 5. Record deployment
echo "✓ Recording deployment..."
curl -X POST http://deployment-tracker/api/deployments \
    -H "Content-Type: application/json" \
    -d "{
        \"service\": \"$SERVICE_NAME\",
        \"deployment_id\": \"$DEPLOYMENT_ID\",
        \"environment\": \"$ENVIRONMENT\",
        \"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",
        \"status\": \"success\"
    }"

echo "✅ Deployment verification complete!"

# Time saved: 30 minutes per day → 15 hours per month

Incident Management

Incident Severity Levels

# incident-severity.yaml
severity_levels:
  SEV1:
    name: "Critical"
    description: "Service is down or severely degraded"
    examples:
      - "Complete service outage"
      - "Data loss or corruption"
      - "Security breach"
    response_time: "15 minutes"
    response_team:
      - on_call_sre
      - engineering_manager
      - cto
    
  SEV2:
    name: "High"
    description: "Significant feature unavailable"
    examples:
      - "Major feature broken"
      - "Performance severely degraded"
      - "Affecting >10% of users"
    response_time: "1 hour"
    response_team:
      - on_call_sre
      - product_owner
    
  SEV3:
    name: "Medium"
    description: "Minor feature degraded"
    examples:
      - "Non-critical feature broken"
      - "Minor performance degradation"
      - "Affecting <10% of users"
    response_time: "4 hours"
    response_team:
      - on_call_sre
    
  SEV4:
    name: "Low"
    description: "Cosmetic issues"
    examples:
      - "UI issues"
      - "Documentation errors"
    response_time: "next business day"
    response_team:
      - assigned_engineer

Incident Response Runbook

# Incident Response Runbook

## Phase 1: Detection & Triage (0-5 minutes)

1. **Acknowledge Alert**
   ```bash
   # Acknowledge in PagerDuty
   pd incident ack <incident-id>

Assess Severity
- Is the service down?
- How many users affected?
- Is data at risk?

Declare Incident

# Create incident channel
/incident create --severity SEV2 --title "API Latency Spike"

Phase 2: Investigation (5-30 minutes)

Gather Information

# Check service health
kubectl get pods -n production
   
# View recent logs
kubectl logs -n production deployment/api --tail=100
   
# Check metrics
curl "http://prometheus:9090/api/v1/query?query=up{job='api'}"

Form Hypothesis
- Recent deployments?
- Infrastructure changes?
- External dependencies?

Test Hypothesis

# Check recent deployments
kubectl rollout history deployment/api -n production
   
# Compare with previous version
kubectl diff -f deployment.yaml

Phase 3: Mitigation (30-60 minutes)

Implement Fix

# Option 1: Rollback
kubectl rollout undo deployment/api -n production
   
# Option 2: Scale up
kubectl scale deployment/api --replicas=10 -n production
   
# Option 3: Emergency patch
kubectl set image deployment/api api=api:hotfix-123

Verify Fix

# Check metrics improved
# Verify error rate decreased
# Confirm latency normalized

Phase 4: Recovery (60+ minutes)

Monitor Stability
- Watch metrics for 30 minutes
- Ensure no regression

Close Incident

/incident close --resolution "Rolled back to v1.2.3"

Phase 5: Post-Incident

Schedule Postmortem
- Within 48 hours
- Blameless culture
- Focus on system improvements ```

Blameless Postmortem Template

# Postmortem: API Latency Spike

**Date**: 2024-10-12
**Duration**: 2 hours 15 minutes
**Severity**: SEV2
**Impact**: 35% of API requests experienced >5s latency

## Summary

Between 14:00 and 16:15 UTC, our API experienced significant latency spikes affecting approximately 35% of requests. The issue was caused by a database connection pool exhaustion following a deployment that increased default query timeout.

## Timeline

| Time  | Event |
|-------|-------|
| 14:00 | Deployment of v2.5.0 completed |
| 14:05 | First latency alerts triggered |
| 14:10 | Incident declared (SEV2) |
| 14:15 | Investigation began |
| 14:30 | Hypothesis: Database connection issue |
| 14:45 | Confirmed: Connection pool exhausted |
| 15:00 | Rollback initiated |
| 15:15 | Rollback completed |
| 15:30 | Metrics returned to normal |
| 16:15 | Incident closed |

## Root Cause

The new code version increased the default database query timeout from 5s to 30s. During peak traffic, this caused connections to be held longer, eventually exhausting the connection pool (max 100 connections).

## Impact

- **Users Affected**: ~50,000 users
- **Requests Impacted**: 2.1M requests with >5s latency
- **Revenue Impact**: Estimated $12,000 in lost transactions
- **Error Budget**: Consumed 15% of monthly budget

## What Went Well

✅ Monitoring detected the issue within 5 minutes
✅ Team responded quickly and followed runbook
✅ Rollback was smooth and effective
✅ Communication was clear and timely

## What Went Wrong

❌ Deployment didn't include load testing with new timeout
❌ Connection pool monitoring was not in place
❌ No automated rollback on SLO violation

## Action Items

| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| Add connection pool metrics to Grafana | Alice | 2024-10-19 | P0 |
| Implement load testing in CI/CD | Bob | 2024-10-26 | P0 |
| Create automated rollback on SLO violation | Charlie | 2024-11-02 | P1 |
| Document database timeout best practices | David | 2024-10-26 | P2 |
| Review all timeout configurations | Team | 2024-11-09 | P2 |

## Lessons Learned

1. **Always load test timeout changes** - Seemingly small configuration changes can have major impact
2. **Monitor resource exhaustion** - Connection pools, file descriptors, memory
3. **Implement progressive rollouts** - Canary deployments would have caught this
4. **Trust but verify** - Staging didn't replicate production load

## Prevention

Going forward:
- Mandatory load testing for any timeout/connection configuration changes
- Connection pool utilization must be <70% in production
- Automated rollback if error rate exceeds 5% for 5 minutes
- Monthly review of all timeout configurations

On-Call Best Practices

On-Call Rotation

# oncall-schedule.yaml
schedule:
  rotation_length: 1 week
  handoff_time: "09:00 local"
  
  primary_rotation:
    - alice
    - bob
    - charlie
    - david
  
  secondary_rotation:
    - eve
    - frank
    - grace
  
  coverage:
    weekdays: "24/7"
    weekends: "on-call-only"
  
  escalation_policy:
    - level: 1
      delay: 5 minutes
      notify: primary
    - level: 2
      delay: 15 minutes
      notify: secondary
    - level: 3
      delay: 30 minutes
      notify: engineering_manager

On-Call Checklist

# On-Call Engineer Checklist

## Before Your Shift

- [ ] Review open incidents from previous shift
- [ ] Check current system health dashboard
- [ ] Review error budget status
- [ ] Test alert notifications (SMS, email, app)
- [ ] Ensure VPN access working
- [ ] Have laptop charged and available
- [ ] Review recent deployments
- [ ] Check calendar for scheduled maintenance

## During Your Shift

- [ ] Respond to alerts within 15 minutes
- [ ] Update incident channels regularly
- [ ] Document all actions taken
- [ ] Escalate if unable to resolve in 1 hour
- [ ] Monitor error budget consumption
- [ ] Keep stakeholders informed

## After Your Shift

- [ ] Complete handoff document
- [ ] Brief next on-call engineer
- [ ] Close resolved incidents
- [ ] File any necessary follow-up tickets
- [ ] Update runbooks if needed

SRE Tools and Automation

Chaos Engineering

# chaos_monkey.py
import random
import time
from kubernetes import client, config

class ChaosMonkey:
    def __init__(self):
        config.load_kube_config()
        self.api = client.CoreV1Api()
    
    def kill_random_pod(self, namespace, label_selector):
        """Kill a random pod matching the selector"""
        pods = self.api.list_namespaced_pod(
            namespace=namespace,
            label_selector=label_selector
        )
        
        if not pods.items:
            print("No pods found")
            return
        
        target_pod = random.choice(pods.items)
        print(f"Terminating pod: {target_pod.metadata.name}")
        
        self.api.delete_namespaced_pod(
            name=target_pod.metadata.name,
            namespace=namespace
        )
    
    def introduce_latency(self, namespace, deployment, delay_ms=1000):
        """Add network latency to a deployment"""
        # Using toxiproxy or similar tool
        pass
    
    def run_experiment(self, namespace, label_selector, duration_minutes=5):
        """Run chaos experiment"""
        print(f"Starting chaos experiment for {duration_minutes} minutes")
        end_time = time.time() + (duration_minutes * 60)
        
        while time.time() < end_time:
            self.kill_random_pod(namespace, label_selector)
            time.sleep(random.randint(30, 120))  # Wait 30-120 seconds
        
        print("Chaos experiment complete")

# Usage
chaos = ChaosMonkey()
chaos.run_experiment(
    namespace="production",
    label_selector="app=api,tier=backend",
    duration_minutes=5
)

Key Metrics for SRE

# golden-signals.yaml
golden_signals:
  latency:
    description: "Time to service a request"
    metrics:
      - p50_latency
      - p95_latency
      - p99_latency
    target: "p95 < 500ms"
  
  traffic:
    description: "Demand on the system"
    metrics:
      - requests_per_second
      - concurrent_connections
    target: "Handle 10,000 RPS"
  
  errors:
    description: "Rate of failed requests"
    metrics:
      - error_rate
      - 5xx_rate
    target: "< 0.1% error rate"
  
  saturation:
    description: "How full the service is"
    metrics:
      - cpu_utilization
      - memory_utilization
      - disk_utilization
      - connection_pool_utilization
    target: "< 70% utilization"

Conclusion

Site Reliability Engineering provides a framework for building and maintaining reliable systems at scale. By implementing SLOs, error budgets, and automation, you can balance innovation with reliability while maintaining high service quality.

Key Takeaways

✅ Define clear SLOs based on user experience ✅ Use error budgets to balance reliability and innovation
✅ Automate toil to free up engineering time
✅ Practice blameless postmortems to learn from failures
✅ Monitor the right metrics - SLIs that matter to users
✅ Build for failure - expect and plan for incidents
✅ Continuous improvement - iterate on processes

Resources

How do you implement SRE in your organization? Share your experiences!

Tags: #sre #reliability #monitoring #slo #sli #error-budget #devops

Author

Hari Prasad

Seasoned DevOps Lead with 11+ years of expertise in cloud infrastructure, CI/CD automation, and infrastructure as code. Proven track record in designing scalable, secure systems on AWS using Terraform, Kubernetes, Jenkins, and Ansible. Strong leadership in mentoring teams and implementing cost-effective cloud solutions.

Continue Reading

Oct 15, 2024

KEDA on EKS: Complete Guide to Event-Driven Autoscaling with Real-World Examples

Master KEDA implementation on Amazon EKS with comprehensive examples for multiple scaling scenarios including message...

Read Article

Oct 08, 2024

GitHub Actions for CI/CD: Complete Workflow Automation Guide

Master GitHub Actions for automated CI/CD pipelines with reusable workflows, matrix builds, and deployment strategies.

Read Article

Oct 14, 2024

AIOps: AI-Powered DevOps Automation and Intelligent Operations

Comprehensive guide to implementing AIOps - using AI and machine learning to transform DevOps practices with intellig...

Read Article

Oct 07, 2024

GitOps Workflow with ArgoCD and Flux: Declarative Kubernetes Management

Implement GitOps workflows using ArgoCD and Flux for declarative, version-controlled Kubernetes deployments with auto...

Read Article

Site Reliability Engineering: Practical Guide to SLOs, SLIs, and Error Budgets

PPF Calculator

Resume Builder

EKS Pod Cost Calculator

AWS VPC Designer Pro

Discover My DevOps Journey

Portfolio

Blog

Courses

Tools

What is SRE?

The SRE Principles

1. Embrace Risk

2. Service Level Objectives (SLOs)

3. Service Level Indicators (SLIs)

Implementing SLOs Step-by-Step

Step 1: Identify Critical User Journeys

Step 2: Define SLIs for Each Journey

Step 3: Set SLO Targets

Step 4: Implement SLO Monitoring

Error Budgets

Understanding Error Budgets

Error Budget Policy

Implementing Error Budget Enforcement

Alerting on SLO Burn Rate

Multi-Window Multi-Burn-Rate Alerts

Toil Reduction

Identifying Toil

Measuring Toil

Automation Script Example

Incident Management

Incident Severity Levels

Incident Response Runbook

Phase 2: Investigation (5-30 minutes)

Phase 3: Mitigation (30-60 minutes)

Phase 4: Recovery (60+ minutes)

Phase 5: Post-Incident

Blameless Postmortem Template

On-Call Best Practices

On-Call Rotation

On-Call Checklist

SRE Tools and Automation

Chaos Engineering

Key Metrics for SRE

Conclusion

Key Takeaways

Resources

Hari Prasad

Continue Reading

KEDA on EKS: Complete Guide to Event-Driven Autoscaling with Real-World Examples

GitHub Actions for CI/CD: Complete Workflow Automation Guide

AIOps: AI-Powered DevOps Automation and Intelligent Operations

GitOps Workflow with ArgoCD and Flux: Declarative Kubernetes Management

DevOps Tools & Calculators Free Tools

Enjoyed this article?