Site Reliability Engineering (SRE) bridges the gap between development and operations by applying software engineering principles to infrastructure and operations problems. This comprehensive guide covers implementing SRE practices in your organization.
What is SRE?
SRE is Google’s approach to DevOps that emphasizes:
- Reliability as a feature: Treating reliability with the same importance as features
- Engineering solutions: Automating operations work
- Measurable objectives: Using SLOs and SLIs to measure reliability
- Error budgets: Balancing innovation with reliability
- Blameless postmortems: Learning from failures
The SRE Principles
1. Embrace Risk
Accept that 100% reliability is impossible and unnecessary.
Availability Target: 99.9% (Three Nines)
Allowed downtime: 43.8 minutes/month
Error budget: 0.1%
Availability Target: 99.99% (Four Nines)
Allowed downtime: 4.38 minutes/month
Error budget: 0.01%
2. Service Level Objectives (SLOs)
SLOs define target levels of reliability:
# Example SLO Definition
slo:
name: "API Availability"
description: "Percentage of successful API requests"
target: 99.9
window: 30d
sli:
type: availability
numerator: "sum(rate(http_requests_total{status!~'5..'}[5m]))"
denominator: "sum(rate(http_requests_total[5m]))"
3. Service Level Indicators (SLIs)
Quantitative measures of service level:
Common SLIs:
- Availability: Fraction of time service is usable
- Latency: Time to complete a request
- Throughput: Requests per second
- Error Rate: Fraction of failed requests
- Correctness: Fraction of correct responses
Implementing SLOs Step-by-Step
Step 1: Identify Critical User Journeys
# user-journeys.yaml
user_journeys:
- name: "User Login"
steps:
- "Load login page"
- "Submit credentials"
- "Receive auth token"
- "Redirect to dashboard"
criticality: high
- name: "Product Search"
steps:
- "Enter search query"
- "Display results"
- "Filter results"
criticality: high
- name: "Checkout Process"
steps:
- "Add to cart"
- "Review cart"
- "Enter payment"
- "Complete order"
criticality: critical
Step 2: Define SLIs for Each Journey
# slis.yaml
slis:
api_availability:
description: "Percentage of successful API requests"
query: |
sum(rate(http_requests_total{job="api",code!~"5.."}[5m]))
/
sum(rate(http_requests_total{job="api"}[5m]))
unit: "percent"
api_latency_p95:
description: "95th percentile API response time"
query: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket{job="api"}[5m])
)
unit: "seconds"
threshold: 0.5
api_error_rate:
description: "Percentage of failed API requests"
query: |
sum(rate(http_requests_total{job="api",code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="api"}[5m]))
unit: "percent"
threshold: 1.0
Step 3: Set SLO Targets
# slos.yaml
slos:
- name: "API Availability SLO"
sli: api_availability
target: 99.9
window: 30d
alerting:
burn_rate_1h: 14.4
burn_rate_6h: 6.0
- name: "API Latency SLO"
sli: api_latency_p95
target_less_than: 500 # milliseconds
percentile: 95
window: 30d
- name: "Search Availability"
sli: search_availability
target: 99.5
window: 30d
dependencies:
- elasticsearch
- cache
Step 4: Implement SLO Monitoring
# slo_calculator.py
from prometheus_api_client import PrometheusConnect
from datetime import datetime, timedelta
class SLOCalculator:
def __init__(self, prometheus_url):
self.prom = PrometheusConnect(url=prometheus_url, disable_ssl=True)
def calculate_availability_slo(self, service, window_days=30):
"""Calculate availability SLO for a service"""
end_time = datetime.now()
start_time = end_time - timedelta(days=window_days)
# Query for successful requests
success_query = f'''
sum(rate(http_requests_total{{
job="{service}",
code!~"5.."
}}[5m]))
'''
# Query for total requests
total_query = f'''
sum(rate(http_requests_total{{
job="{service}"
}}[5m]))
'''
success_data = self.prom.custom_query_range(
success_query,
start_time=start_time,
end_time=end_time,
step='5m'
)
total_data = self.prom.custom_query_range(
total_query,
start_time=start_time,
end_time=end_time,
step='5m'
)
# Calculate SLI
total_successful = sum([float(d[1]) for d in success_data[0]['values']])
total_requests = sum([float(d[1]) for d in total_data[0]['values']])
sli = (total_successful / total_requests) * 100 if total_requests > 0 else 0
return {
'service': service,
'window_days': window_days,
'sli': round(sli, 4),
'total_requests': int(total_requests),
'successful_requests': int(total_successful),
'failed_requests': int(total_requests - total_successful)
}
def calculate_error_budget(self, service, slo_target=99.9, window_days=30):
"""Calculate remaining error budget"""
result = self.calculate_availability_slo(service, window_days)
current_sli = result['sli']
# Calculate error budget
allowed_failure_rate = 100 - slo_target
actual_failure_rate = 100 - current_sli
error_budget_consumed = (actual_failure_rate / allowed_failure_rate) * 100
error_budget_remaining = 100 - error_budget_consumed
return {
**result,
'slo_target': slo_target,
'allowed_failure_rate': allowed_failure_rate,
'actual_failure_rate': round(actual_failure_rate, 4),
'error_budget_consumed_percent': round(error_budget_consumed, 2),
'error_budget_remaining_percent': round(error_budget_remaining, 2),
'status': 'HEALTHY' if error_budget_remaining > 10 else 'WARNING' if error_budget_remaining > 0 else 'CRITICAL'
}
# Usage
calculator = SLOCalculator('http://prometheus:9090')
budget = calculator.calculate_error_budget('api-service', slo_target=99.9)
print(f"Service: {budget['service']}")
print(f"Current SLI: {budget['sli']}%")
print(f"SLO Target: {budget['slo_target']}%")
print(f"Error Budget Remaining: {budget['error_budget_remaining_percent']}%")
print(f"Status: {budget['status']}")
Error Budgets
Understanding Error Budgets
SLO: 99.9% availability
Error Budget: 100% - 99.9% = 0.1%
For 30 days:
Total time: 30 days × 24 hours × 60 minutes = 43,200 minutes
Allowed downtime: 43,200 × 0.001 = 43.2 minutes
Error budget remaining determines:
- Can we deploy new features?
- Should we focus on reliability?
- What's our risk appetite?
Error Budget Policy
# error-budget-policy.yaml
error_budget_policy:
service: api-service
slo_target: 99.9
window: 30d
thresholds:
green:
min: 50
actions:
- "Normal feature development"
- "2 deployments per day allowed"
- "Experimental features permitted"
yellow:
min: 10
max: 50
actions:
- "Increased caution on deployments"
- "1 deployment per day allowed"
- "Require staging validation"
- "No experimental features"
red:
max: 10
actions:
- "Feature freeze"
- "Emergency fixes only"
- "Focus on reliability improvements"
- "Incident review required"
- "Rollback recent changes"
alerts:
- threshold: 25
severity: warning
notification: slack
- threshold: 10
severity: critical
notification: pagerduty
Implementing Error Budget Enforcement
# error_budget_enforcer.py
class ErrorBudgetEnforcer:
def __init__(self, calculator, policy):
self.calculator = calculator
self.policy = policy
def check_deployment_allowed(self, service):
"""Check if deployment is allowed based on error budget"""
budget = self.calculator.calculate_error_budget(
service,
slo_target=self.policy['slo_target']
)
remaining = budget['error_budget_remaining_percent']
if remaining > 50:
return {
'allowed': True,
'risk_level': 'LOW',
'max_deployments_per_day': 2,
'message': 'Normal operations - proceed with deployment'
}
elif remaining > 10:
return {
'allowed': True,
'risk_level': 'MEDIUM',
'max_deployments_per_day': 1,
'message': 'Increased caution - require additional validation'
}
else:
return {
'allowed': False,
'risk_level': 'HIGH',
'max_deployments_per_day': 0,
'message': 'Feature freeze - focus on reliability improvements'
}
def generate_report(self, service):
"""Generate error budget report"""
budget = self.calculator.calculate_error_budget(service)
deployment_status = self.check_deployment_allowed(service)
report = f"""
═══════════════════════════════════════════
ERROR BUDGET REPORT - {service.upper()}
═══════════════════════════════════════════
Current SLI: {budget['sli']}%
SLO Target: {budget['slo_target']}%
Error Budget:
- Consumed: {budget['error_budget_consumed_percent']}%
- Remaining: {budget['error_budget_remaining_percent']}%
- Status: {budget['status']}
Deployment Status:
- Allowed: {'✓ YES' if deployment_status['allowed'] else '✗ NO'}
- Risk Level: {deployment_status['risk_level']}
- Max Deployments: {deployment_status['max_deployments_per_day']}/day
Recommendation:
{deployment_status['message']}
═══════════════════════════════════════════
"""
return report
# Usage
enforcer = ErrorBudgetEnforcer(calculator, error_budget_policy)
print(enforcer.generate_report('api-service'))
Alerting on SLO Burn Rate
Multi-Window Multi-Burn-Rate Alerts
# prometheus-slo-alerts.yaml
groups:
- name: slo_alerts
interval: 30s
rules:
# Fast burn rate (1 hour window) - page immediately
- alert: SLOBurnRateFast
expr: |
(
sum(rate(http_requests_total{job="api",code!~"5.."}[1h]))
/
sum(rate(http_requests_total{job="api"}[1h]))
) < 0.856 # 14.4x burn rate
for: 2m
labels:
severity: critical
slo: api_availability
annotations:
summary: "Fast SLO burn detected - {{ $value | humanizePercentage }}"
description: "API availability is burning through error budget 14.4x faster than acceptable"
# Slow burn rate (6 hour window) - ticket
- alert: SLOBurnRateSlow
expr: |
(
sum(rate(http_requests_total{job="api",code!~"5.."}[6h]))
/
sum(rate(http_requests_total{job="api"}[6h]))
) < 0.94 # 6x burn rate
for: 15m
labels:
severity: warning
slo: api_availability
annotations:
summary: "Slow SLO burn detected - {{ $value | humanizePercentage }}"
description: "API availability is burning through error budget faster than target"
# Latency SLO
- alert: LatencySLOViolation
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket{job="api"}[5m])
) > 0.5
for: 5m
labels:
severity: warning
slo: api_latency
annotations:
summary: "P95 latency exceeds 500ms - {{ $value }}s"
description: "API latency SLO is being violated"
Toil Reduction
Identifying Toil
Toil is manual, repetitive, automatable work that:
- Is manual
- Is repetitive
- Can be automated
- Is tactical not strategic
- Grows with service size
- Lacks enduring value
Measuring Toil
# toil_tracker.py
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import List
@dataclass
class ToilTask:
name: str
time_minutes: int
frequency: str # daily, weekly, monthly
category: str
automatable: bool
class ToilCalculator:
def __init__(self):
self.tasks: List[ToilTask] = []
def add_task(self, task: ToilTask):
self.tasks.append(task)
def calculate_monthly_toil(self):
"""Calculate total monthly toil hours"""
total_minutes = 0
for task in self.tasks:
if task.frequency == 'daily':
total_minutes += task.time_minutes * 30
elif task.frequency == 'weekly':
total_minutes += task.time_minutes * 4
elif task.frequency == 'monthly':
total_minutes += task.time_minutes
return total_minutes / 60 # Convert to hours
def toil_by_category(self):
"""Group toil by category"""
categories = {}
for task in self.tasks:
freq_multiplier = {'daily': 30, 'weekly': 4, 'monthly': 1}
minutes = task.time_minutes * freq_multiplier[task.frequency]
if task.category not in categories:
categories[task.category] = 0
categories[task.category] += minutes / 60
return categories
def automation_opportunities(self):
"""Identify high-value automation opportunities"""
opportunities = []
for task in self.tasks:
if task.automatable:
freq_multiplier = {'daily': 30, 'weekly': 4, 'monthly': 1}
monthly_hours = (task.time_minutes * freq_multiplier[task.frequency]) / 60
opportunities.append({
'task': task.name,
'monthly_hours_saved': monthly_hours,
'category': task.category
})
return sorted(opportunities, key=lambda x: x['monthly_hours_saved'], reverse=True)
# Example usage
tracker = ToilCalculator()
# Add toil tasks
tracker.add_task(ToilTask("Manual deployment verification", 30, "daily", "Deployments", True))
tracker.add_task(ToilTask("Certificate renewal", 45, "monthly", "Security", True))
tracker.add_task(ToilTask("Log analysis for errors", 60, "daily", "Monitoring", True))
tracker.add_task(ToilTask("Database backup verification", 15, "daily", "Backups", True))
tracker.add_task(ToilTask("Capacity planning review", 120, "weekly", "Planning", False))
print(f"Total monthly toil: {tracker.calculate_monthly_toil():.1f} hours")
print(f"\nToil by category: {tracker.toil_by_category()}")
print(f"\nTop automation opportunities:")
for opp in tracker.automation_opportunities()[:3]:
print(f" - {opp['task']}: {opp['monthly_hours_saved']:.1f} hours/month")
Automation Script Example
#!/bin/bash
# automate-deployment-verification.sh
set -euo pipefail
SERVICE_NAME="$1"
DEPLOYMENT_ID="$2"
ENVIRONMENT="$3"
echo "🔍 Automated Deployment Verification"
echo "Service: $SERVICE_NAME"
echo "Deployment: $DEPLOYMENT_ID"
echo "Environment: $ENVIRONMENT"
# 1. Check deployment status
echo "✓ Checking deployment status..."
kubectl rollout status deployment/$SERVICE_NAME -n $ENVIRONMENT --timeout=5m
# 2. Verify pod health
echo "✓ Verifying pod health..."
DESIRED_REPLICAS=$(kubectl get deployment $SERVICE_NAME -n $ENVIRONMENT -o jsonpath='{.spec.replicas}')
READY_REPLICAS=$(kubectl get deployment $SERVICE_NAME -n $ENVIRONMENT -o jsonpath='{.status.readyReplicas}')
if [ "$DESIRED_REPLICAS" != "$READY_REPLICAS" ]; then
echo "❌ Not all replicas are ready: $READY_REPLICAS/$DESIRED_REPLICAS"
exit 1
fi
# 3. Run smoke tests
echo "✓ Running smoke tests..."
ENDPOINT="https://$SERVICE_NAME.$ENVIRONMENT.example.com"
for endpoint in "/health" "/ready" "/metrics"; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" $ENDPOINT$endpoint)
if [ "$STATUS" != "200" ]; then
echo "❌ Smoke test failed for $endpoint: HTTP $STATUS"
exit 1
fi
echo " ✓ $endpoint: OK"
done
# 4. Check error rate
echo "✓ Checking error rate..."
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode "query=sum(rate(http_requests_total{service=\"$SERVICE_NAME\",code=~\"5..\"}[5m]))/sum(rate(http_requests_total{service=\"$SERVICE_NAME\"}[5m]))" \
| jq -r '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "❌ Error rate too high: $ERROR_RATE"
exit 1
fi
# 5. Record deployment
echo "✓ Recording deployment..."
curl -X POST http://deployment-tracker/api/deployments \
-H "Content-Type: application/json" \
-d "{
\"service\": \"$SERVICE_NAME\",
\"deployment_id\": \"$DEPLOYMENT_ID\",
\"environment\": \"$ENVIRONMENT\",
\"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",
\"status\": \"success\"
}"
echo "✅ Deployment verification complete!"
# Time saved: 30 minutes per day → 15 hours per month
Incident Management
Incident Severity Levels
# incident-severity.yaml
severity_levels:
SEV1:
name: "Critical"
description: "Service is down or severely degraded"
examples:
- "Complete service outage"
- "Data loss or corruption"
- "Security breach"
response_time: "15 minutes"
response_team:
- on_call_sre
- engineering_manager
- cto
SEV2:
name: "High"
description: "Significant feature unavailable"
examples:
- "Major feature broken"
- "Performance severely degraded"
- "Affecting >10% of users"
response_time: "1 hour"
response_team:
- on_call_sre
- product_owner
SEV3:
name: "Medium"
description: "Minor feature degraded"
examples:
- "Non-critical feature broken"
- "Minor performance degradation"
- "Affecting <10% of users"
response_time: "4 hours"
response_team:
- on_call_sre
SEV4:
name: "Low"
description: "Cosmetic issues"
examples:
- "UI issues"
- "Documentation errors"
response_time: "next business day"
response_team:
- assigned_engineer
Incident Response Runbook
# Incident Response Runbook
## Phase 1: Detection & Triage (0-5 minutes)
1. **Acknowledge Alert**
```bash
# Acknowledge in PagerDuty
pd incident ack <incident-id>
- Assess Severity
- Is the service down?
- How many users affected?
- Is data at risk?
- Declare Incident
# Create incident channel /incident create --severity SEV2 --title "API Latency Spike"
Phase 2: Investigation (5-30 minutes)
- Gather Information
# Check service health kubectl get pods -n production # View recent logs kubectl logs -n production deployment/api --tail=100 # Check metrics curl "http://prometheus:9090/api/v1/query?query=up{job='api'}"
- Form Hypothesis
- Recent deployments?
- Infrastructure changes?
- External dependencies?
- Test Hypothesis
# Check recent deployments kubectl rollout history deployment/api -n production # Compare with previous version kubectl diff -f deployment.yaml
Phase 3: Mitigation (30-60 minutes)
- Implement Fix
# Option 1: Rollback kubectl rollout undo deployment/api -n production # Option 2: Scale up kubectl scale deployment/api --replicas=10 -n production # Option 3: Emergency patch kubectl set image deployment/api api=api:hotfix-123
- Verify Fix
# Check metrics improved # Verify error rate decreased # Confirm latency normalized
Phase 4: Recovery (60+ minutes)
- Monitor Stability
- Watch metrics for 30 minutes
- Ensure no regression
- Close Incident
/incident close --resolution "Rolled back to v1.2.3"
Phase 5: Post-Incident
- Schedule Postmortem
- Within 48 hours
- Blameless culture
- Focus on system improvements ```
Blameless Postmortem Template
# Postmortem: API Latency Spike
**Date**: 2024-10-12
**Duration**: 2 hours 15 minutes
**Severity**: SEV2
**Impact**: 35% of API requests experienced >5s latency
## Summary
Between 14:00 and 16:15 UTC, our API experienced significant latency spikes affecting approximately 35% of requests. The issue was caused by a database connection pool exhaustion following a deployment that increased default query timeout.
## Timeline
| Time | Event |
|-------|-------|
| 14:00 | Deployment of v2.5.0 completed |
| 14:05 | First latency alerts triggered |
| 14:10 | Incident declared (SEV2) |
| 14:15 | Investigation began |
| 14:30 | Hypothesis: Database connection issue |
| 14:45 | Confirmed: Connection pool exhausted |
| 15:00 | Rollback initiated |
| 15:15 | Rollback completed |
| 15:30 | Metrics returned to normal |
| 16:15 | Incident closed |
## Root Cause
The new code version increased the default database query timeout from 5s to 30s. During peak traffic, this caused connections to be held longer, eventually exhausting the connection pool (max 100 connections).
## Impact
- **Users Affected**: ~50,000 users
- **Requests Impacted**: 2.1M requests with >5s latency
- **Revenue Impact**: Estimated $12,000 in lost transactions
- **Error Budget**: Consumed 15% of monthly budget
## What Went Well
✅ Monitoring detected the issue within 5 minutes
✅ Team responded quickly and followed runbook
✅ Rollback was smooth and effective
✅ Communication was clear and timely
## What Went Wrong
❌ Deployment didn't include load testing with new timeout
❌ Connection pool monitoring was not in place
❌ No automated rollback on SLO violation
## Action Items
| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| Add connection pool metrics to Grafana | Alice | 2024-10-19 | P0 |
| Implement load testing in CI/CD | Bob | 2024-10-26 | P0 |
| Create automated rollback on SLO violation | Charlie | 2024-11-02 | P1 |
| Document database timeout best practices | David | 2024-10-26 | P2 |
| Review all timeout configurations | Team | 2024-11-09 | P2 |
## Lessons Learned
1. **Always load test timeout changes** - Seemingly small configuration changes can have major impact
2. **Monitor resource exhaustion** - Connection pools, file descriptors, memory
3. **Implement progressive rollouts** - Canary deployments would have caught this
4. **Trust but verify** - Staging didn't replicate production load
## Prevention
Going forward:
- Mandatory load testing for any timeout/connection configuration changes
- Connection pool utilization must be <70% in production
- Automated rollback if error rate exceeds 5% for 5 minutes
- Monthly review of all timeout configurations
On-Call Best Practices
On-Call Rotation
# oncall-schedule.yaml
schedule:
rotation_length: 1 week
handoff_time: "09:00 local"
primary_rotation:
- alice
- bob
- charlie
- david
secondary_rotation:
- eve
- frank
- grace
coverage:
weekdays: "24/7"
weekends: "on-call-only"
escalation_policy:
- level: 1
delay: 5 minutes
notify: primary
- level: 2
delay: 15 minutes
notify: secondary
- level: 3
delay: 30 minutes
notify: engineering_manager
On-Call Checklist
# On-Call Engineer Checklist
## Before Your Shift
- [ ] Review open incidents from previous shift
- [ ] Check current system health dashboard
- [ ] Review error budget status
- [ ] Test alert notifications (SMS, email, app)
- [ ] Ensure VPN access working
- [ ] Have laptop charged and available
- [ ] Review recent deployments
- [ ] Check calendar for scheduled maintenance
## During Your Shift
- [ ] Respond to alerts within 15 minutes
- [ ] Update incident channels regularly
- [ ] Document all actions taken
- [ ] Escalate if unable to resolve in 1 hour
- [ ] Monitor error budget consumption
- [ ] Keep stakeholders informed
## After Your Shift
- [ ] Complete handoff document
- [ ] Brief next on-call engineer
- [ ] Close resolved incidents
- [ ] File any necessary follow-up tickets
- [ ] Update runbooks if needed
SRE Tools and Automation
Chaos Engineering
# chaos_monkey.py
import random
import time
from kubernetes import client, config
class ChaosMonkey:
def __init__(self):
config.load_kube_config()
self.api = client.CoreV1Api()
def kill_random_pod(self, namespace, label_selector):
"""Kill a random pod matching the selector"""
pods = self.api.list_namespaced_pod(
namespace=namespace,
label_selector=label_selector
)
if not pods.items:
print("No pods found")
return
target_pod = random.choice(pods.items)
print(f"Terminating pod: {target_pod.metadata.name}")
self.api.delete_namespaced_pod(
name=target_pod.metadata.name,
namespace=namespace
)
def introduce_latency(self, namespace, deployment, delay_ms=1000):
"""Add network latency to a deployment"""
# Using toxiproxy or similar tool
pass
def run_experiment(self, namespace, label_selector, duration_minutes=5):
"""Run chaos experiment"""
print(f"Starting chaos experiment for {duration_minutes} minutes")
end_time = time.time() + (duration_minutes * 60)
while time.time() < end_time:
self.kill_random_pod(namespace, label_selector)
time.sleep(random.randint(30, 120)) # Wait 30-120 seconds
print("Chaos experiment complete")
# Usage
chaos = ChaosMonkey()
chaos.run_experiment(
namespace="production",
label_selector="app=api,tier=backend",
duration_minutes=5
)
Key Metrics for SRE
# golden-signals.yaml
golden_signals:
latency:
description: "Time to service a request"
metrics:
- p50_latency
- p95_latency
- p99_latency
target: "p95 < 500ms"
traffic:
description: "Demand on the system"
metrics:
- requests_per_second
- concurrent_connections
target: "Handle 10,000 RPS"
errors:
description: "Rate of failed requests"
metrics:
- error_rate
- 5xx_rate
target: "< 0.1% error rate"
saturation:
description: "How full the service is"
metrics:
- cpu_utilization
- memory_utilization
- disk_utilization
- connection_pool_utilization
target: "< 70% utilization"
Conclusion
Site Reliability Engineering provides a framework for building and maintaining reliable systems at scale. By implementing SLOs, error budgets, and automation, you can balance innovation with reliability while maintaining high service quality.
Key Takeaways
✅ Define clear SLOs based on user experience
✅ Use error budgets to balance reliability and innovation
✅ Automate toil to free up engineering time
✅ Practice blameless postmortems to learn from failures
✅ Monitor the right metrics - SLIs that matter to users
✅ Build for failure - expect and plan for incidents
✅ Continuous improvement - iterate on processes
Resources
How do you implement SRE in your organization? Share your experiences!