Toil Reduction and Automation
What is Toil?
Google's Definition:
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
Characteristics of Toil
| Characteristic | What it means | Example |
|---|---|---|
| Manual | Requires human intervention | Manually restarting crashed services |
| Repetitive | Done over and over | Weekly log cleanup |
| Automatable | Could be done by a machine | Deploying the same service 10 times |
| Tactical | Interrupt-driven, reactive | Responding to the same alert repeatedly |
| No enduring value | Leaves things in same state | Temporary workarounds |
| Scales linearly | More load = more work | Manual user provisioning |
What is NOT Toil?
✅ NOT toil:
- Writing code
- Engineering/design work
- Strategic planning
- Improving automation
- Learning new skills
- One-time project work
- Incident response (if different each time)
❌ IS toil:
- Manually running the same script every day
- Copy-pasting config files
- Clicking through UI wizards repeatedly
- Manually scaling resources
- Running the same SQL queries
- Manual log analysis for known patterns
Measuring Toil
Track Your Time
# Example: Weekly toil tracking
toil_categories = {
"manual_deployments": 4.5, # hours
"log_analysis": 3.0,
"user_provisioning": 2.5,
"alert_ack_without_action": 2.0,
"manual_scaling": 1.5,
"password_resets": 1.0,
"config_updates": 1.5,
}
total_toil = sum(toil_categories.values()) # 16 hours
total_work_hours = 40
toil_percentage = (total_toil / total_work_hours) * 100 # 40%
# Google's target: Keep toil < 50%
# Elite SRE teams: Keep toil < 30%
Toil Metrics
metrics_to_track:
- toil_hours_per_week: "Total hours spent on toil"
- toil_percentage: "Toil as % of total work time"
- toil_by_category: "Where is toil concentrated?"
- toil_trend: "Is toil increasing or decreasing?"
- automation_roi: "Hours saved by automation projects"
Toil Dashboard
┌─────────────────────────────────────────────────────┐
│ TOIL DASHBOARD - Q1 2024 │
├─────────────────────────────────────────────────────┤
│ Current Toil: 35% (14 hrs/week) Target: <30% │
│ Trend: ↓ Down 5% from last quarter │
│ │
│ TOP TOIL SOURCES: │
│ 1. Manual deployments 4.5 hrs [█████████ ]│
│ 2. Log analysis 3.0 hrs [██████ ]│
│ 3. User provisioning 2.5 hrs [█████ ]│
│ 4. Alert acknowledgment 2.0 hrs [████ ]│
│ 5. Manual scaling 1.5 hrs [███ ]│
│ │
│ AUTOMATION IMPACT: │
│ - Certificate renewal automation: -3 hrs/week │
│ - Self-service user onboarding: -2 hrs/week │
│ - Auto-scaling implementation: -1.5 hrs/week │
│ │
│ IN PROGRESS: │
│ - Automated deployment pipeline (est. -4 hrs/week) │
│ - Structured logging + dashboards (est. -2 hrs/week)│
└─────────────────────────────────────────────────────┘
Automation Strategy
Automation Hierarchy
Level 0: No automation (100% manual)
↓
Level 1: Documentation (runbook exists)
↓
Level 2: Semi-automated (script exists, run manually)
↓
Level 3: Automated execution (triggered manually)
↓
Level 4: Fully automated (triggered automatically)
↓
Level 5: Self-healing (detects issues and auto-remediates)
Automation ROI Calculator
def calculate_automation_roi(
task_frequency_per_week,
time_per_task_minutes,
time_to_automate_hours,
maintenance_hours_per_month=0
):
"""
Calculate ROI for automation project
"""
# Weekly time saved
weekly_savings_hours = (
task_frequency_per_week * time_per_task_minutes
) / 60
# Monthly time saved
monthly_savings = weekly_savings_hours * 4
# Annual time saved
annual_savings = monthly_savings * 12
# Account for maintenance
annual_maintenance = maintenance_hours_per_month * 12
net_annual_savings = annual_savings - annual_maintenance
# Break-even point
break_even_weeks = time_to_automate_hours / weekly_savings_hours
# 2-year ROI
two_year_roi = (net_annual_savings * 2) - time_to_automate_hours
return {
"weekly_savings_hours": weekly_savings_hours,
"annual_savings_hours": net_annual_savings,
"break_even_weeks": break_even_weeks,
"two_year_roi_hours": two_year_roi,
"automate": two_year_roi > 0
}
# Example: Should we automate certificate renewal?
result = calculate_automation_roi(
task_frequency_per_week=5, # 5 times per week
time_per_task_minutes=30, # 30 min each time
time_to_automate_hours=16, # Will take 2 days to automate
maintenance_hours_per_month=1 # 1 hour/month to maintain
)
print(f"Break even in: {result['break_even_weeks']:.1f} weeks")
print(f"2-year ROI: {result['two_year_roi_hours']:.0f} hours saved")
print(f"Recommendation: {'AUTOMATE' if result['automate'] else 'KEEP MANUAL'}")
# Output:
# Break even in: 6.4 weeks
# 2-year ROI: 33 hours saved
# Recommendation: AUTOMATE
Prioritizing Automation Projects
automation_candidates = [
{
"task": "Certificate renewal",
"frequency": "5x/week",
"time_per_task": 30, # minutes
"pain_level": 8, # 1-10 scale
"error_prone": 9, # 1-10 scale
"time_to_automate": 16 # hours
},
{
"task": "Database backup verification",
"frequency": "1x/day",
"time_per_task": 15,
"pain_level": 4,
"error_prone": 7,
"time_to_automate": 8
},
# ... more tasks
]
def prioritize_automation(tasks):
"""Score each task for automation priority"""
scored_tasks = []
for task in tasks:
# Calculate weekly time spent
weekly_time = (
task["frequency"].split("x/")[0] *
task["time_per_task"] / 60 *
(7 if "week" in task["frequency"] else 1)
)
# Composite score
score = (
weekly_time * 2 + # Time savings (2x weight)
task["pain_level"] + # How annoying is it?
task["error_prone"] + # How risky is manual?
-task["time_to_automate"]/10 # Implementation cost
)
scored_tasks.append({
**task,
"score": score,
"weekly_hours": weekly_time
})
# Sort by score
return sorted(scored_tasks, key=lambda x: x["score"], reverse=True)
Automation Best Practices
Start Small, Iterate
❌ Don't:
# Try to build the perfect automation from day 1
def fully_automated_deployment_with_ai_and_ml_and_blockchain():
# 6 months of development
# Never actually finishes
# Meanwhile, still doing manual deployments
pass
✅ Do:
# Week 1: Basic script
def deploy_v1():
"""Simple script that automates the deployment steps"""
run("git pull")
run("./build.sh")
run("./deploy.sh")
# Week 2: Add safety checks
def deploy_v2():
ensure_tests_pass()
ensure_on_correct_branch()
deploy_v1()
verify_health_check()
# Week 3: Add rollback capability
def deploy_v3():
previous_version = get_current_version()
try:
deploy_v2()
except DeploymentError:
rollback(previous_version)
# Month 2: CI/CD integration
# Month 3: Canary deployments
# Month 4: Automated rollback on metrics
Make Automation Observable
# Bad: Silent automation
def cleanup_old_logs():
os.system("rm -rf /var/log/*.old")
# Good: Observable automation
def cleanup_old_logs():
"""Clean old logs and track metrics"""
start_time = time.time()
log_dir = "/var/log"
# Find old logs
old_logs = find_files(log_dir, pattern="*.old")
# Track what we're doing
logger.info(f"Found {len(old_logs)} old log files to delete")
# Delete and track
deleted_count = 0
deleted_bytes = 0
for log_file in old_logs:
size = os.path.getsize(log_file)
os.remove(log_file)
deleted_count += 1
deleted_bytes += size
duration = time.time() - start_time
# Emit metrics
metrics.gauge("log_cleanup.files_deleted", deleted_count)
metrics.gauge("log_cleanup.bytes_freed", deleted_bytes)
metrics.timing("log_cleanup.duration_seconds", duration)
# Log results
logger.info(
f"Cleanup complete: {deleted_count} files, "
f"{deleted_bytes/1024/1024:.2f} MB freed in {duration:.2f}s"
)
return {
"files_deleted": deleted_count,
"bytes_freed": deleted_bytes,
"duration_seconds": duration
}
Build Safety into Automation
# Automation safety checklist
class SafeAutomation:
"""Framework for safe automation"""
def __init__(self, operation_name):
self.operation = operation_name
self.dry_run = os.getenv("DRY_RUN", "false") == "true"
def run(self, action, **kwargs):
"""Run action with safety checks"""
# 1. Pre-flight checks
self.preflight_checks()
# 2. Rate limiting
self.check_rate_limit()
# 3. Dry run mode
if self.dry_run:
logger.info(f"DRY RUN: Would execute {action}")
return
# 4. Confirm destructive actions
if self.is_destructive(action):
self.require_confirmation()
# 5. Create backup/snapshot before destructive operations
if self.is_destructive(action):
self.create_backup()
# 6. Execute with timeout
try:
result = self.execute_with_timeout(action, **kwargs)
except TimeoutError:
logger.error(f"{action} timed out")
self.handle_timeout()
raise
# 7. Verify success
self.verify_result(result)
# 8. Log for audit trail
self.audit_log(action, result)
return result
def is_destructive(self, action):
"""Check if action is destructive"""
destructive_keywords = [
"delete", "remove", "drop", "truncate",
"destroy", "terminate", "kill"
]
return any(kw in action.lower() for kw in destructive_keywords)
def create_backup(self):
"""Create backup before destructive operation"""
timestamp = datetime.now().isoformat()
backup_id = f"auto_backup_{timestamp}"
logger.info(f"Creating safety backup: {backup_id}")
# ... backup logic ...
Idempotent Automation
# Bad: Not idempotent - running twice causes problems
def setup_user():
create_user("alice") # Fails if user exists
create_home_directory("/home/alice")
set_password("alice", "temp123")
# Good: Idempotent - safe to run multiple times
def setup_user_idempotent():
# Check if user exists
if not user_exists("alice"):
create_user("alice")
logger.info("Created user alice")
else:
logger.info("User alice already exists, skipping creation")
# Ensure home directory exists
ensure_directory_exists("/home/alice")
# Set password only if not already set
if not password_is_set("alice"):
set_password("alice", "temp123")
# Always ensure correct permissions (idempotent)
set_permissions("/home/alice", owner="alice", mode="0755")
logger.info("User alice setup complete")
Common Toil Reduction Patterns
1. Self-Service Portals
Problem: Team spends hours provisioning resources for developers
Solution: Build self-service portal
# Before: Manual provisioning (30 min per request, 10 requests/week)
# SRE receives ticket → creates database → configures access → notifies requester
# After: Self-service portal (0 min SRE time, unlimited requests)
@app.route('/api/provision/database', methods=['POST'])
def provision_database():
"""Developers provision their own databases"""
# Validate request
request_data = validate_request(request.json)
# Apply guardrails
if request_data['size_gb'] > MAX_DB_SIZE:
return error("Database too large, contact SRE")
# Provision automatically
database = create_database(
name=request_data['db_name'],
size_gb=request_data['size_gb'],
owner=request.user,
environment=request_data['environment']
)
# Auto-configure access
grant_access(database, user=request.user, role="admin")
# Notify
send_notification(
to=request.user,
subject=f"Database {database.name} is ready!",
connection_string=database.connection_string
)
return success(database)
# Result: 300 hours/year saved (10 req/week * 30 min * 52 weeks / 60)
2. Automated Remediation
Problem: Same alerts require same manual fixes repeatedly
Solution: Auto-remediate known issues
# Pattern: Alert → Auto-remediate → Notify
alert_remediations = {
"disk_space_low": auto_cleanup_old_logs,
"connection_pool_exhausted": restart_connection_pool,
"cache_thrashing": clear_cache,
"stuck_worker": restart_worker,
}
def handle_alert(alert):
"""Auto-remediate if known issue, else page human"""
remediation = alert_remediations.get(alert.type)
if remediation:
logger.info(f"Auto-remediating {alert.type}")
try:
remediation(alert)
# Verify fix worked
if alert.is_resolved():
notify_slack(
f"✅ Auto-resolved {alert.type} on {alert.host}"
)
return
except Exception as e:
logger.error(f"Auto-remediation failed: {e}")
# Fall through to paging
# Unknown issue or remediation failed - page human
page_oncall(alert)
3. ChatOps for Common Tasks
Problem: Common tasks require SSH'ing to servers, running commands
Solution: Expose commands via Slack/chat
# Slack bot for common SRE tasks
@slack_bot.command('/restart-service')
def restart_service_command(service_name, environment):
"""
Usage: /restart-service user-api production
"""
# Verify permissions
if not user_can_restart(service_name, environment, user=slack.user):
return "❌ You don't have permission to restart that service"
# Safety check
if environment == "production":
require_confirmation(
f"Are you sure you want to restart {service_name} in PRODUCTION?"
)
# Execute
result = restart_service(service_name, environment)
# Report back
return f"✅ {service_name} restarted in {environment}. Health: {result.health}"
# Before: SSH to server, find process, restart, verify (5 min)
# After: Type command in Slack (30 seconds)
4. Infrastructure as Code
Problem: Manual infrastructure changes are slow and error-prone
Solution: Codify infrastructure
# Before: 30-step runbook to provision new environment
# After: terraform apply
resource "aws_instance" "web_server" {
count = var.server_count
ami = data.aws_ami.ubuntu.id
instance_type = var.instance_type
tags = {
Name = "web-${count.index}"
Environment = var.environment
ManagedBy = "Terraform"
}
# Automatically configure on boot
user_data = templatefile("${path.module}/init-script.sh", {
environment = var.environment
})
}
# Result: Consistent, repeatable, version-controlled infrastructure
5. Automated Testing
Problem: Manual testing before deployments takes hours
Solution: Comprehensive automated testing
# CI/CD Pipeline
pipeline:
- stage: Unit Tests
script: npm test
duration: 2 min
- stage: Integration Tests
script: npm run test:integration
duration: 5 min
- stage: Security Scan
script: npm audit && docker scan
duration: 3 min
- stage: Deploy to Staging
script: ./deploy.sh staging
- stage: Smoke Tests
script: ./smoke-tests.sh staging
duration: 2 min
- stage: Deploy to Production
script: ./deploy.sh production
requires: manual_approval
# Before: 2 hours of manual testing per deployment
# After: 12 minutes automated, high confidence
Toil Reduction Workflow
1. IDENTIFY TOIL
↓
Track time spent on repetitive tasks
Categorize and measure
2. PRIORITIZE
↓
Calculate ROI
Consider: frequency, time, pain level, error-prone
3. AUTOMATE
↓
Start simple
Build in safety and observability
Make it idempotent
4. MEASURE IMPACT
↓
Track time saved
Monitor automation reliability
Gather feedback
5. ITERATE
↓
Improve automation based on usage
Find new toil sources
Repeat
Key Takeaways
- Target: <50% toil (Google's recommendation, aim for <30%)
- Measure your toil - Track hours spent on repetitive tasks
- Calculate ROI - Not everything is worth automating
- Start small, iterate - Perfect is the enemy of done
- Make automation observable - Log, metric, alert
- Build in safety - Dry-run mode, confirmations, backups
- Idempotency - Safe to run multiple times
- Self-service > tickets - Empower users to help themselves
- Auto-remediate known issues - Don't wake humans for known problems
- Continuous improvement - Toil reduction is never "done"
Remember: Every hour spent reducing toil is an investment that pays dividends forever. The best time to automate was yesterday. The second-best time is now.