On-Call Best Practices
The Philosophy of On-Call
Being on-call means you're the first responder when systems fail. Your job is to:
- Respond quickly to pages and alerts
- Stabilize the service and restore it to normal operation
- Communicate clearly with stakeholders
- Document incidents for future learning
- Prevent recurrence through follow-up work
The Golden Rule: You should be able to hand off an incident to anyone on the team at any time with complete context.
On-Call Rotation Design
Rotation Patterns
1. Primary + Secondary
rotation_pattern: "Primary + Secondary"
schedule:
primary:
- week_1: Alice
- week_2: Bob
- week_3: Charlie
secondary:
- week_1: Bob # Backs up Alice
- week_2: Charlie # Backs up Bob
- week_3: Alice # Backs up Charlie
escalation:
- level_1: Primary (0-5 min)
- level_2: Secondary (5-10 min)
- level_3: Team Lead (10-15 min)
- level_4: Engineering Manager (15+ min)
Pros: Backup coverage, mentorship opportunities Cons: Two people need to be available
2. Follow-the-Sun
rotation_pattern: "Follow-the-Sun"
schedule:
# Handoff every 8 hours based on timezone
00:00-08:00_UTC: APAC_team # Sydney, Singapore
08:00-16:00_UTC: EMEA_team # London, Berlin
16:00-00:00_UTC: Americas_team # SF, NYC
benefits:
- No one gets paged at night
- Fresh eyes on each shift
- Better work-life balance
Pros: No nighttime pages, global coverage Cons: Requires global team, handoff overhead
3. Quarterly Rotation
rotation_pattern: "Quarterly with Blackout Dates"
q1_2024:
jan: Alice
feb: Bob
mar: Charlie
blackout_dates:
- Alice: ["2024-01-15 to 2024-01-20"] # Vacation
- Bob: ["2024-02-10 to 2024-02-12"] # Conference
swap_process:
- Request swap at least 1 week in advance
- Find your own coverage
- Update PagerDuty schedule
- Notify team in #oncall channel
Rotation Best Practices
✅ Do's: - Rotate regularly (weekly or bi-weekly) - Give at least 48 hours notice before shift starts - Allow schedule swaps with proper notification - Honor blackout dates (vacation, conferences, personal) - Ensure at least 2 people know how to handle critical issues
❌ Don'ts: - Don't have same person on-call for more than 2 weeks straight - Don't schedule new team members alone for first rotation - Don't rotate too frequently (daily rotations are exhausting) - Don't forget to account for timezones - Don't make on-call mandatory during major life events
On-Call Responsibilities
During Your Shift
☐ Acknowledge pages within 5 minutes
☐ Respond to incidents following incident response process
☐ Keep #oncall channel updated
☐ Escalate if you need help
☐ Document all actions taken
☐ Hand off active incidents at shift change
☐ Update runbooks with any new learnings
Before Your Shift
☐ Review recent incidents and ongoing issues
☐ Check for scheduled maintenance or deployments
☐ Test your alerting (phone, SMS, app)
☐ Ensure you have access to all necessary systems
☐ Review runbooks and escalation procedures
☐ Know who your backup is
☐ Be in a location with internet access
After Your Shift
☐ Hand off any ongoing incidents
☐ Document unresolved issues
☐ Update team on shift summary
☐ File tickets for follow-up work
☐ Update runbooks if gaps were found
☐ Provide feedback on alerts (too noisy? actionable?)
Alert Quality
The Anatomy of a Good Alert
alert: DatabaseConnectionPoolNearExhaustion
severity: warning
description: |
Database connection pool is {{ $value }}% full on {{ $labels.instance }}.
This may lead to connection errors if pool becomes exhausted.
impact: |
- Users may experience "cannot connect to database" errors
- New requests may be rejected
- Application may become unresponsive
runbook: https://wiki.company.com/runbooks/db-connection-pool
dashboard: https://grafana.company.com/d/database-health
thresholds:
warning: 80% # Alert but don't page
critical: 95% # Page on-call
suggested_actions:
- Check for connection leaks in application logs
- Review slow query log for blocking queries
- Consider temporarily increasing pool size
- Identify and kill long-running transactions if safe
escalation:
- Primary: 0-5 minutes
- Secondary: 5-10 minutes
- Database team: 10-15 minutes
Alert Fatigue Prevention
Symptom: Too Many Alerts
# Calculate alert noise
def analyze_alert_quality(alerts_last_30_days):
"""Identify problematic alerts"""
alerts_analysis = {
"total_alerts": len(alerts_last_30_days),
"false_positives": 0,
"no_action_taken": 0,
"duplicates": 0,
"flapping": 0,
}
for alert in alerts_last_30_days:
# Alert that resolved itself without action
if alert.resolved_without_action():
alerts_analysis["no_action_taken"] += 1
# Alert fired but was incorrect
if alert.was_false_positive():
alerts_analysis["false_positives"] += 1
# Alert fired multiple times for same issue
if alert.is_duplicate_of_recent():
alerts_analysis["duplicates"] += 1
# Alert flapping (firing and resolving repeatedly)
if alert.flapping_count() > 3:
alerts_analysis["flapping"] += 1
# Calculate noise percentage
noise = (
alerts_analysis["false_positives"] +
alerts_analysis["no_action_taken"] +
alerts_analysis["duplicates"] +
alerts_analysis["flapping"]
)
alerts_analysis["noise_percentage"] = (
noise / alerts_analysis["total_alerts"] * 100
)
# Target: <10% noise
alerts_analysis["quality_score"] = (
100 - alerts_analysis["noise_percentage"]
)
return alerts_analysis
# Example output:
# {
# "total_alerts": 450,
# "false_positives": 25,
# "no_action_taken": 60,
# "duplicates": 30,
# "flapping": 15,
# "noise_percentage": 28.9%,
# "quality_score": 71.1
# }
# Action: Need to improve alerts - too much noise!
Solutions for Alert Fatigue
1. Adjust Thresholds
# Too sensitive (pages every night)
alert: HighCPU
expr: cpu_usage > 50%
# Better (only alerts when actually problematic)
alert: HighCPU
expr: cpu_usage > 80%
for: 15m # Must be sustained
2. Use Alert Grouping
# Before: 10 alerts for 10 unhealthy instances
alert: InstanceUnhealthy
for_each: instance
# After: 1 alert for cluster degradation
alert: ClusterDegraded
expr: (unhealthy_instances / total_instances) > 0.2
3. Add "for" Clause
# Prevents flapping alerts
alert: HighErrorRate
expr: error_rate > 5%
for: 10m # Must be true for 10 minutes
4. Time-Based Muting
# Don't page for expected batch job load
alert: HighDatabaseLoad
expr: db_load > 80%
mute_windows:
- start: "02:00"
end: "04:00"
days: ["Mon", "Wed", "Fri"]
reason: "Nightly ETL job"
On-Call Handbook
Quick Response Guide
┌─────────────────────────────────────────────────────┐
│ ON-CALL QUICK RESPONSE │
├─────────────────────────────────────────────────────┤
│ STEP 1: ACKNOWLEDGE (within 5 minutes) │
│ - Ack the alert in PagerDuty │
│ - Check #incidents channel │
│ - Open monitoring dashboards │
│ │
│ STEP 2: ASSESS SEVERITY │
│ - How many users affected? │
│ - Is data at risk? │
│ - Are we violating SLOs/SLAs? │
│ - Determine SEV level (1-5) │
│ │
│ STEP 3: COMMUNICATE │
│ - Create #incident-YYYY-MM-DD-description │
│ - Post initial status update │
│ - Update status page if user-facing │
│ - Escalate if needed │
│ │
│ STEP 4: INVESTIGATE │
│ - Check runbook for this alert │
│ - Review recent changes/deployments │
│ - Examine logs and metrics │
│ - Test hypothesis │
│ │
│ STEP 5: MITIGATE │
│ - Apply fix (rollback, scale, restart, etc.) │
│ - Monitor for improvement │
│ - Don't hesitate to escalate │
│ │
│ STEP 6: DOCUMENT │
│ - Timeline of events │
│ - Actions taken │
│ - What worked, what didn't │
│ - Follow-up tasks │
│ │
│ REMEMBER: Ask for help early and often! │
└─────────────────────────────────────────────────────┘
Common On-Call Scenarios
Scenario 1: Alert Fires, Service Seems Fine
**What happened:**
Alert fired for high error rate, but dashboard shows everything normal.
**What to do:**
1. ✅ Don't immediately dismiss - investigate
2. ✅ Check if alert is looking at correct metrics
3. ✅ Look for intermittent issues (may have self-resolved)
4. ✅ Check if alert threshold is too sensitive
5. ✅ Document findings
6. ✅ Create ticket to fix or remove alert
**What NOT to do:**
❌ Silence alert without investigation
❌ Assume alert is broken without verification
Scenario 2: Multiple Alerts Firing
**What happened:**
10 different alerts firing simultaneously.
**What to do:**
1. ✅ Find the root cause alert (others are likely symptoms)
2. ✅ Check infrastructure first (network, cloud provider)
3. ✅ Look for recent changes (deployment, config change)
4. ✅ Focus on user-impacting issues first
5. ✅ Escalate if overwhelmed
**Priority order:**
1. User-facing services
2. Data integrity issues
3. Internal tools
4. Infrastructure alerts
Scenario 3: Incident During Off-Hours
**What happened:**
Paged at 3 AM for production issue.
**What to do:**
1. ✅ Acknowledge page within 5 minutes
2. ✅ Assess if it needs immediate action or can wait
3. ✅ If immediate: follow incident response process
4. ✅ If can wait: document assessment, schedule for morning
5. ✅ Communicate decision in #oncall channel
**Mental health check:**
- You're not expected to be at 100% at 3 AM
- It's OK to escalate if you're not confident
- It's OK to bring in backup if needed
Scenario 4: Alert You Don't Understand
**What happened:**
Alert fired for a system you're not familiar with.
**What to do:**
1. ✅ Check runbook (should explain alert)
2. ✅ Look for recent incidents with same alert
3. ✅ Post in #oncall: "Need help with [alert_name]"
4. ✅ Escalate to service owner
5. ✅ Document that runbook needs improvement
**What NOT to do:**
❌ Ignore the alert
❌ Try random fixes without understanding
❌ Suffer in silence
On-Call Tools
Essential Toolkit
**Alerting/Paging:**
- PagerDuty / Opsgenie / VictorOps
- Mobile app + SMS + phone call backup
- Test alerts weekly
**Monitoring:**
- Grafana / Datadog / New Relic dashboards
- Mobile app for on-the-go viewing
- Pre-built dashboard links in runbooks
**Communication:**
- Slack / Teams for incident channels
- Zoom / Google Meet for war rooms
- Status page for customer updates
**Documentation:**
- Runbooks (searchable, up-to-date)
- Incident response templates
- Architecture diagrams
**Access:**
- VPN (if needed)
- AWS/GCP/Azure console access
- SSH keys / bastions
- Database query tools
- Feature flag controls
On-Call Laptop Setup
# Quick setup script for on-call laptop
cat > ~/.oncall_setup.sh << 'EOF'
#!/bin/bash
# Essential tools for on-call
brew install --cask \
slack \
pagerduty \
zoom \
iterm2
brew install \
awscli \
kubectl \
postgresql \
jq \
httpie
# SSH keys
if [ ! -f ~/.ssh/id_rsa ]; then
echo "⚠️ Set up SSH keys for production access"
fi
# VPN
if [ ! -d "/Applications/Cisco AnyConnect.app" ]; then
echo "⚠️ Install VPN client"
fi
# Test alerts
echo "Testing PagerDuty alerts..."
# (Add PagerDuty test alert command)
echo "✅ On-call setup complete!"
EOF
chmod +x ~/.oncall_setup.sh
On-Call Compensation and Well-being
Fair Compensation
compensation_models:
model_1_stipend:
description: "Fixed payment for being on-call"
example:
- on_call_week: $500
- per_incident_response: $0
pros: "Predictable, fair even if quiet week"
cons: "Doesn't account for incident volume"
model_2_hourly:
description: "Paid for hours worked on incidents"
example:
- on_call_availability: $100/week
- incident_response: $75/hour
pros: "Pay matches work done"
cons: "Can incentivize slow resolution"
model_3_time_off:
description: "Comp time for on-call work"
example:
- quiet_week: 0.5 days comp time
- busy_week: 2 days comp time
pros: "Flexible, promotes recovery"
cons: "Requires tracking, approval process"
model_4_hybrid:
description: "Base stipend + incident hours"
example:
- on_call_week: $300
- after_hours_incident: $100/hour
- weekend_incident: $150/hour
pros: "Balances availability and incident work"
cons: "More complex to administer"
Burnout Prevention
Warning Signs of Burnout
🚨 Warning Signs:
Individual level:
- Dreading on-call rotation
- Anxiety when phone buzzes
- Trouble sleeping during on-call week
- Resentment toward job/team
- Decreased quality of incident response
Team level:
- High turnover
- Frequent alerts being ignored
- Decreased code quality
- Increased incident severity/frequency
- Team members calling in sick during on-call
Burnout Prevention Strategies
1. Limit On-Call Frequency
max_oncall_frequency:
- rule: "No more than 1 week in 6"
- rule: "At least 4 weeks between rotations if possible"
- rule: "New parents: reduced rotation for 3 months"
- rule: "After major incident: 2 week break from on-call"
2. Alert Quality Program
Monthly alert review:
☐ Review all alerts that fired
☐ Identify noisy/low-value alerts
☐ Tune or remove 10% noisiest alerts
☐ Ensure all alerts have runbooks
☐ Track alert-to-incident ratio
Target: <20% of alerts require action
3. Mandatory Time Off After Incidents
incident_recovery:
sev1_incident:
if_duration: "> 4 hours"
then: "Next day off"
weekend_incidents:
if_duration: "> 2 hours"
then: "Comp day Monday"
multi_day_incident:
if_duration: "> 24 hours"
then: "Week off after resolution"
4. On-Call Buddy System
Pair experienced with less experienced:
- Buddy reviews on-call person's responses
- Quick Slack for "does this make sense?"
- Reduces stress of being alone
- Knowledge transfer happens naturally
On-Call Metrics
Track These Metrics
oncall_metrics:
alerting_metrics:
- total_alerts_per_week: "How many alerts fired?"
- after_hours_alerts: "How many outside business hours?"
- actionable_alert_percentage: "% that required action"
- false_positive_rate: "% that were false alarms"
- time_to_acknowledge: "How fast did on-call respond?"
incident_metrics:
- incidents_per_week: "How many incidents occurred?"
- incident_duration: "How long to resolve?"
- escalation_rate: "% that required escalation"
- weekend_incident_rate: "Incidents during weekends"
human_metrics:
- sleep_disruption: "Alerts during sleep hours (11pm-7am)"
- on_call_satisfaction: "Survey: 1-10 rating"
- burnout_indicators: "Sick days during/after on-call"
- rotation_fairness: "Even distribution across team?"
Example Dashboard
┌─────────────────────────────────────────────────────┐
│ ON-CALL HEALTH DASHBOARD - Q1 2024 │
├─────────────────────────────────────────────────────┤
│ ALERT QUALITY: Score: 73% │
│ Total alerts: 420 │
│ Actionable: 306 (73%) [████████░░] │
│ False positives: 84 (20%) [████░░░░░░] │
│ Duplicate: 30 (7%) [██░░░░░░░░] │
│ │
│ INCIDENT LOAD: │
│ Incidents/week: 3.2 [████░░░░░░] │
│ Avg duration: 45 min [█████░░░░░] │
│ After-hours: 35% [█████████░] │
│ │
│ TEAM HEALTH: │
│ On-call satisfaction: 7.2/10 [███████░░░] │
│ Sleep disruptions/week: 1.8 [████████░░] │
│ Escalation rate: 12% [███░░░░░░░] │
│ │
│ 🎯 GOALS: │
│ ☑ Reduce false positives to <10% │
│ ☑ Reduce after-hours incidents to <20% │
│ ☐ Increase satisfaction to 8.0/10 │
│ ☐ Reduce sleep disruptions to <1/week │
└─────────────────────────────────────────────────────┘
On-Call Handoff Process
Shift Start Handoff
## On-Call Handoff Template
**From:** Alice
**To:** Bob
**Date:** 2024-01-15
**Time:** 9:00 AM
### Ongoing Incidents
- None currently active
### Ongoing Issues (Not Incidents)
1. **Elevated API latency on EU region**
- Status: Monitoring
- Impact: p95 latency 250ms (normally 150ms)
- Ticket: PROD-1234
- Context: Gradual increase over 3 days, investigating
- Action: Monitor, escalate if > 400ms
### Recent Incidents (Last 7 Days)
1. **Database Connection Pool Exhaustion - SEV-2**
- When: Jan 12, 2:30 AM
- Duration: 45 minutes
- Resolution: Increased pool size from 100→150
- Follow-up: PROD-1230 (investigate connection leaks)
### Upcoming Maintenance/Deployments
1. **Kubernetes upgrade - Jan 16, 2 PM**
- Runbook: https://wiki.company.com/k8s-upgrade
- Rollback plan: documented in runbook
- On-call should monitor during upgrade
### Known Issues/Workarounds
1. **Monitoring alert "CacheMisses" is noisy**
- Fires 2-3x per day
- Safe to ack if cache hit rate > 90%
- Fix in progress: PROD-1235
### Important Context
- Holiday weekend coming up (Jan 20-21)
- New feature flag "beta-checkout-flow" rolled out to 5% yesterday
- Database backup window: Every day 2-4 AM UTC
### Questions?
Reachable on Slack @alice until 6 PM today.
Key Takeaways
- On-call is a team sport - Don't suffer alone, escalate early
- Quality > Quantity - Fix noisy alerts, reduce false positives
- Runbooks are essential - Every alert should have clear instructions
- Compensate fairly - Being on-call has real cost
- Prevent burnout - Monitor team health, rotate regularly
- Good handoffs - Context transfer is critical
- Continuous improvement - Learn from every incident
- It's OK to not know - Escalate when uncertain
- Recovery > diagnosis - During incidents, restore service first
- Take care of yourself - Sleep, eat, take breaks
Remember: The best on-call shift is one where nothing happens because systems are reliable and alerts are well-tuned. Work toward that goal.