The On-Call Challenge
On-call is a critical part of running reliable production systems, but poorly designed on-call programs lead to burnout, attrition, and ironically, more incidents.
A sustainable on-call program balances reliability requirements with engineer well-being.
Designing Fair Rotations
1. Rotation Length
The ideal rotation length depends on team size:
- Small teams (3-5): Weekly rotations
- Medium teams (6-10): Weekly rotations with secondary on-call
- Large teams (10+): Daily or 3-day rotations
2. Coverage Models
- Follow-the-sun: Distribute across time zones for 24/7 coverage without overnight pages
- Primary/secondary: Primary handles alerts, secondary provides backup
- Tiered escalation: L1 → L2 → L3 based on severity and response time
3. Compensation
Fair compensation for on-call includes:
- Flat on-call stipend per shift
- Additional pay per incident response
- Comp time for overnight pages
- Clear escalation to avoid unnecessary wake-ups
Reducing Alert Noise
The #1 complaint from on-call engineers is too many false alarms.
Alert Hygiene Practices
- Every alert must be actionable — If no action is needed, delete the alert
- Set appropriate thresholds — Use dynamic baselines, not static values
- Deduplicate alerts — Group related alerts into single notifications
- Add context — Include runbook links, affected services, and recent changes
- Review alerts monthly — Delete or tune alerts that haven't fired in 30 days
Effective Runbooks
Every alert should link to a runbook that includes:
- Description of what the alert means
- Impact assessment — What's affected?
- Diagnosis steps — How to investigate
- Remediation steps — How to fix
- Escalation criteria — When to escalate
On-Call Health Metrics
Track these metrics to assess your on-call program:
- Pages per shift — Target fewer than 2 per shift
- Time to acknowledge — Target under 5 minutes
- Time to resolve — Track trends over time
- False positive rate — Target under 10%
- Sleep interruptions — Track overnight pages
Preventing Burnout
- Rotate fairly — No one should be on-call more than 25% of the time
- Provide quiet hours — Suppress non-critical alerts 10PM-7AM
- Allow recovery time — Day off after overnight incidents
- Celebrate wins — Recognize on-call heroes
How SRExpert Supports On-Call
SRExpert's smart alerting reduces on-call noise by up to 70% with intelligent deduplication and correlation. Our on-call scheduling feature manages rotations, escalations, and 10+ notification channels — so your team only gets paged for real incidents.

