Playbook: Post-Incident Validation¶
Goal¶
Ensure system state is healthy and operational after incident resolution.
Trigger¶
Immediately after resolving a major incident (e.g. downtime, crash, config error).
Checklist¶
- Verify affected services are running:
systemctl status <service>
- Run end-user test (UI/API ping)
- Review logs:
journalctl -u <service> --since "10 min ago"
- Confirm alerts are cleared
- Update incident log or ticket
Output¶
Postmortem summary (markdown or shared doc) + retrospective scheduling