Learning from Operational mistakes
In the world of Ops, it’s always good to learn from mistakes. It’s not good enough that we solved a problem (*fix*), but we must also do a post-mortem to understand what went wrong (*root cause*), what can we do to prevent it in the future (*long term solution*).
I am of the opinion that long term solutions are preferred to short term fixes (hacks!). But long term solutions are not easy, they almost always require understanding the root cause, and that is not always obvious.
After any incidents, crisis, problems, whatever you want to call it, make sure you have a *blame-free* post-mortem. This is very important. We are not looking to blame anyone, we should be focusing on the root cause and how can it be prevented from happening again. Going into a post-mortem with the right mind-set also help make the process go much smoother. You will get better cooperation from involved parties. It’s a team effort, to improve everyone’s job.
The process should be something like this.
- Assign an owner of the post-mortem process. Usually the lead engineer involved in the incident. That person is empowered to call for help from anyone needed.
- Assign a specific time-frame for when the post-mortem must conclude by. You do not want to let it drag on. Let’s get it done and move on. Recommend no more than two weeks from date of incident.
- Communicated what is expected of the post-mortem output.
- When — Timeline of incidents
- What — specific details of alerts, failures, etc.
- Communications during the incidents — within team, with other teams, internal and external (customers, press).
- Root Cause Analysis
- Training – better training
- Monitoring – better monitoring (add monitor, alerts)
- Failure detection – missed failures
- SPOF – Single Point of Failure. Add redundancies. Re-architecture.
It’s good if we can learn from past mistakes. It is even better if we can learn from others’ mistakes!
Here is the start of a list of Operational mistakes published on the web. I will be adding more as I find them. Feel free to submit any that I missed. Thanks!
Very nice post from David Henke:
In the world of Ops, it’s always good to learn from mistakes. It’s not good enough that we solved a problem (*fix*), but we must also do a post-mortem to understand what went wrong (*root cause*), what can we do to prevent it in the future (*long term solution*). I am of the opinion that…