Given that incidents are inherent to software engineering, regardless of quality, it’s crucial to seek ways to handle and prevent such situations that can be costly for your business. For instance, imagine an e-commerce site going down for two hours during Black Friday; scary, right? To avoid nightmares like this, I’ve listed 4 tips that will better prepare you for a future incident in your company.
The Power of Documentation
We must agree improvising is the worst way to handle an incident. I’m not talking about a temporary fix to get your systems back online. We’re discussing situations where, due to improvised processes, engineers are unsure of how to proceed and find themselves lost amidst a crisis. This tends to happen if your company doesn’t prioritize documentation. While many see documentation as a cost, something that hinders feature delivery and value, we should always keep in mind its benefits:
- Faster development in future related deliveries;
- Easier context recovery for legacy or “forgotten” artifacts;
- A shorter learning curve for new team members;
- A clearer understanding of a service’s operations and processes;
- Early detection of misunderstandings or errors;
The above items are just a few of the many benefits of documenting processes and services. With that said, ensure your team takes documentation seriously without postponing or rushing it. Thus, when facing a future incident, the chances of improvisation will be lower, and engineers will have a better idea of how to proceed, resulting in quicker incident resolution. Also, make sure to document the communication process during incidents, which we’ll discuss next.
Communication and Transparency
A significant mistake during incident resolution is the lack of communication and transparency with those affected. Of course, engineers don’t need to live-stream every action they take. However, basic communication is vital. Keeping those affected by the incident informed is a simple act that maintains the company’s trust and image amidst chaos.
But how would this communication work? Firstly, there should be a communication channel with stakeholders. The communication method isn’t the focus of this article, as it varies based on your company’s size and market. However, it can range from simple email exchanges to dedicated websites displaying your company’s service status. Although the method may vary, the communicated message tends to be similar and follows a kind of script:
- Initiate communication with those affected, informing them that the team is aware of the service degradation and is investigating.
- After the initial message, focus on finding the root cause. Once identified, inform the affected parties. Detailed explanations aren’t necessary at this point; simply stating that the cause has been found and a solution is underway is sufficient.
- After mitigating the issue and monitoring service normalization, inform stakeholders that the problem has been resolved.
Notice the simplicity in the above script; sending just three messages is enough to establish good communication and transparency. Messages don’t require technical details or lengthy texts; save that for the post-incident analysis.
The Importance of Postmortems
After an incident, it’s crucial for your team to take the time to investigate why it happened in detail. You might say that “the root cause was found and fixed,” but the question is: how did that root cause occur in the first place? Is there a risk of similar or identical causes leading to future incidents? This is the question to be answered by an internal postmortem. This document is essential to prevent repeated mistakes. Documenting incidents strengthens processes and enhances a company’s service resilience.
In general, improvements from a postmortem result from identifying hidden issues. These are non-obvious flaws that escalated into an incident. Under normal circumstances, these issues might go unnoticed or even be ignored despite their potential to weaken services and products. Therefore, when an incident occurs, it’s an opportunity to find improvements. Remember that a postmortem isn’t about placing blame, which we’ll discuss next.
No One Is To Blame
Although it might seem straightforward, blaming one or more individuals for an incident is the easiest way to waste a learning opportunity. Even if the mistake made by the person responsible seems “unbelievable,” rest assured that they aren’t to blame for the entire incident; the company is. We should assume that everyone involved did their best with the information they had and acted in good faith. With that in mind, the cause becomes a systematic issue of how these individuals lack complete or accurate information.
This mindset is crucial for thoroughly investigating the incident’s cause without fear of harming others or oneself. Otherwise, individuals might hesitate to identify and point out issues due to fear of retaliation. Therefore, avoid pointing fingers during incident investigations. Be pragmatic and focus on what truly matters and will bring systematic improvements when addressed.
The Essence of Effective Incident Management
Although managing, resolving, and investigating incidents isn’t simple, the tips outlined in this article can benefit any team that applies them. These tips work regardless of your company’s industry or size because they focus on the systematic approach you should adopt when solving problems. Remember, when it comes to incidents, the “why” will always be more important than the “who” or “how”.