Introduction to Postmortem report

Ric Hincapie
4 min readOct 5, 2020

Here you will find an example on how to make a proper postmortem or incident report.

As we work in the tech industry, we find that problems arise once in a while no matter how much planning our teams do or the quality of their code.

The best way to deal with this reality is to look at incidents as a good opportunity to learn so the same mistakes happen no more.

I recommend you to don’t focus on “who is to blame”, but on “what conditions led this to happen”, because the punitive approach will make it harder for you and your organization to get information as accurate as possible about the issue, since engineers in charge or close to it will try to “save the ass” if retaliation is what they expect.

So, below you will find a postmortem report about a load balancer’s algorithm issue when requests went to the rise with four parts:

  • Issue summary: time gap, describe impact and the root cause.
  • Timeline: inform what were the actions, assumptions, correctives and other information regarding the solving process.
  • Root cause: what, in detail, caused the incident and how it was fixed.
  • Corrective and preventive issues: in general terms, what needs to be improved and what task are/will be done to prevent it from happening again.

Open Democracy outage report

Friday, October 2, 2020

By Ricardo Hincapié

Issue summary

The outage experienced started on September 25th at 22:27 EDT and ended on September 26th at 7:31 EDT.

We had a spike of 24% of requests, then we had an impact in 38% of users experiencing a rise in response time up to 250% specifically in URLs working with the electoral data from the northern part of Colombia.

The root cause of the failure was the HAProxy load balancer configuration algorithm, which was defined as round robin but one of the two available servers had less CPU than the other, so eventually the weaker one got overloaded, and the monitoring alarm configuration, to which nobody responded because of non-proper notification.

Timeline

  • 22:20 EDT: spike in requests of 173%.
  • 22:27 EDT: response time rises between 80% to 138% recorded in monitoring tools. An alarm went off without the team’s response.
  • Sep 26th 06:14 EDT: health check on both servers. They are up and running normally. No server crash happened and response time is back to normal. We assume there was a load balancing problem or DB server had a problem.
  • 06:23 EDT: DB server monitoring shows no overload nor unusual activity besides an increase in workload from 22:20 EDT Sep 25th to 00:56 EDT Sep 26th. Load balancer logs indicate server 2 response was the cause. Server 2 had an CPU overload from 22:18 to 23:27 EDT on Sept 25th.
  • 6:50 EDT: Load balancer algorithm was set to round robin, but servers have unequal computing resources. This was the root cause of the incident.
  • 7:31 EDT: HAProxy load balancing algorithm was configured to weighted round robin.

Root cause:

On Sep 25th 22:10 EDT a nationwide broadcaster released a post about Open Democracy service opening for Barranquilla, Santa Marta and Cartagena electoral data.

Users from these locations and nearby requested information from servers at a rate 80% to 138% bigger than usual. Web Server 1 has 30% more CPU capacity than Web Server 2. Round robin algorithm was distributing workload evenly between unequal servers, so users redirected to Web Server 2 experienced slowdowns in response.

The issue was solved by identifying exactly the computing power gap between the two servers so as to calculate the exact ratio as with which to configure a new load balancer algorithm capable of distributing workload over uneven servers, which is the weighted round robin. We figured out that a 5/3 workload ratio would prevent Web Server 2 from overloading and will still not overload Web Server 1 in similar conditions as experienced that night.

Corrective and preventive measures

It is important to keep an eye on the computing resources available in each server and from the team leader to make sure that information is clearly accessible to the engineers. Although there was a person responsible to be in the frontline of this kind of emergency, the monitoring alarm was misconfigured and the notification never reached our engineer.

  • Workload tests will be scheduled closer one to another to make sure high demand can be met. (Completed)
  • Budgeting for 2021 of an elastic infrastructure service that gives automated response to demand pikes.
  • Creation of a channel with the communication department so as to get early information about possible demand pikes. (Completed)

Conclusions

As you could read, a postmortem report need to be brief and informative enough so managers, engineers and even clients can understand the problems’ root and how you managed to work it out.

You’re expected to deliver and publish a report like this one week after the incident happened so you should concentrate in getting all the facts surrounding the problem from all the sources possible and deliver them in a comprehensive report so the organization and you can learn about it.

If it was your decision the cause of the problem, don’t worry. Now you’re the team’s expert in it and can improve reliability by making sure it doesn’t happen again. That’s what making mistakes is all about.

If you find this article useful, it would cheer me up if you clap it 😀!

Thanks!

--

--