A Human Approach to Postmortem Reviews

Dave Zwieback on how considering the human side of outages and postmortems can help build more resilient systems and teams.

There is nothing pleasant about postmortem reviews following an outage, and many companies struggle to execute positive, effective reviews. In a recent interview, Dave Zwieback (@mindweather), head of infrastructure at Knewton, said that we often focus only on technical issues during postmortems, to the exclusion of human elements. We also tend to fall into the “blame game” and point fingers when assessing particularly bad outages, he said.

In the following interview, Zwieback addresses the importance of including human and organizational elements in postmortem reviews, and outlines contributing factors to take into consideration, such as particular stressors and cognitive biases. He will address these issues further in a free online webcast, The Human Side of Postmortems, at 1 p.m., (PT) April 30.

How are postmortems typically approached, and why is it so important to make human and organizational factors more of a concern?

dave_zwiebackDave Zwieback: First, it’s worth noting that we as an industry have come a long way in terms of both routinely conducting postmortems after outages as well as sharing their results publicly. There’s an emergent culture of analyzing and learning from postmortems, thanks in part to folks like Tim Freeman, who have been collecting lots of them. However, we still largely focus on the technical details of outages and exclude human and organizational factors from both the postmortems and the subsequent documentation. As John Allspaw says, engineers “like to simplify complex problems so we can work on them in a reductionist fashion.”

Human factors can be difficult to analyze, and engineers generally lack training to do so. Still, excluding human factors is a glaring omission: arguably all of the failures in complex systems have human components. Examples of such conditions of failure include “human error” during the design or operations of the system, communication breakdowns due to a culture of blaming and shaming engineers for outages, or the effects of stress and fatigue on people dealing with outages. A deeper focus on the human side of outages and postmortems can ultimately help us build more resilient systems and teams, and reduce the duration and severity of outages.

What are the major human and organizational factors that arise during an outage, and how do they affect people trying to address the outage?

Dave Zwieback: The two major factors that I’ve been researching are the effects of stress and cognitive biases during outages.

We certainly know that outages are stressful events, but is all stress bad? For instance, encountering a dangerous animal is certainly a stressful event that will reliably produce a fight-or-flight reaction. However, this instinctual reaction can be quite useful, and it has helped keep humans safe for millennia.

While outages are typically not life-or-death events, the “softer” stressors endemic to outages nonetheless produce measurable stress responses by the body. Specifically, the following four “relative stressors” can negatively impact humans and their decisions during outages:

  • A situation that is interpreted as novel
  • A situation that is interpreted as unpredictable
  • A feeling of a lack of control over a situation
  • A situation where one can be judged negatively by others (the “social evaluative threat”)

In general, the duration and severity of outages have much to do with the quality of decision making. In addition to stress, the extent to which we jump to conclusions without fully considering the available data—in other words, the extent to which cognitive biases are clouding our judgments—will greatly impact the quality of our decisions.

What is the Yerkes-Dodson law, and how does it apply to a postmortem review?

Dave Zwieback: The Yerkes-Dodson law establishes a relationship between stress and performance. It was initially discovered by psychologists Robert Yerkes and John Dodson in the early 20th century after a series of experiments with mice. There’s been a wealth of subsequent research that has confirmed the validity of the law for humans in a variety of circumstances.

The essence of the Yerkes-Dodson law is that as stress increases, so does performance, at least for some time. However, after a point (which is different for each individual and also varies between simple and complex tasks), additional stress causes performance to deteriorate due to impaired attention and reduced ability to make sound decisions. The length of time that one is subject to stress also impacts the severity of its effects.

Finding the exact point at which stress becomes harmful is very difficult. We can, however, bring more awareness of the effects of stress into our field, specifically by discussing them during postmortems. This enables organizations to put in place simple procedures similar to those followed by the Heroku ops team, which institutes “an emergency incident commander rotation of 8 hours per shift, keeping a fresh mind in charge of the situation at all time” in case of lengthy outages.

What is “stress surface,” and how can it be used to facilitate more effective postmortem reviews?

Dave Zwieback: As I mentioned, it’s difficult to measure the effects of stress precisely. To address this, I’ve introduced the concept of “stress surface,” which measures the perception of the four relative stressors during an outage: the novelty of the situation, its unpredictability, lack of control, and social evaluative threat. These four stressors are selected because they are present during most outages, are known to cause a stress response by the body, and therefore have the potential to impact performance. The data can be collected via a simple online survey prior to the postmortem.

Stress surface is similar to the computer security concept of “attack surface”—a measure of the collection of ways in which an attacker can damage a system. An outage with a larger stress surface is more susceptible to the effects of stress than that with a smaller stress surface. We can use stress surface to compare the potential impact of stress on different outages as well as assess the impact of efforts to reduce stress surface over time. Knowing the stress surface score and asking questions like “Why did we feel a lack of control during the outage?” also opens the door to understanding the causes and effects of stress in real-world situations. Furthermore, we can use the data to determine if any particular dimension of the stress surface (for example, the threat of being negatively judged) remains stable between various outages.

What are some classic cognitive biases that present during an outage, and how should they be taken into account during a postmortem?

Dave Zwieback: There are more than 100 cognitive biases listed in Wikipedia. The following biases are almost always present during postmortems:

  1. Hindsight bias: During postmortems, we evaluate what happened during an outage with the benefit of currently available information, such as hindsight. As we aim to identify the conditions that were necessary and sufficient for an outage to occur, we often uncover things that could have prevented or shortened the outage. We hear statements like “You shouldn’t have made the change without backing up the system first” or “I don’t know how I overlooked this obvious step” from solemn postmortem participants. When we do, we need to remember that we’re likely being affected by hindsight bias and that this information may not have been available or obvious during the outage.
  2. Outcome bias: When the results of an outage are especially bad, hindsight bias is often accompanied by outcome bias, which is a major contributor to the “blame game” during postmortems. Under the influence of outcome bias, we judge the quality of the actions or decisions that contributed to the outage in proportion to how “bad” the outage was. The worse the outage, the more we tend to blame the human committing the error—starting with overlooking information due to “a lack of training,” and quickly escalating to the more nefarious “carelessness,” “irresponsibility,” and “negligence.” People become “root causes” of failure and therefore something that must be remediated.
  3. Availability bias: In preparing for future outages or mitigating effects of past outages, we tend to consider scenarios that appear more likely but are, in fact, only easier to remember, either because of the attention they received or because they occurred recently. That is, we might invest heavily in remediating a less likely condition of failure simply because it was part of a memorable outage. Furthermore, especially under stress, we often fall back to familiar responses from prior outages. As a result, we’re likely to try approaches that worked before, even though there might be evidence that they do not apply to the current outage.

It’s often easier to recognize other people’s mistakes than our own. Working in groups and openly asking the following questions can illuminate people’s quick judgments and cognitive biases at work:

  • How is this outage different from previous outages?
  • What is the relationship between these two pieces of information—causation, correlation, or neither?
  • What evidence do we have to support this explanation of events? Can there be a different explanation for this event?
  • What is the risk of this action? (Or, what could possibly go wrong?)

Finally, as Edward Tufte says, one “must always ask: How do I know that? That’s probably the most powerful question of all time.”

This interview was edited and condensed.

Related

Sign up for the O'Reilly Programming Newsletter to get weekly insight from industry insiders.
topic: Programming