One of many issues in doing “root trigger evaluation” inside complicated techniques is that there’s virtually by no means “one dangerous factor” that’s actually on the root of the issue, and speaking concerning the incident as if there’s One True Root might be not productive. It’s necessary to determine the complete vary of contributing components, as a way to do one thing about these parts individually in addition to de-risking the system as a complete.
I not too long ago heard somebody discuss struggling to shift the language of their org round root trigger, and it occurred to me that adapting Macneil’s 5 P components mannequin from drugs/psychology could be very helpful in SRE “innocent postmortems” (or conventional ITIL downside administration RCAs). I’ve by no means seen something about utilizing this mannequin in IT, and an informal Google search turned up nothing, so I figured I’d write a weblog publish about it.
The 5 Ps (described in IT phrases) — effectively, actually six Ps, an issue and 5 P components — are as follows:
- The presenting downside is just not solely the core influence, but in addition its broader penalties, which all ought to be examined and addressed. As an illustration, “The FizzBots service was down” turns into “Our community was unstable, leading to FizzBots service failure. Our name middle was overwhelmed, our prospects are mad at us, and we have to pay out on our SLAs.”
- The precipitating components are the issues that triggered the incident. There won’t be a single set off, and the set off won’t be a one-time occasion (i.e. it might be a rising difficulty that finally crossed a threshold, corresponding to exhaustion of a connection pool or working out of server capability). For instance, “A community engineer made a typo in a router configuration.”
- The perpetuating components are the issues that resulted within the incident persevering with (or changing into worse), as soon as triggered. As an illustration, “When the community was down, utility parts queued requests, ran out of reminiscence, crashed, and needed to be manually recovered.”
- The predisposing components are the long-standing issues that made it extra possible {that a} dangerous state of affairs would consequence. As an illustration, “We don’t have automation that checks for dangerous configurations and prevents their propagation.” or “We’re working outdated software program on our load-balancers that accommodates a recognized bug that leads to typically sending requests to unresponsive backends.”
- The protecting components are issues that helped to restrict the influence and scope (primarily, your resilience mechanisms). As an illustration, “We’ve automation that detected the issue and reverted the configuration change, so the community outage period was temporary.”
- The current components are different components that had been related to the result (together with “the place we obtained fortunate”). As an illustration, “A brand new model of an utility part had simply been pushed shortly earlier than the community outage, complicating downside prognosis,” or “The incident started at midday, when a lot of the ops staff was out having lunch, delaying response.”
If you concentrate on the October 2021 Fb outage in these phrases, the presenting downside was the outage of a number of main Fb properties and their attendant penalties. The precipitating issue was the dangerous community config change, but it surely’s clearly not actually the “root trigger”. (In case your conclusion is “they need to fireplace the careless engineer who made a typo”, your considering is Mistaken.) There have been tons of contributing components, all of which ought to be addressed. “Blame” can’t be laid on the toes of anybody specifically, although a few of the predisposing and perpetuating components clearly had extra influence than others (and subsequently ought to be addressed with greater precedence).
I like this terminology as a result of it’s a clear classification that encompasses plenty of differing types of contributing components, and it’s meant for use in conditions which have a good quantity of uncertainty to them. I feel it might be helpful to construction incident postmortems, and I’d be eager to know the way it works for you, when you strive it out.