Intuition is a handy thing to have in IT. As a former IT Jack-of-all-trades, I could join an outage call and frequently identify what was wrong more quickly than the experts. As a newbie with DevOps, data science, and other emerging IT fields, I don’t have that advantage anymore. It is uncomfortable.
But, it is exciting.
That does not mean I am entirely bereft of intuition in these new areas. Logic is logic, regardless of subject. However, it is invigorating to discover new, Counterintuitive IT™ facts or assertions. This is the first in an ongoing series of posts where I will share Counterintuitive IT™ examples as I run into them.
Counterintuitive IT™ #1: CABs Don’t Help, They Hinder
Having spent too much time on outage calls, I can appreciate any effort to reduce change induced incidents (CIIs). Avoiding those is especially important in healthcare IT (my present career). At a minimum, we inconvenience our clinicians and/or patients. Worst case, it impacts patient safety.
One popular approach to minimize CIIs is to require all non-standard, non-emergency modifications of production systems go before a change approval board (CAB). Having multiple sets of eyes will ensure you have all your ducks in a row, significantly reducing the likelihood of a problem.
Well, Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations by Nicole Forsgren, Jez Humble, and Gene Kim provides evidence that all that CABs do is slow things down:
We wanted to investigate the impact of change approval processes on software delivery performance. Thus, we asked about four possible scenarios: All production changes must be approved by an external body (such as a manager or CAB). Only high-risk changes, such as database changes, require approval. We rely on peer review to manage changes. We have no change approval process. The results were surprising. We found that approval only for high-risk changes was not correlated with software delivery performance. Teams that reported no approval process or used peer review achieved higher software delivery performance. Finally, teams that required approval by an external body achieved lower performance.
Well, that still seems intuitive, right? If you add a step, it’ll slow things down. So what?
What if slowing it down gains you nothing safety-wise?:
We investigated further the case of approval by an external body to see if this practice correlated with stability. We found that external approvals were negatively correlated with lead time, deployment frequency, and restore time, and had no correlation with change fail rate. In short, approval by an external body (such as a manager or CAB) simply doesn’t work to increase the stability of production systems, measured by the time to restore service and change fail rate. However, it certainly slows things down. It is, in fact, worse than having no change approval process at all.
Of course, this does not mean there should be a product change free-for-all. Instead, if you follow the other learnings of DevOps, you will implement other solution life cycle adjustments that will make it safe for folks close to the changes to press the deploy button without involving a board that may really only have authority, not knowledge.
Counterintuitive IT™ #2: Trying to Get the Most Out of Resources Slows Delivery
Budgets are tight everywhere. We need to squeeze as much as we can out of every resource, system, human, etc. I’ve personally been guilty of saying it’s best if we have 110% on our individual plates. It maximizes our output.
Oh, foolish me, as noted by Mik Kersten in From Project to Product:
As Reinertsen points out, the tendency for the business is to maximize utilization of the resources in the value stream. In manufacturing, this could suggest ensuring that each robot is 100% utilized. Goldratt demonstrates how flawed this approach is for manufacturing, while Reinertsen provides proof of the negative effects of over-utilization for product development. The corresponding practice in software delivery is the tendency to set the delivery of flow items such as features to 100% allocation for the teams building the software. As DeGrandis summarizes, the result of seeking full utilization is similarly problematic for software delivery as it is for manufacturing, having a negative effect on flow velocity and flow time.
It does not matter if it is a person, a machine, or a computer: attempt to get 100% utilization you’ll slow things down. Potentially horribly.
Not convinced? Check out this “The Resource Utilization Trap” video for visual proof:
To quote The Goal by Eliyahu Goldratt, “A system of local optimums is not an optimum system at all; it is a very inefficient system.” Trying to utilize each piece (human or otherwise) at 100% is trying to build “a system of local optimums.”
And that will lead to “a very inefficient system,” “having a negative effect on flow velocity and flow time” (to mash-up Goldratt and Kersten).
Counterintuitive IT™ #3: There Is No Single Root Cause with a Complex System
This next, and final one might have been the hardest for me to digest and accept, and I’ll admit I’m still wrestling it with it a bit. However, in his paper, “How Complex Systems Fail,” Richard I. Cook, MD makes a great argument against the common practice of identifying the root cause of any system failure. You really should review the whole document (it is filled with myriad gems), but here is an excerpt directly speaking to Counterintuitive IT™ #3:
7) Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.
Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents. Each of these is necessary insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident. Indeed, it is the linking of these causes together that creates the circumstances required for the accident. Thus, no isolation of the ‘root cause’ of an accident is possible. The evaluations based on such reasoning as ‘root cause’ do not reflect a technical understanding of the nature of failure but rather the social, cultural need to blame specific, localized forces or events for outcomes.
That portion of his article does not do the overall discussion justice, but you get the gist. Complex systems are, well, complex. Failures don’t happen because of a single factor. Our need to identify a root cause is because of “the social, cultural need to blame specific, localized forces or events for outcomes.”
I’m not quite ready to throw out my ASQ root cause process books :-), but…counterintuitively…I think Dr. Cook is right. Well, that is going to be an interesting conversation with my boss. 🙂
Your Turn: What Examples of Counterintuitive IT™ Do You Have?
Well, there are three Counterintuitive IT™ examples from me. What would you add to the list? Things you were surprised to find out. Things that seem to go against common sense. Things you don’t want others to learn the hard way.
If you have any, please add them as a comment below…