AGLX Thinking

Beyond the Bug: The CrowdStrike Affair

Written by Steven McCrone | Aug 26, 2024 2:20:38 AM

Summary:

  • A faulty piece of software from cybersecurity firm CrowdStrike caused global IT meltdowns.
  • While the impact of the incident was unprecedented, the nature of the event was paradoxically common.
  • Firms underestimate how frequently similar events occur, and overestimate their own ability to prevent them.
  • The standard practice responses only help to ensure that identical events are less likely, but do little to address similar events or different events with similar consequences.
  • Instead of relying entirely on robustness, we recommend that firms also add resilience and improve their ability to adapt to unpredictable events.

By JP Castlin , independent management consultant, and Steven McCrone, AGLX

On July 19, a bugged software update from CrowdStrike, a US cybersecurity firm, triggered one of the biggest IT outages in history. Immediately after it had rolled out, a myriad of sectors from private to public began suffering catastrophic losses of performance as their computer systems crashed; Microsoft estimated that 8.5 million Windows computers were affected. As a result, banks such as JPMorgan Chase, Bank of America, and Wells Fargo experienced business interruption, VISA customers were unable to make payments, supply chains were disrupted, more than 41,000 flights around the world were delayed (over 4,600 were canceled outright) – and those were merely the headline-grabbing items of the first 24 hours. The broader economic aftermath is undoubtedly going to be dramatic, but unreported costs are likely to go beyond the financial: many hospitals and crucial emergency functions, such as 911 call centers, also saw their systems fail.

As for CrowdStrike itself, the company’s stock has predictably declined significantly at the time of writing, and though it has regained some of its losses from the initial fall, tens of billions of dollars in shareholder value has been brusquely wiped out. CEO George Kurtz has gone on a PR tour to limit reputational damages, apologize and accept blame, saying on NBC’s morning news program, for example, that he was “deeply sorry for the impact” that had been caused “to customers, to travelers, to anyone affected by this, including our company”. Similar mea culpas are rare, not least because of the legal implications (many of which are currently playing out).

Although the grand scale ramifications of CrowdStrike’s snafu were unprecedented, as almost all media channels were quick to point out, the incident as such was far from unique. As the interconnectedness of modern systems increases, similar disruptions (whether due to human error or Black Swan events) occur with far higher frequency than commonly thought. It is a paradoxical fact of life: the probability of any one event may effectively be zero, but applied at scale in an ever-changing environment, the likelihood of something happening is substantial. To borrow an illustration from Mandelbrot and Hudson, a Citigroup study conducted in 2002 found unusually sharp price swings in a large number of currencies (1). On one occasion, the US dollar vaulted over the yen by 7.92% (the equivalent of 10.7 standard deviations in a Gaussian distribution). The normal odds of the event occurring were practically nonexistent. Even if Citigroup had begun trading dollars and yen at the moment of the Big Bang 15 billion years ago, the company should never have seen such deviations. Five years later, at the dawn of the financial crisis in 2007, the Chief Financial Officer of Goldman Sachs, David Viniar, announced that his firm was seeing 25-standard deviation moves several days in a row. It was even unlikelier, yet we all know what followed.

IT breakdowns are, by comparison, practically everyday occurrences (coincidentally, Kurtz was the CTO of McAfee when it launched a similar flawed security update in 2010 that caused tens of thousands of computers to crash), but the principle applies broadly. All systems – no matter how supposedly reliable – are constantly challenged at the edges. Consequently, reflexively stating that “something like this has never happened before” (what one might call the Greenspan defense) and labeling it a crisis cannot be deemed an acceptable response, nor is it adequate to rely on so-called business continuity planning. Modern organizations share the historical idiosyncrasy of assuming that the systems within which they act, and upon which they rely, are fundamentally ordered; possible to model, accurately forecast, and precisely predict. In reality, as we are repeatedly reminded, they are not. Rather, they are complex. Complex adaptive systems (CAS) cannot be modeled using standard approaches; they display behaviors such as self-organization, emergence, and adaptation (2). In turn, this means that CAS cannot be understood as sums of their parts, nor will past behavior reliably predict future behavior. Planning for all possible perturbations is thus impossible a priori. Contingencies will by nature be emergent.

Adding further robustness, which has become the de facto “solution”, may serve to address the precise problem that arose but does little more. While the chance of identical events may be reduced by the modification of initial conditions, neither similar events nor different events with similar consequences are necessarily prevented. On the contrary, as Carlson and Doyle demonstrated at the turn of the millennium, systems that are robust to perturbations that they were designed to handle become fragile to unexpected perturbations and design flaws (3). That is to say, the more optimal (and robust) a firm gets in relation to foreseeable events, the more vulnerable it becomes in relation to surprise. Occurrences such as the CrowdStrike incident must therefore also be understood from a higher systemic perspective; they constitute unavoidable consequences of a technological world that is increasingly interconnected and entangled. Although there was a time at which computer applications operated in an environment isolated from any global grid, today’s IT infrastructure is a dense network of interdependencies between applications, data, and people. The likelihood of a small error in one part of the system – a weak signal – amplifying latent conditions in other parts of the system thereby increases dramatically. While most of these errors are caught, contained, and fixed before there is significant damage, some inevitably cascade throughout the system, resulting in widespread performance losses (4).

In other words, firms must move beyond a complete reliance on robustness if for no other reason that there are so many causative factors, feedback loops, points of escalation, points of inflection, etc., that it typically is impossible to know which to fix and how. The current modus operandi of, sometimes metaphorically and sometimes literally, building a slightly higher seawall in the hope that the last observed biggest wave was as high as waves will ever be simply does not work – as we know only too well from financial market crashes, power plant catastrophes, and so forth. As David Noble once wrote, technology leads a double life (5). The first is one that conforms to the intentions of the designers. The second is one that contradicts them, yields to unintended consequences and unanticipated possibilities.

Accordingly, companies must increase their adaptive capacity and add resilience if they wish to better manage unforeseen events. In practice, this requires an understanding of tradeoffs; change may be a constant (not a contrast), but resources are simultaneously finite. Building upon Woods, a balance may be achieved by recognition of two distinct capabilities: the competence envelope (performance far from the limits of organizational knowledge) and graceful extensibility (performance close to the limits of organizational knowledge) (6). By adding resources to the former, one may add robustness but will subtract from sources of resilience, making the system more brittle. By adding resources to the latter, one may improve the system’s ability to adapt to surprise, but will subtract from processes that are known (or believed to be known), potentially hurting the bottom line. Where the precise balance lies will vary from firm to firm, but may be established by analysis of how well the organization takes advantage of frequently occurring events, patterns, and variations, and how strong its ability to respond effectively to novel surprises and opportunities is.

Events such as the CrowdStrike affair are an unavoidable part of the complex reality in which modern companies exist. Although individual incidents may be prevented from reoccurring, others will, so to speak, take their place. It is thus imperative that firms stop focusing solely on process improvement and robust systems control under the assumption that they are building benign technological systems; it contradicts the inherently unpredictable and unintended nature of the very same. Instead of devoting their time to making contingency plans, executives need to ensure they have the requisite organizational resilience to adapt to events that were not planned for.

CrowdStrike may never make the same mistake again, but companies consist of people, and people will always make mistakes. Even more than that, though, unintended consequences are a feature of the interconnected world, not a bug. And while it is acceptable to be surprised, it cannot be acceptable to be baffled.

 

Want to find out more about about how to increase resilience in your organization?

Get in touch to find out how we can help.

 

 

References:

(1) Mandelbrot, B. & Hudson, R. L., The (Mis)Behavior of Markets: A Fractal View of Risk, Ruin, and Reward. Basic Books (2004).

(2) Carmichael, T. & Hadzikadic, M., The Fundamentals of Complex Adaptive Systems. Springer Nature (2019).

(3) Carlson, J. M. & Doyle, J., Highly Optimized Tolerance: Robustness and Design in Complex Systems. American Physical Society (2000).

(4) Alexander, D., & Pescaroli, G., What are cascading disasters? UCL Open Environ (2019).

(5) Noble, D., Forces of production: a social history of industrial automation. Routledge (2011).

(6) Woods, D., The Theory of Graceful Extensibility: Basic rules that govern adaptive systems. Springer Nature (2018).