My take on the CrowdStrike outage: A culture of hubris creating an inevitable failure

Noah Guttman
Jul 23, 2024
3 min read

I realize that this post comes a bit late to the party as everyone has already had their fun bashing CrowdStrike and decided who to blame. I want to add my take as someone with more than 30 years coding, who has worked in companies ranging from tiny startups to Fortune 50s.

This is fundamentally a failure of process that comes from a bad company culture.

Now that we have the conclusion out of the way, lets go over what exactly the failures really were. These can be divided in multiple areas and sub areas.

Design

Way back when I was learning to code drives and kernel modules (more than 20 years ago) in university our professor made sure to impart the critical problem with these kinds of modules: they load during boot when there is no user input allowed. If an application has a failure that causes it (or the system) to crash, then it is very easy to not run that software - but kernel modules are not like that. If they crash, then they are likely to do during boot and the system can go into a crash loop. This is exactly what happened with the CrowdStrike outage.

For this reason we were taught that on installation and update of such a module it was industry standard to have some form of auto-detection/auto-removal/-auto-rollback code within the module.

This can be done fairly easily by checking module load if this is the first time running this version, then putting a marker to disk to that effect, and setting a job for post boot to remove the marker and set another one that the module worked. If the system crashes at any time during boot after the module loads, this is then detected on the next boot and module can self rollback/disable/remove as desired.

So why was an industry standard not followed by CrowdStrike?

Release day

Ask anyone knowledgeable of the SaaS industry about pushing things to production/customers, and you will get concerns about making changes to close to weekends or other times when critical staff are going to be away. Exceptions to this are only made when the change addresses a critical bug already affecting customers, or a security flaw that is putting them at risk. I have found nothing in any release notes or statements that show that either justification existed here.

So why was an industry standard not followed by CrowdStrike?

Rollout

Another backbone of the SaaS industry is the phased rollout. Pushing out updates in waves to selected site/customer/devices. I can't say for sure exactly how old this standard has existed, but it has been so for the 15 years I have spent in SaaS. This is just another easy to implement method of reducing the overall risk. Like the above, exceptions are made for critical updates, but as previously discussed that was not the case here.

What NOT to Blame

Testing/QA

The reality of testing is that you can only test for the interactions the developers and QA personnel think of. This will always be less than the total number of interactions.

In short no one's testing is perfect and it is unfair to blame the testers.

The developer(s) who wrote the code

Like with QA/Testing it is unfair to blame developers for making a mistake - we all make them. If the code worked in the development environment and passed all tests, then there is no reason for the developers to doubt their work. Nor do i find it likely that they violated process (as is being intimated) in order to push out the code. Even if they did, it would still ultimately be a reflection of a bad company culture.

Lower Management

It is never the team leads that get to decide on official process, nor do they have any control over company culture. These directives/elements always come from the more senior managers, directors and VPs.

What about DEI?

I have seen several posts blaming DEI for this outage given that CrowdStrike publicly declares their commitment to the movement, but this is at best an oversimplification. DEI is a result of the same kind of hubris that caused this outage, it did not produce said hubris. No small company can afford to start looking at anything other than the competence of prospective employees, and those who make such a mistake rapidly go out of business.

Conclusion: Hubris powered incompetence

It was hubris that lead CrowStrike to the incompetent choice to not have automated protection from a kernel boot crash loop, It was Hubris that made them ignore standard rollout practices.