Code Crash Investigations

I’m quite a fan of a Canadian documentary series called Air Crash Investigations. The show re-enacts aviation incidents and the investigations that follow. Admittedly, its a little dark, but its absolutely fascinating. The debris field can be vast and there’s often very little left of the aircraft. None the less, the investigators painstakingly piece together the chain of events and sequences of errors that led to a plane falling out of the sky.

If you watch enough episodes you start to notice a pattern. It is almost never one person’s fault or one single failure that causes an incident. It is inevitabley a sequence of events that come together to form the perfect storm that brings a plane out of the sky. The recommendations that result from the investigation always outline a series of changes to be made to the checks and balances around the operation and maintenance of the aircraft. Processes are tightened to ensure that the same sequence of errors is simply no longer possible. The processes surrounding avaiation must be such that they are repeatable by numerous individuals with the same deterministic outcome. Anything less is a recipe for disaster.

What does this have to do with software you say? Well …

It’s not uncommon for large software systems to endure periods of instability as they are built out and modified to fit the ever changing landscape of the business that conceived them. When this happens, alerts and incidents are common, often dragging engineers out of bed in the wee hours to reboot the server yet again. Its unpleasant and everyone gets very grumpy very quickly, especially our end users.

Engineering teams can be susceptible to accepting the status quo that the system crashes sometimes and needs a reboot. Kubernetes even automates this process for us but, ultimately, rebooting and restarting is not a solution. It makes the symptom go away but the root cause is still there waiting to bring the house down at the least opportune moment.

What if we were to apply the same thinking as aviation investigators? Leave no stone unturned until we find the root cause of that process crash or slow performance. Once we find the issue we fix it. But we don’t stop there. We keep investigating. How was it possible the error made it into our code? How did it get all the way to production undetected? How might we have prevented that error from being made or ensure that we caught it early? We keep going until we prevent that error from ever happening again.

Once we start thinking this way, error rates tend to fall, incidents become a rare freak occurrence and everyone gets a peacefull night’s sleep. More importantly, our customers don’t hate us and we actually deliver on that SLA we flashed about so proudly.

The aviation industry has taken something frought with risk and made it so safe that almost no one questions boarding a plane that flies ten kilometers high at just shy of the speed of sound. This is engineering excellence and we can learn much from such an industry.