Thinking About Failures

Recently, I was listening to a re-invent talk where Werner Vogels, AWS CTO, mentioned that everything fails all the time. As I was getting close to home, I started thinking about failures. Power failures. Hardware failures. Software failures. Software developers also encountered failures while building applications. One day might be a database issue. In other situations, it might be a hardware issue that it’s preventing you from getting things done. One of the main problems is that developers are not wired to think about failures. We are builders. We get paid to create new software programs. If you take a look at job descriptions, you will not find many references to failures. In many instances there is no planning for failures. We can learn a lot from Netflix as they have pioneer chaos engineering. Netflix relies on AWS infrastructure to run all its operations. At the beginning, every outage was an opportunity to ask questions without blaming anybody. It was an opportunity to improve the system. By asking the difficult questions, Netflix learned more about its strength and weaknesses. Soon Netflix realized that they needed to test these failures with complete control. A different team was created to bring chaos to Netflix’s system and processes. Without notice, an availability zone was removed. In another day, it was removing an entire AWS region. When these failures occurred, Netflix’s system will stop traffic to the failing region and start re-routing traffic to a good standing region.

We, as software developers, need to spend more time thinking about failures because they will happen sooner or later.

What do you think? Do software developers think enough about failures?