Prepared for Success

In a prior role as Director of IT Operations, I found myself reporting to a new Vice President.  As expected, he was curious about the operations of the systems and datacenter.  Were the databases clustered?  Were the applications highly available?  Was the network design redundant?  Were all of the drives mirrored?

“Oh, all of the drives are mirrored?” he confirmed as we walked through the datacenter.  “Then you won’t mind if I pull this drive from this server?” he inquired as he smiled and placed his hand on a hot swappable drive.  After my confident yet nervous confirmation, he then proceeded to pull the drive from a production server.

Nearly 10 years ago, I was confident in simple drive mirroring configurations, and there was no resulting outage from the test.  The reality is that failure of components should be expected, and should not be an unanticipated exception.  Hard drives fail to spin.  Power supplies die. Web servers crash.  Network cards start losing packets.  Ethernet cables get crimped.  The only question is whether your applications can tolerate the failures.  A simple demonstration years ago started the growth of my philosophy of expecting and embracing failure scenarios.

Do you have confidence that your application, database, web server, firewall, or storage array can experience failure of a component and persist without interruption?  Or, instead, do you wishfully hope that things might work without too much intervention or impact in the event of a failure?  Do you hesitate to reboot or patch a redundant system because the load balancing or application may not function properly?

If you are not confident that you can tolerate failure with ease, then here are some steps to move from fear to confidence.

Anticipate failure.  Build a culture where engineers are always considering outlier scenarios.  Just as developers may naturally design and test for scenarios of invalid data or input, they should also consider the impact of systems failures across a distributed system.  Establish an environment where engineers always consider the impact of failure, and understand when to use tools such as connection failover, session persistence, queuing, and asynchronous processing.

Design for failure.  Applications should be designed from the beginning to anticipate failure scenarios.  Consider what will happen for any given failure.  What will happen if a disk fails?  What will happen if a SAN switch fails?  What will happen if the application cannot connect to a web service?  Engineers sometimes design applications and systems without considering the impact if other components are not working properly.  At other times, they may design a redundant component or system, without considering the configuration of other components to handle the failover scenario.  Failure should be considered at every level, whether in system configuration, code design, network design, or any other area.

Test for failure.  Researching, designing, and engineering a system with the most brilliant and thorough redundancy at every layer, according to the most advanced world class practices, does not guarantee high availability.  Testing how a component responds to failure is as important as properly designing for failure.  A system that is designed for failure, but does not properly function at time of failure, is a failure.  Verify the fault tolerant design with every possible failure scenario.

Operationalize failure and recovery scenarios.  Since production failure is not routine, teams are usually caught by surprise and not able to respond as quickly as possible.  The moment an application will not start is not the best time to brush the dust off the manual and figure things out.  While engineers can create new user accounts with closed eyes, the once-every-few-years production database recovery might require significant trial and error.  Instead, work to build failure processes into daily operational routines.  Refresh test databases using the same procedures you would use in the event of a production recovery.  Use storage snapshots as a backout strategy for iterative test e-mail upgrades.

Automate failure recovery.  Automation of recovery scenarios is the nirvana state for handling failure.  Automation eliminates trial and error, guesswork, and uncertainty.  Automate application restarts.  Automate the detection and restart of failed web servers.  Automate restore processes.  At Benefitfocus, an automated process creates a full copy of our test systems from production to our Disaster Recovery site, using the exact same process we would use in the event of a disaster.  This process occurs every single day, is used for continuous live testing, and provides great confidence in our ability to handle a disaster.   No guessing is required.

Create Chaos.  In the end, most platforms have likely evolved to handle common scenarios out of necessity, but may not be prepared for the uncommon situations.  If a component has a high failure rate, you design around it.  If a system only occasionally fails in a test environment, you may just live with the occasional pain.  Netflix introduced a philosophy of Chaos Engineering, by which they intentionally inject failure into their production systems to expose weakness and ensure fault-tolerance.  They then address the uncommon issues.  While every organization may not be comfortable intentionally injecting chaos into production systems (no matter how comfortable they are with their survivability), the introduction of failure scenarios into test environments and scheduled maintenance windows can provide tremendous learning opportunities and can build tremendous confidence.  So, in addition to testing recovery scenarios as an operational process, also introduce random chaos.  Routinely kill processes in a test environment to see how an application responds.  Shut down web servers to see make sure load balancing properly functions.  Pull a network cable to a server.   And when a developer complains about losing connectivity from his test application to a test database, don’t just restart things.  When the queued JMS transactions generate errors, don’t just manually re-queue them.  Document and resolve the failover scenarios!