Building Software with Failure in Mind

How do you handle failure within your application software? If you have an exception, how is this handled? Do you log to a log file, generate a tracking ID, and provide the user with an error message?  Does that error message provide enough info for the user to understand what to do next? Does the user retry the application several times then suddenly it works?

All of the above could happen if your application is unable to connect to the database or a web service is timing out. How will the user know how to proceed? Let’s assume you generate an error message or error screen that provides some friendly message that states it is unable to process your request and provides them a technical phone number. Is that enough?

What if the application receives a “connection to the database” problem; can it figure out from the status returned that it is a recoverable condition where the query and connection could be retried? One example of this is Oracle’s Fast Connection Failover. The application, if configured, returns an Oracle status indicating a recoverable error. As an engineer or architect, have you considered the ability to retry the connection if this technology is available?

Applications should look at all connections, either to a database or services.  You need to determine if you can build in an automatic retry. Not all issues can be retried, but have you at least considered this in your design?

Background or batch processes need to be designed with failure in mind. If you are using an event driven architecture, and a background process receives an error processing the event message, how is the error tracked? If you are using a message broker implementation, is the message written to an error message queue? You cannot lose a message.  You must figure out how to reprocess the background message. Of course, alerting must be a part of this as well so that someone will know when the background process fails. Again, how the exception is handled is critical.

Hopefully, once the issue has been debugged and corrected, you can simply move the message off the error message queue back to the normal processing queue and it will then work. But, what about the massive batch process that is reading a 10,000 record file to update your data? If the process fails after processing 9,998 records, are the recovery instructions simply to just rerun the job? Does that mean you have to then process 9,998 records again?

Consider using a checkpoint / restart methodology in your batch processing where you create the checkpoint after a commit point. Say commit after 100 records, and then record the last record processed. When the batch process is restarted, logic will be needed to detect a restart condition for this file that will read through the records and restart the batch process after the last recorded checkpoint. If you do log an error, what information is being logged? You should ensure you have the key info needed for the support engineer to figure out what happened. Hopefully, the application has a standard exception manager that is visible to the critical support team which tracks the number and/or types of errors so that metrics can be gathered.

The ability to categorize your exceptions is important as well. If you see that you have had over X errors in the last 2 minutes, and they are repeating, then this should set off alarms bells. It is important to ensure the application works correctly and persists the data as it is designed to do.  In addition, it is just as important to test and understand how the application handles exceptions.

In summary, when you design application software you should expect failure; it is not just about redundancy and high availability in the infrastructure, it also needs to be part of the application software design. Think about it as if you are the engineer getting that phone call at 3 am. Do you have enough information in the exception log to tell you what is going on? Once you figure out the issue, can you restart this or does the user have to re-enter the data? Hopefully, you can build in detection of errors and retry where you know you can, and more importantly, figure out how to design an automatic restart path.