Ever got frustrated when your software stops responding and crashes, throwing a popup message asking you to send the information to xyz developer? With increasing complexity in the software we see an increasing trend of software hangups. The classic case was the windows blue screen. Thanks to microsoft for getting rid of ugly blue screen and keeping windows OS running when one program misbehaves.
Even though whole world was fed up with blue screen, programmers have not yet learned that one action by a misbehaving component should not bring the whole system down. With rise of web 2.0 we have seen rise in rich internet applications and a rise in hanging browsers. One misbehaving plugin or a misbehaving tab can crash the whole browser. Gosh, have they forgotten about keeping program stable while writing the initial code ? or introduce new uber cool unstable feature was more important than overall software performance ?
It is very easy to blame software authors for all the mess but lets spend some time trying to understand what causes software to fail and how to avoid failures.
Software failure can be divided in 3 parts, error,fault and failure. Fault or bug is that produces error and error leads to failure. Error is a state of the system under investigation, a state that can bring down the whole system. So our discussion will focus on handling the system states that are liable for failures. Fault tolerance is the set of techniques aimed at detecting, isolating and recovering from computational state that can lead to failure. In Software fault tolerance techniques and implementations Laura Pullum identifies 4 steps for fault tolerance viz
- Error identification or detection
- Error diagnostic to identify the cause of error.
- Error containment to prevent further damage.
- Error recovery the transition from erroneous state to error-free state.
The simplest approach to faul tolerance is try-catch block in OOP. As soon as an error is detected an exception is thrown and a catch block isolates the error giving an option to recover from the fault. Simple solution which works… But OOP is a programming language feature whereas software is made of components, so one example is not enough. In this series I will pick examples from foss projects and show how fault tolerance can be built into a system. So see you for more…