No Mind

It's the mind that makes you miss the shot
July 5th, 2009

Scalable application architectures – stability

Recently I started working on an application that will have to cater to the needs of thousands of users. It is not just the number of users but the application needs to aggregate data from multiple web services and push data to multiple webservice. This might sound as a simple but when you have to talk to about 30 webservice which have nothing in common except the HTTP and XML. Each webservice represents data in different format even though most of them deal with a simple text document. This means we need to figure out a way to create the business object from multiple sources at the same time keep the application linear. The complexity of the requirements increases by leaps and bounds when you have to work with live data. Yup, live up to date data. So the only way out seems to be to have a stateless, asynchronous design. But it is not easy to write stateless asynchronous applications :(

You may argue that why am I worried about the scalability of the application. Let the design evolve over a time. My experience with building applications is that, you cannot have a scalable design that “evolves”.  Not without tons of hard work later and not without breaking few things.  Writing scalable applications is like building an earthquake resistant skyscrapper. You cannot wait for the earthquake to come before you will start working on making the building earthquake resistant. You have to design it up front and test the model in lab before you lay the foundation stone of the building.

So what exactly is scalable. The sad part of computer industry is, we still dont have a scale to measure the scalability. What works for one set of data may fail for another set of data. A friend of mine suggested that, he measures his application profitability if the cost per transaction is less than the revenue per transaction.  I think the logical way to measure scalability would be, to measure how far the application can scale while keeping cost per transaction lower than the revenue per transaction :)

So lets try to define stability. To  an end user stability means that the system is available and capable of doing transaction irrespective load.  So first we need to identify what hampers system availability.

  1. Sudden surge of requests (like being slashdotted)
  2. Large number of requests being received continuous  over a period of time.
  3. Internal problems like memory leaks.

For point 1 we do have a solution. Do a load testing. That should give you an indication how long the system will survive before crashing under the load of sudden surge of request or in short what category of earthquake can building handle.

What about point number 2 ? How do you test a system under large number of continuous requests ? Do you do load testing for couple of days before releasing a new build in production ? One may argue that given the way most internet companies work, you have release the work very often. Acceptable point, but what is the use of adding that on cool new feature, that your marketing guy wants like anything, without testing the system stability ? If your cool new feature crashes it is only going to shake users confidence. To handle the point number 2, you need to test your application under different load conditions continuously for few days. I remember building a stock market ticker which would pass all the tests in development but crash in production. We found later that when the application was in productopn for 3 days continuously, some parts of application suffered from data overflow. Though it might sound a stupid mistake from a developer but the fact is the company suffered considerable losses due to repeatedly crashing application. And this was in the era when stock ticker from webservices was a new feature on the internet and every business head of a financial site, wanted to have the feature on the site because some competitor had it.

Testing for longevity of application is a very important test that is ignored more often than it is conducted. A test for longevity can bring out bugs in application that will go untraced in any other type of testing. The test of longevity needs to handle different load conditions under different time. It is equally important to measure the performance of the application during night conditions (low load) to peak conditions (day time).  Performance of different systems as the application load ramps up or down could reveal certain startling facts about your application.

What about point number 3 ? It takes some experience to identify internal problems. For instance memory leak can only be identified by seasoned programmer as compared to a johnny. So code review plays an important part here.  But what ever you do, some or the other internal problem will arise.  You need to build safety nets for such situations. Like building air bags for front passengers which inflate automatically when the car is hit.  Such impact absorbers will be able to handle internal problems and yet let the system perform or what is known as fault tolerance.

So keeping above points in mind, I have started designing the application. Currently I am evaluating whether to use a RDBMS or go with no-sql. Will post about the same when I arrive to a decision :) .

More later…

December 29th, 2008

Software fault tolerance

Ever got frustrated when your software stops responding and crashes, throwing a popup message asking you to send the information to xyz developer? With increasing complexity in the software we see an increasing trend of software hangups. The classic case was the windows blue screen. Thanks to microsoft for getting rid of ugly blue screen and keeping windows OS running when one program misbehaves.

Even though whole world was fed up with blue screen, programmers have  not yet learned that one action by a misbehaving component should not bring the whole system down. With rise of web 2.0 we have seen rise in rich internet applications and a rise in hanging browsers. One misbehaving plugin or a misbehaving tab can crash the whole browser. Gosh, have they forgotten about keeping program stable while writing the initial code ? or introduce new uber cool unstable feature was more important than overall software performance ?

It is very easy to blame software authors for all the mess but lets spend some time trying to understand what causes software to fail and how to avoid failures.

Software failure can be divided in 3 parts, error,fault and failure. Fault or bug is that produces error and error leads to failure. Error is a state of the system under investigation, a state that can bring down the whole system. So our discussion will focus on handling the system states that are liable for failures. Fault tolerance is the set of techniques aimed at detecting, isolating and recovering from computational state that can lead to failure. In Software fault tolerance techniques and implementations Laura Pullum  identifies 4 steps for fault tolerance viz

  1. Error identification or detection
  2. Error diagnostic to identify the cause of error.
  3. Error containment to prevent further damage.
  4. Error recovery the transition from erroneous state to error-free state.

The simplest approach to faul tolerance is try-catch block in OOP. As soon as an error is detected an exception is thrown and a catch block isolates the error giving an option to recover from the fault.  Simple solution which works…  But OOP is a programming language feature whereas software is made of components, so one example is not enough. In this series I will pick examples from foss projects and show how fault tolerance can be built into a system. So see you for more…