Recently I started working on an application that will have to cater to the needs of thousands of users. It is not just the number of users but the application needs to aggregate data from multiple web services and push data to multiple webservice. This might sound as a simple but when you have to talk to about 30 webservice which have nothing in common except the HTTP and XML. Each webservice represents data in different format even though most of them deal with a simple text document. This means we need to figure out a way to create the business object from multiple sources at the same time keep the application linear. The complexity of the requirements increases by leaps and bounds when you have to work with live data. Yup, live up to date data. So the only way out seems to be to have a stateless, asynchronous design. But it is not easy to write stateless asynchronous applications
You may argue that why am I worried about the scalability of the application. Let the design evolve over a time. My experience with building applications is that, you cannot have a scalable design that “evolves”. Not without tons of hard work later and not without breaking few things. Writing scalable applications is like building an earthquake resistant skyscrapper. You cannot wait for the earthquake to come before you will start working on making the building earthquake resistant. You have to design it up front and test the model in lab before you lay the foundation stone of the building.
So what exactly is scalable. The sad part of computer industry is, we still dont have a scale to measure the scalability. What works for one set of data may fail for another set of data. A friend of mine suggested that, he measures his application profitability if the cost per transaction is less than the revenue per transaction. I think the logical way to measure scalability would be, to measure how far the application can scale while keeping cost per transaction lower than the revenue per transaction
So lets try to define stability. To an end user stability means that the system is available and capable of doing transaction irrespective load. So first we need to identify what hampers system availability.
- Sudden surge of requests (like being slashdotted)
- Large number of requests being received continuous over a period of time.
- Internal problems like memory leaks.
For point 1 we do have a solution. Do a load testing. That should give you an indication how long the system will survive before crashing under the load of sudden surge of request or in short what category of earthquake can building handle.
What about point number 2 ? How do you test a system under large number of continuous requests ? Do you do load testing for couple of days before releasing a new build in production ? One may argue that given the way most internet companies work, you have release the work very often. Acceptable point, but what is the use of adding that on cool new feature, that your marketing guy wants like anything, without testing the system stability ? If your cool new feature crashes it is only going to shake users confidence. To handle the point number 2, you need to test your application under different load conditions continuously for few days. I remember building a stock market ticker which would pass all the tests in development but crash in production. We found later that when the application was in productopn for 3 days continuously, some parts of application suffered from data overflow. Though it might sound a stupid mistake from a developer but the fact is the company suffered considerable losses due to repeatedly crashing application. And this was in the era when stock ticker from webservices was a new feature on the internet and every business head of a financial site, wanted to have the feature on the site because some competitor had it.
Testing for longevity of application is a very important test that is ignored more often than it is conducted. A test for longevity can bring out bugs in application that will go untraced in any other type of testing. The test of longevity needs to handle different load conditions under different time. It is equally important to measure the performance of the application during night conditions (low load) to peak conditions (day time). Performance of different systems as the application load ramps up or down could reveal certain startling facts about your application.
What about point number 3 ? It takes some experience to identify internal problems. For instance memory leak can only be identified by seasoned programmer as compared to a johnny. So code review plays an important part here. But what ever you do, some or the other internal problem will arise. You need to build safety nets for such situations. Like building air bags for front passengers which inflate automatically when the car is hit. Such impact absorbers will be able to handle internal problems and yet let the system perform or what is known as fault tolerance.
So keeping above points in mind, I have started designing the application. Currently I am evaluating whether to use a RDBMS or go with no-sql. Will post about the same when I arrive to a decision .