Friday, March 5, 2010

System Availability & Reliability

Achieving the ability to provide services from any number of loosely coupled servers is essential in facilitating the deployment of redundant systems. Redundancy is the key to continued availability. Traditional mainframe environments were designed to be monolithic and had fewer components. Following the precept that, if put all your eggs in one basket, you better make sure that it’s a good basket, mainframe systems were designed to be extremely reliable and to come embedded with high-availability features. On the other hand, SOA systems tend to include more moving parts. More moving parts means there is a higher possibility of failure. Also, these moving parts may be components that have not been engineered or manufactured with the same high level of quality control applied to the more expensive mainframe. No use debating it: Out of the box, most mainframe systems deliver far higher availability levels.
SOA must overcome these inherent availability issues.  The method used to achieve redundancy in SOA is by introducing redundant elements; usually via clustering. To enable full utilization of the clustering capabilities provided by application server vendors, you should reduce state-dependent services.   This reduction will facilitate the logical decoupling that allows you to design a very resilient system that consists of active-active components in each layer of the stack from dual communication links, to redundant routers and switches, to clustered servers and redundant databases.
In the diagram below, a sample mainframe system has, for the sake of discussion, a 90% availability (mainframe systems usually have much higher availability ratings. I am using this number to simplify the following calculations).

Now, let’s say that you deploy a two-component SOA environment with each component giving 90% availability. . .

In this latter SOA system, you should expect the overall system availability to be no greater than 0.90 * 0.90 = 0.81! That is, by virtue of having added another component to the flow, you have gone from 90% availability to 81%. The reason for this is that both components are in a series and both have to be functional for the system to operate. In SOA you must adjust by adding additional fallback components:

The overall availability of two systems working in parallel such as in Cluster B above is calculated by this formula:
          B = 1 – ((1-B1) * (1 – B2))
In other words, Cluster B has an availability of 1- (0.10 * 0.10) = 0.99 or 99%
The total system availability thus obtained is now:
          0.90 * 0.99 = 0.89
Not quite the 90% received from the mainframe solution, but very close.  By increasing the number of Node B systems even more, the availability will increase somewhat. However, the overall system availability can never exceed the availability of the weakest link: the lowest available cluster in the chain. In this case, we could increase the availability further by adding a second “A” component. 

The combined availability of the Node A Cluster is now 0.99 and thus the combined system availability is now 0.99*0.99 > 98%.  This resulting system availability is higher than the availability capability of any single one of its components!  This is when the concepts of decoupling services via interfaces, avoiding state, and encapsulating business services and data so that services can be deployed to allow horizontal scalability and availability, makes the SOA approach truly shine.
Remember that the system will only be as resilient as its weakest point.  This means that the fall-back philosophy outlined here must be replicated for each level of the operational stack. It serves no purpose to put dual servers in an environment that, for instance, relies on a single communications link or that depends on a single router.  
Secondly, you should never forget the importance of controlling the introduction of new software, configuration changes and data to the system.  Take Google as an example. Google is reputed to have a supremely parallelized environment consisting of not one, not two, but actually hundreds of thousands of CPUs. In theory, such an environment ought to be almost infallible. Yet, on January 31, 2009, as a result of a trivial mistake during a configuration change, the entire Google system failed.  Millions of search request received results with the same warning message: “This site may harm your computer” for almost one hour[1]. It turns out that someone accidentally added the generic tool URL of ‘/’ to Google’s list of potentially malicious addresses, resulting in the flagging of all URLs in the Internet!  In this case, computers did not fail, the network didn’t fail, but thanks to perennial human frailty, the “super-available” Google system failed just the same.

[1] “Google Glitch Briefly Disrupts World’s Search” The New York Times.