Friday, March 26, 2010

The Systems Management Stack

Managing an SOA environment requires a unified view of all levels of the system components. As mentioned before, the way to ensure a unified management view across all layers is to create a Management Dashboard, Centralized Logging Repository, and Single Sign-On Capability. These components ultimately rely on the introduction of probes to monitoring all resources and to track the flow of services across the SOA system.
Unfortunately, the software utility industry has yet to catch-up with the overall SOA management demands. After all, it took decades for the systems management suite to evolve around the mainframe model, and the integrated management view required by modern SOA systems is still evolving. This does not mean that you should take the, see no evil; hear no evil, view of the NASA administrator who ignored the request to check out the Shuttle. It simply means that you should endeavor to create the needed probes and components that will give you a minimum of capability in this area.

Notice from the diagram the suggestion that you manage security at each layer of the management stack. Security is not a layer, but rather an attribute of each layer.
In so far as the entirety of the management cycle is concerned, you should have capabilities for:
·         Continuously monitoring the overall health of your system, with the ability to be notified on a trigger basis of events demanding immediate attention.
·         Providing the ability to direct specific diagnostic checks to any component or layer in your system on an on-demand basis.
·         Maintaining a comprehensive logging repository for all events and traffic taking place in your system. Clearly, this repository could grow to prohibitive levels, but you should at least have the ability to keep a solid log of all messages and events in your system for a period of time, with appropriate summary analytics for the log events that might have to be discarded.
Ideally there should be a unified view where alerts from one layer can be correlated to alerts from another. For this to occur, you will need a canonical way to represent all alerts and events of the various system layers. Unfortunately, chances are that you will have to deal with the formats and interfaces provided by the vendor of choice for each specific layer. If you can afford it, you could add an additional integration component that normalizes the various formats and events around a canonical form that can be used for future analytics.
As for the system probes measuring performance within each component, you should ensure that these monitors never add more than a few percentage points of overhead to the system (<5% is as high as it should be, in my opinion). Ideally, you will have the option of heightening or lowering the degree of monitoring, depending upon circumstances. You can have low level monitoring for steady-state operations and more intrusive monitoring for those cases where more detailed diagnosis is needed. 
Finally, make sure to exercise appropriate change management controls in configuring the monitoring and tracing levels. I can’t count the number of times I have witnessed failures caused by someone “forgetting” to remove diagnostic tools from a production system.

Friday, March 19, 2010

SOA Systems Management

On February 1, 2003, the Columbia space shuttle disintegrated over Texas upon re-entry. The cause of the tragedy was the damage sustained to the wing during liftoff. This had been a two-week space mission, mind you, and many at NASA had been aware for days of the potential problem after seeing videos of debris hitting the leading edge of the shuttle’s wing during liftoff. In fact,  after receiving a  request from the Debris Assessment Team (DAT) to have spy satellites take pictures of the shuttle as it circled the Earth, the NASA’s Columbia Mission Management Team leader answered with a  “. . . this is not worth pursuing, for even if we see some damage to the shuttle, we can’t do anything about it”.
Just as with Apollo XIII, when NASA ingenuity and genius saved, against all odds, a mission in great peril, it is now believed that, had they been given the opportunity, NASA engineers could have come up with at least two strategies that would have saved the Columbia crew, if not the shuttle itself.
It’s only human nature to close one’s eyes when we believe we are powerless to rectify a problem. I’ve been there. There were times, after deployment of a complex system, when I wished I could simply close my eyes in order to avoid “finding” any problems. I suppose we never outgrow our instinct to “peek-a-boo” with reality, but alas, part of adulthood is the realization that reality often finds a way to bite us on the behind.  Despite the unspoken desire to avoid looking at brewing problems, hoping they will go away (or at least pretending they don’t exist), it’s better to recognize that operating an SOA environment is, in fact, a more complex proposition than operating and managing a traditional mainframe-based environment. SOA demands our full attention and it necessitates the deployment of system and network management components to enable proactive identification and resolution of issues before it is too late to handle them with grace. Successful control of this environment requires that these concepts and tools be in place:
·         Management and Monitoring at each level of the system stack
·         Deployment of a centralized Logging Server
·         Real-time operational dashboards.
It also must be said that none of the above would be useful without adequate planning of remediation strategies to deal with failure. These strategies must be part of the overall system organizational governance and will be covered later on when I discuss the administrative and management aspects related to managing the IT transformation.  Next week, I’ll cover the Management and Monitoring components.

Friday, March 12, 2010

Security & Continuance

Security and Continuance aspects should be dealt with simultaneously. They represent the two faces of the same coin. As discussed earlier, the continuance cause is advanced by the design of fault tolerant systems. If you don’t believe me, consider this example: Soon after the People’s Republic of China opened its economy and began the process of establishing a Chinese Stock Exchange, a group of advisors from a leading US computer company were asked to check out the country’s new electronic trading system developed to support that exchange. Upon inspection they found that the entire system was based on a single server computer with no fallbacks and no backups.  When they were told by a very proud systems engineer that the system was able to process upwards of 300 transactions per second, the American team was flabbergasted.  How were they able to process such throughput in what was, after all, no more than a single mid size server? “Well everything is being kept in memory,” was the response.  “But . . . doesn’t that mean that if someone hacks the system or the system goes down for whatever reason,  you are bound to lose all the stock exchange transactions held in the memory?” the baffled Americans asked[1]. The Chinese programmer, who clearly at the time was not yet well versed in the principles of Capitalism, thought it over for a moment and then replied, “Well . . . Stock Market . . .  very risky business!”
The fault tolerance issues of the trading system could be resolved by using redundant servers and by handling the transactions according to ACID rules, but then the system security should also become an intrinsic element of this design.  But how much security is appropriate? Instinctively, most security managers who would love to encase the system in layer-upon-layer of firewalls and encryption—something I like to call the “Fort Knox in a Box” approach.  If you were to carry out the most rigorous of these security recommendations, you would end up with a system that’s not only expensive but also so heavy and burdensome that no one would be able to use it.
There is always the tradeoff between security, business continuance, cost and performance. What’s the right level? Therein lays the conundrum.
This might be considered controversial but, in my view, as long as the relaxed position does not compromise the core business in a way similar to the stock market application mentioned in the anecdote, the right level can only be found by calibrating the amount of security or continuance coming from the more relaxed position. In other words: start simple. Simpler security guidelines are more likely to be followed than are complicated rules (in my experience, stricter parents always wind up with the most rebellions kids!) However, this approach only works well when you have designed the system to be flexible, so that it can quickly accommodate new security layers, and when you can act proactively to preempt any security exposure.
The stock market solution in my story was too flimsy from the get-go, but at least there was a chance to harden the system. In my experience, trying to loosen-up a system that has been initially over-engineered often results in a structurally weakened system.  
When it comes to security and business continuance, one should apply reasonable criteria that can be measured against the actual likelihood and impact of exposure.  Paranoia is a good attribute to have when it comes to designing security systems, but hysteria is not. I knew a security manager who wanted to encrypt all the messages flowing in the central serve complex; no matter that this complex was decoupled from the outside world by virtue of a DMZ. The argument was that disgruntled employees would still be able to snoop at the unencrypted messages. Assuming, of course, that those disgruntled employees had access to the central complex (not all employees did), the proposed security “solution” was one that would have cost the company many millions of dollars more in extra hardware to protect it against a possibility that was strictly speculative.
I once witnessed a large web project that initially contemplated the placement of encryption on every web page via a series of password layers, causing the overall system to perform at snail’s pace. An effort was made to remove many of the security layers and encryption in order to improve performance, but by then the system had been designed with such an inherently complex structure that it could not be improved upon. The entire effort had to be scrapped and a less burdensome and more efficient system had to be created from scratch.
Naturally, this argument would have made sense in the context of a specific critical business system. After all, the degree of security should be commensurate with the consequences of a breach. If we are to protect a nuclear silo, massive security layers make sense. Trying to apply that level of security to protect your web server might be overkill. Adopting a strategy to not encrypt all the internal traffic was deemed to be an acceptable risk given the circumstances.
For instance, consider the need for compliance and certification of industry standards such as the Payment Card Industry (PCI) security standard requiring encryption of all critical credit card information. Even if a literal reading of the standard might allow the transfer of plain credit card information in an internal, controlled environment, one can make the decision to encrypt this information anyway. However, an acceptable compromise implies that only those fields related to the PCI certification need to be encrypted; not all the messages flowing in the core system.
A security strategy whereby assets are safe-guarded on a case-by-case basis according to their criticality is more appropriate than trying to encase the entire system in accordance with its most critical element.

[1] If you remember discussion on ACID attributes for transaction systems, this would be an example of a transaction environment lacking in inherent durability.

Friday, March 5, 2010

System Availability & Reliability

Achieving the ability to provide services from any number of loosely coupled servers is essential in facilitating the deployment of redundant systems. Redundancy is the key to continued availability. Traditional mainframe environments were designed to be monolithic and had fewer components. Following the precept that, if put all your eggs in one basket, you better make sure that it’s a good basket, mainframe systems were designed to be extremely reliable and to come embedded with high-availability features. On the other hand, SOA systems tend to include more moving parts. More moving parts means there is a higher possibility of failure. Also, these moving parts may be components that have not been engineered or manufactured with the same high level of quality control applied to the more expensive mainframe. No use debating it: Out of the box, most mainframe systems deliver far higher availability levels.
SOA must overcome these inherent availability issues.  The method used to achieve redundancy in SOA is by introducing redundant elements; usually via clustering. To enable full utilization of the clustering capabilities provided by application server vendors, you should reduce state-dependent services.   This reduction will facilitate the logical decoupling that allows you to design a very resilient system that consists of active-active components in each layer of the stack from dual communication links, to redundant routers and switches, to clustered servers and redundant databases.
In the diagram below, a sample mainframe system has, for the sake of discussion, a 90% availability (mainframe systems usually have much higher availability ratings. I am using this number to simplify the following calculations).

Now, let’s say that you deploy a two-component SOA environment with each component giving 90% availability. . .

In this latter SOA system, you should expect the overall system availability to be no greater than 0.90 * 0.90 = 0.81! That is, by virtue of having added another component to the flow, you have gone from 90% availability to 81%. The reason for this is that both components are in a series and both have to be functional for the system to operate. In SOA you must adjust by adding additional fallback components:

The overall availability of two systems working in parallel such as in Cluster B above is calculated by this formula:
          B = 1 – ((1-B1) * (1 – B2))
In other words, Cluster B has an availability of 1- (0.10 * 0.10) = 0.99 or 99%
The total system availability thus obtained is now:
          0.90 * 0.99 = 0.89
Not quite the 90% received from the mainframe solution, but very close.  By increasing the number of Node B systems even more, the availability will increase somewhat. However, the overall system availability can never exceed the availability of the weakest link: the lowest available cluster in the chain. In this case, we could increase the availability further by adding a second “A” component. 

The combined availability of the Node A Cluster is now 0.99 and thus the combined system availability is now 0.99*0.99 > 98%.  This resulting system availability is higher than the availability capability of any single one of its components!  This is when the concepts of decoupling services via interfaces, avoiding state, and encapsulating business services and data so that services can be deployed to allow horizontal scalability and availability, makes the SOA approach truly shine.
Remember that the system will only be as resilient as its weakest point.  This means that the fall-back philosophy outlined here must be replicated for each level of the operational stack. It serves no purpose to put dual servers in an environment that, for instance, relies on a single communications link or that depends on a single router.  
Secondly, you should never forget the importance of controlling the introduction of new software, configuration changes and data to the system.  Take Google as an example. Google is reputed to have a supremely parallelized environment consisting of not one, not two, but actually hundreds of thousands of CPUs. In theory, such an environment ought to be almost infallible. Yet, on January 31, 2009, as a result of a trivial mistake during a configuration change, the entire Google system failed.  Millions of search request received results with the same warning message: “This site may harm your computer” for almost one hour[1]. It turns out that someone accidentally added the generic tool URL of ‘/’ to Google’s list of potentially malicious addresses, resulting in the flagging of all URLs in the Internet!  In this case, computers did not fail, the network didn’t fail, but thanks to perennial human frailty, the “super-available” Google system failed just the same.

[1] “Google Glitch Briefly Disrupts World’s Search” The New York Times.