Friday, March 26, 2010

The Systems Management Stack

Managing an SOA environment requires a unified view of all levels of the system components. As mentioned before, the way to ensure a unified management view across all layers is to create a Management Dashboard, Centralized Logging Repository, and Single Sign-On Capability. These components ultimately rely on the introduction of probes to monitoring all resources and to track the flow of services across the SOA system.
Unfortunately, the software utility industry has yet to catch-up with the overall SOA management demands. After all, it took decades for the systems management suite to evolve around the mainframe model, and the integrated management view required by modern SOA systems is still evolving. This does not mean that you should take the, see no evil; hear no evil, view of the NASA administrator who ignored the request to check out the Shuttle. It simply means that you should endeavor to create the needed probes and components that will give you a minimum of capability in this area.

Notice from the diagram the suggestion that you manage security at each layer of the management stack. Security is not a layer, but rather an attribute of each layer.
In so far as the entirety of the management cycle is concerned, you should have capabilities for:
·         Continuously monitoring the overall health of your system, with the ability to be notified on a trigger basis of events demanding immediate attention.
·         Providing the ability to direct specific diagnostic checks to any component or layer in your system on an on-demand basis.
·         Maintaining a comprehensive logging repository for all events and traffic taking place in your system. Clearly, this repository could grow to prohibitive levels, but you should at least have the ability to keep a solid log of all messages and events in your system for a period of time, with appropriate summary analytics for the log events that might have to be discarded.
Ideally there should be a unified view where alerts from one layer can be correlated to alerts from another. For this to occur, you will need a canonical way to represent all alerts and events of the various system layers. Unfortunately, chances are that you will have to deal with the formats and interfaces provided by the vendor of choice for each specific layer. If you can afford it, you could add an additional integration component that normalizes the various formats and events around a canonical form that can be used for future analytics.
As for the system probes measuring performance within each component, you should ensure that these monitors never add more than a few percentage points of overhead to the system (<5% is as high as it should be, in my opinion). Ideally, you will have the option of heightening or lowering the degree of monitoring, depending upon circumstances. You can have low level monitoring for steady-state operations and more intrusive monitoring for those cases where more detailed diagnosis is needed. 
Finally, make sure to exercise appropriate change management controls in configuring the monitoring and tracing levels. I can’t count the number of times I have witnessed failures caused by someone “forgetting” to remove diagnostic tools from a production system.