Friday, November 20, 2009

The Data Sentinel


Data is what we put into the system and information is what we expect to get out of it (actually, there’s an epistemological argument that what we really crave is knowledge. For now, however, I’ll use the term ‘information’ to refer to the system output). Data is the dough; Information the cake. When we seek information, we want it to taste good, to be accurate, relevant, current, and understandable. Data is another matter. Data must be acquired and stored in whatever is best from a utilitarian perspective. Data can be anything. This explains why two digits were used to store the date years in the pre-millennium system, leading to the big Y2K brouhaha (more on this later).  Also, data is not always flat and homogeneous. It can have a hierarchical structure and come from multiple sources. In fact, data is whatever we choose to call the source of our information.
Google has reputedly hundreds of thousands of servers with Petabytes of data (1 Petabyte = 1,024 Terabytes), which you and I can access in a manner of milliseconds by typing free context searches. For many, a response from Google represents information, but to others this output is data to be used in the cooking of new information. As a matter of fact, one of the most exciting areas of research today is the emergence of Collective Intelligence via the mining of free text information on the web. Or consider the very promising WolframAlpha knowledge engine effort (wolframalpha.com) which very ambitiously taps a variety of databases to provide consolidated knowledge to users. There are still other mechanisms to provide information that rely on the human element as a source of data. Sites such as Mahalo.com or Chacha.com actually use carbon-based intelligent life forms to respond to questions.
Data can be stored in people’s neurons, spreadsheets, 3 x 5 index cards, papyrus scrolls, punched cards, magnetic media, optical disk or futuristic quantum storage. The point is that the user doesn’t care how the data is stored or how it is structured. In the end, Schemas, SQL, Rows, Columns, Indexes, Tables, are the ways we IT people store and manage data for our own convenience. But as long as the user can access data in a reliable, prompt, and comprehensive fashion, she could care less whether the data comes from a super-sophisticated object oriented data base or from a tattered printed copy of the World Almanac.
How should data be accessed then? I don’t recommend handling data in an explicit manner the way RDBMs vendors tell you to handle it. Data is at the core of the enterprise, but it does not have to be a “visible” core. You don’t visualize data with SQL. Instead, I suggest that you handle all access to data in an abstract way. You visualize data with services and this brings up the need via a Data Sentinel Layer. This layer should be, you guessed it, an SOA enabled component providing data accesses and maintenance services.
To put it simply, the Data Sentinel is the gatekeeper and abstraction layer for data. Nothing goes into the data storages without the Sentinel first passing it through; nothing gets out without the Sentinel allowing it. Furthermore, the Sentinel allows decoupling of how the data is ultimately stored from the way the data is perceived to be stored. Depending upon your needs, you may choose consolidated data storages or, alternatively, you may choose to follow a federated approach to heterogeneous data. It doesn’t matter. The Data Sentinel is responsible for presenting a common SOA façade to the outside world. 
Clearly, a key tenet should be to not allow willy-nilly access to data by bypassing the Sentinel. You should not allow applications or services (whether atomic or composite) to fire their own SQL statements against a data base. If you want to maintain the integrity of your SOA design, make sure to access data via the data abstraction services provided by the Sentinel services only.
Then again, this being a world filled with frailty, there are three exceptions where you will have to allow SOA entities to bypass the abstraction layer provided by the Sentinel. Every castle has secret passageways. I will cover the situations where exceptions may apply later: Security/Monitoring, Batch/Reporting, and the Data Joiner Pattern.
Obviously, data abstraction requires attention to performance, data persistence, and data integrity aspects. Thankfully, there are off-the-shelf tools to help facilitate this abstraction and the implementation of a Sentinel layer, such as Object-Relational mapping, automated data replication, and data caching products (e.g. Hibernate). Whether you choose to use an off-the-shelf tool or to write your own will depend upon your needs, but the use of those tools is not always sufficient to implement a proper Sentinel.  Object-Relational mapping or use of Stored Procedures, for example, are means to more easily map data access into SOA-like services, but you still need to ensure that the interfaces comply with the SOA interface criteria covered earlier. In the end, the use of a Data Sentinel Layer is a case of applying abstraction techniques to deal with the challenges of an SOA-based system, but one that also demands engineering work in order to deploy the Sentinel services in front of the Data Bases/Sources. There are additional techniques and considerations that also apply, and these will be discussed later on.