Friday, January 8, 2010

Data Mapping & Transformation—Part I


A basic tenet should be to keep transformations to a minimum. However, it is not always feasible to create completely homogeneous systems. Even if one wishes to use standard communication means between systems, the reality is that there will be sometimes a need to handle data and protocol mismatches. We frequently must interface the new system with legacy components, or support a federated environment with differing protocols and presentations. More often than not, the system will need to interface with a third party component that does not use the standard format, semantics or protocols.  The question that now emerges is how best to make these components talk to each other? In what components should we deploy the transformation logic?
There are various schools of thought about how to approach the thorny subject of transformation. We end up with the following alternatives: 
·      Broker makes right
·      Sender makes right
·      Receiver makes right
·      Sender and Receiver use a common (“canonical”) format
Broker-Makes-Right is akin to the United Nations with a diplomat giving a speech in, say, Russian, and then having the various translators translating the language for the listeners.
While this works in the UN because the translators are actual human beings capable of human understanding, when relying on automation, the most you can expect the intermediate to perform is straightforward X-to-Y transformations by following pre-defined mapping rules. Unlike UN translators, automated brokers lack the understanding to intelligently optimize the mapping based on cognition.  Broker mediated transformations only make sense when mapping low-level formatting mismatches. Do you need to convert an integer to a string? No problem. Use an intermediate component. Want to append a null termination to strings coming from A and destined for B? Again, fine.
Then there are cases where it is never advisable to let the broker perform automated conversions. For instance, currency conversions from Euros to Dollars may require the knowledge of specific exchange discounts or the application of exchange fees, or anything else that can be dreamed up by government bureaucracies. These types of “business-based” conversions must be placed in the upper layers that are capable of handling business rules, and not in an intermediate broker.
Sender-Makes-Right advocates argue that the sender of the message should know the capabilities of the recipient and adjust the format and characteristics of the delivery to match the recipient’s capabilities. An analogy is that of a teacher communicating to a group of kindergarteners. The teacher will not use complex words and will make an effort to match her language in a way the children will understand. Sender-Makes-Right assumes the sender typically knows-it-all and has the power to adapt its messages to nearly any recipient.
Receiver-Makes-Right proponents believe there is no way a sender will always know the capabilities and limitations of the receiver. Secondly, they argue that the sender is not necessarily the most powerful component of the system. Receiver-Makes-Right proponents argue that the recipient of a message should be able to extract what is needed and transform, as appropriate, the sender’s format.  Obviously, if the scope of information delivered by the sender exceeds what can be handled by the receiver, it is the receiver’s prerogative to dispose of the excess.  If the sender has less knowledge than the receiver, then it is easier for the receiver to map the sender’s format and complement the needed information through other means. An example is the manner in which Google applies complex heuristics to infer the user requests.
A final form is what I call the Esperanto approach, more commonly referred to as the canonical style. Here, both the sender and the receiver agree to use a common language, and both take responsibility for translating their respective formats into this common standard.
Which is the best approach? Clearly each method has both unique issues and advantages. Nest week I’ll go over the recommended approaches.