Friday, January 15, 2010

Data Mapping & Transformation—Part II



Last week, I outlined the various mapping options. The question I left hanging was this: Which approach is the best one?

My experience is that the transformation should take place as soon as possible. For starters, this means that broker-mediated transformations should be avoided, if possible.  The entity doing the transformation must have an understanding of the business processes being mapped and intermediate brokers usually lack this knowledge.
Best is to establish a canonical (i.e. standard) format and then allow both the receiver and the sender translate their respective formats into the chosen canonical form (performance considerations can be dealt with later).  For example, in the modern world, English can be seen as the standard used by all people—A German can communicate with Japanese in English.  In SOA terms, this canonical form may well be a specific set of XML structures.
If a standard protocol is feasible, you will need to decide whether this format will be a subset (a lowest common denominator) of all formats, or whether you will allow the format to carry functions that exceed the capabilities of either one or both of communicating entities.   If the former, you will be forced to “dumb-down” the functionality; if the latter, you will need to restrict the information conveyed by the canonical format in a case-by-case basis. Still, it’s best to make the standard format as comprehensive as possible. It’s always easier to restrict usage of excess functionality than it is to introduce new features during implementation.
If no standard format is feasible because you can’t control the sender or receiver, then you should adopt either a Sender-Makers-Right or a Receiver-Makes-Right approach. In general, the entity that has the better understanding of the business process should take ownership of the mapping.  For example, if you a tourist in another country and use of a canonical language (aka “English) is not possible, then it behooves you to try to speak their local language (i.e. Sender-Makes-Right). After all, it’s unrealistic to expect the local folks will speak your language. On the other hand, if you are visiting the tourism board in a foreign country then you may reasonably assume someone there might speak your language.
Typically, the Sender has a better understanding of the meaning (i.e. “semantics”) of a request. Consider the example where the requester searches for an employee record using the name. The name is in a structured fashion: LastName, FirstName. The server, on the other hand, expects to get the request with a string that contains the “last_name+first_name” (this is a common scenario when the server is a legacy application). The scenario is obvious (I mentioned this was a trivial example!). The requestor (the sender) should create the necessary string. Building the string is much easier for the sender than it is for the receiver. The sender knows the true nature of the last name, while the server’s logic could fail if it tried to derive the last name from regular expression parsing.  (I can’t tell you the number of times I have encountered systems that assume that DEL is my middle name!) Cleary the simple parser used by such software fails to understand that some last names have a space.
This recommendation still leaves open the question of where to do the mapping everything else being equal. My personal view is that when everything is equal, you should put the mapping logic in the server of the request (i.e. Receiver-Makes-Right), simply because it gives you a centralized, single point of control for the mappings. Relying on a Sender-makes-Right scenario places much of the burden on what could eventually become an unmanageable variety of clients. Also, I do suggest that if you decide for one or the other, that you don’t ever mix the approach. That is, if you decide to do a Sender-Makes-Right, do so throughout the system, or vice versa. The hybrid case with mixing Receive-Making-Right with Sender-Making-Right can make the system far too complex and unmanageable.
The corollary to this discussion is that there is a hybrid approach that I believe provides the most flexibility and solves the great majority of transformation needs: using a comprehensive canonical form combined with a Receiver-Makes-Right for cases where the super-set capability exceeds the receiver’s ability. The logic to this approach is that it is easier to down-scope features than it is to second-guess a more powerful capability.
Consider a typical search application scenario: A client sends a search request and the server then prepares a response which includes the found elements; plus ranking scores related to each item returned. The Sender converts the ranking weight factors from a relational database into a “standardized” ranking score system defined by the canonical form. Now, let’s assume the client (the receiver of the response) is not prepared to get or use this extra information. The receiver simply discards the extra information. The down-scoped information loses some of its value, but the client will still be able to present the search results, even if not in a ranked fashion. As long the key results are obtained no major harm occurs. A future, more competent client will be able to use the ranking information. Note that this approach only works if the information being ignored is not essential to the response. If you have a need to ensure essential information is not discarded, you’ll have to define this information as core to the canonical standards.
Yes. Transformation work is sure to have an impact on performance. Next I will cover a technique used to remediate this problem: Caching.