Tuesday, April 29, 2014

Paving the yellow brick road to Big Data

One of my favorite treats as a young child in Mexico City was a candy called “Suertes” (“Luckies”). It consisted of a cardboard roll containing little round candies known as “chochitos” and a small plastic toy (the toy was the lucky surprise: usually a cheap top, a miniature car, or a soldier figurine).  It was a cheap treat—think of a third-world version of Kinder Eggs.  Less third-world was the way these “Suertes” were packaged.  I now know that each roll was formed with a recycled IBM punch card further wrapped in rice paper to prevent the diminutive round chochitos from falling through the used card’s EBCDIC –encoded perforations[1].
Since the cards were essentially eighty column data encoders, I came to this conclusion: Data is fungible; it can even be used to wrap candies!
While the world’s population has more than doubled since the punch card days, data storage capability has grown exponentially during the same period.   In fact, storage capacity is poised to outstrip the maximum information content humanity is able to generate. According to a research study published in the Science Express Journal[2] , 2002 was the year when that digital storage capacity exceeded the analogue capacity. By 2007, 94% of all data stored was digital.  While it is estimated that the world had reached 2.75 Zettabytes of total data storage in 2012[3], we are expected to hit the 40 Zettabytes mark by 2020 which comes to about 5.2 Terabytes of data for every human being alive.
Not only has digital storage become a dirt-cheap commodity, but advances in compression and search algorithms have turned storage into a dynamically accessible asset—a true source of information. The emergence of the Cloud also allows further storage optimization. (I would be surprised to learn Amazon is storing a copy of your online books in your cloud space versus simply maintaining an index pointing to a single master copy of each book in their catalogue.)
The ability to store huge amounts of data in a digital form speaks to the phenomena of “Datification”. True, most of what we are now placing in digitized form are pictures and videos, and studies show that less than 1% of all this data has been analyzed. But even as  more than half a billion pictures are being posted to social media sites every day, new machine learning techniques to help us analyze this type of graphic content are being developed. There is no doubt that we are truly in the midst of the Digital Era. Or rather, the Era of Big Data  .  .  .

Big Data has been defined as having the following attributes: Volume (obviously!), Velocity (dealing with the need to get data via an on-demand, even streaming basis), Variety (encompassing non-structured data), and Veracity (making sure the data is trusted). The field of Data Science is being formed around the exploitation of big data, particularly in ways that take advantage of the emerging properties derived by the four-V attributes.  The emergent phenomenon reveals that Data is now viewed as a product in its own right.
One of the most exciting ways in which  the Data Science/Big-Data phenomena has delivered value is with the unexpected ways data correlations can appear and be exploited for surprising business purposes.  You are probably familiar with how Google is able to track flu epidemics based on search patterns, and how companies are finding ways to market to various demographics based on ancillary consumption data (Wal-Mart noticed that, prior to a hurricane, sales of Pop-Tarts increased along with sales of flashlights).
But while all this is fine from a theoretical and anecdotal perspective, as a CIO, CTO, or IT executive for a medium size or small company you would do well to ask: What does all this hype have to do with my company’s bottom line?
In my last article I recommended evaluating potential big data-applications for your business. Even if you do not know precisely how all this big data transformation will impact you, there are steps you can proactively take now.  Just as in the story of the Wizard of Oz, this is a case where the journey is part of the destination. You should pave the yellow brick road that will take you there:
  1. Revisit the state of Data Governance in your organization. Obviously you should maintain the traditional SQL related roles, but transforming towards big data requires a fresh look at storage engineering, data integrity, data security, and the need to train for and secure needed emerging skills such as those of data scientists.
  2. Establish a “Datification” strategy for your business. Have you ever seen those reality shows about Hoarders? That’s it. You must become a fanatical data hoarder. This is not the time to dismiss any of the data you capture as too insignificant or expensive to store. Part of the strategy is the creation and documentation of taxonomy of data to better organize and understand potential data interrelations.
  3. Re-focus on data quality and integrity. Review your data cleansing and deduplication processes. Adapt them to meet the higher volumes presented by Datification.  The ideal time to ensure the data you capture is as clean as possible is at the point of data acquisition. The old adage of Garbage-In/Garbage-Out still applies with big data, except now the motto is Big Garbage In/ Big Garbage Out.
  4. Normalize the data. Just because the data is in digital form, does not mean you can use it.  Big-data practitioners estimate that about 80% of their work goes into preparing the data in a manner that can be exploited.
  5. Review and adapt the data security strategy. Design your data security strategy from the get go.  I recommend you visit two of my previous blogs discussing the subject of security:  “The Systems Management Stack”, and “Security & Continuance”.  Bottom line, your security strategy should be part of the core data strategy.
  6. Move to the Cloud, even if the cloud is internal. Too much time is being spent deciding whether or not to “Move to the Cloud”.  Most businesses I have come across are wary of placing strategic data assets in a public cloud.  You should separate the debate as to whether or not to make the move to a public cloud from the need to ensure the data can be in a “cloud” form. You cannot have Datification without Cloudification. This means that you should be using virtualized access and storage of your data to the nth degree.  You should ensure decoupling all access of the data from its physical location via appropriate service-level interfaces.   The decision as to whether or not to use a local private cloud, network private cloud, or public cloud or any other variation (Platform as a Service, Infrastructure as a Service, etc.)  is the topic for another blog article. Be aware that if you try to create and manage your own cloud you will need to secure the appropriate internal engineering resources. This is not an inexpensive proposition. Also, you and your cloud consultant will need to define a Storage Area Network strategy that allows placement of heterogeneous data with large scalable capabilities. Following this route will also require you to define non-SQL data replication, data sharding, and backup strategies. The time to start this process is now.
  7. Conduct a census of useful externally available data. A key premise of big data is the view of data as a product in its own right. Not only are you positioning your company’s data as a capitalizable asset that could potentially be made available to others as a revenue generating option, but you will also be in a position to access and exploit data assets available by others. At a minimum, you should conduct a census of potential data set sources openly available from public entities and governments and define a strategy of how you can better exploit these assets. 
Obviously you will have to face the task of justifying the needed investment to your CEO and financial controllers.  Projects related to data virtualization intrinsically improve availability, and other projects dealing with security (PCI or otherwise) should all be justifiable purely on best-practice, business continuance basis.  You will need to tap into traditional operational budgets to better fund them.  Also, this is one of those cases where you will need to find obvious functional features that you can jointly sponsor with your business partners (these are the proverbial “low hanging” fruits).  If there is not enough money (when is there?), you don’t have to do everything at once. You can begin with data elements your taxonomy has identified as most essential.
Furthermore, there is an increasing realization that big data can actually be accounted as a company asset. After all, the company valuations of Facebook and Twitter are primarily based on the strength of their data sets. For example, it is currently estimated that the value of each member to Facebook is about $100. Customer acquisition costs in the social media space are usually estimated to be in the range of $5 to $15; so properly structuring consumable data sets can be used as part of your financial justification.
That’s it. This endeavor should keep you busy for a while.  At the end of the road you will have proven you had courage and a heart all along; plus you’ll get a Big Data diploma too!  

[1] Of course as a child, I did not know the punch cards were being repurposed to hold the candy and so I always wondered why someone would “design” perforated cards to hold the chochitos!
[2]The World’s Technological Capacity to Store, Communicate, and Compute Information” by Martin Hilbert and Priscila Lopez.
[3] Optimally compressed. One Zettabye equals one thousand Exabytes. One Exabyte equals one billion Gigabytes or one million Terabytes. The actual digitized speech of all words ever spoken by human beings could be stored in 42 Zettabytes (16 kHz, 16-bit audio). What follows after Zettabytes, in case you are wondering is: Yottabye, Xenottabyte, Shilentnobyte, and Domegemegrottebyte which in addition to having 18 Scrabble-busting letters in its name, it represents 1033 bytes.