Tuesday, September 08, 2015

CIO Deep Dive In The Data Lake

The data lake concept has been around for a few years. It really came into its own in 2014 after PwC implied it can reduce information silos and Gartner warned of its limits. It is certainly a marketing boon for Hadoop architecture vendors and their integration specialists. A data lake that can scale as fast as cloud storage expands means more efficient IT spending, at least initially. It may also mean more IT spending at the back end of a large project if CIOs do not define the lake's governance up front.

The fourth stage of maturity in Edd Dumbill's data lake dream throws down a challenge to IT pros who design strong application clouds. Any CIO with an eye on the long game must design analytics, governance, and security into the data lake at the very beginning. The CIO's budget proposal to the CFO will then be a realistic estimate that won't come back to haunt the company in a year when the Chief Data Officer (or the CIO again, if they also wear the data hat) asks for more money to make the cloud apps work. The CFO should not need to explain a negative capex surprise to analysts in a future conference call with analysts if the CIO is realistic about the data lake's eventual requirements.

The emergence of XML as a strong industry standard means data lakes should be portable if an enterprise switches to another public cloud IaaS provider. An immature data lake will pose problems for data lifecycle management (DLM) if it ignores the midlife activities of processing and analytics just to save money on storage and retrieval. The Data Management Association (DAMA) Data Management Body of Knowledge (DMBOK) is a CIO's help file for optimizing a data lake's DLM. Call the process data administration or DLM, but the result in matching investment outlay to Big Data enterprise goals is the same.

Dumping dirty data into a data lake may seem like a cheap and easy way to assemble an integrated data base. Completing the dump without layering analytics into the Hadoop structure risks turning it into a data swamp. The swamp metaphor became a joke among IT pros soon after the Gartner release linked above hit the wires. Calculating the Cloudonomics risk/return tradeoff of a data lake approach is the CIO's key to staying out of a swamp.