Originally posted 8/30/16 on Data Points by Daniel Graham
We had a Q&A session with Dan about Data Lakes and Hadoop, to get the scoop on what he’ll present at Teradata PARTNERS Conference, and here’s what we discovered –
1. What is the biggest opportunity right now with a Data Lake and Hadoop?
Answer: Data lakes are best at gathering data from dozens of sources and distributing it to downstream workloads. This is still the biggest opportunity for data lakes. That doesn’t mean ALL data, as some want to believe. Data lakes are great for new workloads, webclicks, Internet of Things, and other large data volumes. The data lake is not great at storing small files and small data but often needs them anyway.
Think of the data lake as a huge yogurt shop where programmers and business users go to get data dispensed. Pull the handle, get your data. Add your own sprinkles. Programmers and users are doing this anyway. A data lake is the best way to institutionalize the culture of data hunter-gatherers.
Recommendation: Set the ‘dispensary’ vision in place early. Make it easy for users to get the data the need.
2. Who is finding the most success with Data Lakes and Hadoop in the marketplace?
Answer: Web properties were quick to get involved with Hadoop back in 2010-2012. Ebay, NetFlix, and Blizzard come to mind first. Data lakes are now mission critical to their businesses. Communications companies like T-Mobile, Telefonica, Comcast, and Verizon quickly followed. Fast forward to today, and companies in every industry are exploiting the value of the data lake. It’s a long list of success stories: CVS Caremark, Home Depot, Auchan, Volvo, Siemens, Western Digital, Bank of America, Pepsi, Dow, and BNSF. These are a few success stories I know about personally.
Recommendation: Read reference stories in your industry and neighbor industries.
3. What is the biggest misconception about Data Lakes and Hadoop?
Answer: The biggest fallacy has always been that Hadoop replaces the data warehouse. That was a rallying cry about cost elimination by early Hadoop aficionados. To date, there is not one Teradata Data Warehouse replaced by Hadoop. Five years of confusion and failed attempts by some have led to clarity but not one replacement. Now everyone agrees that the data warehouse and data lake are complementary technologies. Everyone includes Gartner, 451 Group, Cloudera, MapR, and Hortonworks. Hadoop provides new capabilities outside the role of the data warehouse. It’s complementary.
Another myth: Too many believe that a data lake is petabytes in size. This is only true for a couple dozen companies. The majority of data lakes are a dozen servers and a few hundred terabytes of storage. Most of the data is cold storage so it’s easy to bulk up on big disk drives. Petabyte size data lakes evolve from many mature workloads and full governance.
Recommendation: Understand fully what a data lake is and isn’t. Get references in your industry.
4. If a company were to do only one thing about a Data Lakes and/or Hadoop what would you recommend that be?
Answer: Hire quality consultants like Thing Big to start your first data lake or repair a failed effort. Data lakes are still do-it-yourself projects. A trusted partner in the beginning saves a world of pain, detours, and rebuilding in the future. There’s still a lot of emotion in data lake development because so much is still changing every week. And it takes a lot more than HDFS, Pig, and Sqoop programming skills to build a data lake. Good consultants debunk myths and steer developers in the right direction from the outset. Experts like Think Big have done many data lake implementations. They know what works and what doesn’t.
Recommendation: I’m biased and impressed with every Think Big person I know.
5. What is the biggest mistake people are making with Data Lakes and Hadoop?
Answer: The biggest mistake has always been avoiding data governance. Early open source developers loved the lawless wild-west. They flew under the radar of operation managers, administrators, auditors, and CIOs. Many IT shops rushed their data lake into production without basic data management disciplines. Even today, few data lakes have strong security, data cleansing, master data, compliance, roles, standards, or metrics. Metadata lineage is one of the biggest challenges with governing Hadoop. Imagine trying to navigate five million files in the data lake. Do you have the right file? How do you know you calculated the right answer? Don’t confuse white paper claims with governance. Most governance topics are a career path taking years of implementation.
Recommendation: Form a data lake governance council if there isn’t one. Business users and operations managers are part of the data lake team.
6. What is your go to resource on Data Lakes and Hadoop?
Answer: Yikes. Well, I read a lot of Gartner materials on a weekly basis. They are the most accurate and well written. I’m lucky I can call the architects at Think Big. I read the 451 Group research weekly for startup news. Last, I rely on Netflix and LinkedIn technical blogs. They are well written insights from the front-lines of data lakes.
7. Where will Data Lakes and Hadoop be in five years?
Answer: Many data lakes will be well governed. Not as well as data warehouses today but good enough for many purposes, even some auditors. Second, SQL-on-Hadoop will deliver fast data mart style queries (aka joins). Presto is a step in the right direction since it is Hadoop distribution neutral and fast.
Finally we will see the separation of storage from compute resources. This will be much more than just using storage area networks. It includes full data virtualization, like QueryGrid. It includes mature versions of Docker and other virtual techniques. That’s when the second wave of data lake designs and investments begin.