< Articles & News

Anarchy Doesn't Scale: Why Your Data Lake Needs A Government

June 4, 2015   |   7:23 PM
Teradata Articles

With limited access to resources, you don’t need a government. Anarchy works just fine. To put this in context, I’ll start with a story.

Once there was a pristine lake surrounded by rugged mountains. The lake was accessible only by a few intrepid scientists who settled its shores and studied its depths. All were free to explore, filling their notebooks with insights and periodically sharing findings around the campfire. It was a simple time, where a handshake and a chat were all that was needed to keep things tidy.

And then along came the eight-lane highway.


It wasn’t long ago that the Hadoop data lake was the exclusive domain of a handful of data scientists and architects using Hadoop to gain insights from large but relatively homogenous data sources (often Apache weblogs) within the vast lake.

Now more people are getting into the data lake every day. At the same time, that lake is filling with data from an expanding array of heterogeneous sources. When only a few data experts accessed the lake, it was easy for each person to know what the others were working on. Now a marketing analyst might create a script for working with app data, never knowing that the person next door had already spent 20 hours developing the exact same thing. There’s chaos on the data lake: Work is being duplicated, users are inadvertently getting stale data, and they can’t see where the data came from.

If you are a chief data officer or lead architect, this should give you pause. It’s time to begin thinking about institutional structures that will enable your people to be productive in this suddenly bustling environment. You need some type of governance because anarchy simply doesn’t scale. Let’s consider some requirements.

It’s no longer possible for manual cataloguing to keep pace with the volume and variety of data coming in. You need an automated mechanism that understands when data is flowing into Hadoop, what’s happening to it, where it is, and its format. Users need information about data, such as whether it’s the latest version, whether it’s been cleaned, and where it came from if they are to perform accurate analytics. As data enters the lake, this information must be captured. It’s the foundation needed to establish a fully documented data catalogue that is created automatically so people see the expanding variety of data in your Hadoop environment and make optimal use of it.

Work on the data lake needs transparency so people can share and build on each other’s work. That transparency is also essential so that it is possible to look across the environment, spot duplication, and consolidate efforts to get the most from available resources.

It is also vital to consider the means of access to the data lake. This starts with education. It used to be that if you did not have the skills and training of a data scientist, you could not participate. If you are going to get the most from your data resources that can no longer be the case. Fortunately, easy to use, highly visual data preparation tools are emerging. These solutions democratize access to data and provide greater transparency, which makes it clear what’s in the data lake and offers visibility into where data came from and how it has been changed or aggregated.

If all of this sounds a bit familiar, that’s no surprise. The data lake was a frontier. Its early success was in part spurred by the lack of rules that encouraged exploration by a few bold adventurers. As this environment becomes part of mainstream operations, we need structures and technologies to manage the influx of people, data, and use cases. This is not order for order’s sake, but a means of ensuring productivity and encouraging the discovery of insights.

More people and more data are crossing the mountains and approaching your data lake every day. We need structure and transparency to make expanded access and exploding data volume and variety work to our advantage. Now is the time to think about how you can make big data analytics work for multitudes of stakeholders, enabling you to get the most value from your data lake.