When I first edited Wikipedia’s big data topic back in 2010, it was a hodge podge of confusion and passionate commentaries. My original post has survived albeit with some embellishments:
“Big data usually includes data sets with sizes beyond the ability of commonly used [hardware and] software tools to capture, curate, manage, and process data within a tolerable elapsed time. Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.” 1
Most software and hardware products cannot process big data fast enough. This has been going on since the first stack of IBM punched cards emerged in the 1960s. Data size always outpaces machine sizes and budgets. That’s because machines are finite and data is almost infinite. Today, big data means terabytes or petabytes processed by large parallel processing servers. No, your laptop copying 4TB of iTunes to another hard disk is not big data. The aging SQL Server data mart stuck in a closet is not struggling with big data. It just needs a bigger server.
The current Wikipedia big data definition is unusable because it’s too large and complex. Indeed, it has more than 128 citations. Like most pundits, it also belabors the popular big data three V’s mantra: big data is volume, variety, velocity. Volume – that’s the ‘big’ part. Variety references complex data types such as video, JSON, and webpage log files. Velocity means streaming data which arrives continuously throughout the day. Since the three V’s emerged, pundits added more attributes to the list. Some people add veracity, variability, value, and complexity. Exploring the V’s is entertaining but not a definition. A text message arriving every millisecond is not big data. A two megabyte video is not big data. A petabyte of data is definitely big data. The simple answer is often the best and in this case it’s intuitively obvious.
Big data is huge quantities of data, of any kind, arriving at any speed. Big data overwhelms most software and hardware combinations. Once data gets really big, scalability enabled by parallel processing is necessary to unlock business value.
Teradata has been in the big data business for decades. Our first system was the DBC/1012 -- the DataBase Computer first sold in 1984. In the 1980s, large expensive disk drives only had two gigabytes of storage. Analyzing a terabyte was a goal and the inspiration for the name Teradata. Then in 1992, we installed the world’s first terabyte sized data warehouse. It was an astonishing big data accomplishment at that time. In 2007, we shipped the first petabyte size data warehouse. Today, our largest system is 36 petabytes (36,000 terabytes) with 4096 Intel® cores. More than 30 Teradata customers have a single system larger than a petabyte.
Around 2010, an open source product called Hadoop sparked big data publicity. Web properties like Google and Yahoo! turned to MapReduce – the core of Hadoop – to cope with overwhelming data volumes. This fostered Hadoop startups such as Hortonworks, MapR, and Cloudera. Hadoop is primarily used to build data lakes. A data lake is part data refinery and part archive. It collects millions of files, refines some, and dispenses files to downstream systems. Hadoop petabyte size data lakes exist in many internet web properties around the world. Early Hadoop implementations focused on ETL data integration and weblog (clickstream) analysis. It was the ability to process visitor mouse clicks on corporate web pages that fostered the term big data and it took off like a rocket since then.
Beyond data volume, a consistent theme in every discussion of big data is analytics. Industry analysts, customers, and vendors all agree that analyzing big data has enormous value. If you sift through all the research and press articles on big data, about half will be technical how-to discussions and half will be about analyzing the data. For decades, Teradata has managed big data with two V’s: volume and value. Everything else is a distant second place.
The definition of big data is still ambiguous. Even so, the phrase has been a huge boon to the data management industry. The phrase has captured the imagination of two new communities. First, Java and Python programmers discovered parallelism and scalability. Big data use cases provide ways to enrich their corporation with new applications. Second, journalists found a gold mine of interesting stories to write about big data. Those hot stories have lasted almost five years. Many press articles explore how big data analytics benefit corporations and society. These two communities have boosted the visibility of big data analytics. Corporations, consumers, HR departments, grandmothers, and students now understand big data. The phrase big data is a BIG success.
So let’s keep this going another five years. What is YOUR definition of big data?
1 Big data: https://en.wikipedia.org/wiki/Big_data