Thursday, October 3, 2013

The Big Data Conundrum: How to Define It?

Big Data is revolutionising 21st-century business without anybody knowing what it actually means. Now computer scientists have come up with a definition they hope everyone can agree on. 




One of the biggest new ideas in computing is “big data”. There is unanimous agreement that big data is revolutionising commerce in the 21st century. When it comes to business, big data offers unporecedented insight, improved decision-making and reveals untapped sources of profit.

And yet ask a chief technology officer to define big data and he or she will will scuff their feet and stare at the floor. Chances are, you will get as many definitions as the number of people you ask. And that’s a problem for anyone attempting to buy or sell or use big data services–what exactly is on offer?

Today, Jonathan Stuart Ward and Adam Barker at the University of St Andrews  in Scotland take the issue in hand. These guys survey the various definitions offered by the world’s biggest and most influential hi-tech organisations. They then attempt to distill from all this noise a definition that everyone can agree on.

Stuart Ward and Barker cast their net far and wide but the results are mixed. Formal definitions are hard to come by with many organisations preferring to give anecdotal examples.

In particular, the notion of ‘big’ is tricky to pin down, not least because a dataset that seems large today will almost certainly seem small in the not-too-distant future. Where one organisation gives hard figures for what constitutes ‘big’, another gives a relative definition, implying that big data will always be more than conventional techniques can handle.

Some organisations point out that large data sets are not always complex and small dataset are always simple. Their point is that the complexity of a dataset is an important factor in deciding whether it is ‘big’.

Here is a summary of the kind of descriptions Stuart Ward and Barker discovered from various influential organisations:

1. Gartner. In 2001, a Meta (now Gartner) report noted the increasing size of data, the increasing rate at which it is produced and the increasing range of formats and representations employed. This report predated the term ‘dig data’ but proposed a three-fold definition encompassing the “three Vs”: Volume, Velocity and Variety. This idea has since become popular and sometimes includes a fourth V: veracity, to cover questions of trust and uncertainty.

2. Oracle. Big data is the derivation of value from traditional relational database driven business decision making, augmented with new sources of unstructured data.

3. Intel. Big data opportunities emerge in organisations generating a median of 300 terabytes of data a week. The most common forms of data analysed in this way are business transactions stored in relational databases followed by documents, email, sensor data, blogs and social media.

4. Microsoft. “Big data is the term increasingly used to describe the process of applying serious computing power - the latest in machine learning and artificial intelligence - to seriously massive and often highly complex sets of information.”  

5. The Method for an Integrated Knowledge Environment open source project. The MIKE project argues that big data is not a function of the size of a dataset but its complexity. Consequently, it is the high degree of permutations and interactions within a dataset that defines big data.

6. The National Institute of Standards and Technology. NIST argues that big data is data which: “exceed(s) the capacity or capability of current or conventional methods and systems”. In other words, the notion of “big” is relative to the current standard of computation.

A mixed bag if ever there was one.

In addiiotn to the search for defintions, Stuart Ward and Barker attempted to better understand the way people use the phrase big data by searching Google Trends to see what words are most commonly associated with it. They say these are: data analytics, Hadoop, NoSQL, Google, IBM, and Oracle.

These guys bravely finish their survey with a definition of their own in which they attempt to bring together these disparate ideas. Here’s their defintion:

“Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.”

A game attempt at a worthy goal–a definition that everyone can agree is certainly overdue.

But will this do the trick? Answers please in the comments section below

Ref:arxiv.org/abs/1309.5821: Undefined By Data: A Survey of Big Data Definitions



No comments:

Post a Comment