***





Big Data

[Go BIG - or go home!]

Today's topics

Next lecture: the 'HOW' of Big Data [storage, processing]

Context (setting the stage)

The slides that follow are very general in nature - they present the 'big picture' concepts of Big Data. By nature, the content is relatively speaking, soft/squishy/fluffy..

It *is* important to understand the context in which we will discuss data mining etc. in upcoming lectures, otherwise the material will seem dry/irrelevant.

Big Data, Wordle (TM) summary :)

How many of these buzzwords do _you_ know? :)

What is 'Big Data'?

Big Data has indeed become somewhat of a catch-phrase/buzzword.

But, we can provide an operational definition: Big Data is data that is 'too big' to be stored in a single machine, and/or processed by a single machine. In fact, 'single machine' might even mean an entire site (a cluster of machines), if the data is 'too' big to fit in one.

This definition is intentionally vague, to keep it relevant for the future as well.

Ways to characterize Big Data

What makes Big Data 'big'? The following three (or seven!) 'V's do.

Big Data is data that has:

In other words, it is data that is varied in nature (comprises diverse types), needs to accessed and used quickly, and comes in large quantities.

Often, three more Vs are also used to characterize Big Data:

More, MORE! (how big?)

Data measurement units are getting bigger - Gigabyes (GB) to Zettabytes (ZB) to Yottabytes (YB) to... Each change is orders of magnitude bigger!!

Big Data is not only big, but getting bigger at a rapid rate..

Big Data infographics

Appropriately enough, we can get an overview, from this pic:

Here is another summarization.

Concern over security/privacy

Purchase history of products+services could reveal a lot.

Vehicle tracking: license plate pic capture is legal.

TACMA - OMG. And, More OMG.

Privacy and security are at odds at times: NSA large-scale surveillance, 'No Fly' List, real-time face tracking..

Are you being... spied on?

Some data-related issues

Big Data can be quite useful if collected, analyzed and interpreted properly. Here are things that can be problematic:

TMD (Too Much Data)!

Storing huge amounts of data costs time, money; retrieval could be problematic, analysis will cost as well - somewhat oblivious to such concerns, we are creating a data deluge! How come? Because sensors are ubiquitous, storage is cheap, and we feel we 'need to' [FOMO].

Maybe we need to be prudent, in our quest to squeeze wisdom out of Big Data: "The purpose of [scientific] computing is insight, not numbers." - Richard Hamming (1962), in 'Numerical Methods for Scientists and Engineers'

Is it all (not) just hot air?

So, which is it? BOTH!



Why deal with Big Data?

So far, we looked at what Big Data is. Now we turn to the 'why' - what are the sources, reasons ("drivers") etc.

Sources of Big Data

Big Data can result from [you have seen this before]:

"Datafication"

Wikipedia: Datafication is a modern technological trend turning many aspects of our life into computerized data and transforming this information into new forms of value. Examples of datafication as applied to social and communication media are how Twitter datafies stray thoughts or datafication of HR by LinkedIn and others.

In other words, it is the new notion that people, our built envrironment (eg. number of freeways in the US), etc. can create, ie. lead to, (large scale) data.

"Once we datafy things, we can transform their purpose and turn the information into new forms of value."

IoT

IoT is the 'Internet of Things' - what if (almost) every lightbulb, tire, building, plane engine, bridge, fridge etc. had an IP address and a sensor, and transmits data periodically through a network? Among other things, this by itself will lead to an *explosion* of data :)

Here is an IoT infographic:

Big Data in Manufacturing

Big Data is starting to play a 'big' role in modernizing manufacturing operations:

Big Data - redux

Again, these are Big Data's characteristics:

Why is this useful again?

What are things we can we do now, that we couldn't, before?

* combine multiple sources of data (however small or seemingly insignificant) for a better 'bigger picture'

* exploit unstructured data - voice, video, images, tweets, blog posts..

* provide insights to [internal] frontline managers in near-real-time (to enable making more agile business decisions)

* experiment with the marketplace (fluid price-setting) as often as needed!

So here's what is new: better insight, quicker action.

How long will this be useful/relevant?

According to IEEE (and others), a long time.

Conferences, organizations, LA user group

Here are some links (to a variety of conferences, a group and meetup):

Summary

We are at the start of a transformative phase, fed by our relatively-new ability to collect, store, analyze and benefit from MASSIVE amounts of data from every walk of life.