***





Data Science - overview

Topics

In this lecture, we are going to get a bird's eye view of 'data science':

A basic question

WHAT is "data science"?

We are aware of physical sciences, biological sciences, and social sciences, but what is data science?

Data science is... science with data :) Let us find out what this means...

What is data science?

One answer: https://www.youtube.com/watch?v=L7CdHnuR4pE

Another answer: https://www.youtube.com/watch?v=xC-c7E5PK0Y

What is data science?

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to data mining.

-Wikipedia


Data science is a multidisciplinary blend of data inference, algorithm development, and technology in order to solve analytically complex problems.

At the core is data. Troves of raw information, streaming in and stored in enterprise data warehouses. Much to learn by mining it. Advanced capabilities we can build with it. Data science is ultimately about using this data in creative ways to generate business value:

- datajobs.com


Data science is "inter-disciplinary"

Data science lies at the intersection of, ie. uses, ie. is built upon, multiple pre-existing fields.


We live in a data-driven world, where massive computers analyze the traces of all human activity in our quest to predict the future. Data Science is a rapidly emerging discipline at the intersection of statistics, machine learning, data visualization, and mathematical modeling, but it remains mysterious and even threatening to the broader public.

- Steven Skiena


"inter-disciplinary" - diagrams

A diagram, esp. a 'Venn' diagram, is useful in depicting the idea of something involving multiple entities.

Here are some 'data science' Venn digrams:

And here is a 'concordance' diagram that lists terms related to data science:

Boiling it all down, there are three core areas that come into play:

A fourth 'pillar' can be added, actually:


Who is a data scientist?

Just like data science can be depicted a using Venn diagram, a data scientist can be, as well:


Data, information, knowledge

Another way to look at data science is this: it is the process of transforming (converting) data into information, information into knowledge: data → information → knowledge.


Data is/are 'raw facts'.

Information is 'useful data'.

Knowledge is 'actionable information'.


Sometimes you'd see a fourth component, "wisdom":

Data informatics vs data science

We just saw that data science has to do with the extraction of knowledge/value, out of raw data - this can be regarded as a purely scientific/technical process (more on this, soon). What happens AFTER the knowledge is extracted?

Informatics is defined, in the medical field, this way: "Biomedical informatics (BMI) is the interdisciplinary field that studies and pursues the effective uses of biomedical data, information, and knowledge for scientific inquiry, problem solving and decision making, motivated by efforts to improve human health." The emphasis is on USING, ie. ACTING UPON, the resulting knowledge.

So 'data informatics' is a superset of 'data science', since it (informatics) has broader scope/context - where data science is a crucial component for societal and other (eg. business-oriented) problem-solving.

The process (of data science)

Here is a diagram, with a linear sequence of steps:

Q: what does this remind you of, ie. what other VERY WELL KNOWN method does it resemble?

Summary

This clip (till 11:42) is a nice summary of what we discussed: https://www.youtube.com/watch?v=KxryzSO1Fjs

Also, here are Skiena's slides on introductory data science.

Data lifecycle

The use of data science is an ongoing process. This diagram (from UC Berkeley) shows the cyclical nature of the process:

Notice the 'Communicate' -> 'Capture' link in particular - after utilizing the results of an analysis, we CONTINUE capturing MORE data, for ONGOING use.

Lifecycle, in a bit more detail

Here (30:18 till end) is a description of the various phases of a typical data science project: https://www.youtube.com/watch?v=KxryzSO1Fjs

Gazing into the crystal ball - musician/band success

Wouldn't it be cool (and lucrative!) to be able to predict rising music stars? A group at USC is working on enabling just that, using data (concert venue data in particular): https://viterbischool.usc.edu/news/2018/08/predicting-the-next-taylor-swift-with-algorithms/

Walkthrough - cab driver tip prediction

This SQL Server documentation contains a detailed walkthrough that shows how to ingest, analyze (learn from), and predict, tipping behavior - it uses 173 million (!) data points [of NYC cab rides in 2013]... Three variations are presented (Python, R+SQL, SQL+R):


Remember - MOST of 'data science' involves those three steps in 'bold' above: