Data Science - overview |
In this lecture, we are going to get a bird's eye view of 'data science':
WHAT is "data science"?
We are aware of physical sciences, biological sciences, and social sciences, but what is data science?
Data science is... science with data :) Let us find out what this means...
One answer: https://www.youtube.com/watch?v=L7CdHnuR4pE
Another answer: https://www.youtube.com/watch?v=xC-c7E5PK0Y
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to data mining.
-Wikipedia
Data science is a multidisciplinary blend of data inference, algorithm development, and technology in order to solve analytically complex problems.
At the core is data. Troves of raw information, streaming in and stored in enterprise data warehouses. Much to learn by mining it. Advanced capabilities we can build with it. Data science is ultimately about using this data in creative ways to generate business value:
- datajobs.com
Data science lies at the intersection of, ie. uses, ie. is built upon, multiple pre-existing fields.
We live in a data-driven world, where massive computers analyze the traces of all human activity in our quest to predict the future. Data Science is a rapidly emerging discipline at the intersection of statistics, machine learning, data visualization, and mathematical modeling, but it remains mysterious and even threatening to the broader public.
- Steven Skiena
A diagram, esp. a 'Venn' diagram, is useful in depicting the idea of something involving multiple entities.
Here are some 'data science' Venn digrams:
And here is a 'concordance' diagram that lists terms related to data science:
Boiling it all down, there are three core areas that come into play:
A fourth 'pillar' can be added, actually:
Just like data science can be depicted a using Venn diagram, a data scientist can be, as well:
Another way to look at data science is this: it is the process of transforming (converting) data into information, information into knowledge: data → information → knowledge.
Data is/are 'raw facts'.
Information is 'useful data'.
Knowledge is 'actionable information'.
Sometimes you'd see a fourth component, "wisdom":
We just saw that data science has to do with the extraction of knowledge/value, out of raw data - this can be regarded as a purely scientific/technical process (more on this, soon). What happens AFTER the knowledge is extracted?
Informatics is defined, in the medical field, this way: "Biomedical informatics (BMI) is the interdisciplinary field that studies and pursues the effective uses of biomedical data, information, and knowledge for scientific inquiry, problem solving and decision making, motivated by efforts to improve human health." The emphasis is on USING, ie. ACTING UPON, the resulting knowledge.
So 'data informatics' is a superset of 'data science', since it (informatics) has broader scope/context - where data science is a crucial component for societal and other (eg. business-oriented) problem-solving.
Here is a diagram, with a linear sequence of steps:
Q: what does this remind you of, ie. what other VERY WELL KNOWN method does it resemble?
This clip (till 11:42) is a nice summary of what we discussed: https://www.youtube.com/watch?v=KxryzSO1Fjs
Also, here are Skiena's slides on introductory data science.
The use of data science is an ongoing process. This diagram (from UC Berkeley) shows the cyclical nature of the process:
Notice the 'Communicate' -> 'Capture' link in particular - after utilizing the results of an analysis, we CONTINUE capturing MORE data, for ONGOING use.
Here (30:18 till end) is a description of the various phases of a typical data science project: https://www.youtube.com/watch?v=KxryzSO1Fjs
Wouldn't it be cool (and lucrative!) to be able to predict rising music stars? A group at USC is working on enabling just that, using data (concert venue data in particular): https://viterbischool.usc.edu/news/2018/08/predicting-the-next-taylor-swift-with-algorithms/
This SQL Server documentation contains a detailed walkthrough that shows how to ingest, analyze (learn from), and predict, tipping behavior - it uses 173 million (!) data points [of NYC cab rides in 2013]... Three variations are presented (Python, R+SQL, SQL+R):
Remember - MOST of 'data science' involves those three steps in 'bold' above: