Data Science |
Let us begin at the beginning, and define what data science is..
v1
"Data science is an emerging interdisciplinary field that combines elements of mathematics, statistics, computer science, and knowledge in a particular application domain for the purpose of extracting meaningful information from the increasingly sophisticated array of data available in many settings."
v2
From Wikipedia:
Data science employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information science, and computer science, including signal processing, probability models, machine learning, statistical learning, data mining, database, data engineering, pattern recognition and learning, visualization, predictive analytics, uncertainty modeling, data warehousing, data compression, computer programming, artificial intelligence, and high performance computing. Methods that scale to big data are of particular interest in data science, although the discipline is not generally considered to be restricted to such big data, and big data solutions are often focused on organizing and preprocessing the data instead of analysis. The development of machine learning has enhanced the growth and importance of data science.
Data science affects academic and applied research in many domains, including machine translation, speech recognition, robotics, search engines, digital economy, but also the biological sciences, medical informatics, health care, social sciences and the humanities. It heavily influences economics, business and finance. From the business perspective, data science is an integral part of competitive intelligence, a newly emerging field that encompasses a number of activities, such as data mining and data analysis.
v3
Data science is OSEMN ('awesome') - it involves Obtaining, Scrubbing, Exploring, Modeling, and iNterpreting data - Jeroen Janssens.
v4
Data Scientist (noun): Person who is better at statistics than any software engineer and better at software engineering than any statistician - Josh Wills.
FYI tidbit: DJ Patil, the current Chief Data Scientist of the United States and previously the Head of Data Products at Linkedin, is the one who first coined the term 'data science'.
Since Data Science is inter-disciplinary, job titles could take on a variety of forms:
Note: you don't 'NEED' a PhD to work in this field! Good analytical skills, knowledge of the tools and algorithms, and coding ability - these would help immensely.
As you can imagine, the single word answer would be "insight" - we would like to **extract** meaningful info, "actionable" knowledge, insight, wisdom - whatever it is called - from "cold, hard data". That insight could then bring in profits, change societies, save lives..
This isn't hype or a vague concept. It means that we start with collected/measured data, analyze/process it, and obtain as a result, something **new** that we did not know, did not realize.
As we mentioned earlier, data science is an **interdisciplinary** field. It encompasses the following areas:
Math
Algorithms
Coding
Domain knowledge
You need to KNOW the area/topic/subject/field/domain in which you want to work!
This could be one of a variety of things: agriculture, business, climate, consumer-oriented, ecology, education, finance, government, health/medicine, manufacturing, societal, science.. - each area has MANY sub areas!
Why is subject matter important to know? Because you need to understand the input data, know the terminology, ask the right questions (regarding the data you want to analyze), understand and then communicate the results of your analysis. Without domain knowledge, you will feel like an outsider, and likely be perceived as one.
Here is a ranking of relevant skills..
There are several courses available, both at 'SC and elsewhere:
There are a lot of schools throughout the US and the world, that offer degree programs in data science. Columbia U has a pretty comprehensive program; here is their course list [a good checklist, for putting together your own data science track]:
Here is another good course, and this one is put together by Microsoft.
Data science based analysis consists of a defined series of steps, as shown below:
Notice that the "pipeline" shown above [in blue] is strictly sequential/linear - it is how ALL science is carried out (it is classical 'hypothesis verification', ie. 'the scientific method'). So, "data science" is the application of the scientific method, to ALL forms of data.
'Mountains' of Big Data exist at the federal, state and city levels, free for the taking - to analyze, get insights from, and as a result, transform society:
"From scratch" analyses can be performed using homegrown and open-source libraries in R, Python, Java etc.
Weka is a comprehensive data mining tool, as are KNIME and RapidMiner [all free] - all are quite popular. There are a lot of tools! Even Excel, MATLAB and Mathematica could be used for data analysis..
For visualizing, there are many options again - Qlikview, Tableau, periscope.io..
The following companies aim to provide productivity and efficiency boosts, for data scientists:
Platforms provide 'end to end' ("productionized") functionality, eg. ETL, analytics, viz. Here are some, in no particular order:
There are also, 'data science appliances' - these are turnkey systems for more casual users. Anodot and GeoStrategies are two companies that provide such products.
Internships are a good way to get into the field (or test the waters!)..
You can consider applying to these highly competitive, intensive and immersive programs:
Opportunities like the above will let you exercise your (Big Data, analytics) knowledge, (coding, communication, teamwork) skills and passion for effecting social change.
Kaggle: https://www.kaggle.com/jobs
DataJobs: https://datajobs.com/data-science-jobs
Here are MANY more job postings...
Here is a list of 100 basic DS questions.
Excellent books:
Here are some sample blogs and sites:
Below are gatherings - attend several of these, to learn, and to network.