HW1: EDA

Summary: In this HW, you'll use some basic Python commands, and a cloud platform for doing EDA on provided datasets.

Description

As a first step, you need to have your data science environment set up. You'll use 'JuPyteR' (which we'll call 'Jupyter' from now on) for doing HW1, using a specialized web-page-like document called a 'notebook'. Jupyter is going to continue to be useful to you ongoing, long after you finish this course! It is even possible to run Jupyter notebooks 'on the cloud', eg. by storing them in your GDrive, etc.

Summary of the workflow, for Jupyter:

* install 'miniconda'

* use miniconda to install Jupyter, and to install additional libraries you'd need

* launch Jupyter (via your web browser)

* using Jupyter, open your HW1 notebook

* modify code in your notebook, run, modify, run... ['save' often]

To install Jupyter, do this:

* install 'miniconda': https://conda.io/miniconda.html

* once miniconda is installed, open up an 'Anaconda prompt' [a black terminal window], and type in:

conda install jupyter

* after that, typing

jupyter notebook

at the prompt should pop up a web browser page, with a 'localhost' URL - congrats, you're ready to do Data Science :)

Here are my screenshots for Jupyter installation (on a Win10 laptop):

Next, you need to install three Python 'modules' [libraries] that help with data analysis and plotting - pandas, seaborn, matplotlib [run the following commands in the conda shell (ie. prompt)]:

conda install matplotlib

conda install pandas

conda install seaborn

It'll take a little while for the modules to be fetched from the web (from known repositories), and installed locally on your machine.

Now, onto EDA :)

Here is your first dataset - 51 years (1955-2005, both years inclusive) of Fortune 500 companies' revenues and profits (each year, top 500 companies [ie. 'Fortune 500'] for that year are listed).

This is your second dataset - a popular one used for introductory data science. It's the 'Iris dataset': 50 rows of data (4 columns of flower measurements), for each of three different 'iris' plants.

Create a folder/directory called 'INF250-HW1' on your machine, and download/move the about two datasets into that folder. This will keep things organized.

Q1 (5 points). Compute summary statistics for the 'Revenue' column of the Fortune500 dataset.

To do this, you'd launch Jupyter, then load in this notebook [here's how - BROWSE for the .ipynb notebook, via the Jupyter localhost:8888/tree page that popped up when you ran the 'jupyter notebook' command in the conda shell]. As mentioned in class, a notebook is a 'literate' entity, containing text+images, and runnable code (in Julia, Python, R, and many more languages); for this HW, you'll just use Python, which comes pre-installed with Jupyter. FYI, as you might have noticed when you clicked on the notebook link above, a notebook .ipynb file is in plaintext JSON!

Copy/download the notebook to your HW1 folder [the little video in the link shows two different ways - you can download the file, or copy and paste its contents from the browser into a new file - you can do either, result is the same], then open it in Jupyter (as you can see, this means that the notebook shows up as if it were a browser page). You can edit the notebook, by inserting one of two types of 'cells' - a Markdown cell, for adding text+images, or a code cell, for adding (Python) code. You'd create a code cell by default, and switch it to Markdown where needed (for this HW you don't need to create Markdown cells). The Markdown cells are where text, images, videos, links etc go, and the code cells are where code (in Julia, Python, R, JavaScript...) goes; by inter-mixing text and code cells, we are able to create a 'live' document, where, unlike a book with printed code, the code here is runnable and editable - cool! And, you can save your notebooks' versions incrementally as you work, creating a record of what you learned so that you can keep going back to them when you need to (much better than static notes!).

For practice, insert a code cell (again, which is the cell type that gets inserted by default), type in

print("Hello INF250!")

and 'Run' it :) You can delete this cell if you like.

Please get in the habit of doing 'Ctrl s' often, to save the notebook - that way you won't lose too many of the changes you make to the notebook, if you need to restart the browser/machine for whatever reason - your notebook is autosaved in regular intervals, but you can press Ctrl s to force a save anytime (eg. after typing in some code).

To compute 'Revenue' summary, you'd read the entire rectangular dataset (rows x columns) into memory, then work on the Revenues column.

pandas has a function called read_csv(), that's what you'll use, to read in a dataset: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

The result of reading in a dataset using read_csv(), is a 'DataFrame' object (a rectangular/tabular block of data): https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html A very important method in DataFrame is .describe() [hint - you can use it to get your result :)]

As the notebook says, add in code to read the dataset and compute the summary.

Q2 (5 points). Read in the 'iris' dataset [link above] into the notebook, and follow the steps listed in the notebook (describe the data, generate column histograms).

Q3 (5 points). Here, you will 'wrangle' (pre-process) data.

For this question, you need to:

create a free account on Trifacta, to run 'Trifacta Wrangler' on their cloud - it is a system for pre-processing raw data, for later (DM, ML...) steps
watch https://www.trifacta.com/resource-library/trifacta-data-engineering-cloud-demo/ to get an idea of the workflow
also, scroll down to the bottom of https://www.trifacta.com/ to look at the various industries ("verticals") and use cases for Trifacta

Useful: https://www.youtube.com/watch?v=Y72qtcCxDb0

Update: use Alteryx, instead of Trifacta: https://www.alteryx.com/about-us/trifacta-is-now-alteryx-designer-cloud and https://help.alteryx.com/20231/designer/build-workflows. Sign up and get started, LIKE SO:

Create a new 'flow', import your fortune500.csv dataset into it (Import datasets -> Upload from your computer), and clean up the data - delete rows where the profit column contains 'N/A'. You'd add a 'recipe', then edit it to specify what to do (ie delete rows), then execute the cleanup recipe (click on the output block of your flow, then click 'Run'). Download the resulting data as fortune500_cleanedup.csv, you'll need to submit it: look for an 'Export results' button in your job's result page (the page url might look like 'https://cloud.trifacta.com/jobs/3703'). Can't figure out how to download the csv file to your machine? Look carefully, it's in there :) Also, take a screenshot of the cleanup workflow (recipe), and submit the .jpg or .png screenshot.

Hint - here is what you'd see, before cleanup:

ALTERNATIVE to Alteryx [if you couldn't Alteryx to install/work]: what would be the Python commands (on the DataFrame object) you'd run in your notebook, to do this cleanup (delete rows that have 'N.A.' in the profit column)? HINT:

What to submit (as a .zip, with your name, eg. HW1_Melanie.zip):

your notebook, eg. hw1.ipynb - running the cells in this should produce answers for Q1, Q2
Alteryx/Trifacta screenshot (of the cleanup recipe), cleaned up dataset (eg. fortune500_cleanedup.csv), OR a Q3.txt file, with just the Python cmd(s) to delete 'N/A' rows in fortune500.csv

Please upload your submission on to D2L, in the 'HW1' folder.

HAVE FUN!