INF250: HW2

####   # HW2: Data Mining (tool-based) ###**Summary**: In this homework, you are going to use three UI-based tools (no coding!), to carry out data mining: **WEKA, KNIME, Orange.** There are 6 questions you need to answer. ###**Description** ####Start by downloading WEKA, from https://www.cs.waikato.ac.nz/ml/weka/ [Getting Started -> Download]. FYI WEKA is written in Java, so you need to install Java [most likely you already have it] prior to installing WEKA. WEKA is powerful and capable - you can continue using WEKA long after this course, and in the future, even consider extending it by writing plugins for it. ####Take a few hours to go through WEKA's tutorials: https://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/. You can also look at YouTube videos to come up to speed. It is just a matter of getting familiar with the UI, and with the overall workflow (read in data, possibly clean data, do analysis, possibly export results). ####<a href="data/housing.arff">Here</a> is a famous (in the ML/DM community) dataset called the <a href="https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html">'Boston Housing Dataset'.</a> As you can read from the description, it is a dataset that contains data regarding houses in several Boston suburbs, published in 1993. It has 506 rows (records) of data, and 14 columns (attributes). For this HW, we'll use the 'MEDV' (median home price) attribute as the "class" (the output to predict). In other words, using existing data from the other 13 columns, we want to be able to learn to predict MEDV for a new record that contains values for those 13 columns. ####As you can see, the data is in a WEKA-native format called ARFF [https://www.cs.waikato.ac.nz/ml/weka/arff.html], which resembles, but is more descriptive than, CSV. ####Q1. Build a linear regression equation, to predict MEDV. Include a screenshot that shows the linear equation. How many terms are in the equation, and 'why'? In other words, discuss the resulting equation. ####Q2. Create a 'MultilayerPerceptron' neural network that learns the data. You can see that you affect the root mean squared error ('RMSE') by setting different values for the learning rate (try to keep this between 0.1 and 0.3), and momentum (again, 0.1 to 0.3). What is the lowest RMSE you are able to achieve? Eg. an RMSE of 5.0 would mean that the MEDV predictions for our 506 rows were off by 5.0 units on the average (the actual values max out at 50, so this represents 10% error on the average). Again, include (two) screenshots that show the NN and the RMSE. <div style="height:1px;border:1px solid #BBBBBB;"/> ####<a href="data/shells.arff">Here</a> is another dataset to use. It consists of 4177 rows of data regarding <a href="https://www.google.com/search?q=abalone+shells&num=100&rlz=1C1CHBF_enUS723US723&source=lnms&tbm=isch&sa=X&ved=0ahUKEwihrcyRq8zXAhWJCpAKHXDdDDUQ_AUICygC&biw=1280&bih=541">abalone shells</a>, where each row resulted from measuring 9 parameters/features/values for each shell. The data is in text format (.arff format, for input to WEKA, like above), do take a look at it. The idea is to be able to predict the 9th value, number-of-rings, given the other 8 values, using the existing dataset to learn how to predict. ####Next, download and install <a href="https://www.knime.com/">KNIME</a> ("nime"), and work through the quickstart tutorial. KNIME is also UI-driven, like WEKA; additionally, it's also visual-dataflow-driven, which means we can do data mining with it, by 'connecting the boxes' (where each box reads data or does mining or writes data, etc). ####Q3. Use KNIME to perform linear regression [on all parameters, not a subset]. You need these nodes: AARF Reader, Linear Regression Learner. Create and connect the nodes, and execute each. What is the linear equation? ####Q4. Set up a 'Decision Tree Learner' predictor, where 'sex' is the predicted variable. Note - think "simple" - no need to partition the data into training and test data, etc! Provide a snapshot (.jpg or .png) of the \*entire\* decision tree [OK if the nodes are too zoomed out and are therefore unreadable] - hint: look at the \*right\* side of the split-pane window. <div style="height:1px;border:1px solid #BBBBBB;"/> ####Download the oh-so-fruity <a href="https://orangedatamining.com/">Orange</a>, and play with it for a bit - it is also dataflow-based, just like KNIME. It comes with ALL these widgets/nodes/operators/'boxes': https://orangedatamining.com/widget-catalog/ ####Q5. Bring in the shells.arff data as .csv [use an online converter such as https://pulipulichen.github.io/jieba-js/weka/arff2csv/ to convert .arff to .csv], and only work with these 4 params/columns: length,diameter,height,num_rings. Create 6 clusters (with all the 4 attrs together, ie. you'd be clustering a 4D dataset) out of the 4177 pieces of data (use the 'k-Means' node). Question: how many data points are in each cluster? ####Q6. Next, do a linear regression to predict num_rings, given these three columns: length,diameter,height. Question: what is the equation? You'd be using the'Linear Regression' operator for this. <div style="height:1px;border:1px solid #BBBBBB;"/> ####Q7 [Bonus, 1 point]. Pick a dataset using https://toolbox.google.com/datasetsearch. Use WEKA, KNIME or Orange or even <a href="https://rapidminer.com/">RapidMiner</a>, do some analysis on it (run a mining algorithm that is not the above list of 6 questions), and present your result[s]. <div style="height:1px;border:1px solid #BBBBBB;"/> ####It's highly worth knowing how to use such tools for analysis, as opposed to only knowing how to do so using Python, R, JS, Julia or SQL code - **the interface-driven tools are JUST AS powerful**, because they encapsulate all the data-mining algorithms/code, in easy to use UI that (even) non-programmers can use. ####Please upload your (.zip) submission on to D2L as usual. ####ENJOY!! <div style="height:1px;border:1px solid #BBBBBB;"/> <div style="height:1px;border:1px solid #BBBBBB;"/>