Machine Learning and Data Analytics

CSCI 480, Spring semester, 2019

To the Bottom of the Page

Instructor: Dr. Shieu-Hong Lin ()

Class: TR 1:30-2:45 pm at Lim 41

Office Hours: Lim 137 MW 1:30-3:30pm T Th 3:00-5:00pm

Submission of all your work: go to Biola Canvas Your grades: see them under Biola Canvas

*************************************************************************************************

Week 1. Overview of the Landscape of (i) Machine Learning, (ii) the WEKA toolkit, and (iii) the SciPy ecosystem for Data Science

Lab # 1: WEKA: Report due: Thursday Jan. 24

Exploration: Install WEKA. Download and unzip this zip file to get the zoo data set and the Iris data set as two files in arff format. Open and examine the data files as text files using text editors such as notepad. Browse Chapter 2 of Data Mining: Practical Machine Learning Tools and Techniques (Full-text contents of 3^rd ed. available online through the Biola Library account) to understand the arff data file format used by WEKA
Experiment: (i) Run WEKA and use the Explorer in WEKA. Under explorer, open the zoo data set. Select the classifier J48 (under the tree section of the classifier menu) and apply it to learn a decision tree from the dataset. Copy and paste the decision tree you got from WEKA into a WORD document. (ii) Do the same things to the Iris data set to learn a decision tree. Copy and paste the decision tree you got from WEKA into the WORD document.
Submission: Upload the WORD document through Biola Canvas to show your findings.

Reading 1: Report due: Thursday Jan. 24

Download and install Anaconda on your own computer so that you can play with the Jupiter notebooks in the reading assignment.
Read Chapter 1 of Introduction to Machine Learning with Python (available as an e-book through your Biola Library account). Play with the corresponding notebook. (Link to notebooks on Github).
Browse Chapter 1 of Python Data Science Handbook by Jake VanderPlas. Play with the corresponding notebook. (Link to notebooks on Github).
Browse this Jupiter Notebook tutorial.
Browse this general introduction to the key tools in the SciPy ecosystem.
Submission: Submit your reading report through Biola Canvas.

Thoughts about Project: Rock-Paper-Scissor as an example

Sample agents #1 and #2: Download, unzip, and run rock-paper-scissor Agent#1 and Agent#2 for a couple of times. Each time the program would require you to play with the agent for 100 matches and yield a transcript file RPS_transcript.txt about the outcomes of these 100 matches in the same folder.
A simple learning project: Apply machine learning techniques to build predictive models for predicting the actions of agent based on the transcripts of previous interactions.
Presentation in Biola CBR or else: A paper based on student work from CSCI 480 on Mining the Social Web in 2017

*************************************************************************************************

Week 2. Intro to Python and Scipy (I) | Machine Learning: Decision Trees

Presentation #1 (10-20 minutes each person): Tuesday Jan. 29

Topics: Python Tutorials: A Whirlwind Tour of Python by Jake VanderPlas (GitHub repository).
Jacob: Basic Python Semantics (Variables and Objects), Basic Python Semantics (Operator), Built-In Types, Control Flow
Josiah: Built-In Data Structures
Stephen: Functions, Iterators
Trystan: List Comprehensions, Generators

Lab #2 (Entropy, information gain, and numpy basics):

Exploration: Play with the Jupiter notebook Lab2.ipynb to see how you may calculate the entropy of a given distribution in a numpy array.

Experiment: Examine and expand the Jupiter notebook Lab2.ipynb to do the following: (i) Calculate the entropy and the information gain shown in Section 4.3 of Data Mining: Practical Machine Learning Tools and Techniques regarding the original toy weather data set in the book for determining the root of the decision tree. (ii) Instead of the original the toy weather data set, see this csv file for a modified weather data set. Do things similar to what you did above to determine the entropy and the information gain regarding the modified weather data set and the root of the decision tree.
Submission: Upload your expanded Lab2.ipynb to show your work through Biola Canvas.

Reading 2: Report due: Thursday Jan. 31

Python Tutorials: A Whirlwind Tour of Python by Jake VanderPlas (GitHub repository).
Read Sections 2.1-2.2 of Python Data Science Handbook by Jake VanderPlas. Play with the corresponding notebooks. (Link to notebooks on Github).
Read Section 4.3 of Data Mining: Practical Machine Learning Tools and Techniques (Full-text contents of 3^rd ed. available online through the Biola Library account) on the concepts of entropy and information gain and their application in the construction of decision trees for machine learning.
Submit your reading report through Biola Canvas.

*************************************************************************************************

Week 3. Numpy I | Machine Learning: Naïve Bayes

Presentation #2 (10-20 minutes each person): Tuesday Feb. 5

Topics: Sections 2.1-2.6 of Python Data Science Handbook by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github).
Michael: 2.1-2.2 Basics of Numpy Arrays
Trystan: 2.3 Computation on NumPy Arrays: Universal Functions
Stephen: 2.4 Aggregations: Min, Max, and Everything In Between
Josiah: 2.5 Computation on Arrays: Broadcasting
Jacob: 2.6 Comparisons, Masks, and Boolean Logic

Homework #1: (Decision tree induction based on entropy and information gain): Thursday Feb. 7

Note: If there is a tie of entropy reduction (i.e. information gain), break the tie arbitrarily.
Purpose: Concepts about entropy and decision tree induction.
Send in your work through Biola Canvas.

Reading 3: Report due: Thursday Feb. 7

Read 2.1-2.6 of Python Data Science Handbook by Jake VanderPlas. Play with the corresponding notebook. (Link to notebooks on Github).
Read Section 4.2 of Data Mining: Practical Machine Learning Tools and Techniques on Naïve Bayes (Full-text contents of 3^rd ed. available online through the Biola Library account)
Submit your reading report through Biola Canvas.

*************************************************************************************************

Week 4. Numpy (II) and Pandas (I)

Presentation #3 (10-20 minutes each person): Tuesday Feb. 12

Topics: Sections 2.7-2.8 and 3.1-3.3 of Python Data Science Handbook by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github).
Michael: 2.7 Fancy Indexing
Josiah: 2.8 Sorting Arrays
Trystan: 3.1 Introducing Pandas Objects
Stephen: 3.2 Data Indexing and Selection
Jacob: 3.3 Operating on Data in Pandas

Reading 4: Report due: Thursday Feb. 14

Read Sections 2.7-2.8 and 3.1-3.3 of Python Data Science Handbook by Jake VanderPlas. Play with the corresponding notebook. (Link to notebooks on Github).
Read Section 4.6 of Data Mining: Practical Machine Learning Tools and Techniques on linear models (Full-text contents of 3^rd ed. available online through the Biola Library account)
Submit your reading report through Biola Canvas.

Lab #3 (Finding information gain given the distribution information stored in a 2-dimensional numpy array): Thursday Feb. 14

Task: Examine the Jupiter notebook Lab3.ipynb and add your own code to complete the definitions of infoGain function in [4] and infoGain2 function in [5] such that it can calculate the information gain of a selected attribute (such as "Outlook" in the weather dataset) given a 2-dimensional Numpy array recording the distributions of the class attribute in the context of the selected attribute. For infoGain2, you should allow the use of an optional keyword argument more to determine whether addition information would be printed. Check whether it works to correctly calculate the information gains of different attributes in the context of the original toy weather data set.
Submission: Upload your expanded Lab3.ipynb to show your work through Biola Canvas.

*************************************************************************************************

Week 5. More on Numpy (II) and Pandas (I)

Presentation #3 Continued (10-20 minutes each person): Tuesday Feb. 19

Topics: Sections 2.7-2.8 and 3.1-3.3 of Python Data Science Handbook by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github).
Michael: 2.7 Fancy Indexing
Josiah: 2.8 Sorting Arrays
Trystan: 3.1 Introducing Pandas Objects
Stephen: 3.2 Data Indexing and Selection
Jacob: 3.3 Operating on Data in Pandas

Reading 5: Report due: Thursday Feb. 21

Revisit Sections 2.7-2.8 and 3.1-3.3 of Python Data Science Handbook one more time. Play with the corresponding notebook. (Link to notebooks on Github).
Submit your reading report through Biola Canvas.

Homework#2 (Naïve Bayes classification): Thursday Feb. 21

Purpose: Naïve Bayes classification.
You may want to revisit Sections 4.2 of Data Mining: Practical Machine Learning Tools and Techniques on Naïve Bayes (Full-text contents of 3^rd ed. available online through the Biola Library account) to make sure you fully understand the concept of Naïve Bayes.
Send in your work through Biola Canvas.

Lab #4 (Analysis of rock-paper-scissor transcripts using numpy): Thursday Feb. 28

Rock-paper-scissor transcripts: Download, unzip, and run rock-paper-scissor Agent#1 and Agent#2 for a couple of times. Each time the program would require you to play with the agent for 100 matches and yield a transcript file RPS_transcript.txt about the outcomes of these 100 matches in the same folder. Here is a zip file containing two sample transcripts generated from playing with Agent#1 and Agent#2 respectively. For this lab, please conduct basic data analysis on the data collected in these two transcripts.
Loading data into a Numpy array: Create a Jupiter notebook, Lab4.ipynb. In the notebook, for each transcript in the zip file do the following: (i) use numpy.loadtext to load the data into a 2-D numpy array (for example use numpy.loadtxt("RPS_transcript1.txt", delimiter =',' , usecols=range(0,3), dtype=np.int32) to load the data from RPS_transcript1.txt, see more details in the documentation) , (ii) determine the percentage of time the agent played rock, paper, and scissor respectively, (iii) determine the percentage of time the agent played rock, paper, and scissor respectively when in the previous match the agent played rock, (iv) determine the percentage of time the agent played rock, paper, and scissor respectively when in the previous match the agent played paper, (v) determine the percentage of time the agent played rock, paper, and scissor respectively when in the previous match the agent played scissor, (vi) determine the percentage of time the agent played rock, paper, and scissor respectively when in the previous match the user played rock, (vii) determine the percentage of time the agent played rock, paper, and scissor respectively when in the previous match the user played paper, (viii) determine the percentage of time the agent played rock, paper, and scissor respectively when in the previous match the user played scissor.
Sample outputs: For your testing purpose, here are the sample outputs about the results you should see. Hints: There is an elegant way to make it work using Boolean masks and slicing of numpy arrays.
Submission of your work: Upload your Lab4.ipynb (to show your code and results) under canvas.

*************************************************************************************************

Weeks 6-7. More on Pandas (II) | Spring Break

Presentation #4 Continued (10-20 minutes each person): Thursday March. 14

Topics: Sections 3.4-3.8 of Python Data Science Handbook by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github).
Michael: 3.4 Handling Missing Data
Josiah: 3.5 Hierarchical Indexing
Trystan: 3.6 Combining Datasets: Concat and Append
Stephen: 3.7 Combining Datasets: Merge and Join
Jacob: 3.8 Aggregation and Grouping

Reading 6: Report due: Thursday Feb. 28

Read Sections 3.4-3.8 of Python Data Science Handbook by Jake VanderPlas. Play with the corresponding notebook. (Link to notebooks on Github).
Submit your reading report through Biola Canvas.

Reading 7: Report due: Thursday March 14

Review Chapter 2 on Numpy arrays and Sections 3.4-3.8 of Python Data Science Handbook on Pandas. (Link to notebooks on Github).
Submit your reading report through Biola Canvas.

Lab #5 (Analysis of information gain from rock-paper-scissor transcripts using numpy): Thursday March. 14

Rock-paper-scissor transcripts: As in Lab 4, examine the same zip file containing two sample transcripts generated from playing with Agent#1 and Agent#2 respectively. For each transcript in the zip file, use numpy.loadtext to load the data into a 2-D numpy array to conduct basic data analysis on the data collected in these two transcripts as described below
Information gain: We want to determine how much each of the agent’s actions in the previous two matches and each of the user’s actions in the previous two matches may affect the agent’s action in the current match. Create a Jupiter notebook, Lab5.ipynb. In the notebook, for the data collected in these two transcripts determine the information gain of knowing (i) the agent’s action two matches ago, (ii) the agent’s action in the previous match, (iii) the user’s action two matches ago, and (iv) the user’s action in the previous match respectively for predicting the action in the current match. The results can inform us whether and how much it may help us to predict the action of the agent if we know the agent’s actions in the previous two matches and the user’s actions in the previous two matches. Note that you can accomplish this task conveniently based on what you learned and did for Lab 3 and Lab 4.
Sample outputs and a partial script: here (updated 5:30pm, Feb. 28) is the zip file, in which you can find a partial script for Lab 5 and the sample outputs about the results you should see.
Submission of your work: Upload your Lab5.ipynb (to show your code and results) under canvas.

*************************************************************************************************

Weeks 8-9. Review | Test 1 | Missions Conference

Test #1 (Develop and use Naïve Bayes classifiers based on rock-paper-scissor transcripts using numpy): Thursday March. 28

Rock-paper-scissor transcripts: As in Lab 4 and Lab 5, examine the same zip file containing two sample transcripts generated from playing with Agent#1 and Agent#2 respectively. For each transcript in the zip file, use numpy.loadtext to load the data into a 2-D numpy array to conduct basic Naïve-Bases data analysis on the data collected in these two transcripts as described below
See the specification here. To get a concrete picture of the tasks, please carefully examine a sample partial script and the sample outputs shown here.
Submission of your work: Fill out this self-evaluation report. Upload the self-evaluation report and your Test1.ipynb (to show your code and results) and under canvas.

Review: Naïve Bayes classification

To do Test 1 well, revisit Sections 4.2 of Data Mining: Practical Machine Learning Tools and Techniques on Naïve Bayes (Full-text contents of 3^rd ed. available online through the Biola Library account) to make sure you fully understand the concept of Naïve Bayes classification. Revisit Homework 2 too.

*************************************************************************************************

Week 10. More on Pandas (III)

Presentation #4 Continued (10-20 minutes each person): Tuesday April. 2

Topics: Sections 3.4-3.8 of Python Data Science Handbook by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github).
Michael: 3.4 Handling Missing Data (Finished)
Josiah: 3.5 Hierarchical Indexing (Finished)
Trystan: 3.6 Combining Datasets: Concat and Append
Stephen: 3.7 Combining Datasets: Merge and Join (Finished)
Jacob: 3.8 Aggregation and Grouping

Reading 10: Thursday, April. 4

Carefully read or review the following Sections of Python Data Science Handbook: (i) Sections 3.1-3.3 of Python Data Science Handbook on the fundamentals of Pandas, (ii) Section 3.8 on Aggregation and Grouping, and (iii) Section 3.9 on Pivot Tables. These sections are important for you to work on Lab 6.
Download and play with the Jupiter notebook here (updated 04/04) to see how you may (i) import a Rock-Paper-Scissor transcript to create a Pandas dataframe and (ii) conveniently collect Naïve Bayes statistics using either groupby plus unstack, or pivot_table, or crosstab.

*************************************************************************************************

Week 11. More on Pandas (IV) | Supervised Learning: Linear Models

Reading 11: Thursday, April. 11

(i) Review Sections 3.1-3.3, 3.8-3.9 of Python Data Science Handbook. (ii) Read Section 4.6 on linear models in Chapter 4 of Data Mining: Practical Machine Learning Tools and Techniques (Full-text contents of 3^rd ed. available online through the Biola Library account, also see the ppts of Chapter 4 here)

Lab 6 (Naïve Bayes classification using Pandas): Thursday, April. 11

Goal: Redo Homework#2 (Naïve Bayes classification) by using Pandas to collect the Naïve Bayes statistics from the data set in this csv file. But this time you are required to create a Jupiter notebook using Pandas to read the csv data file into a Pandas dataframe and automate the steps you went through in Homework#2 for collecting statistics to (i) build a Naïve Bayes model using the dataframe, groupby, and unstack and (ii) conduct Naïve Bayes classification.
Tasks to complete: Please carefully examine the partial script and the sample output here (v2 updated 04/03) to understand the framework. Then fill in the code to complete 3 tasks (defining 3 functions and testing them) to accomplish the goal above.
Send in your Jupiter notebook through Biola Canvas.

*************************************************************************************************

Week 12. Matplotlib | Supervised Learning: Linear Models

Presentation #4 Continued (10-20 minutes each person): Tuesday April. 16

Topics: Sections 4.1-4.5 of Python Data Science Handbook by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github).
Michael: 4.1 Simple Line Plots
Trystan: 4.2 Simple Scatter Plots
Josiah: 4.3 Visualizing Errors
Stephen: 4.4 Density and Contour Plots
Jacob: 4.5 Histograms, Binnings, and Density

Reading 12: Thursday, April. 18

(i) Read the first 5 sections (up to the section on Visualizing Errors) in Chapter 4 of Python Data Science Handbook on Matplotlib. (ii) Review Section 4.6 on linear models in Chapter 4 of Data Mining: Practical Machine Learning Tools and Techniques (Full-text contents of 3^rd ed. available online through the Biola Library account, also see the ppts of Chapter 4 here)

*************************************************************************************************

Week 13. Machine Learning Using Scikitlearn (I)

Reading 13: Thursday, April. 25

Read the first 4 sections in Chapter 5 on machine learning using sklearn in Python Data Science Handbook by Jake VanderPlas.
Examine the linear regression demo program here.

Test #2 Develop and use Naïve Bayes classifiers based on rock-paper-scissor transcripts using Pandas: Thursday April. 25

Purpose: We are going to implement functions using Pandas dataframe objects as the main data structures for Naïve Bayes classification using rock-paper-scissor transcripts. The approach is similar to what you did in Lab 6. The functions you implement for Test 2 essentially provide the comparable functionality of what you have accomplished in Test1.
Tasks: Please carefully read the sample partial script and examine the sample outputs in the zip file here. Please complete the implementation of the functions specified in the tasks and test your implementation to see whether you get the same results in the outputs.
Submission of your work: Fill out this self-evaluation report. Upload the self-evaluation report and your Test1.ipynb (to show your code and results) and under canvas.

Homework #3: (Linear Regression): Thursday, April. 25

Purpose: Linear regression.
Note: You may adapt the linear regression demo program here to verify whether you got the right coefficients for this homework.
Submit in your work through Biola Canvas.

*************************************************************************************************

Week 14. Machine Learning Using Scikitlearn (II)

Reading 14: Thursday, May 2

Read Sections 5-6, 8 (Naïve Bayes, Linear Regression, Decision Trees) in Chapter 5 on machine learning using Scikitlearn in Python Data Science Handbook by Jake VanderPlas.

Test #3: due: Thursday, May 9

· Problem set: See here (updated version posted 17:20 April 30).

· Overview: Rock-Paper-Scissor and classification algorithms using Scikitlean. In the previous tests, you implemented functions for Naïve Bayes classification based on Numpy and Pandas. You also apply the functions to Rock-Paper-Scissor transcripts to learn to predict the behavior of a specific agent based on either (i) the actions of the two sides in the previous two matches or (ii) the actions of the two sides in the previous match only. In the exam, you’ll be given the transcripts of 3 different agents to play with. Your task is to load the agent transcripts as Pandas dataframe objects and use Scikitlearn to apply Naïve Bayes (see Chapter 5 of Python Data Science Handbook) to analyze the behavior of the agents. Applying Naïve Bayes using Scikitlearn, you should (i) learn a classifier respectively for prediction and (ii) get an empirical estimate the accuracy of prediction respectively based on the prediction accuracy on the testing data.

· Overview: Linear regression and kernel trick using Scikitlean. You need to use Scikitlearn to apply linear regression (decently covered in Chapter 5 of Python Data Science Handbook) to conduct some basic learning tasks.

· Submission of your work: Fill out this self-evaluation report. Upload the self-evaluation report and your Jupiter notebooks (to show your code and results) and under canvas.

*************************************************************************************************

Links to online resources

About Jupiter Notebook
Python Tutorials: A Whirlwind Tour of Python by Jake VanderPlas (GitHub repository)

*************************************************************************************************

To the Top of the Page