**To the Bottom
of the Page**

Instructor: Dr. Shieu-Hong Lin ()

Class: TR 1:30-2:45 pm at Lim 41

Office Hours: Lim
137 MW 1:30-3:30pm
T Th 3:00-5:00pm

**Submission of all your work**: go to Biola Canvas **Your grades: **see them under Biola Canvas

*************************************************************************************************

**Week 1**. Overview of the Landscape of (i) Machine Learning, (ii) the WEKA toolkit, and (iii) the
SciPy ecosystem for Data Science

**Lab # 1: WEKA**:
Report due: Thursday Jan. 24

**Exploration**: Install WEKA. Download and unzip this zip file to get the zoo data set and the Iris data set as two files in arff format. Open and examine the data files as text files using text editors such as notepad. Browse Chapter 2 of*Data Mining: Practical Machine Learning Tools and Techniques*(Full-text contents of 3^{rd}ed. available online through the Biola Library account) to understand the arff data file format used by WEKA**Experiment**: (i) Run WEKA and use the**Explorer**in WEKA. Under explorer, open the zoo data set. Select the classifier J48 (under the tree section of the classifier menu) and apply it to learn a decision tree from the dataset. Copy and paste the decision tree you got from WEKA into a WORD document. (ii) Do the same things to the Iris data set to learn a decision tree. Copy and paste the decision tree you got from WEKA into the WORD document.**Submission**: Upload the WORD document through Biola Canvas to show your findings.

**Reading 1**: Report due: Thursday Jan. 24

- Download and install Anaconda on your own computer so that you can play with the Jupiter notebooks in the reading assignment.
- Read Chapter 1 of
*Introduction to Machine Learning with Python*(available as an e-book through your Biola Library account). Play with the corresponding notebook. (Link to notebooks on Github). - Browse Chapter
1 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebook. (Link to notebooks on Github). - Browse this Jupiter Notebook tutorial.
- Browse this general introduction to the key tools in the SciPy ecosystem.
**Submission**: Submit your reading report through Biola Canvas.

**Thoughts about Project: Rock-Paper-Scissor as an example**

**Sample agents #1 and #2:**Download, unzip, and**run**rock-paper-scissor Agent#1 and Agent#2 for a couple of times. Each time the program would require you to play with the agent for 100 matches and yield a transcript file*RPS_transcript.txt*about the outcomes of these 100 matches in the same folder.**A simple learning project: A**pply machine learning techniques to build predictive models for predicting the actions of agent based on the transcripts of previous interactions.**Presentation in Biola CBR or else:**A paper based on student work from CSCI 480 on Mining the Social Web in 2017

*************************************************************************************************

**Week 2**. Intro to Python and Scipy
(I) **|** Machine Learning: Decision Trees

**Presentation #1 **(10-20 minutes each person): Tuesday Jan. 29

- Topics: Python Tutorials:
*A Whirlwind Tour of Python*by Jake VanderPlas (GitHub repository). - Jacob: Basic
Python Semantics (Variables and Objects), Basic Python Semantics
(Operator), Built-In Types, Control Flow
- Josiah:
- Stephen:
Functions, Iterators
- Trystan: List
Comprehensions, Generators

**Lab #2 (Entropy, information gain, and numpy basics)**:

**Exploration**: Play with the Jupiter notebook Lab2.ipynb
to see how you may calculate the entropy of a given distribution in a numpy array.

**Experiment**: Examine and expand the Jupiter notebook Lab2.ipynb to do the following: (i) Calculate the entropy and the information gain shown in Section 4.3 of*Data Mining: Practical Machine Learning Tools and Techniques*regarding the original toy weather data set in the book for determining the root of the decision tree. (ii) Instead of the original the toy weather data set, see this csv file for a modified weather data set. Do things similar to what you did above to determine the entropy and the information gain regarding the modified weather data set and the root of the decision tree.**Submission**: Upload your expanded Lab2.ipynb to show your work through Biola Canvas.

**Reading 2**: Report due: Thursday Jan. 31

- Python Tutorials:
*A Whirlwind Tour of Python*by Jake VanderPlas (GitHub repository). - Read Sections 2.1-2.2 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebooks. (Link to notebooks on Github). - Read Section
4.3 of
*Data Mining: Practical Machine Learning Tools and Techniques*(Full-text contents of 3^{rd}ed. available online through the Biola Library account) on the concepts of entropy and information gain and their application in the construction of decision trees for machine learning. **Submit**your reading report through Biola Canvas.

*************************************************************************************************

**Week 3**. Numpy I **|** Machine
Learning: Naïve Bayes

**Presentation #2 **(10-20 minutes each person): Tuesday Feb. 5

- Topics: Sections 2.1-2.6 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github). - Michael:
2.1-2.2 Basics of Numpy Arrays
- Trystan: 2.3
Computation on NumPy Arrays: Universal Functions
- Stephen: 2.4
Aggregations: Min, Max, and Everything In Between
- Josiah:
- Jacob: 2.6
Comparisons, Masks, and Boolean Logic

**Homework #1**: (Decision tree induction based on entropy
and information gain): Thursday
Feb. 7

- Note: If there is a
tie of entropy reduction (i.e. information gain), break the tie
arbitrarily.
- Purpose: Concepts about entropy and
decision tree induction.
- Send in your work through Biola Canvas.

**Reading 3**: Report due: Thursday Feb. 7

- Read 2.1-2.6 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebook. (Link to notebooks on Github). - Read Section 4.2 of
*Data Mining: Practical Machine Learning Tools and Techniques*on Naïve Bayes (Full-text contents of 3^{rd}ed. available online through the Biola Library account) **Submit**your reading report through Biola Canvas.

*************************************************************************************************

**Week 4**. Numpy (II) and
Pandas (I)

**Presentation #3 **(10-20 minutes each person): Tuesday Feb. 12

- Topics: Sections 2.7-2.8 and 3.1-3.3 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github). - Michael: 2.7
Fancy Indexing
- Josiah:
- Trystan: 3.1
Introducing Pandas Objects
- Stephen: 3.2
Data Indexing and Selection
- Jacob: 3.3
Operating on Data in Pandas

**Reading 4**: Report due: Thursday Feb. 14

- Read Sections 2.7-2.8 and 3.1-3.3 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebook. (Link to notebooks on Github). - Read Section 4.6 of
*Data Mining: Practical Machine Learning Tools and Techniques*on linear models (Full-text contents of 3^{rd}ed. available online through the Biola Library account) **Submit**your reading report through Biola Canvas.

**Lab #3 (Finding information gain given the
distribution information stored in a 2-dimensional numpy
array)**: Thursday Feb. 14

**Task**: Examine the Jupiter notebook Lab3.ipynb and add your own code to complete the definitions of*infoGain*function in [4] and*infoGain2*function in [5] such that it can calculate the information gain of a selected attribute (such as "Outlook" in the weather dataset) given a 2-dimensional Numpy array recording the distributions of the class attribute in the context of the selected attribute. For*infoGain2*, you should allow the use of an optional keyword argument*more*to determine whether addition information would be printed.**Submission**: Upload your expanded Lab3.ipynb to show your work through Biola Canvas.

*************************************************************************************************

**Week 5**.
More on Numpy (II) and Pandas (I)

**Presentation #3 **Continued (10-20 minutes each person): Tuesday Feb. 19

- Topics: Sections 2.7-2.8 and 3.1-3.3 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github). - Michael: 2.7
Fancy Indexing
- Josiah:
- Trystan: 3.1
Introducing Pandas Objects
- Stephen: 3.2
Data Indexing and Selection
- Jacob: 3.3
Operating on Data in Pandas

**Reading 5**: Report due: Thursday Feb. 21

- Revisit Sections 2.7-2.8 and 3.1-3.3 of
*Python Data Science Handbook*one more time. Play with the corresponding notebook. (Link to notebooks on Github). **Submit**your reading report through Biola Canvas.

**Homework#2**
(Naïve Bayes classification): Thursday Feb. 21

- Purpose: Naïve Bayes classification.
- You may want to revisit
Sections 4.2 of
*Data Mining: Practical Machine Learning Tools and Techniques*on Naïve Bayes (Full-text contents of 3^{rd}ed. available online through the Biola Library account) to make sure you fully understand the concept of Naïve Bayes. - Send in your work through Biola Canvas.

**Lab #4 (Analysis
of ****rock-paper-scissor transcripts using numpy)**:
Thursday Feb. 28

**Rock-paper-scissor transcripts**: Download, unzip, and**run**rock-paper-scissor Agent#1 and Agent#2 for a couple of times. Each time the program would require you to play with the agent for 100 matches and yield a transcript file*RPS_transcript.txt*about the outcomes of these 100 matches in the same folder. Here is a zip file containing two sample transcripts generated from playing with Agent#1 and Agent#2 respectively. For this lab, please conduct basic data analysis on the data collected in these two transcripts.**Loading data into a Numpy array:**Create a Jupiter notebook, Lab4.ipynb. In the notebook, for each transcript in the zip file do the following: (i) use numpy.loadtext to load the data into a 2-D numpy array (for example use*numpy.loadtxt**("RPS_transcript1.txt", delimiter =',' , usecols=range(0,3), dtype=np.int32)*to load the data from RPS_transcript1.txt, see more details in the documentation) , (ii) determine the percentage of time**the agent**played rock, paper, and scissor respectively, (iii) determine the percentage of time**the agent**played rock, paper, and scissor respectively when in the previous match**the agent**played rock, (iv) determine the percentage of time**the agent**played rock, paper, and scissor respectively when in the previous match**the agent**played paper, (v) determine the percentage of time**the agent**played rock, paper, and scissor respectively when in the previous match**the agent**played scissor, (vi) determine the percentage of time**the agent**played rock, paper, and scissor respectively when in the previous match**the user**played rock, (vii) determine the percentage of time**the agent**played rock, paper, and scissor respectively when in the previous match**the user**played paper, (viii) determine the percentage of time**the agent**played rock, paper, and scissor respectively when in the previous match**the user**played scissor.**Sample outputs:**For your testing purpose, here are the sample outputs about the results you should see.**Hints:**There is an elegant way to make it work using Boolean masks and slicing of numpy arrays.**Submission of your work**: Upload your Lab4.ipynb (to show your code and results) under canvas.

*************************************************************************************************

**Weeks 6-7**.
More on Pandas (II) | Spring Break

**Presentation #4 **Continued (10-20 minutes each person): Thursday March. 14

- Topics: Sections 3.4-3.8 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github). - Michael: 3.4
Handling Missing Data
- Josiah:
- Trystan: 3.6
Combining Datasets: Concat and Append
- Stephen: 3.7
Combining Datasets: Merge and Join
- Jacob: 3.8
Aggregation and Grouping

**Reading 6**: Report due: Thursday Feb. 28

- Read Sections 3.4-3.8 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebook. (Link to notebooks on Github). **Submit**your reading report through Biola Canvas.

**Reading 7**: Report due: Thursday March 14

- Review Chapter 2 on Numpy
arrays and Sections 3.4-3.8 of
*Python Data Science Handbook*on Pandas. (Link to notebooks on Github). **Submit**your reading report through Biola Canvas.

**Lab #5 (Analysis
of information gain from ****rock-paper-scissor
transcripts using numpy)**:
Thursday March. 14

**Rock-paper-scissor transcripts**: As in Lab 4, examine the same zip file containing two sample transcripts generated from playing with Agent#1 and Agent#2 respectively. For each transcript in the zip file, use*numpy.loadtext*to load the data into a 2-D numpy array to conduct basic data analysis on the data collected in these two transcripts as described below**Information gain:**We want to determine how much each of the agent¡¦s actions in the previous two matches and each of the user¡¦s actions in the previous two matches may affect the agent¡¦s action in the current match. Create a Jupiter notebook, Lab5.ipynb. In the notebook, for the data collected in these two transcripts determine the information gain of knowing (i) the agent¡¦s action two matches ago, (ii) the agent¡¦s action in the previous match, (iii) the user¡¦s action two matches ago, and (iv) the user¡¦s action in the previous match respectively for predicting the action in the current match. The results can inform us whether and how much it may help us to predict the action of the agent if we know the agent¡¦s actions in the previous two matches and the user¡¦s actions in the previous two matches.**Note that you can accomplish this task conveniently based on what you learned and did for Lab 3 and Lab 4.****Sample outputs and a partial script:**here (updated 5:30pm, Feb. 28) is the zip file, in which you can find a partial script for Lab 5 and the sample outputs about the results you should see.**Submission of your work**: Upload your Lab5.ipynb (to show your code and results) under canvas.

*************************************************************************************************

**Weeks 8-9**.
Review | Test 1 |
Missions Conference

**Test #1 (Develop
and use Naïve Bayes classifiers based on ****rock-paper-scissor transcripts
using numpy)**: Thursday
March. 28

**Rock-paper-scissor transcripts**: As in Lab 4 and Lab 5, examine the same zip file containing two sample transcripts generated from playing with Agent#1 and Agent#2 respectively. For each transcript in the zip file, use*numpy.loadtext*to load the data into a 2-D numpy array to conduct basic Naïve-Bases data analysis on the data collected in these two transcripts as described below**See the specification here. To get a concrete picture of the tasks, please carefully examine a sample partial script and the sample outputs shown here.****Submission of your work**: Fill out this self-evaluation report. Upload the self-evaluation report and your Test1.ipynb (to show your code and results) and under canvas.

**Review: **Naïve Bayes classification

- To do Test 1 well,
revisit Sections 4.2 of
*Data Mining: Practical Machine Learning Tools and Techniques*on Naïve Bayes (Full-text contents of 3^{rd}ed. available online through the Biola Library account) to make sure you fully understand the concept of Naïve Bayes classification. Revisit Homework 2 too.

*************************************************************************************************

**Week 10**. More on Pandas (III)

**Presentation #4 **Continued (10-20 minutes each person): Tuesday April. 2

- Topics: Sections 3.4-3.8 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github). - Michael: 3.4
Handling Missing Data (Finished)
- Josiah:
- Trystan: 3.6
Combining Datasets: Concat and Append
- Stephen: 3.7
Combining Datasets: Merge and Join (Finished)
- Jacob: 3.8
Aggregation and Grouping

**Reading 10**: Thursday,
April. 4

- Carefully
read or review the following Sections of
*Python Data Science Handbook*: (i) Sections 3.1-3.3 of*Python Data Science Handbook*on the fundamentals of Pandas, (ii) Section 3.8 on Aggregation and Grouping, and (iii) Section 3.9 on Pivot Tables. These sections are important for you to work on Lab 6. - Download and play with the Jupiter notebook here (updated 04/04) to see how you may (i) import a Rock-Paper-Scissor transcript to create a
Pandas dataframe and (ii) conveniently collect
Naïve Bayes statistics using
**either***groupby*plus*unstack*,**or***pivot_table*,**or***crosstab*.

*************************************************************************************************

**Week 11**. More on Pandas (IV) | Supervised Learning: Linear Models

**Reading 11**: Thursday,
April. 11

- (
**i****)**Review Sections 3.1-3.3, 3.8-3.9 of*Python Data Science Handbook.***(ii)***Data Mining: Practical Machine Learning Tools and Techniques*(Full-text contents of 3^{rd}ed. available online through the Biola Library account, also see the ppts of Chapter 4 here)

**Lab 6** (Naïve Bayes classification using
Pandas): Thursday, April. 11

**Goal**: Redo**Homework#2**(Naïve Bayes classification) by using Pandas to collect the Naïve Bayes statistics from the data set in this csv file. But this time you are required to create a Jupiter notebook using Pandas to read the csv data file into a Pandas dataframe and automate the steps you went through in**Homework#2**for collecting statistics to (i) build a Naïve Bayes model using the dataframe,*groupby*, and*unstack*and (ii) conduct Naïve Bayes classification.**Tasks to complete**: Please carefully examine the partial script and the sample output here (v2 updated 04/03) to understand the framework. Then fill in the code to complete 3 tasks (defining 3 functions and testing them) to accomplish the goal above.- Send in your Jupiter notebook through Biola Canvas.

*************************************************************************************************

**Week 12**. Matplotlib | Supervised Learning: Linear
Models

**Presentation #4 **Continued (10-20 minutes each person): Tuesday April. 16

- Topics: Sections 4.1-4.5 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github). - Michael: 4.1
Simple Line Plots
- Trystan:
- Josiah: 4.3
Visualizing Errors
- Stephen: 4.4
Density and Contour Plots
- Jacob: 4.5
Histograms, Binnings, and Density

**Reading 12**: Thursday,
April. 18

- (
**i****)**Read the first 5 sections (up to the section on*Visualizing Errors*) in Chapter 4 of*Python Data Science Handbook*on Matplotlib*.***(ii)***Data Mining: Practical Machine Learning Tools and Techniques*(Full-text contents of 3^{rd}ed. available online through the Biola Library account, also see the ppts of Chapter 4 here)

*************************************************************************************************

**Week 13**.
Machine Learning Using Scikitlearn (I)

**Reading 13**: Thursday,
April. 25

- Read the
first 4 sections in Chapter 5 on machine learning using sklearn in
*Python Data Science Handbook*by Jake VanderPlas. - Examine the linear regression demo program here.

**Test #2 Develop
and use Naïve Bayes classifiers based on ****rock-paper-scissor transcripts
using Pandas**: Thursday April. 25

**Purpose**: We are going to implement functions using Pandas dataframe objects as the main data structures for Naïve Bayes classification using**Tasks**:**Please carefully read the sample partial script and examine the sample outputs in the zip file here.**Please complete the implementation of the functions specified in the tasks and test your implementation to see whether you get the same results in the outputs.**Submission of your work**: Fill out this self-evaluation report. Upload the self-evaluation report and your Test1.ipynb (to show your code and results) and under canvas.

**Homework #3:** (Linear
Regression): Thursday, April. 25

- Purpose: Linear regression.
- Note: You may
adapt the linear regression demo program here to
verify whether you got the right coefficients for this homework.
- Submit in your work through Biola Canvas.

*************************************************************************************************

**Week 14**.
Machine Learning Using Scikitlearn (II)

**Reading 14**: Thursday,
May 2

- Read Sections
5-6, 8 (Naïve Bayes, Linear Regression, Decision Trees) in Chapter 5 on
machine learning using Scikitlearn in
*Python Data Science Handbook*by Jake VanderPlas.

**Test #3**: due: Thursday, May 9

¡P
Problem set: See here (updated version
posted 17:20 April 30).

¡P
**Overview: Rock-Paper-Scissor and
classification algorithms using Scikitlean**. In the previous tests, you
implemented functions for Naïve Bayes classification based on Numpy and Pandas. You also apply the functions to
Rock-Paper-Scissor transcripts to learn to predict the behavior of a specific
agent based on either (i) the actions of the two
sides in the previous two matches or (ii) the actions of the two sides in the
previous match only. In the exam, you¡¦ll be given the transcripts of 3
different agents to play with. Your task is to load the agent transcripts as
Pandas *dataframe*
objects and use Scikitlearn to apply Naïve Bayes (see
Chapter 5 of *Python Data Science
Handbook*) to analyze the
behavior of the agents. Applying Naïve Bayes using Scikitlearn, you should (i)
learn a classifier respectively for prediction and (ii) get an empirical
estimate the accuracy of prediction respectively based on the prediction
accuracy on the testing data.

¡P
**Overview: Linear regression and
kernel trick using Scikitlean**. You need to use Scikitlearn
to apply linear regression (decently covered in Chapter 5 of
*Python
Data Science Handbook*) to conduct some basic learning tasks.

¡P
**Submission
of your work**: Fill out this self-evaluation
report. Upload the self-evaluation report and your Jupiter notebooks (to
show your code and results) and under canvas.

*************************************************************************************************

**Links to**** online resources**

- About
Jupiter Notebook
- Python Tutorials:
*A Whirlwind Tour of Python*by Jake VanderPlas (GitHub repository)

*************************************************************************************************

**To the ****Top**** of the Page **