**To the Bottom
of the Page**

Instructor: Dr. Shieu-Hong Lin ()

Class: TR 1:30-2:45 pm at Lim 41

Office Hours: Lim
137 MW 1:30-3:30pm
T Th 3:00-5:00pm (Reserving a slot by email in advance is
encouraged.)

**Submission of all your work**: go to Biola Canvas **Your grades: **see them under Biola Canvas

*************************************************************************************************

**Week 1**. Overview of the Landscape of (i) Machine Learning, (ii) the WEKA toolkit, and (iii) the
SciPy ecosystem for Data Science

**Lab # 1: WEKA**:
Report due: Thursday Jan. 24

**Exploration**: Install WEKA. Download and unzip this zip file to get the zoo data set and the Iris data set as two files in arff format. Open and examine the data files as text files using text editors such as notepad. Browse Chapter 2 of*Data Mining: Practical Machine Learning Tools and Techniques*(Full-text contents of 3^{rd}ed. available online through the Biola Library account) to understand the arff data file format used by WEKA**Experiment**: (i) Run WEKA and use the**Explorer**in WEKA. Under explorer, open the zoo data set. Select the classifier J48 (under the tree section of the classifier menu) and apply it to learn a decision tree from the dataset. Copy and paste the decision tree you got from WEKA into a WORD document. (ii) Do the same things to the Iris data set to learn a decision tree. Copy and paste the decision tree you got from WEKA into the WORD document.**Submission**: Upload the WORD document through Biola Canvas to show your findings.

**Reading 1**: Report due: Thursday Jan. 24

- Download and install Anaconda on your own computer so that you can play with the Jupiter notebooks in the reading assignment.
- Read Chapter 1 of
*Introduction to Machine Learning with Python*(available as an e-book through your Biola Library account). Play with the corresponding notebook. (Link to notebooks on Github). - Browse Chapter
1 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebook. (Link to notebooks on Github). - Browse this Jupiter Notebook tutorial.
- Browse this general introduction to the key tools in the SciPy ecosystem.
**Submission**: Submit your reading report through Biola Canvas.

**Thoughts about Project: Rock-Paper-Scissor as an example**

**Sample agents #1 and #2:**Download, unzip, and**run**rock-paper-scissor Agent#1 and Agent#2 for a couple of times. Each time the program would require you to play with the agent for 100 matches and yield a transcript file*RPS_transcript.txt*about the outcomes of these 100 matches in the same folder.**A simple learning project: A**pply machine learning techniques to build predictive models for predicting the actions of agent based on the transcripts of previous interactions.**Presentation in Biola CBR or else:**A paper based on student work from CSCI 480 on Mining the Social Web in 2017

*************************************************************************************************

**Week 2**. Intro to Python and Scipy
(I) **|** Machine Learning: Decision Trees

**Presentation #1 **(10-20 minutes each person): Tuesday Jan. 29

- Topics: Python Tutorials:
*A Whirlwind Tour of Python*by Jake VanderPlas (GitHub repository). - Jacob: Basic
Python Semantics (Variables and Objects), Basic Python Semantics
(Operator), Built-In Types, Control Flow
- Josiah:
- Stephen:
Functions, Iterators
- Trystan: List
Comprehensions, Generators

**Lab #2 (Entropy, information gain, and numpy basics)**:

**Exploration**: Play with the Jupiter notebook Lab2.ipynb
to see how you may calculate the entropy of a given distribution in a numpy array.

**Experiment**: Examine and expand the Jupiter notebook Lab2.ipynb to do the following: (i) Calculate the entropy and the information gain shown in Section 4.3 of*Data Mining: Practical Machine Learning Tools and Techniques*regarding the original toy weather data set in the book for determining the root of the decision tree. (ii) Instead of the original the toy weather data set, see this csv file for a modified weather data set. Do things similar to what you did above to determine the entropy and the information gain regarding the modified weather data set and the root of the decision tree.**Submission**: Upload your expanded Lab2.ipynb to show your work through Biola Canvas.

**Reading 2**: Report due: Thursday Jan. 31

- Python Tutorials:
*A Whirlwind Tour of Python*by Jake VanderPlas (GitHub repository). - Read Sections 2.1-2.2 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebooks. (Link to notebooks on Github). - Read Section
4.3 of
*Data Mining: Practical Machine Learning Tools and Techniques*(Full-text contents of 3^{rd}ed. available online through the Biola Library account) on the concepts of entropy and information gain and their application in the construction of decision trees for machine learning. **Submit**your reading report through Biola Canvas.

*************************************************************************************************

**Week 3**. Numpy I **|** Machine
Learning: Naïve Bayes

**Presentation #2 **(10-20 minutes each person): Tuesday Feb. 5

- Topics: Sections 2.1-2.6 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github). - Michael:
2.1-2.2 Basics of Numpy Arrays
- Trystan: 2.3
Computation on NumPy Arrays: Universal Functions
- Stephen: 2.4
Aggregations: Min, Max, and Everything In Between
- Josiah:
- Jacob: 2.6
Comparisons, Masks, and Boolean Logic

**Homework #1**: (Decision tree induction based on entropy
and information gain): Thursday
Feb. 7

- Note: If there is a
tie of entropy reduction (i.e. information gain), break the tie
arbitrarily.
- Purpose: Concepts about entropy and
decision tree induction.
- Send in your work through Biola Canvas.

**Reading 3**: Report due: Thursday Feb. 7

- Read 2.1-2.6 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebook. (Link to notebooks on Github). - Read Section 4.2 of
*Data Mining: Practical Machine Learning Tools and Techniques*on Naïve Bayes (Full-text contents of 3^{rd}ed. available online through the Biola Library account) **Submit**your reading report through Biola Canvas.

*************************************************************************************************

**Week 4**. Numpy (II) and
Pandas (I)

**Presentation #3 **(10-20 minutes each person): Tuesday Feb. 12

- Topics: Sections 2.7-2.8 and 3.1-3.3 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github). - Michael: 2.7
Fancy Indexing
- Josiah:
- Trystan: 3.1
Introducing Pandas Objects
- Stephen: 3.2
Data Indexing and Selection
- Jacob: 3.3
Operating on Data in Pandas

**Reading 4**: Report due: Thursday Feb. 14

- Read Sections 2.7-2.8 and 3.1-3.3 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebook. (Link to notebooks on Github). - Read Section 4.6 of
*Data Mining: Practical Machine Learning Tools and Techniques*on linear models (Full-text contents of 3^{rd}ed. available online through the Biola Library account) **Submit**your reading report through Biola Canvas.

**Lab #3 (Finding information gain given the
distribution information stored in a 2-dimensional numpy
array)**: Thursday Feb. 14

**Task**: Examine the Jupiter notebook Lab3.ipynb and add your own code to complete the definitions of*infoGain*function in [4] and*infoGain2*function in [5] such that it can calculate the information gain of a selected attribute (such as "Outlook" in the weather dataset) given a 2-dimensional Numpy array recording the distributions of the class attribute in the context of the selected attribute. For*infoGain2*, you should allow the use of an optional keyword argument*more*to determine whether addition information would be printed.**Submission**: Upload your expanded Lab3.ipynb to show your work through Biola Canvas.

*************************************************************************************************

**Week 5**.
More on Numpy (II) and Pandas (I)

**Presentation #3 **Continued (10-20 minutes each person): Tuesday Feb. 19

- Topics: Sections 2.7-2.8 and 3.1-3.3 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github). - Michael: 2.7
Fancy Indexing
- Josiah:
- Trystan: 3.1
Introducing Pandas Objects
- Stephen: 3.2
Data Indexing and Selection
- Jacob: 3.3
Operating on Data in Pandas

**Reading 5**: Report due: Thursday Feb. 21

- Revisit Sections 2.7-2.8 and 3.1-3.3 of
*Python Data Science Handbook*one more time. Play with the corresponding notebook. (Link to notebooks on Github). **Submit**your reading report through Biola Canvas.

**Homework#2**
(Naïve Bayes classification): Thursday Feb. 21

- Purpose: Naïve Bayes classification.
- You may want to revisit
Sections 4.2 of
*Data Mining: Practical Machine Learning Tools and Techniques*on Naïve Bayes (Full-text contents of 3^{rd}ed. available online through the Biola Library account) to make sure you fully understand the concept of Naïve Bayes. - Send in your work through Biola Canvas.

**Lab #4 (Analysis
of ****rock-paper-scissor transcripts using numpy)**:
Thursday Feb. 28

**Rock-paper-scissor transcripts**: Download, unzip, and**run**rock-paper-scissor Agent#1 and Agent#2 for a couple of times. Each time the program would require you to play with the agent for 100 matches and yield a transcript file*RPS_transcript.txt*about the outcomes of these 100 matches in the same folder. Here is a zip file containing two sample transcripts generated from playing with Agent#1 and Agent#2 respectively. For this lab, please conduct basic data analysis on the data collected in these two transcripts.**Loading data into a Numpy array:**Create a Jupiter notebook, Lab4.ipynb. In the notebook, for each transcript in the zip file do the following: (i) use numpy.loadtext to load the data into a 2-D numpy array (for example use*numpy.loadtxt**("RPS_transcript1.txt", delimiter =',' , usecols=range(0,3), dtype=np.int32)*to load the data from RPS_transcript1.txt, see more details in the documentation) , (ii) determine the percentage of time**the agent**played rock, paper, and scissor respectively, (iii) determine the percentage of time**the agent**played rock, paper, and scissor respectively when in the previous match**the agent**played rock, (iv) determine the percentage of time**the agent**played rock, paper, and scissor respectively when in the previous match**the agent**played paper, (v) determine the percentage of time**the agent**played rock, paper, and scissor respectively when in the previous match**the agent**played scissor, (vi) determine the percentage of time**the agent**played rock, paper, and scissor respectively when in the previous match**the user**played rock, (vii) determine the percentage of time**the agent**played rock, paper, and scissor respectively when in the previous match**the user**played paper, (viii) determine the percentage of time**the agent**played rock, paper, and scissor respectively when in the previous match**the user**played scissor.**Sample outputs:**For your testing purpose, here are the sample outputs about the results you should see.**Hints:**There is an elegant way to make it work using Boolean masks and slicing of numpy arrays.**Submission of your work**: Upload your Lab4.ipynb (to show your code and results) under canvas.

*************************************************************************************************

**Weeks 6-7**.
More on Pandas (II) | Spring Break

**Presentation #4 **Continued (10-20 minutes each person): Thursday March. 14

- Topics: Sections 3.4-3.8 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github). - Michael: 3.4
Handling Missing Data
- Josiah:
- Trystan: 3.6
Combining Datasets: Concat and Append
- Stephen: 3.7
Combining Datasets: Merge and Join
- Jacob: 3.8
Aggregation and Grouping

**Reading 6**: Report due: Thursday Feb. 28

- Read Sections 3.4-3.8 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebook. (Link to notebooks on Github). **Submit**your reading report through Biola Canvas.

**Reading 7**: Report due: Thursday March 14

- Review Chapter 2 on Numpy
arrays and Sections 3.4-3.8 of
*Python Data Science Handbook*on Pandas. (Link to notebooks on Github). **Submit**your reading report through Biola Canvas.

**Lab #5 (Analysis
of information gain from ****rock-paper-scissor
transcripts using numpy)**:
Thursday March. 14

**Rock-paper-scissor transcripts**: As in Lab 4, examine the same zip file containing two sample transcripts generated from playing with Agent#1 and Agent#2 respectively. For each transcript in the zip file, use*numpy.loadtext*to load the data into a 2-D numpy array to conduct basic data analysis on the data collected in these two transcripts as described below**Information gain:**We want to determine how much each of the agent・s actions in the previous two matches and each of the user・s actions in the previous two matches may affect the agent・s action in the current match. Create a Jupiter notebook, Lab5.ipynb. In the notebook, for the data collected in these two transcripts determine the information gain of knowing (i) the agent・s action two matches ago, (ii) the agent・s action in the previous match, (iii) the user・s action two matches ago, and (iv) the user・s action in the previous match respectively for predicting the action in the current match. The results can inform us whether and how much it may help us to predict the action of the agent if we know the agent・s actions in the previous two matches and the user・s actions in the previous two matches.**Note that you can accomplish this task conveniently based on what you learned and did for Lab 3 and Lab 4.****Sample outputs and a partial script:**here (updated 5:30pm, Feb. 28) is the zip file, in which you can find a partial script for Lab 5 and the sample outputs about the results you should see.**Submission of your work**: Upload your Lab5.ipynb (to show your code and results) under canvas.

*************************************************************************************************

**Weeks 8-9**.
More on Pandas (III) | Test 1 | Missions Conference

**Presentation #5** (10-20 minutes each person): TBA

**Test #1 (Develop and
use Naïve Bayes classifiers based on ****rock-paper-scissor
transcripts using numpy)**:
Thursday March. 28

**Rock-paper-scissor transcripts**: As in Lab 4 and Lab 5, examine the same zip file containing two sample transcripts generated from playing with Agent#1 and Agent#2 respectively. For each transcript in the zip file, use*numpy.loadtext*to load the data into a 2-D numpy array to conduct basic Naïve-Bases data analysis on the data collected in these two transcripts as described below**See the specification here. To get a concrete picture of the tasks, please carefully examine a sample partial script and the sample outputs shown here.****Submission of your work**: Fill out this self-evaluation report. Upload the self-evaluation report and your Test1.ipynb (to show your code and results) and under canvas.

**Review: **Naïve Bayes classification

- To do Test 1 well,
revisit Sections 4.2 of
*Data Mining: Practical Machine Learning Tools and Techniques*on Naïve Bayes (Full-text contents of 3^{rd}ed. available online through the Biola Library account) to make sure you fully understand the concept of Naïve Bayes classification. Revisit Homework 2 too.

*************************************************************************************************

**Links to**** online resources**

- About
Jupiter Notebook
- Python Tutorials:
*A Whirlwind Tour of Python*by Jake VanderPlas (GitHub repository)

*************************************************************************************************

**To the ****Top**** of the Page **