**To the Bottom
of the Page**

Instructor: Dr. Shieu-Hong Lin ()

Class: TR 1:30-2:45 pm at Lim 41

Office Hours: Lim
137 MW 1:30-3:30pm
T Th 3:00-5:00pm (Reserving a slot by email in advance is
encouraged.)

**Submission of all your work**: go to Biola Canvas **Your grades: **see them under Biola Canvas

*************************************************************************************************

**Week 1**. Overview of the Landscape of (i) Machine Learning, (ii) the WEKA toolkit, and (iii) the
SciPy ecosystem for Data Science

**Lab # 1: WEKA**:
Report due: Thursday Jan. 24

**Exploration**: Install WEKA. Download and unzip this zip file to get the zoo data set and the Iris data set as two files in arff format. Open and examine the data files as text files using text editors such as notepad. Browse Chapter 2 of*Data Mining: Practical Machine Learning Tools and Techniques*(Full-text contents of 3^{rd}ed. available online through the Biola Library account) to understand the arff data file format used by WEKA**Experiment**: (i) Run WEKA and use the**Explorer**in WEKA. Under explorer, open the zoo data set. Select the classifier J48 (under the tree section of the classifier menu) and apply it to learn a decision tree from the dataset. Copy and paste the decision tree you got from WEKA into a WORD document. (ii) Do the same things to the Iris data set to learn a decision tree. Copy and paste the decision tree you got from WEKA into the WORD document.**Submission**: Upload the WORD document through Biola Canvas to show your findings.

**Reading 1**: Report due: Thursday Jan. 24

- Download and install Anaconda on your own computer so that you can play with the Jupiter notebooks in the reading assignment.
- Read Chapter 1 of
*Introduction to Machine Learning with Python*(available as an e-book through your Biola Library account). Play with the corresponding notebook. (Link to notebooks on Github). - Browse Chapter
1 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebook. (Link to notebooks on Github). - Browse this Jupiter Notebook tutorial.
- Browse this general introduction to the key tools in the SciPy ecosystem.
**Submission**: Submit your reading report through Biola Canvas.

**Thoughts about Project: Rock-Paper-Scissor as an example**

**Sample agents #1 and #2:**Download, unzip, and**run**rock-paper-scissor Agent#1 and Agent#2 for a couple of times. Each time the program would require you to play with the agent for 100 matches and yield a transcript file*RPS_transcript.txt*about the outcomes of these 100 matches in the same folder.**A simple learning project: A**pply machine learning techniques to build predictive models for predicting the actions of agent based on the transcripts of previous interactions.**Presentation in Biola CBR or else:**A paper based on student work from CSCI 480 on Mining the Social Web in 2017

*************************************************************************************************

**Week 2**. Intro to Python and Scipy
(I) **|** Machine Learning: Decision Trees

**Presentation #1 **(10-20 minutes each person): Tuesday Jan. 29

- Topics: Python Tutorials:
*A Whirlwind Tour of Python*by Jake VanderPlas (GitHub repository). - Jacob: Basic Python
Semantics (Variables and Objects), Basic Python Semantics (Operator),
Built-In Types, Control Flow
- Josiah:
- Stephen:
Functions, Iterators
- Trystan: List
Comprehensions, Generators

**Lab #2 (Entropy, information gain, and numpy basics)**:

**Exploration**: Play with the Jupiter notebook Lab2.ipynb
to see how you may calculate the entropy of a given distribution in a numpy array.

**Experiment**: Examine and expand the Jupiter notebook Lab2.ipynb to do the following: (i) Calculate the entropy and the information gain shown in Section 4.3 of*Data Mining: Practical Machine Learning Tools and Techniques*regarding the original toy weather data set in the book for determining the root of the decision tree. (ii) Instead of the original the toy weather data set, see this csv file for a modified weather data set. Do things similar to what you did above to determine the entropy and the information gain regarding the modified weather data set and the root of the decision tree.**Submission**: Upload your expanded Lab2.ipynb to show your work through Biola Canvas.

**Reading 2**: Report due: Thursday Jan. 31

- Python Tutorials:
*A Whirlwind Tour of Python*by Jake VanderPlas (GitHub repository). - Read Sections 2.1-2.2 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebooks. (Link to notebooks on Github). - Read Section
4.3 of
*Data Mining: Practical Machine Learning Tools and Techniques*(Full-text contents of 3^{rd}ed. available online through the Biola Library account) on the concepts of entropy and information gain and their application in the construction of decision trees for machine learning. **Submit**your reading report through Biola Canvas.

*************************************************************************************************

**Week 3**. Numpy I **|** Machine
Learning: Naïve Bayes

**Presentation #2 **(10-20 minutes each person): Tuesday Feb. 5

- Topics: Sections 2.1-2.6 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github). - Michael:
2.1-2.2 Basics of Numpy Arrays
- Trystan: 2.3
Computation on NumPy Arrays: Universal Functions
- Stephen: 2.4
Aggregations: Min, Max, and Everything In Between
- Josiah:
- Jacob: 2.6
Comparisons, Masks, and Boolean Logic

**Homework #1**: (Decision tree induction based on entropy
and information gain): Thursday
Feb. 7

- Note: If there is a
tie of entropy reduction (i.e. information gain), break the tie
arbitrarily.
- Purpose: Concepts about entropy and
decision tree induction.
- Send in your work through Biola Canvas.

**Reading 3**: Report due: Thursday Feb. 7

- Read 2.1-2.6 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebook. (Link to notebooks on Github). - Read Section 4.2 of
*Data Mining: Practical Machine Learning Tools and Techniques*on Naïve Bayes (Full-text contents of 3^{rd}ed. available online through the Biola Library account) **Submit**your reading report through Biola Canvas.

*************************************************************************************************

**Week 4**. Numpy (II) and Pandas
(I)

**Presentation #3 **(10-20 minutes each person): Tuesday Feb. 12

- Topics: Sections 2.7-2.8 and 3.1-3.3 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github). - Michael: 2.7
Fancy Indexing
- Josiah:
- Trystan: 3.1 Introducing
Pandas Objects
- Stephen: 3.2
Data Indexing and Selection
- Jacob: 3.3
Operating on Data in Pandas

**Reading 4**: Report due: Thursday Feb. 14

- Read Sections 2.7-2.8 and 3.1-3.3 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebook. (Link to notebooks on Github). - Read Section 4.6 of
*Data Mining: Practical Machine Learning Tools and Techniques*on linear models (Full-text contents of 3^{rd}ed. available online through the Biola Library account) **Submit**your reading report through Biola Canvas.

**Lab #3 (Finding information gain given the
distribution information stored in a 2-dimensional numpy
array)**: Thursday Feb. 14

**Task**: Examine the Jupiter notebook Lab3.ipynb and add your own code to complete the definitions of*infoGain*function in [4] and*infoGain2*function in [5] such that it can calculate the information gain of a selected attribute (such as "Outlook" in the weather dataset) given a 2-dimensional Numpy array recording the distributions of the class attribute in the context of the selected attribute. For*infoGain2*, you should allow the use of an optional keyword argument*more*to determine whether addition information would be printed.**Submission**: Upload your expanded Lab3.ipynb to show your work through Biola Canvas.

*************************************************************************************************

**Week 5**. More on Numpy (II)
and Pandas (I)

**Presentation #3 **Continued (10-20 minutes each person): Tuesday Feb. 19

- Topics: Sections 2.7-2.8 and 3.1-3.3 of
*Python Data Science Handbook*by Jake VanderPlas. Play with the corresponding notebooks. (Link to Github). - Michael: 2.7
Fancy Indexing
- Josiah:
- Trystan: 3.1 Introducing
Pandas Objects
- Stephen: 3.2
Data Indexing and Selection
- Jacob: 3.3
Operating on Data in Pandas

**Reading 5**: Report due: Thursday Feb. 21

- Revisit Sections 2.7-2.8 and 3.1-3.3 of
*Python Data Science Handbook*one more time. Play with the corresponding notebook. (Link to notebooks on Github). **Submit**your reading report through Biola Canvas.

**Homework#2**
(Naïve Bayes classification): Thursday Feb. 21

- Purpose: Naïve Bayes classification.
- You may want to revisit
Sections 4.2 of
*Data Mining: Practical Machine Learning Tools and Techniques*on Naïve Bayes (Full-text contents of 3^{rd}ed. available online through the Biola Library account) to make sure you fully understand the concept of Naïve Bayes. - Send in your work through Biola
Canvas.

**Lab #4 (Analysis of
****rock-paper-scissor transcripts using numpy)**:
Thursday Feb. 21

**Rock-paper-scissor transcripts**: Download, unzip, and**run**rock-paper-scissor Agent#1 and Agent#2 for a couple of times. Each time the program would require you to play with the agent for 100 matches and yield a transcript file*RPS_transcript.txt*about the outcomes of these 100 matches in the same folder. Here is a zip file containing two sample transcripts generated from playing with Agent#1 and Agent#2 respectively. For this lab, please conduct basic data analysis on the data collected in these two transcripts.**Loading data into a Numpy array:**Create a Jupiter notebook, Lab4.ipynb. In the notebook, for each transcript in the zip file do the following: (i) use numpy.loadtext (example, documentation) to load the data into a 2-D numpy array, (ii) determine the percentage of time**the agent**played rock, paper, and scissor respectively, (iii) determine the percentage of time**the agent**played rock, paper, and scissor respectively when in the previous match**the agent**played rock, (iv) determine the percentage of time**the agent**played rock, paper, and scissor respectively when in the previous match**the agent**played paper, (v) determine the percentage of time**the agent**played rock, paper, and scissor respectively when in the previous match**the agent**played scissor, (vi) determine the percentage of time**the agent**played rock, paper, and scissor respectively when in the previous match**the user**played rock, (vii) determine the percentage of time**the agent**played rock, paper, and scissor respectively when in the previous match**the user**played paper, (viii) determine the percentage of time**the agent**played rock, paper, and scissor respectively when in the previous match**the user**played scissor.**Submission of your work**: Upload your Lab4.ipynb (to show your code and results) under canvas.

*************************************************************************************************

**Links to**** online resources**

- About
Jupiter Notebook
- Python Tutorials:
*A Whirlwind Tour of Python*by Jake VanderPlas (GitHub repository)

*************************************************************************************************

**To the ****Top**** of the Page **