Machine Learning and Data Analytics
CSCI 480, Spring semester, 2019
To the Bottom
of the Page
Instructor:
Dr.
Shieu-Hong Lin ()
Course
Syllabus
Class: TR
1:30-2:45 pm at Lim 41
Office Hours: Lim
137 MW 1:30-3:30pm
T Th 3:00-5:00pm
Submission of all your work: go to Biola Canvas Your grades: see them under Biola Canvas
*************************************************************************************************
Week 1. Overview of the Landscape of (i) Machine Learning, (ii) the WEKA toolkit, and (iii) the
SciPy ecosystem for Data Science
Lab # 1: WEKA:
Report due: Thursday Jan. 24
- Exploration:
Install WEKA.
Download and unzip this
zip file to get the zoo data set and the Iris data set as two files in
arff format. Open and examine the data files as
text files using text editors such as notepad. Browse Chapter 2
of Data Mining: Practical
Machine Learning Tools and Techniques (Full-text contents of 3rd
ed. available online through the Biola Library
account) to understand the arff data file
format used by WEKA
- Experiment:
(i) Run WEKA and use the Explorer in WEKA. Under explorer, open the zoo data set.
Select the classifier J48 (under the tree section of the classifier menu)
and apply it to learn a decision tree from the dataset. Copy and paste the
decision tree you got from WEKA into a WORD document. (ii) Do the same
things to the Iris data set to learn a decision tree. Copy and paste the
decision tree you got from WEKA into the WORD document.
- Submission:
Upload the WORD document through Biola Canvas
to show your findings.
Reading 1: Report due: Thursday Jan. 24
Thoughts about Project: Rock-Paper-Scissor as an example
- Sample agents
#1 and #2: Download, unzip, and
run rock-paper-scissor Agent#1
and Agent#2
for a couple of times. Each time the program would require you to play
with the agent for 100 matches and yield a transcript file RPS_transcript.txt about the
outcomes of these 100 matches in the same folder.
- A simple
learning project: Apply
machine learning techniques to build predictive models for predicting the
actions of agent based on the transcripts of previous interactions.
- Presentation
in Biola CBR or else: A paper based on student work from CSCI 480 on
Mining the Social Web in 2017
*************************************************************************************************
Week 2. Intro to Python and Scipy
(I) | Machine Learning: Decision Trees
Presentation #1 (10-20 minutes each person): Tuesday Jan. 29
- Topics: Python Tutorials: A Whirlwind Tour of Python by Jake VanderPlas (GitHub repository).
- Jacob: Basic
Python Semantics (Variables and Objects), Basic Python Semantics
(Operator), Built-In Types, Control Flow
- Josiah: Built-In Data Structures
- Stephen:
Functions, Iterators
- Trystan: List
Comprehensions, Generators
Lab #2 (Entropy, information gain, and numpy basics):
Exploration: Play with the Jupiter notebook Lab2.ipynb
to see how you may calculate the entropy of a given distribution in a numpy array.
- Experiment:
Examine and expand the Jupiter notebook Lab2.ipynb
to do the following: (i) Calculate the entropy
and the information gain shown in Section
4.3 of Data Mining: Practical
Machine Learning Tools and Techniques regarding the original toy weather data set in the
book for determining the root of the decision tree. (ii) Instead of the
original the toy weather data set, see this csv file
for a modified weather data set. Do things similar to
what you did above to determine the entropy and the information gain
regarding the modified weather data set and the root of the decision tree.
- Submission:
Upload your expanded Lab2.ipynb to show your work through Biola Canvas.
Reading 2: Report due: Thursday Jan. 31
*************************************************************************************************
Week 3. Numpy I | Machine
Learning: Naïve Bayes
Presentation #2 (10-20 minutes each person): Tuesday Feb. 5
- Topics: Sections 2.1-2.6 of Python Data
Science Handbook by Jake VanderPlas. Play with the
corresponding notebooks. (Link to Github).
- Michael:
2.1-2.2 Basics of Numpy Arrays
- Trystan: 2.3
Computation on NumPy Arrays: Universal Functions
- Stephen: 2.4
Aggregations: Min, Max, and Everything In Between
- Josiah: 2.5 Computation on Arrays: Broadcasting
- Jacob: 2.6
Comparisons, Masks, and Boolean Logic
Homework #1: (Decision tree induction based on entropy
and information gain): Thursday
Feb. 7
- Note: If there is a
tie of entropy reduction (i.e. information gain), break the tie
arbitrarily.
- Purpose: Concepts about entropy and
decision tree induction.
- Send in your work through Biola Canvas.
Reading 3: Report due: Thursday Feb. 7
*************************************************************************************************
Week 4. Numpy (II) and
Pandas (I)
Presentation #3 (10-20 minutes each person): Tuesday Feb. 12
- Topics: Sections 2.7-2.8 and 3.1-3.3 of Python Data
Science Handbook by Jake VanderPlas. Play with the
corresponding notebooks. (Link to Github).
- Michael: 2.7
Fancy Indexing
- Josiah: 2.8 Sorting Arrays
- Trystan: 3.1
Introducing Pandas Objects
- Stephen: 3.2
Data Indexing and Selection
- Jacob: 3.3
Operating on Data in Pandas
Reading 4: Report due: Thursday Feb. 14
Lab #3 (Finding information gain given the
distribution information stored in a 2-dimensional numpy
array): Thursday Feb. 14
- Task:
Examine the Jupiter notebook Lab3.ipynb and add
your own code to complete the definitions of infoGain function in [4] and
infoGain2 function in [5] such
that it can calculate the information gain of a selected attribute (such
as "Outlook" in the weather dataset) given a 2-dimensional Numpy array recording the distributions of the class
attribute in the context of the selected attribute. For infoGain2, you should allow the use
of an optional keyword argument more
to determine whether addition information would be printed. Check whether it works to
correctly calculate the information gains of different attributes in the
context of the original toy weather data set.
- Submission:
Upload your expanded Lab3.ipynb to show your work through Biola Canvas.
*************************************************************************************************
Week 5.
More on Numpy (II) and Pandas (I)
Presentation #3 Continued (10-20 minutes each person): Tuesday Feb. 19
- Topics: Sections 2.7-2.8 and 3.1-3.3 of Python Data
Science Handbook by Jake VanderPlas. Play with the
corresponding notebooks. (Link to Github).
- Michael: 2.7
Fancy Indexing
- Josiah: 2.8 Sorting Arrays
- Trystan: 3.1
Introducing Pandas Objects
- Stephen: 3.2
Data Indexing and Selection
- Jacob: 3.3
Operating on Data in Pandas
Reading 5: Report due: Thursday Feb. 21
Homework#2
(Naïve Bayes classification): Thursday Feb. 21
Lab #4 (Analysis
of rock-paper-scissor transcripts using numpy):
Thursday Feb. 28
- Rock-paper-scissor
transcripts: Download, unzip, and
run rock-paper-scissor Agent#1
and Agent#2
for a couple of times. Each time the program would require you to play
with the agent for 100 matches and yield a transcript file RPS_transcript.txt about the
outcomes of these 100 matches in the same folder. Here is a zip file containing two sample transcripts
generated from playing with Agent#1 and Agent#2 respectively. For this
lab, please conduct basic data analysis on the data collected in these two
transcripts.
- Loading
data into a Numpy array: Create a Jupiter notebook, Lab4.ipynb. In the
notebook, for each transcript in the zip file
do the following: (i) use numpy.loadtext
to load the data into a 2-D numpy array (for
example use numpy.loadtxt("RPS_transcript1.txt",
delimiter =',' , usecols=range(0,3), dtype=np.int32) to load the data from
RPS_transcript1.txt, see more details in the documentation)
, (ii) determine the percentage of time the agent played rock, paper, and scissor respectively, (iii)
determine the percentage of time the
agent played rock, paper, and scissor respectively when in the
previous match the agent played
rock, (iv) determine the percentage of time the agent played rock, paper, and scissor respectively when in
the previous match the agent
played paper, (v) determine the percentage of time the agent played rock, paper, and scissor respectively when in
the previous match the agent
played scissor, (vi) determine the percentage of time the agent played rock, paper, and scissor respectively when in
the previous match the user played rock, (vii) determine the percentage of
time the agent played rock,
paper, and scissor respectively when in the previous match the user played paper,
(viii) determine the percentage of time the agent played rock, paper, and scissor respectively when in
the previous match the user played scissor.
- Sample
outputs: For your testing purpose, here are the sample outputs about the
results you should see. Hints:
There is an elegant way to make it work using Boolean masks and slicing of
numpy arrays.
- Submission of
your work: Upload your Lab4.ipynb (to show your code and results)
under canvas.
*************************************************************************************************
Weeks 6-7.
More on Pandas (II) | Spring Break
Presentation #4 Continued (10-20 minutes each person): Thursday March. 14
- Topics: Sections 3.4-3.8 of Python Data Science
Handbook by Jake VanderPlas. Play with the
corresponding notebooks. (Link to Github).
- Michael: 3.4
Handling Missing Data
- Josiah: 3.5 Hierarchical Indexing
- Trystan: 3.6
Combining Datasets: Concat and Append
- Stephen: 3.7
Combining Datasets: Merge and Join
- Jacob: 3.8
Aggregation and Grouping
Reading 6: Report due: Thursday Feb. 28
Reading 7: Report due: Thursday March 14
Lab #5 (Analysis
of information gain from rock-paper-scissor
transcripts using numpy):
Thursday March. 14
- Rock-paper-scissor
transcripts: As in Lab 4, examine the same zip
file containing two sample transcripts generated from playing with
Agent#1 and Agent#2 respectively. For each transcript in the zip file, use numpy.loadtext
to load the data into a 2-D numpy array to
conduct basic data analysis on the data collected in these two transcripts
as described below
- Information
gain: We want to determine how much each of the agent¡¦s actions in the
previous two matches and each of the user¡¦s actions in the previous two
matches may affect the agent¡¦s action in the current match. Create a
Jupiter notebook, Lab5.ipynb. In the notebook, for the data collected in
these two transcripts determine the information gain of knowing (i) the agent¡¦s action two matches ago, (ii) the
agent¡¦s action in the previous match, (iii) the user¡¦s action two matches
ago, and (iv) the user¡¦s action in the previous match respectively for
predicting the action in the current match. The results can inform us
whether and how much it may help us to predict the action of the agent if
we know the agent¡¦s actions in the previous two matches and the user¡¦s
actions in the previous two matches. Note
that you can accomplish this task conveniently based on what you learned
and did for Lab 3 and Lab 4.
- Sample outputs
and a partial script: here (updated 5:30pm, Feb. 28)
is the zip file, in which you can find a partial script for Lab 5 and the sample outputs about the results
you should see.
- Submission of
your work: Upload your Lab5.ipynb (to show your code and results)
under canvas.
*************************************************************************************************
Weeks 8-9.
Review | Test 1 |
Missions Conference
Test #1 (Develop
and use Naïve Bayes classifiers based on rock-paper-scissor transcripts
using numpy): Thursday
March. 28
- Rock-paper-scissor
transcripts: As in Lab 4 and Lab 5, examine the same zip file containing two sample transcripts
generated from playing with Agent#1 and Agent#2 respectively. For each
transcript in the zip file, use numpy.loadtext
to load the data into a 2-D numpy array to
conduct basic Naïve-Bases data analysis on the data collected in these two
transcripts as described below
- See the
specification here. To get a concrete picture of the
tasks, please carefully examine a sample partial script and the sample
outputs shown here.
- Submission of
your work: Fill out this self-evaluation
report. Upload the self-evaluation report and your Test1.ipynb (to
show your code and results) and under canvas.
Review: Naïve Bayes classification
- To do Test 1 well,
revisit Sections 4.2 of Data Mining: Practical
Machine Learning Tools and Techniques on Naïve Bayes (Full-text
contents of 3rd ed. available online through the Biola Library account) to make sure you fully
understand the concept of Naïve Bayes classification. Revisit Homework 2
too.
*************************************************************************************************
Week 10. More on Pandas (III)
Presentation #4 Continued (10-20 minutes each person): Tuesday April. 2
- Topics: Sections 3.4-3.8 of Python Data
Science Handbook by Jake VanderPlas.
Play with the corresponding notebooks. (Link to Github).
- Michael: 3.4
Handling Missing Data (Finished)
- Josiah: 3.5 Hierarchical Indexing (Finished)
- Trystan: 3.6
Combining Datasets: Concat and Append
- Stephen: 3.7
Combining Datasets: Merge and Join (Finished)
- Jacob: 3.8
Aggregation and Grouping
Reading 10: Thursday,
April. 4
- Carefully
read or review the following Sections of Python
Data Science Handbook: (i) Sections 3.1-3.3 of Python
Data Science Handbook on the fundamentals
of Pandas, (ii) Section 3.8 on Aggregation and Grouping, and (iii)
Section 3.9 on Pivot Tables. These sections are important for you to work
on Lab 6.
- Download and play with the Jupiter notebook here (updated 04/04) to see how you may (i) import a Rock-Paper-Scissor transcript to create a
Pandas dataframe and (ii) conveniently collect
Naïve Bayes statistics using either
groupby
plus unstack, or pivot_table, or crosstab.
*************************************************************************************************
Week 11. More on Pandas (IV) | Supervised Learning: Linear Models
Reading 11: Thursday,
April. 11
Lab 6 (Naïve Bayes classification using
Pandas): Thursday, April. 11
- Goal: Redo Homework#2 (Naïve Bayes
classification) by using Pandas to collect the Naïve Bayes statistics from the data
set in this csv file. But this time you
are required to create a Jupiter notebook using Pandas to read the csv
data file into a Pandas dataframe and automate
the steps you went through in Homework#2 for collecting
statistics to (i) build a Naïve Bayes model using
the dataframe, groupby, and unstack and (ii) conduct Naïve Bayes
classification.
- Tasks to
complete: Please carefully examine the partial script and the sample
output here (v2 updated 04/03) to understand
the framework. Then fill in the code to complete 3 tasks (defining 3
functions and testing them) to accomplish the goal above.
- Send in your Jupiter notebook through Biola Canvas.
*************************************************************************************************
Week 12. Matplotlib | Supervised Learning: Linear
Models
Presentation #4 Continued (10-20 minutes each person): Tuesday April. 16
- Topics: Sections 4.1-4.5 of Python Data
Science Handbook by Jake VanderPlas.
Play with the corresponding notebooks. (Link to Github).
- Michael: 4.1
Simple Line Plots
- Trystan: 4.2 Simple Scatter Plots
- Josiah: 4.3
Visualizing Errors
- Stephen: 4.4
Density and Contour Plots
- Jacob: 4.5
Histograms, Binnings, and Density
Reading 12: Thursday,
April. 18
*************************************************************************************************
Week 13.
Machine Learning Using Scikitlearn (I)
Reading 13: Thursday,
April. 25
- Read the
first 4 sections in Chapter 5 on machine learning using sklearn in Python Data
Science Handbook by Jake VanderPlas.
- Examine the linear regression demo program here.
Test #2 Develop
and use Naïve Bayes classifiers based on rock-paper-scissor transcripts
using Pandas: Thursday April. 25
- Purpose:
We are going to implement functions using Pandas dataframe
objects as the main data structures for Naïve Bayes classification using rock-paper-scissor transcripts.
The approach is similar to what you did in Lab 6. The functions you
implement for Test 2 essentially provide the comparable functionality of
what you have accomplished in Test1.
- Tasks: Please carefully read the sample partial
script and examine the sample outputs in the zip file here.
Please complete the implementation of the functions specified in the
tasks and test your implementation to see whether you get the same results
in the outputs.
- Submission of
your work: Fill out this self-evaluation
report. Upload the self-evaluation report and your Test1.ipynb (to
show your code and results) and under canvas.
Homework #3: (Linear
Regression): Thursday, April. 25
- Purpose: Linear regression.
- Note: You may
adapt the linear regression demo program here to
verify whether you got the right coefficients for this homework.
- Submit in your work through Biola Canvas.
*************************************************************************************************
Week 14.
Machine Learning Using Scikitlearn (II)
Reading 14: Thursday,
May 2
- Read Sections
5-6, 8 (Naïve Bayes, Linear Regression, Decision Trees) in Chapter 5 on
machine learning using Scikitlearn in Python Data
Science Handbook by Jake VanderPlas.
Test #3: due: Thursday, May 9
¡P
Problem set: See here (updated version
posted 17:20 April 30).
¡P
Overview: Rock-Paper-Scissor and
classification algorithms using Scikitlean. In the previous tests, you
implemented functions for Naïve Bayes classification based on Numpy and Pandas. You also apply the functions to
Rock-Paper-Scissor transcripts to learn to predict the behavior of a specific
agent based on either (i) the actions of the two
sides in the previous two matches or (ii) the actions of the two sides in the
previous match only. In the exam, you¡¦ll be given the transcripts of 3
different agents to play with. Your task is to load the agent transcripts as
Pandas dataframe
objects and use Scikitlearn to apply Naïve Bayes (see
Chapter 5 of Python Data Science
Handbook) to analyze the
behavior of the agents. Applying Naïve Bayes using Scikitlearn, you should (i)
learn a classifier respectively for prediction and (ii) get an empirical
estimate the accuracy of prediction respectively based on the prediction
accuracy on the testing data.
¡P
Overview: Linear regression and
kernel trick using Scikitlean. You need to use Scikitlearn
to apply linear regression (decently covered in Chapter 5 of
Python
Data Science Handbook) to conduct some basic learning tasks.
¡P
Submission
of your work: Fill out this self-evaluation
report. Upload the self-evaluation report and your Jupiter notebooks (to
show your code and results) and under canvas.
*************************************************************************************************
Links to online resources
*************************************************************************************************
To the Top of the Page
.