Decision Tree Induction Based on Entropy
and Information Gain
Table 1 below is a weather data set with 5 attributes (Outlook, Temp, Humidity, Windy, and Play) and 15 records. It is available as a csv file here too. The last attribute Play is the class attribute and our goal is to find out a way to classify any given new record where the values of the first 4 attributes are known into one of the two classes Play=Yes or Play=Yes. In other words, we want to learn a way to predict whether Play=Yes or Play=Yes based on the values of the first 4 attributes of the record.
In Section 4.3 of Data Mining: Practical Machine Learning
Tools and Techniques (3rd. Edition), you can find the development of a decision tree
step by step for the weather data set in Table 1. Note that in Lab 2 you went through the computation
of information gain needed to determine the root of the decision tree.
Table 1: The original weather data set in the book
Outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy
|
Temp hot hot hot mild cool cool cool mild cool mild mild mild hot mild
|
Humidity high high high high normal normal normal high normal normal normal high normal high
|
Windy FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
|
Play (Class) no no yes yes yes
no yes
no yes yes yes yes yes
no |
Table 2 is the revised weather data set for this homework. It is available in this csv file too.
Outlook Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Rainy
|
Temp Humidity Windy hot high FALSE hot high hot high FALSE mild high FALSE cool normal FALSE cool normal cool normal mild high FALSE cool normal FALSE mild normal FALSE mild normal mild high hot normal FALSE mild high
|
Play (Class) no yes no no yes no no no yes yes no yes yes yes
|
Exercise.
(i) Follow the same procedure demonstrate
in Section 4.3 of Data Mining:
Practical Machine Learning Tools and Techniques (3rd. Edition) step by step to develop a decision tree based on the revised weather data set in Table 2 (also in this csv file) using the information gain as a measure to pick attributes
for classification. You do need to show the steps and values of information
gains computed along different branches. You may want to reuse/revise some of
the Numpy scripts you did for Lab 2 to help you
determine the values of information gains along different branches in the
process of decision-tree construction.
(ii) Show your decision tree and how your tree would classify the following case.
Outlook |
Temp |
Humidity |
Windy |
Play (Class) |
Sunny
|
Cool |
High |
True
|
??? |