Decision Tree Induction Based on Entropy
and Information Gain
Table 1 below is a weather data set with 5 attributes (Outlook, Temp, Humidity, Windy, and Play) and 15 records. It is available as a csv file here too. The last attribute Play is the class attribute and our goal is to find out a way to classify any given new record where the values of the first 4 attributes are known into one of the two classes Play=Yes or Play=Yes. In other words, we want to learn a way to predict whether Play=Yes or Play=Yes based on the values of the first 4 attributes of the record.
In Section 4.3 of Data Mining: Practical Machine Learning
Tools and Techniques (3rd. Edition), you can find the development of a decision tree
step by step for the weather data set in Table 1. Note that in Lab 2 you went through the computation
of information gain needed to determine the root of the decision tree.
Table 1: The original weather data set in the book
Outlooksunnysunnyovercastrainyrainyrainyovercastsunnysunnyrainysunnyovercastovercastrainy
|
Temphothothotmildcoolcoolcoolmildcoolmildmildmildhotmild
|
Humidityhighhighhighhighnormalnormalnormalhighnormalnormalnormalhighnormalhigh
|
WindyFALSETRUEFALSEFALSEFALSETRUETRUEFALSEFALSEFALSETRUETRUEFALSETRUE |
Play (Class)nonoyesyesyes
no yes
no yesyesyesyesyes
no |
Table 2 is the revised weather data set for this homework. It is available in this csv file too.
OutlookSunnySunnyOvercastRainyRainyRainyOvercastSunnySunnyRainySunnyOvercastOvercastRainy
|
Temp Humidity Windyhot high FALSEhot high hot high FALSEmild high FALSEcool normal FALSEcool normal cool normal mild high FALSEcool normal FALSEmild normal FALSEmild normal mild high hot normal FALSEmild high
|
Play (Class)noyesnonoyesnononoyesyesnoyesyesyes
|
Exercise.
(i) Follow the same procedure demonstrate
in Section 4.3 of Data Mining:
Practical Machine Learning Tools and Techniques (3rd. Edition) step by step to develop a decision tree based on the revised weather data set in Table 2 (also in this csv file) using the information gain as a measure to pick attributes
for classification. You do need to show the steps and values of information
gains computed along different branches. You may want to reuse/revise some of
the Numpy scripts you did for Lab 2 to help you
determine the values of information gains along different branches in the
process of decision-tree construction.
(ii) Show your decision tree and how your tree would classify the following case.
|
Outlook |
Temp |
Humidity |
Windy |
Play (Class) |
|
Sunny
|
Cool |
High |
True
|
??? |