Decision Tree Induction Based on Entropy and Information Gain

 

Table 1 below is a weather data set with 5 attributes (Outlook, Temp, Humidity, Windy, and Play) and 15 records. It is available as a csv file here too. The last attribute Play is the class attribute and our goal is to find out a way to classify any given new record where the values of the first 4 attributes are known into one of the two classes Play=Yes or Play=Yes. In other words, we want to learn a way to predict whether Play=Yes or Play=Yes based on the values of the first 4 attributes of the record. 

 

In Section 4.3 of Data Mining: Practical Machine Learning Tools and Techniques (3rd. Edition), you can find the development of a decision tree step by step for the weather data set in Table 1.  Note that in Lab 2 you went through the computation of information gain needed to determine the root of the decision tree.

 

 

Table 1: The original weather data set in the book

 

Outlook
sunny
sunny
overcast
rainy
rainy
rainy
overcast
sunny
sunny
rainy
sunny
overcast
overcast
rainy

 

Temp
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild

 

Humidity
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high

 

Windy
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE
 
Play (Class)
no
no
yes
yes
yes

no

yes 

no

yes
yes
yes
yes
yes 

no

 

 

 

 

Table 2 is the revised weather data set for this homework. It is available in this csv file too. 

 

Outlook
Sunny
Sunny
Overcast
Rainy
Rainy
Rainy
Overcast
Sunny
Sunny
Rainy
Sunny
Overcast
Overcast
Rainy

 

Temp                     Humidity                 Windy
hot                         high                        FALSE
hot                         high                        TRUE
hot                         high                        FALSE
mild                       high                        FALSE
cool                       normal                    FALSE
cool                       normal                    TRUE
cool                       normal                    TRUE
mild                       high                        FALSE
cool                       normal              FALSE
mild                       normal              FALSE
mild                       normal              TRUE
mild                       high                  TRUE
hot                         normal             FALSE
mild                       high                 TRUE

 

Play (Class)
no
yes
no
no
yes
no
no
no
yes
yes
no
yes
yes
yes

 

 

 

Exercise. 

 

(i) Follow the same procedure demonstrate in Section 4.3 of Data Mining: Practical Machine Learning Tools and Techniques (3rd. Edition) step by step to develop a decision tree based on the revised weather data set in Table 2 (also in this csv file) using the information gain as a measure to pick attributes for classification. You do need to show the steps and values of information gains computed along different branches. You may want to reuse/revise some of the Numpy scripts you did for Lab 2 to help you determine the values of information gains along different branches in the process of decision-tree construction.

 

(ii) Show your decision tree and how your tree would classify the following case.

Outlook

Temp

Humidity

Windy

Play (Class)

Sunny

Cool

High

True

???