Friday, 27 February 2015

Review 2.2: Modeling with Decision Trees - Build a Decision Tree

Modeling with Decision Trees: Build a Decision Tree

This note will talk about how to build a desicion tree. Before start, the concept Information gain should be introduced. Information gain is used to measure the difference between the entropy of the father set and the entropy of it's two divided sets. Information gain can be caculated using the following formula.

IF = entropy(fset) - average(entropy(dset1),entropy(dset2))

The value of information gain should be as larger as possible. Becaue the target of a decision tree is to let the child sets to be as pure as it could be. Hence, a good split will reduce large amount of entropy.

Below code shows how to build a decision tree recursively.


The code first get the entropy of the rows, then iterates each column and splitter the data using each value of the column. Then, it calucate the entropy of those splitted set and get information gain for this split. Thirdly, it find the best split using the largest information gain. Repeat the steps until the largest information gain is zero.

Here is the code to print a decision tree:

and the result

This's the result shown in graphics mode to make it more clear.

No comments:

Post a Comment