Friday, 27 February 2015

Review 2.1: Modeling with Decision Trees - Overview and How to compute the amount of disorder of a set

The algorithms to build decision trees introduced in part 5 is called CART(Classification and Regression Trees). There are two advantages of using decision trees. First,  the models produced by decision trees are easy to understand and interpret. Second, it needs less computation power than other classification methods like bayesian classifier and neural network. However, the disadvantage of decision trees is that it does not suit to classification problems with large number of categories.

I am working in a web hosting company. By coincidence, a interesting problem is introduced in this part: Imaging that a web hosting provided two different types of services, the basic service and the premiun service. For marketing purpose, the web hosting company want to provide a number of trial accounts to potential customers. In order to reduce the cost, they need to choose users who are more likely to buy their services.

To minimize the annoyance for users. The following information from server logs can be used.
Referrer         Location               Read FAQ          FAQ Pages viewed              Service chosen
Slashdot          USA                       Yes                          18                                   None
Google            France                    Yes                          23                                  Premium
Digg                USA                       Yes                          24                                   Basic
Kiwitobes         France                   Yes                          23                                   Basic
Google             UK                          No                          21                                  Premium
(direct)            New Zealand             No                          12                                   None
(direct)             UK                          No                          21                                  Basic
Google            USA                         No                          24                                  Premium
Slashdot         France                     Yes                         19                                   None
Digg                USA                        No                          18                                   None
Google             UK                          No                          18                                   None
Kiwitobes         UK                          No                          19                                   None
Digg              New Zealand              Yes                         12                                   Basic
Google             UK                         Yes                         18                                   Basic
Kiwitobes        France                     Yes                         19                                   Basic

We can use List to store the data. For example:
val data = List(
      List(Slashdot, USA, Yes,18, None),
      List(Google, France, Yes,23, Premium),
      ......
)

To decide which variable(column) would separate the outcomes, we need a function to divide the data according to a variable(column).

Here is one example about the outcomes based on Read FAQ column
Yes: None, Premium, Basic, Basic, None, Basic, Basic.
No : Premium, None, Basic, Premium, None, None, None

After observed the result, we find that the outcome is almost randomly distributed. So how to choose the best variable? Here introduced two methods.

The first one is gini impurity.
A example may help to understand impurity
0 * 1.0 = 0
0.1 * 0.9 = 0.09
0.2 * 0.8 = 0.16
0.3 * 0.7 = 0.21
0.4 * 0.6 = 0.24
0.5 * 0.5 = 0.25
So, The higher the giniimpurity probability the worse the split.

The second one is entropy. In information theory, it indicates how mixed a set is.
p(i) = frequency(outcome) = count(outcome) / count(total rows)
Entropy = sum of p(i) x log(p(i)) for all outcomes
Like gini impurity. The more mixed up the set is, the higher their entropy. If the outcomes are all the same(for example, the hosting company is lucky, and every user buys premium service.), The entroy should be abs(0*log(0) + 0*log(0) +..+ 1*log(1))=0. Our goal is dividing the data into two new groups is to reduce the entropy.

Code:



 


No comments:

Post a Comment