I am working in a web hosting company. By coincidence, a interesting problem is introduced in this part: Imaging that a web hosting provided two different types of services, the basic service and the premiun service. For marketing purpose, the web hosting company want to provide a number of trial accounts to potential customers. In order to reduce the cost, they need to choose users who are more likely to buy their services.

To minimize the annoyance for users. The following information from server logs can be used.

**Referrer Location Read FAQ FAQ Pages viewed Service chosen**

Slashdot USA Yes 18

Google France Yes 23

Digg USA Yes 24

Kiwitobes France Yes 23

Google UK No 21

(direct) New Zealand No 12

(direct) UK No 21

Google USA No 24

Slashdot France Yes 19

Digg USA No 18

Google UK No 18 None

Kiwitobes UK No 19

Digg New Zealand Yes 12

Google UK Yes 18

Kiwitobes France Yes 19

We can use List to store the data. For example:

val data = List(

List(Slashdot, USA, Yes,18, None),

List(Google, France, Yes,23, Premium),

......

)

To decide which variable(column) would separate the outcomes, we need a function to divide the data according to a variable(column).

Here is one example about the outcomes based on

**Read FAQ**column

Yes: None, Premium, Basic, Basic, None, Basic, Basic.

No : Premium, None, Basic, Premium, None, None, None

After observed the result, we find that the outcome is almost randomly distributed. So how to choose the best variable? Here introduced two methods.

The first one is gini impurity.

A example may help to understand impurity

0 * 1.0 = 0

0.1 * 0.9 = 0.09

0.2 * 0.8 = 0.16

0.3 * 0.7 = 0.21

0.4 * 0.6 = 0.24

0.5 * 0.5 = 0.25

So, The higher the giniimpurity probability the worse the split.0.1 * 0.9 = 0.09

0.2 * 0.8 = 0.16

0.3 * 0.7 = 0.21

0.4 * 0.6 = 0.24

0.5 * 0.5 = 0.25

The second one is entropy. In information theory, it indicates how mixed a set is.

p(i) = frequency(outcome) = count(outcome) / count(total rows)

Entropy = sum of p(i) x log(p(i)) for all outcomes

Like gini impurity. The more mixed up the set is, the higher their entropy. If the outcomes are all the same(for example, the hosting company is lucky, and every user buys premium service.), The entroy should be abs(0*log(0) + 0*log(0) +..+ 1*log(1))=0. Our goal is dividing the data into two new groups is to reduce the entropy.Entropy = sum of p(i) x log(p(i)) for all outcomes

Code:

## No comments:

## Post a Comment