## Saturday, 28 February 2015

### Review 5.6: Learn from User Clicks - Back Propagation

Learn from User Clicks: Back Propagation

Here is what is interesting. The following picture shows the flow how the back propagation algorthm works(the calculation in the picture is meaningless, it is just used to describe the idea of Back Propagation. So ignore it). The basic idea of back propagation is, the strength of the links between nodes of different layers will change according to users' feed back (The difference between expectation and real output).

Here is the code how back propagation algorithms implemented.
```def backPropagation( targets:List[Double], howMuchToChange:Double = 0.5 ) {
val outDelta = {
for ( i ← 0 until linkids.length ) yield {
//The difference between expectations and real output
val error = targets(i) - ao(i)

//The reason of caculating slope of output is, when the output is larger, the average strength of links between hidden layer and the output nodes is more strong.
//So we need to change the strength more careful when the slope is not sheer
dtanh(ao(i)) * error
}
}.toList

val hiddenDelta = {
for ( i ← 0 until hiddenids.length ) yield {
val error = (for ( j ← 0 until linkids.length ) yield so(i)(j) * outDelta(j)).toList.reduceLeft( _ + _ )
dtanh(ah(i)) * error
}
}.toList

for ( i ← 0 until hiddenids.length ) {
for ( j ← 0 until linkids.length ) {
val change = outDelta(j) * ah(i)
so(i)(j) += change * howMuchToChange
}
}

for ( i ← 0 until wordids.length ) {
for ( j ← 0 until hiddenids.length ) {
val change = hiddenDelta(j) * ai(i)
si(i)(j) += change * howMuchToChange
}
}
}

feedForword()

val targets = {
for (i ← 0 until links.length) yield {
else 0.0
}
}.toList
backPropagation(targets)
//Save to database
saveNetwork()
}

//Slope of tanh
//When y approach to 1, the slope is 0 degree
//When y approach to 0, the slope is 45 degree
def dtanh( y:Double ) = 1 - y * y

```

The code performs the following steps:

For each node in output layer:
1. Caculate the difference between the node's current output and the expected value of this node.
2. Use dtanh to determine how much the node's total input has to change. (delta output) (Why use dtanh? The reason is, when the link is strong, we need to change its strength more careful. So if you see the curve of dtanh, you can find that dtanh just do what we want.).
3. Change the strength of every incoming link in proportion to the link's current strength and learning rate.

For each node in hidden layer:
1. Caculate the error of hidden node by sum up the strength of every node in output layer mulitpled by its delta output.
2. Use dtanh to determine how much the nodes's total input has to change.
3. Change the strength of every incoming link in proportion to the link's current strength and learning rate.

Notes, before running back propagation, it is necessary to run feeding forward to know the ouput of each node.
```def main(args:Array[String]){
createTables();

val players = 101
val football = 102
val NBA = 103

val NBAPlayerIntroduction = 201
val FootballSkills= 202
val VideoGames = 203

for ( i ← 0 until 30 ) {
train(List(players,NBA),List(NBAPlayerIntroduction,FootballSkills,VideoGames),NBAPlayerIntroduction)
train(List(football,NBA),List(NBAPlayerIntroduction,FootballSkills,VideoGames),FootballSkills)
train(List(players),List(NBAPlayerIntroduction,FootballSkills,VideoGames),VideoGames)
}

setupNetwork(List(players,NBA),List(NBAPlayerIntroduction,FootballSkills,VideoGames))
feedForword()
println(ao)

setupNetwork(List(football,NBA),List(NBAPlayerIntroduction,FootballSkills,VideoGames))
feedForword()
println(ao)

setupNetwork(List(NBA),List(NBAPlayerIntroduction,FootballSkills,VideoGames))
feedForword()
println(ao)
}

```

One of the magic of neural network algorthm, especially its hidden layer, is that it can make reasonable guess even though the input query is never seen before.

We can find the query "NBA" is never seen by it self before. But it give "NBAPlayerIntroduction" link a much better score than "FootballSkills" link, even though the query "NBA" was associated just as often with "Football" as it was with "Player". The neural network has not only learned with Link are associated with which queries, it has also learned what the important words are in a particular query.(The reason of this phenomenon: we can see the query "Play" always trained to has a output "VideoGame", so the network will treat "play" contribute less to "NBAPlayerIntroduction" link. And "NBA" are hence the more important word)

### Review 5.5: Learn from User Clicks - Feeding Forward

Learn from User Clicks: Feeding Forward

We can also use users' feed back in the form of users' behavior. e.g. Which link does a user click after the ranked result displayed in front of the user. A very classical way to do this is artifical neural network.

In neural networks, we always use sigmoid function like hyperbolic tangent(tanh) to indicate how a neural node should respond to its input. There are two reasons we use it. First, the result is between [-1,1], so it is easy to be ranked. second, the result adjust slowly when the input get larger. This is just what we want. Because we can not assume each user will click on an answer that is appropriate for everyone.

After all, let us go deep into how the feeding forward algorithm works. Here is the sample code:
``` def createTables(){
val sqllist = List("create table if not exists hiddennode(create_key)",
"create table if not exists wordhidden(fromid,toid, strength)",
"create table if not exists hiddenlink(fromid,toid, strength)" )
execTran(sqllist)
}

this.wordids =  wordids
this.hiddenids = getAllHiddenIds()

ai = wordids.map(x⇒1.0)
ah = hiddenids.map(x⇒1.0)

si ={ for ( i ← 0 until wordids.length )
yield {
for ( j ← 0 until hiddenids.length )
yield getStrength(wordids(i),hiddenids(j),0)
}.toList
}.toList
so = { for ( i ← 0 until hiddenids.length )
yield {
for ( j ← 0 until linkids.length )
}.toList
}.toList
}

def feedForword() = {
this.ah =
{for ( i ← 0 until hiddenids.length )  yield {
var sum = 0.0
for ( j ← 0 until wordids.length ) {
sum += ai(j) * si(j)(i)
}
java.lang.Math.tanh(sum)
}}.toList

this.ao =
{for ( i ← 0 until linkids.length ) yield {
var sum = 0.0
for( j ← 0 until hiddenids.length ) {
sum += ah(j) * so(j)(i)
}
java.lang.Math.tanh(sum)
}}.toList

}

```

The algorithm works by looping all hidden layer nodes and adding together all the outputs from input layer multipled by the  strength of the links. The output of each hidden node is the tanh function of all the inputs, which is passed on to the output layer. The output layer does the same thing, multiplying the output from previous layer by the strength their links, and then applies tanh function to get the final output.

Here is a example running the code:

Because not yet been trained, the neural network give same rank for all urls.

### Review 5.4: Using Inbound Links - PageRank Algorithm

PageRank algorithm is invented by Larry Page from Google. It caculate the probability that some random randomly clicking on links will arrive a certain page. A user will always stop clicking after a while. To capture this, PageRank algorthm also using a damping factor to indicate that there is a 85% chance that a user will continue clicking on links at that page.

The above pic shows how PageRank algorthm works: User other page's pagerank values and proption that links link to a page to caculate the PageRank value of the page.

Then the PageRank value of page A: 0.15 + 0.85 * (0.4/4 + 0.5/5 + 0.2/1) = 0.15 + 0.85 * (0.1+0.1+0.2) = 0.47

A problem is how we know the pagerank of Page B,C,D. A method is to initial the PageRank value of all pages to 1.0. Then we recaculate those pagerank values one page by one page for several iterations, e.g. 20 iterations.

Code snippet:

``` //formula PR(A) = 0.15 + 0.85 * (PR(P1)/Links(P1)+...+PR(Pn)/Links(Pn))
//initial pagerank for every page is 1.0
//0.85 is damping factor(The probility a person continuing to click on a page), 0.15 is a minimun value which is (1-0.85)
//The more iteration, the more accurate the pagerank is. but 20 iterations is enough
def initDataAndCaculatePageRank(iternum:Int = 20) {
initDataForPageRank()
caculatePageRank(iternum)
}

def initDataForPageRank() = {
val sqllist = List("drop table if exists pagerank",
"create table pagerank (urlid primary key, score)",
"insert into pagerank select rowid,1.0 from urllist")
execTran(sqllist)
}

def caculatePageRank(iternum:Int) = {
for( i ← 0 until iternum )
getAllUrlID.foreach(caculatePageRankFor(_))
}

def caculatePageRankFor(urlid:String) = {
var result = 0.15
val dampfactor = 0.85

val fromIdList = query("SELECT fromid FROM link WHERE toid="+urlid) { rs⇒
bmap(rs.next) {
rs.getString("fromid")
}
}

fromIdList.foreach (
fromid⇒{
val score = query("SELECT score FROM pagerank WHERE urlid="+fromid)(rs ⇒ rs.getDouble("score"))
result += dampfactor * (score / links)
}
)
execTran(List("UPDATE pagerank SET score="+result+" WHERE urlid="+urlid))
}

def getPageRankScore(rows:List[List[String]]) = {
normalize(scorelist)
}

def getPageRankScoreFor(urlid:String) = {
query("SELECT score FROM pagerank WHERE urlid="+urlid){rs⇒
rs.getDouble(1)
}
}

```

### Review 5.3: Overview of Content-Based Ranking algorthms

Overview of Content-Based Ranking algorthms
Example result of the query previous note (w0 is for word "c", w1 is for word "programming", w2 is for word "language")

-----------------------------------------------------------------
urlid | w0.location | w1.location | w2.location |
-----------------------------------------------------------------
1    |      3           |          4      |       5          |
-----------------------------------------------------------------
1    |      3           |          4      |       400       |
-----------------------------------------------------------------

Before the birth of google, most search engines were mainly using content-based ranking algorthms and were able to give useful results. Here is a list of typical content-based ranking methods:
- Word frequency
- Word loction in document
- Word distance

1. Word frequency: Count the times of all the words, which a user seached, appears in the webpage of a url. Then this is the score of this url. Caculate scores for all urls, and sort these urls by their scores.

2. Word location in document: If a page is relvant to a search term, it will appear closer to the top of the page. For the same urlid, there may be different combination of word locations. For each combination of this url, we can sum its locations and find the smallest one as the url's score.

3. Word distance: If the distance of all the different words in a url is shorter, then the page of this url tends to be more relvant to the search term. e.g "c programming language is a structured high-level ..." is more relvant then "C > A+B is a formula, in python programming lanugage, we can express it as... " when the user searchs "c programming language" because the distance of the first one is only two.

We may like to combine all the three method together in order to get better result. A way to do this is to normalize the scores to range [0-1] and use weights in each method. More info about normalization:
http://hi.baidu.com/idontknow1987/blog/item/f94bac518fac8d698535248a.html

Because I want to focus more on other page ranking algorthms, I will skip the implementation of the above algorithms.

### Review 5.2: Perform a Query in Search Engine Database

Perform a Query in Search Engine Database

After the database for the search engine was constructed, sql queries needed to be used to get the result what users intend to get.

In order to make it easy to understand, A example is shown as the following. Suppose a user searches "c programming language". Then we can guess this user want to browse some webpages introducing how to program using c. So all the webpages should have all the three words. Hence, we can use a SQL statement like

SELECT w0.urlid, w0.location , w1.location , w2.location
FROM wordlocation as w0, wordlocation as w1, wordlocation as w2
WHERE w0.urlid = w1.urlid
AND w1.urlid = w2.urlid
AND w0.wordid = ("C"'s rowid in table wordlist)
AND w1.wordid = ("programming"'s rowid in table wordlist)
AND w2.wordid = ("language"'s rowid in table wordlist)

### Review 5.1: A Simple Database Schema for Search Engine

A Simple Database Schema for Search Engine

There are already a set of ranking algorithms for search engines. Before starting to study them, I firstly need to prepare a database for the purpose of testing. The schema for this database is as the following pic.

The first table is 'urllist'. 'urllist' stores all the urls that have been indexed by the crawler. The second table (wordlist) has all the words separated from those indexed urls. The third table (wordlocation) indicates whether a webpage of a url contains the word, and what is the location of the word in the webpage. The remaining two tables specify links between webpages. The table (link) stores pairs of URL IDs, indicating a link from one webpage to another, and table (linkwords) uses wordid and linkid columns to store which words actually used in a link.

### Review 4.4: Document filtering - Fisher Method

Fisher Method is another method to give very accurate result for document filtering, particularly for spam filtering. It calculates the probability of  a category for each feature in a document, then combines these probabilities to test whether the document should be classified to this category.

Unlike naivebayes, which uses Pr(feature | category) to calculate Pr(doc | category) ,  and then get Pr(category | doc) from it. Fisher method calculate Pr( category | feauture) first:

After we got Pr(category | feature), we can calculate the fisher probability with the following procedure: Multiplying all the probabilities together, then taking the natural log (math.log in Python), and then multiplying the result by –2.

At last, we can specify the lower bounds for each category, and use fisherprob of different categories to classify items.

Actually, because each feature is not independent in a document. Therefore, neither naivebayes nor fisher method gives real probability for a category. But fisher method is a better estimate method.

### Review 4.3: Document filtering - Use Naive Bayes

Now we know how to calculate the probability of a feature when given a category. And then, we can use this probability to  calculate the probability of a item.

Here is the formula( Pls note that we assume the occurance of different features in a item are independent ):

Prob(Item | category) = Prob(feature1 | category) * Prob(feature2 | category) *...*Prob(featureN | category)

The formula is not enough for us to decide which category a item should belong to, because what we really want to know is , what is the probability of a item when given a category. etc. Prob(category | Item).

Actually , it is easy to get this using naive bayes theory
Prob(Category | Item) = Prob(Item | Category) * Prob(Category) / Prob(Item)

How the formula is deduced:
Prob(AB) = Prob(A|B)*Prob(B) = Prob(B|A) * Prob(A)
=>
Prob(A|B) = Prob(B|A) * Prob(A) / Prob(B)

In the above formula, Prob(Item) is useless for comparison, because it is the same in for different categories.

Here is the code:

```def naivebayes(item:String, cate:String) = {
getDocProb(item,cate) * countTotalFeaturesInCate(cate) / countTotalFeatures()
}

def getDocProb(item:String, cate:String) = {
getFeatures(item).map(getWeightedProb( _, cate )).reduceLeft(_*_)
}

```

### Review 4.2: Document filtering - Calculate probabilities and Make a reasonable guess

Document filtering: Calculate probabilities and Make a reasonable guess

Firstly, we will talk about conditional probabilites to classify a item. Pls note that we assume that one feature always appears once in a item. So the count of features in a category will not be larger than the number of items in the category. Therefore, We can calculate the probability that a feature accures in a category by using the following forumla:

Probability = featureCountInCategory / itemCountInThisCategory

Next, let us talk about the reason to make a reasonable guess. Take a look the following example
train("Nobody owns the water.","good")
train("the quick rabbit jumps fences","good")
train("make quick money at the online casino","bad")
train("the quick brown fox jumps","good")

We find the feature "money" only appear once in all training samples. And the item, which contains "money", is trainned to be "bad". But in real world, money is not directly linked to bad thingy. So , if we have little information about a feature, we need to make some reasonable guess:
WeightedProbability = (weight * assumeProbility + totalFeaturesInAllCategory * ProbabilityOfTheFeature) / (weight +totalFeaturesInAllCategory)
[As an outcome of experience, weight can be 1.0 and assumeProbability can be 0.5.]

Here is the code of Calculating probabilites and Making a reasonable guess, and the way how the reasonable guess affects the result will be explained later:

```def train(item:String, cate:String) {
getFeatures(item).foreach(incf(_,cate))
incc(cate)
}

def getWeightedProb(feature:String, cate:String , weight:Float = 1.0f, assumeProb:Float = 0.5f) = {
val basicProb = this.getFeatureProb(feature,cate)
val total = this.fc(feature).values.sum

(weight * assumeProb + basicProb * total) / (total + weight)
}

def getFeatureProb(feature:String , cate:String) = countItemInCate(cate) match {
case 0 ⇒ 0
case _ ⇒ countFeatureInCate(feature,cate).toFloat / countItemInCate(cate).toFloat
}

```

A set of training samples:

The result:

The result shows that how a reasonable guess affects the result: As the information about a feature grows, the result are pulled a lot more far away from the assumed probability.

### Review 4.1: Document filtering - Overview and How to train a classifier

Document filter alway means using a classifier to class a set of documents to different categories. The most well known example of document filtering is the elimination of spam. Before start to study document filtering, two important concepts about classifier should be introduced. The first concept is item, which is the objects to be classified. In document filtering, documents or document title is items. The second concept is feature, which is anything that can be used to determine as being either present or absent in the item. In document filtering, feature is the word in documents.

Unlike NMF, document filtering uses supervised methods. So before start to classify document, we need to train a classifier. Here is the code to train a classifier, the result should in a format like this : {"girl":[good:1,bad:0], "boy":[good:0,bad:100] }

```import java.util.regex._

object DF {
var fc = Map[String,scala.collection.mutable.Map[String,Int]]()
var cc = Map[String,Int]()

def main(arg:Array[String]) {
train("the quick brown fox jumps over the lazy dog","good")
train("make quick money in the online casino","bad")
println(fc)
}

def train(item:String, cate:String) {
getFeatures(item).foreach(incf(_,cate))
incc(cate)
}

def getFeatures(item:String)={
val pattern = Pattern.compile("\\W")
pattern.split(item.trim).toList
}

//Increase feature count for a category
def incf(feature:String, cate:String) {
if(!fc.isDefinedAt(feature)) {
fc += ((feature, scala.collection.mutable.Map[String,Int]()))
}

if(!fc(feature).isDefinedAt(cate)) {
fc(feature)+=((cate, 0))
}
fc(feature)(cate) += 1
}

//Increase category count
def incc(cate:String) {
if(!cc.isDefinedAt(cate)) {
cc += ((cate,0))
}
cc(cate) += 1
}
}

```

### Review 3.2: Find Independant Features - Use of NMF(Non-Negative Matrix Factorization)

There are two examples of using NMF. Because NMF is a unsupervised learning technology. So it acts a little bit like clustering.

The first example is assigning themes to the acticles from different news websites. We can download those article first, and then construct a matrix in the following format:
|5,0,4|
|3,4,5|
|1,2,9|
The rows are article titles, and the columns are different words. So the matrix shows the word count of different word in each article. The we can use NMF to factorize the matrix to a weight matrix and a feature matrix. Weight matrix is composed with article titles and features , besides, feature matrix is compose with features and words. So according the these two matrix , we can find out the relationship between words count and articles, what we always named as themes.

The second example uses NMF to mine information in the stock market. The stock market data can be download from any financial site. Then rows can be represented as dates, columns represented as different stock. The data can be close volumns. Once, the matrix is factorized, just like the first example, we can find out the relationship between stock price and date. For instance, google's trading volumns is increased a lot on 20 Jan,06 because google announced it would not given information about its search engine usage to the government.

### Review 3.1: Find Independant Features - Understand Non-Negative Matrix Factorization

Find Independant Features: Understand Non-Negative Matrix Factorization

Before start, I think we need to take a look at something seems meaningless. For example, there is a target number 5, but you guess a number 10 . So how far is your guess number to the target number. We can use 5/10 to measure this distance. It is 0.5. And the orginal target number is [You guess] * [The distance]=10 * 0.5 =5.

For single number, the idea is ridiculous. However, this is just the way how Non-Negative Matrix Factorization works. Non-Negative Matrix Factorization(NMF) is a unsupervised learning algorthm. Given a matrix M, NMF can factorize M to a weight matrix W multiplied by a feature matrix F. e.g :
| 29 , 29 |
| 43 , 33 |   can be factorized to
| 15 , 25 |

| 5, 2 |    | 3,5|
| 5, 4 | *  | 7,2|
| 5, 0 |

How the algorthm works to get W and F. Here is the steps:
1. Generate two random matrix as W and F seperately. The column count of W(cw) and the row count of F(rf) is the same as feature count (specified by youself, depends on how many features you want). The row count of W(rw) is the same as M's and the column count of F(cf) is also the same as M's
2. Let Fn = W.transpose * M, (cw (or rf) rows, cf columns) , Fd = W.transpose * (W * H)(cw (or rf) rows, cf columns) , Next H = Array(H) * (Array(Fn)/Array(Fd))(cw (or rf) rows, cf columns). (Here, you can take Fn as the orginal thing, Fd as the guessed thing, Next H is the [guessed thing] * [distance])
3. Let Wn = M * F.transpose, (rw rows, rf or (or cw) columns) , Wd = W * F * F.transpose (rw rows, rf or (or cw) columns), Next W = Array(W) * (Array(Wn)/Array(Wd)), (rw rows, rf or (or cw) columns)
4. Let H = Next H, W = Next W
5. repeat step 2~4 until then difference between W*H and M is small enough.

Source code of NMF algorthm - NMF.scala:
```import Jama._
import java.util.Random

object NMF {
def main(args:Array[String]) {
val wa = Array(Array(1.,2.,3.),Array(4.,5.,6.))
val wb = Array(Array(2.,4.),Array(2.,5.),Array(4.,7.))
val data = (new Matrix(wa)).times(new Matrix(wb))
val result = factorize(data)
display(result._1.getArray)
println()
display(result._2.getArray)
}

implicit def array2SuperArray(arg:Array[Array[Double]]) = new SuperArray(arg)

def factorize(data:Matrix,
fc:Int=2,
maxItr:Int = 500):(Matrix,Matrix) = {
val rc = data.getRowDimension
val cc = data.getColumnDimension

var w = randomMatrix(rc,fc)
var h = randomMatrix(fc,cc)

for ( i ← 0 until maxItr ) {
print( "iteration:"+i+"  ")
println(distance(w.times(h).getArray,data.getArray))
if(distance(w.times(h).getArray,data.getArray)<0 class="" data-blogger-escaped-.001="" data-blogger-escaped-0="" data-blogger-escaped-a="" data-blogger-escaped-arg.foreach="" data-blogger-escaped-arg:array="" data-blogger-escaped-arr.length="" data-blogger-escaped-arr:="" data-blogger-escaped-arr="" data-blogger-escaped-array="" data-blogger-escaped-b="" data-blogger-escaped-colcount:int="" data-blogger-escaped-colcount="" data-blogger-escaped-def="" data-blogger-escaped-display="" data-blogger-escaped-distance="" data-blogger-escaped-for="" data-blogger-escaped-h.getarray="" data-blogger-escaped-h="" data-blogger-escaped-hd.getarray="" data-blogger-escaped-hd="w.transpose.times(w).times(h)" data-blogger-escaped-hn.getarray="" data-blogger-escaped-hn="w.transpose.times(data)" data-blogger-escaped-i="" data-blogger-escaped-j="" data-blogger-escaped-length="" data-blogger-escaped-math.pow="" data-blogger-escaped-math.sqrt="" data-blogger-escaped-matrix.toarray="" data-blogger-escaped-matrix="" data-blogger-escaped-new="" data-blogger-escaped-original:array="" data-blogger-escaped-original="" data-blogger-escaped-ouble="" data-blogger-escaped-pre="" data-blogger-escaped-print="" data-blogger-escaped-println="" data-blogger-escaped-rand.nextdouble="" data-blogger-escaped-rand:random="new" data-blogger-escaped-random="" data-blogger-escaped-randommatrix="" data-blogger-escaped-result.toarray="" data-blogger-escaped-result="for" data-blogger-escaped-return="" data-blogger-escaped-rowcount:int="" data-blogger-escaped-rowcount="" data-blogger-escaped-rray="" data-blogger-escaped-superarray="" data-blogger-escaped-that:array="" data-blogger-escaped-that="" data-blogger-escaped-tmp.toarray="" data-blogger-escaped-tmp="for" data-blogger-escaped-until="" data-blogger-escaped-val="" data-blogger-escaped-var="" data-blogger-escaped-w.getarray="" data-blogger-escaped-w="" data-blogger-escaped-wd.getarray="" data-blogger-escaped-wd="w.times(h).times(h.transpose)" data-blogger-escaped-wh.length="" data-blogger-escaped-wh:array="" data-blogger-escaped-wh="" data-blogger-escaped-wn.getarray="" data-blogger-escaped-wn="(data).times(h.transpose)" data-blogger-escaped-x.foreach="" data-blogger-escaped-x="" data-blogger-escaped-y="" data-blogger-escaped-yield="" data-blogger-escaped-ystem.currenttimemillis="">

Here is a screenshot of running above source code. The last four lines is the weight matrix(the first two rows) and the feature matrix(the second two rows).

Reference: http://book.csdn.net/bookfiles/880/10088027646.shtml```

### Review 2.6: Modeling with Decision Trees - Deal with Numerical Outcomes

It is not a good idea to treat every numerical data as a different category when the outcomes of given dataset are numbers. A better way is using variance as the criteria when finding the best split.

The formula of variance is :
S^2=[(X1-X)^2+(X2-X)^2+...+(Xn-X)^2]/n ;

When building a decision tree using variance as the scoring function , we can use variance to split higher values on one side and lower values on another side. This will reduce the overall variance on the branches.

### Review 2.5: Modeling with Decision Trees - Deal with Missing Values

Another greate feature of Decision Trees is that we can use them to handle missing values easily. For instance, what should we do if a customer's location cannot be determined from his ip address? Actually, we can check both branches of a node when a given item cannot provide information required by this node. Here is the code:

The difference between 'mdclassify' and 'classify' method is that the 'mdclassify' method can go both branches and combine the result if required information cannot be provided.

### Review 2.4: Modeling with Decision Trees- Pruning

Pruning is a method used to solve the overfit problem. If a tree creates branches that decrease the entropy slightly, it may give a an answer which is being more certain than it really is. Another defective method which can solve overfit problem is that, in tree creation procedure, when the decreased entropy is less than a given entropy, we do not create the branch. The drawback of this method is,  although the entropy is not reduced a lot by this split, it may decreased greatly by subsequent splits.

Pruning do not have the drawback described above. It creates the entire tree at first. so we can start to check wheather merging branches would increase the entropy less than a given threshold from the bottom of the tree.

The code of pruning a decision tree:

```
def prune(tree:DTreeNode , threshold:Float ):Unit = {
if (tree.left != None )
prune(tree.left.get, threshold)

if (tree.right != None )
prune(tree.right.get, threshold)

if (tree.left != None && tree.left.get.result!=None &&
tree.right != None && tree.right.get.result !=None ) {
val left = tree.left.get.result.get.toList.map(x=> (for(i←0 to x._2) yield List(x._1)).toList ).flatten
val right = tree.right.get.result.get.toList.map( x=> (for(i←0 to x._2) yield List(x._1)).toList ).flatten
val p1 =  left.length / (left.length + right.length)
val p2 = 1.0 - p1
println(entropy(left++right)+":" + entropy(left)*p1+":" + entropy(right)*p2)
val diff = entropy(left++right) - (entropy(left)*p1 + entropy(right)*p2)
if ( diff ≤ threshold ) {
tree.colIndex = None
tree.split = None
tree.left = None
tree.right = None
tree.result = Some(countResult(left ++ right))
}
}
}
```
A example

### Review 2.3: Modeling with Decision Trees - Classify using a Decision Tree

The code of classifying using a decision tree is as following:

```def classify(item:List[String], dtree:DTreeNode):Option[Map[String,Int]] = {
if (dtree.result!=None)
return dtree.result
if (dtree.colIndex==None)
return None

val (set1, set2) = divide(List(item), dtree.colIndex.get , dtree.split)
if (set1.length!=0) {
if (dtree.left==None)
return None
return classify(item,dtree.left.get)
}

if (set2.length!=0) {
if (dtree.right==None)
return None
return classify(item,dtree.right.get)
}
return None
}
```

The result of classifying item [google,UK,yes,21] using the constructed decision tree.

## Friday, 27 February 2015

### Review 2.2: Modeling with Decision Trees - Build a Decision Tree

Modeling with Decision Trees: Build a Decision Tree

This note will talk about how to build a desicion tree. Before start, the concept Information gain should be introduced. Information gain is used to measure the difference between the entropy of the father set and the entropy of it's two divided sets. Information gain can be caculated using the following formula.

IF = entropy(fset) - average(entropy(dset1),entropy(dset2))

The value of information gain should be as larger as possible. Becaue the target of a decision tree is to let the child sets to be as pure as it could be. Hence, a good split will reduce large amount of entropy.

Below code shows how to build a decision tree recursively.

The code first get the entropy of the rows, then iterates each column and splitter the data using each value of the column. Then, it calucate the entropy of those splitted set and get information gain for this split. Thirdly, it find the best split using the largest information gain. Repeat the steps until the largest information gain is zero.

Here is the code to print a decision tree:

and the result

This's the result shown in graphics mode to make it more clear.

### Review 2.1: Modeling with Decision Trees - Overview and How to compute the amount of disorder of a set

The algorithms to build decision trees introduced in part 5 is called CART(Classification and Regression Trees). There are two advantages of using decision trees. First,  the models produced by decision trees are easy to understand and interpret. Second, it needs less computation power than other classification methods like bayesian classifier and neural network. However, the disadvantage of decision trees is that it does not suit to classification problems with large number of categories.

I am working in a web hosting company. By coincidence, a interesting problem is introduced in this part: Imaging that a web hosting provided two different types of services, the basic service and the premiun service. For marketing purpose, the web hosting company want to provide a number of trial accounts to potential customers. In order to reduce the cost, they need to choose users who are more likely to buy their services.

To minimize the annoyance for users. The following information from server logs can be used.
Referrer         Location               Read FAQ          FAQ Pages viewed              Service chosen
Slashdot          USA                       Yes                          18                                   None
Digg                USA                       Yes                          24                                   Basic
Kiwitobes         France                   Yes                          23                                   Basic
(direct)            New Zealand             No                          12                                   None
(direct)             UK                          No                          21                                  Basic
Slashdot         France                     Yes                         19                                   None
Digg                USA                        No                          18                                   None
Kiwitobes         UK                          No                          19                                   None
Digg              New Zealand              Yes                         12                                   Basic
Kiwitobes        France                     Yes                         19                                   Basic

We can use List to store the data. For example:
val data = List(
List(Slashdot, USA, Yes,18, None),
......
)

To decide which variable(column) would separate the outcomes, we need a function to divide the data according to a variable(column).

Here is one example about the outcomes based on Read FAQ column
Yes: None, Premium, Basic, Basic, None, Basic, Basic.

After observed the result, we find that the outcome is almost randomly distributed. So how to choose the best variable? Here introduced two methods.

The first one is gini impurity.
A example may help to understand impurity
0 * 1.0 = 0
0.1 * 0.9 = 0.09
0.2 * 0.8 = 0.16
0.3 * 0.7 = 0.21
0.4 * 0.6 = 0.24
0.5 * 0.5 = 0.25
So, The higher the giniimpurity probability the worse the split.

The second one is entropy. In information theory, it indicates how mixed a set is.
p(i) = frequency(outcome) = count(outcome) / count(total rows)
Entropy = sum of p(i) x log(p(i)) for all outcomes
Like gini impurity. The more mixed up the set is, the higher their entropy. If the outcomes are all the same(for example, the hosting company is lucky, and every user buys premium service.), The entroy should be abs(0*log(0) + 0*log(0) +..+ 1*log(1))=0. Our goal is dividing the data into two new groups is to reduce the entropy.

Code:

### Review 1.4: Give your program the ability to evolve

Give your program the ability to evolve.

After studied program tree, mutation and crossover(summarized in study notes 1,2,3), all things needed to let a program to evolve by itself is ready. The raw logic is, generate a random set of population, then rank them with a given ranking function and select some elites to breed the next generation. Repeat these steps until one or a set of people(program) which satisfies a given condition.

Here is a sample code to evolve a program:

```
def evolve(paramCount:Int,popSize:Int=500,genSize:Int=500,probMut:Double=0.1,probCross:Double=0.6,probNew:Double=0.2,probExp:Double=0.7) = {
val ancestors = (for( i ← 0 until popSize ) yield makeRandomTree(2)).toList
val last = breed(ancestors,paramCount,popSize,genSize,probNew,probExp)
last.displayTree()
last
}

def breed(ancestors:List[N],paramCount:Int,popSize:Int,genSize:Int,probNew:Double,probExp:Double):N = {
var population = ancestors
for( i ← 0 until genSize ) {
val rankedPrograms = rankPrograms(population)

//Display the score of this generation, the first people is the best one
println(rankedPrograms(0)._2)

//We get the program we want when the score become 0
if( rankedPrograms(0)._2 == 0 )
return rankedPrograms(0)._1

population = breedOneGeneration(rankedPrograms,paramCount,popSize,probNew,probExp)
}
return population(0)
}

def breedOneGeneration(rankedPrograms:List[(N,Int)],paramCount:Int,popSize:Int,probNew:Double,probExp:Double) = {
//Always reserve the top 2 best programs
var newgen = List(rankedPrograms(0)._1,rankedPrograms(1)._1)
while (newgen.length < popSize ) {
if (seed.nextDouble > probNew) {
val eliteFather = rankedPrograms(eliteIndex(probExp))._1
val eliteMother = rankedPrograms(eliteIndex(probExp))._1
val child = mutate(crossover(eliteFather,eliteMother),paramCount)
newgen = newgen:::List(child)
}
else { //We need aliens to make the group of people to be diversity. This can avoid local maxima problem
newgen = newgen:::List(makeRandomTree(paramCount))
}
}
newgen
}
```

In function breedOneGeneration, there are two condition. The first condition means the probility that a member of the new generation will breeded from the old generation. The other condition means the problitiy that a member generated randomly. The reason why we need the second condition is that if all generated from the older generation, new memembers tend to be extremely homogeneous. It will lead a problem call local maxima, a state that is good but not quite good enough. So we need to maintain diversity in each generation.

The function eliteIndex will select a random but pretty good solution from ranked programs. We can use Math.log to achieve this. e.g. (Int)(Math.log(randomDouble) / Math.log(probExp)). If we set probExp to be 0.7 and the population size to be 500. Then the probility of getting a small number is relatively larger than the probility of getting a big number. It is easy to understand when you think about the logarithms curve below.

### Review 1.3: Understanding crossover

Understanding crossover

Different from mutation, which generate child trees from itself, crossover uses two parents to generate the child trees. This involves taking two successful program trees and combining them to create a new program tree, usually by replacing one branch from one with a branch from another. The following pic shows a example:

In order to make the function for performing a crossover taking two program trees as inputs to be simple, It is recommend to traverse both trees at the same level. If a randomly selected threshold is reached, we return a copy of the first tree with one of its branch replaced by a branch from the second tree. Here is a example code:

```def crossover(father:N,mother:N,prob:Double = 0.2, top:Boolean = true ):N = {
if ( seed.nextDouble < prob && !top )
mother
else {
crossoverChildren(father,mother, prob);
}
}

def crossoverChildren(father:N,mother:N,prob:Double):N = {
var result = father
if( father.isInstanceOf[Node] && mother.isInstanceOf[Node] )  //replace a branch
result.asInstanceOf[Node].children = father.asInstanceOf[Node].children.map(crossover(_,choose(mother.asInstanceOf[Node].children), prob, false ));
result
}

//Choose a node randomly
def choose(children:List[N]):N = {
children(seed.nextInt(children.length))
}
```

### Review 1.2: Understanding mutation

Mutation and Crossover are main technologies used in genetic programming. In genetic programming, we usually generate a set of programs which are represented in the format of program trees. But we do not hope these random programs can give correct solutions for problems we want to solve. Hence, we use mutation and crossover technics to make those programs to evolve.

Firstly, let us take a look at mutation. Mutation means something changes from itself. The following code show a method to mutate a program tree:
```def mutate(tree:N, paramCount:Int, prob:Double = 0.1 ):N = {
if ( seed.nextDouble < prob )
makeRandomTree(paramCount)
else {
mutateChildren(tree,paramCount)
}
}

def mutateChildren(parent:N,paramCount:Int) = {
var result = parent
if (parent.isInstanceOf[Node])
result.asInstanceOf[Node].children = parent.asInstanceOf[Node].children.map(mutate(_,paramCount))
result
}
```

The code use a recusive way to mutate the program tree by changing its nodes one by one with a top-down direction. It firstly address to the parent node. If the randomly generated float is less than given probility which is default 0.1, then the parent node mutate by calling makeRandomTree(a method to generate random program tree, mainly use random and probility ways) method. Otherwise, it change its target to the parent's child nodes and apply the mutate method to them.
The following picture shows a kind of mutation.
From the picture we can see the whole tree was changed. This is just one condition. The parent node may not change and just has it child node changed. Or, nothing changed at all.