Saturday, 28 February 2015

Review 5.6: Learn from User Clicks - Back Propagation

Learn from User Clicks: Back Propagation

Here is what is interesting. The following picture shows the flow how the back propagation algorthm works(the calculation in the picture is meaningless, it is just used to describe the idea of Back Propagation. So ignore it). The basic idea of back propagation is, the strength of the links between nodes of different layers will change according to users' feed back (The difference between expectation and real output).

Here is the code how back propagation algorithms implemented.
def backPropagation( targets:List[Double], howMuchToChange:Double = 0.5 ) {
val outDelta = {
for ( i ← 0 until linkids.length ) yield {
//The difference between expectations and real output
val error = targets(i) - ao(i)

//The reason of caculating slope of output is, when the output is larger, the average strength of links between hidden layer and the output nodes is more strong.
//So we need to change the strength more careful when the slope is not sheer
dtanh(ao(i)) * error
}
}.toList

val hiddenDelta = {
for ( i ← 0 until hiddenids.length ) yield {
val error = (for ( j ← 0 until linkids.length ) yield so(i)(j) * outDelta(j)).toList.reduceLeft( _ + _ )
dtanh(ah(i)) * error
}
}.toList

for ( i ← 0 until hiddenids.length ) {
for ( j ← 0 until linkids.length ) {
val change = outDelta(j) * ah(i)
so(i)(j) += change * howMuchToChange
}
}

for ( i ← 0 until wordids.length ) {
for ( j ← 0 until hiddenids.length ) {
val change = hiddenDelta(j) * ai(i)
si(i)(j) += change * howMuchToChange
}
}
}

feedForword()

val targets = {
for (i ← 0 until links.length) yield {
else 0.0
}
}.toList
backPropagation(targets)
//Save to database
saveNetwork()
}

//Slope of tanh
//When y approach to 1, the slope is 0 degree
//When y approach to 0, the slope is 45 degree
def dtanh( y:Double ) = 1 - y * y

The code performs the following steps:

For each node in output layer:
1. Caculate the difference between the node's current output and the expected value of this node.
2. Use dtanh to determine how much the node's total input has to change. (delta output) (Why use dtanh? The reason is, when the link is strong, we need to change its strength more careful. So if you see the curve of dtanh, you can find that dtanh just do what we want.).
3. Change the strength of every incoming link in proportion to the link's current strength and learning rate.

For each node in hidden layer:
1. Caculate the error of hidden node by sum up the strength of every node in output layer mulitpled by its delta output.
2. Use dtanh to determine how much the nodes's total input has to change.
3. Change the strength of every incoming link in proportion to the link's current strength and learning rate.

Notes, before running back propagation, it is necessary to run feeding forward to know the ouput of each node.
def main(args:Array[String]){
createTables();

val players = 101
val football = 102
val NBA = 103

val NBAPlayerIntroduction = 201
val FootballSkills= 202
val VideoGames = 203

for ( i ← 0 until 30 ) {
train(List(players,NBA),List(NBAPlayerIntroduction,FootballSkills,VideoGames),NBAPlayerIntroduction)
train(List(football,NBA),List(NBAPlayerIntroduction,FootballSkills,VideoGames),FootballSkills)
train(List(players),List(NBAPlayerIntroduction,FootballSkills,VideoGames),VideoGames)
}

setupNetwork(List(players,NBA),List(NBAPlayerIntroduction,FootballSkills,VideoGames))
feedForword()
println(ao)

setupNetwork(List(football,NBA),List(NBAPlayerIntroduction,FootballSkills,VideoGames))
feedForword()
println(ao)

setupNetwork(List(NBA),List(NBAPlayerIntroduction,FootballSkills,VideoGames))
feedForword()
println(ao)
}

One of the magic of neural network algorthm, especially its hidden layer, is that it can make reasonable guess even though the input query is never seen before.

We can find the query "NBA" is never seen by it self before. But it give "NBAPlayerIntroduction" link a much better score than "FootballSkills" link, even though the query "NBA" was associated just as often with "Football" as it was with "Player". The neural network has not only learned with Link are associated with which queries, it has also learned what the important words are in a particular query.(The reason of this phenomenon: we can see the query "Play" always trained to has a output "VideoGame", so the network will treat "play" contribute less to "NBAPlayerIntroduction" link. And "NBA" are hence the more important word)

Review 5.5: Learn from User Clicks - Feeding Forward

Learn from User Clicks: Feeding Forward

We can also use users' feed back in the form of users' behavior. e.g. Which link does a user click after the ranked result displayed in front of the user. A very classical way to do this is artifical neural network.

In neural networks, we always use sigmoid function like hyperbolic tangent(tanh) to indicate how a neural node should respond to its input. There are two reasons we use it. First, the result is between [-1,1], so it is easy to be ranked. second, the result adjust slowly when the input get larger. This is just what we want. Because we can not assume each user will click on an answer that is appropriate for everyone.

After all, let us go deep into how the feeding forward algorithm works. Here is the sample code:
def createTables(){
val sqllist = List("create table if not exists hiddennode(create_key)",
"create table if not exists wordhidden(fromid,toid, strength)",
"create table if not exists hiddenlink(fromid,toid, strength)" )
execTran(sqllist)
}

this.wordids =  wordids
this.hiddenids = getAllHiddenIds()

ai = wordids.map(x⇒1.0)
ah = hiddenids.map(x⇒1.0)

si ={ for ( i ← 0 until wordids.length )
yield {
for ( j ← 0 until hiddenids.length )
yield getStrength(wordids(i),hiddenids(j),0)
}.toList
}.toList
so = { for ( i ← 0 until hiddenids.length )
yield {
for ( j ← 0 until linkids.length )
}.toList
}.toList
}

def feedForword() = {
this.ah =
{for ( i ← 0 until hiddenids.length )  yield {
var sum = 0.0
for ( j ← 0 until wordids.length ) {
sum += ai(j) * si(j)(i)
}
java.lang.Math.tanh(sum)
}}.toList

this.ao =
{for ( i ← 0 until linkids.length ) yield {
var sum = 0.0
for( j ← 0 until hiddenids.length ) {
sum += ah(j) * so(j)(i)
}
java.lang.Math.tanh(sum)
}}.toList

}

The algorithm works by looping all hidden layer nodes and adding together all the outputs from input layer multipled by the  strength of the links. The output of each hidden node is the tanh function of all the inputs, which is passed on to the output layer. The output layer does the same thing, multiplying the output from previous layer by the strength their links, and then applies tanh function to get the final output.

Here is a example running the code:

Because not yet been trained, the neural network give same rank for all urls.

Review 5.4: Using Inbound Links - PageRank Algorithm

PageRank algorithm is invented by Larry Page from Google. It caculate the probability that some random randomly clicking on links will arrive a certain page. A user will always stop clicking after a while. To capture this, PageRank algorthm also using a damping factor to indicate that there is a 85% chance that a user will continue clicking on links at that page.

The above pic shows how PageRank algorthm works: User other page's pagerank values and proption that links link to a page to caculate the PageRank value of the page.

Then the PageRank value of page A: 0.15 + 0.85 * (0.4/4 + 0.5/5 + 0.2/1) = 0.15 + 0.85 * (0.1+0.1+0.2) = 0.47

A problem is how we know the pagerank of Page B,C,D. A method is to initial the PageRank value of all pages to 1.0. Then we recaculate those pagerank values one page by one page for several iterations, e.g. 20 iterations.

Code snippet:

//initial pagerank for every page is 1.0
//0.85 is damping factor(The probility a person continuing to click on a page), 0.15 is a minimun value which is (1-0.85)
//The more iteration, the more accurate the pagerank is. but 20 iterations is enough
def initDataAndCaculatePageRank(iternum:Int = 20) {
initDataForPageRank()
caculatePageRank(iternum)
}

def initDataForPageRank() = {
val sqllist = List("drop table if exists pagerank",
"create table pagerank (urlid primary key, score)",
"insert into pagerank select rowid,1.0 from urllist")
execTran(sqllist)
}

def caculatePageRank(iternum:Int) = {
for( i ← 0 until iternum )
getAllUrlID.foreach(caculatePageRankFor(_))
}

def caculatePageRankFor(urlid:String) = {
var result = 0.15
val dampfactor = 0.85

val fromIdList = query("SELECT fromid FROM link WHERE toid="+urlid) { rs⇒
bmap(rs.next) {
rs.getString("fromid")
}
}

fromIdList.foreach (
fromid⇒{
val score = query("SELECT score FROM pagerank WHERE urlid="+fromid)(rs ⇒ rs.getDouble("score"))
result += dampfactor * (score / links)
}
)
execTran(List("UPDATE pagerank SET score="+result+" WHERE urlid="+urlid))
}

def getPageRankScore(rows:List[List[String]]) = {
normalize(scorelist)
}

def getPageRankScoreFor(urlid:String) = {
query("SELECT score FROM pagerank WHERE urlid="+urlid){rs⇒
rs.getDouble(1)
}
}

Review 5.3: Overview of Content-Based Ranking algorthms

Overview of Content-Based Ranking algorthms
Example result of the query previous note (w0 is for word "c", w1 is for word "programming", w2 is for word "language")

-----------------------------------------------------------------
urlid | w0.location | w1.location | w2.location |
-----------------------------------------------------------------
1    |      3           |          4      |       5          |
-----------------------------------------------------------------
1    |      3           |          4      |       400       |
-----------------------------------------------------------------

Before the birth of google, most search engines were mainly using content-based ranking algorthms and were able to give useful results. Here is a list of typical content-based ranking methods:
- Word frequency
- Word loction in document
- Word distance

1. Word frequency: Count the times of all the words, which a user seached, appears in the webpage of a url. Then this is the score of this url. Caculate scores for all urls, and sort these urls by their scores.

2. Word location in document: If a page is relvant to a search term, it will appear closer to the top of the page. For the same urlid, there may be different combination of word locations. For each combination of this url, we can sum its locations and find the smallest one as the url's score.

3. Word distance: If the distance of all the different words in a url is shorter, then the page of this url tends to be more relvant to the search term. e.g "c programming language is a structured high-level ..." is more relvant then "C > A+B is a formula, in python programming lanugage, we can express it as... " when the user searchs "c programming language" because the distance of the first one is only two.

We may like to combine all the three method together in order to get better result. A way to do this is to normalize the scores to range [0-1] and use weights in each method. More info about normalization:
http://hi.baidu.com/idontknow1987/blog/item/f94bac518fac8d698535248a.html

Because I want to focus more on other page ranking algorthms, I will skip the implementation of the above algorithms.

Review 5.2: Perform a Query in Search Engine Database

Perform a Query in Search Engine Database

After the database for the search engine was constructed, sql queries needed to be used to get the result what users intend to get.

In order to make it easy to understand, A example is shown as the following. Suppose a user searches "c programming language". Then we can guess this user want to browse some webpages introducing how to program using c. So all the webpages should have all the three words. Hence, we can use a SQL statement like

SELECT w0.urlid, w0.location , w1.location , w2.location
FROM wordlocation as w0, wordlocation as w1, wordlocation as w2
WHERE w0.urlid = w1.urlid
AND w1.urlid = w2.urlid
AND w0.wordid = ("C"'s rowid in table wordlist)
AND w1.wordid = ("programming"'s rowid in table wordlist)
AND w2.wordid = ("language"'s rowid in table wordlist)

Review 5.1: A Simple Database Schema for Search Engine

A Simple Database Schema for Search Engine

There are already a set of ranking algorithms for search engines. Before starting to study them, I firstly need to prepare a database for the purpose of testing. The schema for this database is as the following pic.

The first table is 'urllist'. 'urllist' stores all the urls that have been indexed by the crawler. The second table (wordlist) has all the words separated from those indexed urls. The third table (wordlocation) indicates whether a webpage of a url contains the word, and what is the location of the word in the webpage. The remaining two tables specify links between webpages. The table (link) stores pairs of URL IDs, indicating a link from one webpage to another, and table (linkwords) uses wordid and linkid columns to store which words actually used in a link.

Review 4.4: Document filtering - Fisher Method

Fisher Method is another method to give very accurate result for document filtering, particularly for spam filtering. It calculates the probability of  a category for each feature in a document, then combines these probabilities to test whether the document should be classified to this category.

Unlike naivebayes, which uses Pr(feature | category) to calculate Pr(doc | category) ,  and then get Pr(category | doc) from it. Fisher method calculate Pr( category | feauture) first:

After we got Pr(category | feature), we can calculate the fisher probability with the following procedure: Multiplying all the probabilities together, then taking the natural log (math.log in Python), and then multiplying the result by –2.

At last, we can specify the lower bounds for each category, and use fisherprob of different categories to classify items.

Actually, because each feature is not independent in a document. Therefore, neither naivebayes nor fisher method gives real probability for a category. But fisher method is a better estimate method.