Saturday, 28 February 2015

Review 5.3: Overview of Content-Based Ranking algorthms

Overview of Content-Based Ranking algorthms
Example result of the query previous note (w0 is for word "c", w1 is for word "programming", w2 is for word "language")

urlid | w0.location | w1.location | w2.location |
 1    |      3           |          4      |       5          |
 1    |      3           |          4      |       400       |

Before the birth of google, most search engines were mainly using content-based ranking algorthms and were able to give useful results. Here is a list of typical content-based ranking methods:
    - Word frequency
    - Word loction in document
    - Word distance

1. Word frequency: Count the times of all the words, which a user seached, appears in the webpage of a url. Then this is the score of this url. Caculate scores for all urls, and sort these urls by their scores.

2. Word location in document: If a page is relvant to a search term, it will appear closer to the top of the page. For the same urlid, there may be different combination of word locations. For each combination of this url, we can sum its locations and find the smallest one as the url's score. 

3. Word distance: If the distance of all the different words in a url is shorter, then the page of this url tends to be more relvant to the search term. e.g "c programming language is a structured high-level ..." is more relvant then "C > A+B is a formula, in python programming lanugage, we can express it as... " when the user searchs "c programming language" because the distance of the first one is only two.

We may like to combine all the three method together in order to get better result. A way to do this is to normalize the scores to range [0-1] and use weights in each method. More info about normalization:

Because I want to focus more on other page ranking algorthms, I will skip the implementation of the above algorithms.

No comments:

Post a comment