Saturday, 28 February 2015

Review 4.3: Document filtering - Use Naive Bayes

Now we know how to calculate the probability of a feature when given a category. And then, we can use this probability to  calculate the probability of a item.

Here is the formula( Pls note that we assume the occurance of different features in a item are independent ):

Prob(Item | category) = Prob(feature1 | category) * Prob(feature2 | category) *...*Prob(featureN | category)

The formula is not enough for us to decide which category a item should belong to, because what we really want to know is , what is the probability of a item when given a category. etc. Prob(category | Item).

Actually , it is easy to get this using naive bayes theory
Prob(Category | Item) = Prob(Item | Category) * Prob(Category) / Prob(Item) 

How the formula is deduced:
  Prob(AB) = Prob(A|B)*Prob(B) = Prob(B|A) * Prob(A) 
=>  
  Prob(A|B) = Prob(B|A) * Prob(A) / Prob(B)

In the above formula, Prob(Item) is useless for comparison, because it is the same in for different categories.

Here is the code:



def naivebayes(item:String, cate:String) = {
  getDocProb(item,cate) * countTotalFeaturesInCate(cate) / countTotalFeatures()
 }
 
 def getDocProb(item:String, cate:String) = {
  getFeatures(item).map(getWeightedProb( _, cate )).reduceLeft(_*_)
 }

No comments:

Post a Comment