Here is the formula( Pls note that we assume

**the occurance of different features in a item are independent**):

**Prob(Item | category) = Prob(feature1 | category) * Prob(feature2 | category) *...*Prob(featureN | category)**The formula is not enough for us to decide which category a item should belong to, because what we really want to know is , what is the probability of a item when given a category. etc.

*.*

**Prob(category | Item)**Actually , it is easy to get this using

**naive bayes theory**.

**Prob(Category | Item) = Prob(Item | Category) * Prob(Category) / Prob(Item)**How the formula is deduced:

**Prob(AB) = Prob(A|B)*Prob(B) = Prob(B|A) * Prob(A)**=>

**Prob(A|B) = Prob(B|A) * Prob(A) / Prob(B)**In the above formula, Prob(Item) is

**useless for comparison,**because it is the same in for different categories.

Here is the code:

def naivebayes(item:String, cate:String) = { getDocProb(item,cate) * countTotalFeaturesInCate(cate) / countTotalFeatures() } def getDocProb(item:String, cate:String) = { getFeatures(item).map(getWeightedProb( _, cate )).reduceLeft(_*_) }

## No comments:

## Post a Comment