Saturday, 28 February 2015

Review 4.4: Document filtering - Fisher Method

Fisher Method is another method to give very accurate result for document filtering, particularly for spam filtering. It calculates the probability of  a category for each feature in a document, then combines these probabilities to test whether the document should be classified to this category.

Unlike naivebayes, which uses Pr(feature | category) to calculate Pr(doc | category) ,  and then get Pr(category | doc) from it. Fisher method calculate Pr( category | feauture) first:

After we got Pr(category | feature), we can calculate the fisher probability with the following procedure: Multiplying all the probabilities together, then taking the natural log (math.log in Python), and then multiplying the result by –2. 

At last, we can specify the lower bounds for each category, and use fisherprob of different categories to classify items.

Actually, because each feature is not independent in a document. Therefore, neither naivebayes nor fisher method gives real probability for a category. But fisher method is a better estimate method.

No comments:

Post a comment