Saturday, 28 February 2015

Review 4.1: Document filtering - Overview and How to train a classifier

Document filter alway means using a classifier to class a set of documents to different categories. The most well known example of document filtering is the elimination of spam. Before start to study document filtering, two important concepts about classifier should be introduced. The first concept is item, which is the objects to be classified. In document filtering, documents or document title is items. The second concept is feature, which is anything that can be used to determine as being either present or absent in the item. In document filtering, feature is the word in documents.

Unlike NMF, document filtering uses supervised methods. So before start to classify document, we need to train a classifier. Here is the code to train a classifier, the result should in a format like this : {"girl":[good:1,bad:0], "boy":[good:0,bad:100] }



import java.util.regex._

object DF {
 var fc = Map[String,scala.collection.mutable.Map[String,Int]]()
 var cc = Map[String,Int]()
 
 def main(arg:Array[String]) {
  train("the quick brown fox jumps over the lazy dog","good")
  train("make quick money in the online casino","bad")
  println(fc)
 }
 
 def train(item:String, cate:String) {
  getFeatures(item).foreach(incf(_,cate))
  incc(cate)
 }
 
 def getFeatures(item:String)={
  val pattern = Pattern.compile("\\W")
  pattern.split(item.trim).toList
 }
 
 //Increase feature count for a category
 def incf(feature:String, cate:String) {
  if(!fc.isDefinedAt(feature)) {
   fc += ((feature, scala.collection.mutable.Map[String,Int]()))
  }
  
  if(!fc(feature).isDefinedAt(cate)) {
   fc(feature)+=((cate, 0))
  }
  fc(feature)(cate) += 1
 }
 
 //Increase category count
 def incc(cate:String) {
  if(!cc.isDefinedAt(cate)) {
   cc += ((cate,0))
  }
  cc(cate) += 1
 }
}

No comments:

Post a Comment