Saturday, 28 February 2015

Review 2.4: Modeling with Decision Trees- Pruning

Pruning is a method used to solve the overfit problem. If a tree creates branches that decrease the entropy slightly, it may give a an answer which is being more certain than it really is. Another defective method which can solve overfit problem is that, in tree creation procedure, when the decreased entropy is less than a given entropy, we do not create the branch. The drawback of this method is,  although the entropy is not reduced a lot by this split, it may decreased greatly by subsequent splits.

Pruning do not have the drawback described above. It creates the entire tree at first. so we can start to check wheather merging branches would increase the entropy less than a given threshold from the bottom of the tree. 

The code of pruning a decision tree:

 
 def prune(tree:DTreeNode , threshold:Float ):Unit = {
  if (tree.left != None ) 
   prune(tree.left.get, threshold)
  
  if (tree.right != None ) 
   prune(tree.right.get, threshold)
  
  if (tree.left != None && tree.left.get.result!=None && 
  tree.right != None && tree.right.get.result !=None ) {
   val left = tree.left.get.result.get.toList.map(x=> (for(i←0 to x._2) yield List(x._1)).toList ).flatten
   val right = tree.right.get.result.get.toList.map( x=> (for(i←0 to x._2) yield List(x._1)).toList ).flatten
   val p1 =  left.length / (left.length + right.length)
   val p2 = 1.0 - p1
   println(entropy(left++right)+":" + entropy(left)*p1+":" + entropy(right)*p2)
   val diff = entropy(left++right) - (entropy(left)*p1 + entropy(right)*p2)
   if ( diff ≤ threshold ) {
    tree.colIndex = None
    tree.split = None
    tree.left = None
    tree.right = None
    tree.result = Some(countResult(left ++ right))
   }
  } 
 }
A example pruningresult.png

No comments:

Post a Comment