Authors: Aris Anagnostopoulos Andrei Broder Kunal Punera
Publish Date: 2007/09/13
Volume: 16, Issue: 2, Pages: 129-154
Abstract
Traditional document classification frameworks which apply the learned classifier to each document in a corpus one by one are infeasible for extremely large document corpora like the Web or large corporate intranets We consider the classification problem on a corpus that has been processed primarily for the purpose of searching and thus our access to documents is solely through the inverted index of a large scale search engine Our main goal is to build the “best” short query that characterizes a document class using operators normally available within search engines We show that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms As part of our study we enhance some of the featureselection techniques that are found in the literature by forcing the inclusion of terms that are negatively correlated with the target class and by making use of term correlations we show that both of those techniques can offer significant advantages Moreover we show that optimizing the efficiency of query execution by careful selection of terms can further reduce the query costs More precisely we show that on our setup the best 10term query can achieve 93 of the accuracy of the best SVM classifier 14000 terms and if we are willing to tolerate a reduction to 89 of the best SVM we can build a 10term query that can be executed more than twice as fast as the best 10term query
Keywords: