Journal Title
Title of Journal: Data Min Knowl Disc
|
Abbravation: Data Mining and Knowledge Discovery
|
Publisher
Kluwer Academic Publishers-Plenum Publishers
|
|
|
|
Authors: Olivier de Vel
Publish Date: 2006/05/26
Volume: 13, Issue: 3, Pages: 309-334
Abstract
In this paper we report an investigation into the learning of semistructured document categorization We automatically discover lowlevel shortrange byte data structure patterns from a document data stream by extracting all byte subsequences within a sliding window to form an augmented or boundedlength string spectrum feature map and using a modified suffix trie data structure called the coloured generalized suffix tree or CGST to efficiently store and manipulate the feature map Using the CGST we are able to efficiently compute the streams boundedlength sequence spectrum kernel We compare the performance of two classifier algorithms to categorize the data streams namely the SVM and Naive Bayes NB classifiers Experiments have provided good classification performance results on a variety of document byte streams particularly when using the NB classifier under certain parameter settings Results indicate that the boundedlength kernel is superior to the standard fixedlength kernel for semistructured documents
Keywords:
.
|
Other Papers In This Journal:
|