Paper Search Console

Home Search Page About Contact

Journal Title

Title of Journal: Data Min Knowl Disc

Search In Journal Title:

Abbravation: Data Mining and Knowledge Discovery

Search In Journal Abbravation:

Publisher

Kluwer Academic Publishers-Plenum Publishers

Search In Publisher:

DOI

10.1002/mame.200800080

Search In DOI:

ISSN

1573-756X

Search In ISSN:
Search In Title Of Papers:

Learning SemiStructured Document Categorization U

Authors: Olivier de Vel
Publish Date: 2006/05/26
Volume: 13, Issue: 3, Pages: 309-334
PDF Link

Abstract

In this paper we report an investigation into the learning of semistructured document categorization We automatically discover lowlevel shortrange byte data structure patterns from a document data stream by extracting all byte subsequences within a sliding window to form an augmented or boundedlength string spectrum feature map and using a modified suffix trie data structure called the coloured generalized suffix tree or CGST to efficiently store and manipulate the feature map Using the CGST we are able to efficiently compute the streams boundedlength sequence spectrum kernel We compare the performance of two classifier algorithms to categorize the data streams namely the SVM and Naive Bayes NB classifiers Experiments have provided good classification performance results on a variety of document byte streams particularly when using the NB classifier under certain parameter settings Results indicate that the boundedlength kernel is superior to the standard fixedlength kernel for semistructured documents


Keywords:

References


.
Search In Abstract Of Papers:
Other Papers In This Journal:


Search Result: