Authors: Anna Telaar Dirk Repsilber Gerd Nürnberg
Publish Date: 2012/01/19
Volume: 28, Issue: 1, Pages: 67-106
Abstract
RNAsample pooling is sometimes inevitable but should be avoided in classification tasks like biomarker studies Our simulation framework investigates a twoclass classification study based on gene expression profiles to point out how strong the outcomes of single sample designs differ to those of pooling designs The results show how the effects of pooling depend on pool size discriminating pattern number of informative features and the statistical learning method used support vector machines with linear and radial kernel random forest RF linear discriminant analysis powered partial least squares discriminant analysis PPLSDA and partial least squares discriminant analysis PLSDA As a measure for the pooling effect we consider prediction error PE and the coincidence of important feature sets for classification based on PLSDA PPLSDA and RF In general PPLSDA and PLSDA show constant PE with increasing pool size and low PE for patterns for which the convex hull of one class is not a cover of the other class The coincidence of important feature sets is larger for PLSDA and PPLSDA as it is for RF RF shows the best results for patterns in which the convex hull of one class is a cover of the other class but these depend strongly on the pool size We complete the PE results with experimental data which we pool artificially The PE of PPLSDA and PLSDA are again least influenced by pooling and are low Additionally we show under which assumption the PLSDA loading weights as a measure for importance of features regarding classification are equal for the different designs
Keywords: