- A critical assessment of feature selection methods for biomarker discovery in clinical proteomics
- Molecular & Cellular Proteomics
- Volume | Issue number
- 12 | 1
- Pages (from-to)
- Document type
- Faculty of Science (FNWI)
Faculty of Medicine (AMC-UvA)
- Swammerdam Institute for Life Sciences (SILS)
In this paper, we compare the performance of six different feature selection methods for LC-MS based proteomics and metabolomics biomarker discovery: t-test, Mann-Whitney-Wilcoxon-test (mww-test), Nearest Shrunken Centroid (NSC), linear Support Vector Machine - Recursive Features Elimination (SVM-RFE), Principal Component Discriminant Analysis (PCDA) and Partial Least Squares Discriminant Analysis (PLSDA) using human urine and porcine cerebrospinal fluid samples that were spiked with a range of peptides at different concentration levels. The ideal feature selection method should select the complete list of discriminating features that are related to the spiked peptides without selecting unrelated features. While many studies have to rely on classification error to judge the reliability of the selected biomarker candidates, we assessed the accuracy of selection directly from the list of spiked peptides. The feature selection methods were applied on data sets with different sample size and extent of sample class separation determined by the concentration level of spiked compounds. For each feature selection method and data set, the performance for selecting a set of features related to spiked compounds was assessed using the harmonic mean of the recall and the precision (f-score) and the geometric mean of the recall and the true negative rate (g-score). We conclude that the univariate t-test and the mww-test with multiple testing correction are not applicable to data sets with small sample size (n=6), but that their performance improves markedly with increasing sample size up to a point (n>12) where they outperform the other methods. PCDA and PLSDA select small feature sets with high precision but miss many true positive features related to the spiked peptides. NSC strikes a reasonable compromise between recall and precision for all data sets independent of spiking level and number of samples. Linear SVM-RFE performs poorly for selecting features related to the spiked compounds even though the classification error is relatively low.
- go to publisher's site
If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library, or send a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.