Statistical data processing in clinical proteomics

S. Smit

Statistical data processing in clinical proteomics

Authors	S. Smit
Supervisors	A.K. Smilde C.G. de Koster
Cosupervisors	H.C.J. Hoefsloot
Award date	22-09-2009
Number of pages	116
Organisations	Faculty of Science (FNWI) - Swammerdam Institute for Life Sciences (SILS)
Abstract	The subject of this thesis is the analysis of data in clinical proteomics studies aimed at the discovery of biomarkers. The data sets produced in proteomics studies are huge, characterized by a small number of samples in which many proteins and peptides are measured. The studies described in this thesis compare different patient groups (recovering vs. relapsing patients) or a group of patients with a group of healthy controls. The size of the data and the size of the differences between the groups call for special data analysis strategies. Chapter 2 is a review of data analysis strategies for the discovery of biomarkers in clinical proteomics. A wealth of classification and feature extraction methods exists and in this chapter the most commonly applied methods are discussed. Due to the complex nature of the data and the high dimensionality it is easy to find differences between groups. However, these differences are possibly just chance results. The goal is to develop classifiers and/or biomarkers that can be used to classify new samples. Therefore, methods to test the validity of the results are part of a good data analysis strategy. A modular framework that fits most of the strategies described in the literature is presented. In this framework feature selection, classification, biomarker discovery and statistical validation are regarded as separate modules in the analysis of proteomics data. A strategy can be built from a combination of these modules in many ways, to suit the data analysis problem at hand. While it is possible to choose from the feature selection, classification and biomarker discovery modules to form a good working classifier, the validation modules are an integral part of the strategy. Which methods are used to execute a specific module is a matter of choice which depends in part on the structure of the data and in part on the preferences and expertise of the data analyst. In Chapter 3 we present a strategy for the statistical validation of discrimination models in proteomics studies. It is illustrated on data from a proteomics study of Gaucher disease, a lysosomal storage disorder. Gaucher disease is chosen as a case study because it is known to cause dramatic changes in the blood of patients. Samples from patients and healthy controls are measured with mass spectrometry and compared with Principal Component Discriminant Analysis (PCDA). The strategy combines permutation tests, single and double cross validation. The permutation test is part of the strategy to rule out the possibility of a chance result, by testing the classification method on randomized data. From the permutation test a p-value is obtained by comparing the performance of the classifier to the performance on randomized data. In the single cross validation the best PCDA model is selected, based on its generalizability towards new samples. In some studies the reported selectivity and specificity of a method is based on the single cross validation error. This error is biased, since the cross validation error is also the criterion that drives the model selection; Model construction and model evaluation are interwoven. In a permutation test this bias is uncovered because the average cross validation error of many permutations will be very different from the expected 50% (for two classes of equal size). An unbiased prediction error is obtained by validating the entire model selection procedure, which in our strategy leads to double cross validation. The permutation test confirmes that the double cross validation is an independent estimation of the performance. The double cross validated sensitivity in the Gaucher vs. control problem is 89% and the specificity is 90%. Fabry disease is a lysosomal storage disorder for which currently no blood biomarker is available. In Chapter 4 we compare serum protein profiles of controls and Fabry patients, an approach that allowed classification of patients suffering fromGaucher disease in Chapter 3. Classification of Fabry patients and controls using PCDA results in high error rates, also after variable selection. With Support Vector Machines (SVM), the prediction error is lower. The permutation test shows that the classification result is significant, but the misclassification rate is still 16%. It might be argued that the procedure used for protein profiling is not sensitive enough to detect early manifestations of Fabry disease. However, concomitant with misclassification of Fabry patients as being normal, some control subjects are classified as diseased Fabry patients. Strikingly, all three unaffected relatives of Fabry patients (R1, R2 and R3) that were tested were classified as being patient, either using SVM or PCDA. This suggests that the discrimination may not be primarily based on the underlying disorder but rather on other characteristics shared by families. This illustrates the importance to use very closely matched control subjects in these types of studies. In Chapter 2 we discussed many classification methods. One of the choices to be made in a proteomics study comparing two classes of patients is the choice for a classification method. In Chapter 5 we apply several classification methods to one clinical proteomics data set, the Gaucher disease data from Chapter 3. The strategy developed in Chapter 3 is now used as a protocol which can be used for choosing among different statistical classification methods and obtaining figures of merit of their performance. The methods considered are PCDA, Penalized Logistic Regression (PLR), LogitBoost (LB), Principal Discriminant Variates (PDV), Nearest Shrunken Centroids (NSC), and SVM. In the extended cross validation study PCDA, PLR and SVM, performed equally well and PDV was almost as good. LB and NSC perform worse than the other four methods. Using a proper classification method, 82 − 90% of the subjects were correctly classified. Chapter 6 introduces an approach tailored to classify paired data. The approach is demonstrated in a cervical cancer proteomics data set. Squamous cell carcinoma antigen (SCC-ag) concentration in serum correlates with the stage of disease, the effect of treatment, and the development of disease, but it has poor predictive value. This study was initiated to find additional cervical cancer markers. Samples were obtained from cervical cancer patients at the time of diagnosis (case samples) and again on average 6 to 12 months after treatment when all patients appear to have recovered (control samples). Measuring the same patients after treatment as controls has an advantage over measuring a separate set of healthy individuals, since the biological variation in the data is reduced, increasing the chance of finding patterns related to disease rather than differences between individuals. The resulting data has a paired structure and a strategy for analysing paired data is proposed. This strategy is compaired to an unpaired strategy in four patient groups, one group of patients that relapse some time after the control sample is taken and three groups of recovering patients. In the relapsing patient group the performance is the same for both methods, while in the three groups with recovering patients classification performance improves using the paired analysis approach. In Chapter 7 we revisit the question of selecting a suitable classification method. The four patient groups from the cervical cancer study in Chapter 6 are considered together, with the objective to find differences between recovering and relapsing patients. SVM and PCDA - two methods that in the previous chapters proved to be good classifiers of clinical proteomics data - are unable to obtain a good classification in this problem. The reason for this is the position of the classes: they are not disjoint (they overlap). Because the within-class covariances are very different, Soft Independent Modelling of Class Analogy (SIMCA) is able to distinguish between the classes, using the residuals from the classes’ PCA models. The difference between PCDA and SIMCA, two seemingly similar methods, can be seen in the metrics they use. Although they can be expressed in a similar fashion, different aspects of the data are stressed, resulting in very different performances. This example shows how choosing an appropriate classification method can improve classification performance.
Document type	PhD thesis
Note	Research conducted at: Universiteit van Amsterdam
Language	English
Downloads	Cover Title pages Contents Chapter 1: Introduction Chapter 2: Statistical data processing in clinical proteomics Chapter 3: Assessing the statistical validity of proteomics based biomarkers Chapter 4: Limited value of serum protein profiling for discrimination of patients suffering from Fabry disease Chapter 6: Optimal use of paired proteomics data Chapter 7: Enhancing classification performance : covariance matters Outlook Bibliography Publications Summary Samenvatting Dankwoord
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Statistical data processing in clinical proteomics