Statistical methods for biomarker discovery in proteomics

University dissertation from Stockholm : Karolinska Institutet, Department of Medical Epidemiology and Biostatistics

Abstract: Surface-Enhanced Laser Desorption and Ionization (SELDI) is a promising proteomic technique for discovering biomarkers. However, the pre-processing of the raw data is still problematic. Integrating transcriptomic and proteomic data may enhance the search for biomarkers, but the current data integration approach results in the loss of large amounts of data. In this thesis, we made improvements to the peak detection step in SELDI by developing the Annotated Regions of Signicance (ARS) method. It uses a multi-spectral signal detection method, `Region of Signicance' (RS), to identify regions with potential biomarkers. RS had better operating characteristics (OC) than existing methods in identifying peaks. Using lung cell line data, at 80% sensitivity, the False Discovery Rates (FDRs) of existing methods were around 25% to 50%, compared to around 8% for RS. ARS extracts a peak template from all spectra in the peak region via Principal Component Analysis (PCA) and ts the template to the spectra. A renement was made to the estimation of the amplitude via a mixture model. Using patient samples from a clinical study, we showed that ARS detected more peaks and gave more accurate peak quantications than the standard method. We implemented ARS as an R package, ProSpect, and also developed a graphical user interface, ProSpectGUI. Motivated by the performance of ARS in SELDI, we extended ARS to MALDI data with isotopic resolution. The extended ARS utilizes the isotopic pattern to lter out peaks which do not adhere to the expected isotopic pattern. Using the spike-in data, we validated the use of the log-transformed intensities for ARS in MALDI. Compared to the standard method, extended ARS generally had better specicity and was better in quantifying the peaks. At low FDR, extended ARS had higher sensitivity than the standard method. We also contributed to the integration of proteomic and transcriptomic information from the same samples by investigating the use of Maximum Covariance Analysis (MCA). The estimates of the gene and protein pattern-pairs from MCA were consistent and biologically congruent, compared to Generalized Singular Value Decomposition (gSVD). Therefore MCA has the potential to enhance biomarker discovery and our understanding of the interplay between genes and proteins.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.