Ligand-based Methods for Data Management and Modelling

University dissertation from Uppsala : Acta Universitatis Upsaliensis

Abstract: Drug discovery is a complicated and expensive process in the billion dollar range. One way of making the drug development process more efficient is better information handling, modelling and visualisation. The majority of todays drugs are small molecules, which interact with drug targets to cause an effect. Since the 1980s large amounts of compounds have been systematically tested by robots in so called high-throughput screening. Ligand-based drug discovery is based on modelling drug molecules. In the field known as Quantitative Structure–Activity Relationship (QSAR) molecules are described by molecular descriptors which are used for building mathematical models. Based on these models molecular properties can be predicted and using the molecular descriptors molecules can be compared for, e.g., similarity. Bioclipse is a workbench for the life sciences which provides ligand-based tools through a point and click interface. The aims of this thesis were to research, and develop new or improved ligand-based methods and open source software, and to work towards making these tools available for users through the Bioclipse workbench. To this end, a series of molecular signature studies was done and various Bioclipse plugins were developed.An introduction to the field is provided in the thesis summary which is followed by five research papers. Paper I describes the Bioclipse 2 software and the Bioclipse scripting language. In Paper II the laboratory information system Brunn for supporting work with dose-response studies on microtiter plates is described. In Paper III the creation of a molecular fingerprint based on the molecular signature descriptor is presented and the new fingerprints are evaluated for target prediction and found to perform on par with industrial standard commercial molecular fingerprints. In Paper IV the effect of different parameter choices when using the signature fingerprint together with support vector machines (SVM) using the radial basis function (RBF) kernel is explored and reasonable default values are found. In Paper V the performance of SVM based QSAR using large datasets with the molecular signature descriptor is studied, and a QSAR model based on 1.2 million substances is created and made available from the Bioclipse workbench.