Managing Applications and Data in Distributed Computing Infrastructures

University dissertation from Uppsala : Acta Universitatis Upsaliensis

Abstract: During the last decades the demand for large-scale computational and storage resources in science has increased dramatically. New computational infrastructures enable scientists to enter a new mode of science, e-science, which complements traditional theory and experiments. E-science is inherently interdisciplinary, involving researchers from several disciplines, and also opens up for large-scale collaborative efforts where physically distributed groups of scientists share software tools and data to make scientific progress. Within the field of e-science, new challenges are emerging in managing large-scale distributed computing efforts and distributed data sets. Different models, e.g. grids and clouds, have been introduced over the years, but new solutions built on these models are needed to enable easy and flexible use of distributed computing infrastructures by application scientists.In the first part of the thesis, application execution environments are studied. The goal is to hide technical details of the underlying distributed computing infrastructure and expose secure and user-friendly environments to the end users. First, a general-purpose solution using portal technology is described, enabling transparent and easy usage of a variety of grid systems. Then a problem-solving environment for genetic analysis is presented. Here the statistical software R is used as a workflow engine, enhanced with grid-enabled routines for performing the computationally demanding parts of the analysis. Finally, the issue of resource allocation in grid system is briefly studied and certain modifications in the distributed resource-brokering model for the ARC middleware are proposed.The second part of the thesis presents solutions for managing and analyzing scientific data using distributed storage resources. First, a new reliable and secure file-oriented distributed storage system, Chelonia, is presented. The architectural design of the system is described and implementation issues are considered. Also, the stability and scalable performance of Chelonia is verified using several test scenarios. Then, tools for providing an efficient and easy-to-use platform for data analysis built on Chelonia are presented. Here, a database driven approach is explored. An extended architecture where Chelonia is combined with the Web-Service MEDiator (WSMED) system is implemented, providing web service tools to query data without any further programming. This approach is then developed further and Chelonia is combined with SciSPARQL, a query language that extends SPARQL to queries over numeric scientific data. This results in a system that is capable of interactive analysis of distributed data sets. Writing customized modules in Java, Python or C can fulfill advanced application-specific analysis requirements. The viability of the approach is demonstrated by applying the system to data produced by URDME, a computational environment in systems biology and results for sample queries expressed in SciSPARQL are presented.Finally, the use of an open source storage cloud, Openstack – SWIFT, for analysis of data from CERN experiments is considered. Here, a pilot implementation for the ROOT data analysis framework is presented together with a performance evaluation.