A Uniform Query Processing Approach for Integrating Data from Heterogeneous Resources

University dissertation from Chalmers University of Technology

Abstract: Scientists who need to explore several different databases in their research can find it difficult and tedious to extract and combine information from various heterogeneous data sources manually. This is a particular problem for researchers in the life sciences, since technical advances in the last decade have resulted in a dramatic increase in the quantity and variety of data. Many databases of interest are developed independently by different research groups, and the database administrators often want to keep their databases autonomous so that they can develop and maintain them without being constrained by other database sources. Therefore, there is a need for software solutions to the problem of data integration that facilitate combining up-to-date data from autonomous, heterogeneous databases located at different sites. A system for data integration from heterogeneous (relational and RDF/S), autonomous and distributed data sources has been designed and implemented in this work. The main aim in the design and implementation of the system has been to make large parts of query and result processing independent of the kinds of data resources that are being used. The queries are held in a resource independent form through large parts of the query processing. We refer to this as uniform query and result processing. The user states queries, global queries, against an integrated view of the underlying data resources. The integrated view does not reveal the structure of the underlying data sources. A global query is rewritten by using rules that describe the mapping from concepts in the integrated view to concepts in the data sources. This is then split into sub-queries that each relate to one of the data sources. Wrappers translate sub-queries into the query languages of the component databases, send these sub-queries to the component databases and then retrieve the results. Several small example federations have been implemented to test the system, one of which is a federation of biological databases. We have focused on incorporating data in relational databases and RDF Schema data, since these are widely used and are becoming increasingly popular for managing data collections. An outcome of this work is a functioning prototype system that applies a uniform query and result processing approach, and has a modular system design that is easy to use as a starting point for modifications and extensions.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.