Query Processing over Heterogeneous Federations of Graph Data

Abstract: Graph data offers a natural and intuitive way to represent complex relationships in various real-world phenomena, such as social networks, e-commerce platforms, and biological networks. The way we interact with information, technology and even society has been changed by graph data, especially after Google started in 2012 to develop the so-called Knowledge Graph. A federation of Knowledge Graphs allows users to perform queries that span across multiple Knowledge Graphs, enabling them to discover relationships and insights that would not be apparent within a single isolated graph and to understand complex knowledge by considering information from different domains or sources. However, retrieving information from such a federation also comes with challenges that must be addressed. Motivated by issues related to retrieving information from federations of Knowledge Graphs, in this thesis, we focus on Knowledge Graphs represented in the Resource Description Framework (RDF) and two forms of heterogeneity: the heterogeneity in terms of data access interfaces, and the heterogeneity of vocabulary used in the schema of RDF data sources. Our research deals with these complexities by designing query planning and optimization to bridge the gap between different graph data sources.In this thesis, we first focus on federations that are heterogeneous in terms of data access interfaces. In particular, we establish a formal framework for defining and representing query plans over heterogeneous federations of graph data. We introduce a data model that captures the notion of a heterogeneous federation of RDF data sources. Based on this model, we define a language, called FedQPL, that can be used to describe logical query plans formally. More precisely, this language can be applied both to define query planning and optimization approaches in a more precise manner and to represent the logical plans in a query engine. Thereafter, we provide an extensive set of rewriting rules together with a cost model for optimization. A comprehensive experimental evaluation shows that the query plan selected using our cost model requires less data to be transferred compared to the baseline approach.Then, this thesis addresses the heterogeneity of vocabularies used in the schema of RDF data sources by extending FedQPL with vocabulary awareness. To this end, we first define what the expected result of a query in a vocabulary-aware setting is; then, we introduce two new query plan operators to translate solutions from a local to the global vocabulary and vice versa; and finally, we introduce an algorithm that produces correct, vocabulary-aware query plans. To identify the overhead of considering vocabulary mappings during query processing, we evaluate our approach in federations with different vocabulary mapping scenarios. Our experiments show that there is no overhead in planning time when considering vocabulary mappings; however, it takes slightly longer to execute the queries than in a baseline scenario with materialized mapped data. In addition, we also provide a set of rewriting rules specific to vocabulary-aware FedQPL expressions, which can be used as query rewriting rules for query optimization under various conditions. Experimental evaluations support the hypothesis that rewriting rules can significantly improve query processing performance while decreasing the amount of extra work introduced by considering vocabulary mappings.Furthermore, we explore possibilities of integrating other types of graph data sources (specifically GraphQL) into the federation. To better understand the different implementation techniques of GraphQL, we design a GraphQL performance benchmark to thoroughly evaluate and compare the performance of approaches to creating GraphQL servers, as a preparation for future integration of Knowledge Graphs that can be accessed via GraphQL APIs into our federation.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.