Modern GIR Systems : Framework, Retrieval Model and Indexing Techniques

Abstract: Geographic information is one of the most important and the most common types of information in human society. It is estimated that more than 70% of all information in the world has some kind of geographic features. In the era of information explosion, information retrieval (IR) tools, such as search engines, are the main tools people used to quickly find the information they need nowadays. Because of the importance of geographic information, recent efforts have been made either by expanding the traditional IR to support a spatial query, or building a GIR in a brand new architecture from the ground such as the SPIRIT project. To some degree, these existing GIR systems could solve users’ information search need with a spatial filter, especially when the users are looking for information on something within a relatively large extent.Despite its advantage on processing geographical information and queries over conventional IR systems, modern GIR systems are also facing challenges including a proper representation and extraction of geographical information within documents, a better information retrieval model for both thematic and geographical information, a fast indexing mechanism for rapid search within documents by thematic and geographical hints, and even a new architecture of system.The objective of this licentiate research is to provide solutions to some of these problems in order to build a better modern GIR system in the future. The following aspects have been investigated in the thesis: a generic conceptual framework and related key technologies for a modern GIR system, a new information retrieval model and algorithm for measuring the relevance scores between documents and queries in GIR, and finally a new better indexing technique to geographically and thematically index the documents for a faster query processing within modern GIR.Concerning the proposed conceptual framework for modern GIR, it includes three modules: (1) the user interface module, (2) the information extractor, storage and indexer module and (3) the query processing and information retrieval module. Two knowledge bases, Gazetteer and Thesaurus, play an important role in the proposed framework. A digital map based user interface is proposed for the input of user information search needs and representation of retrieval results. Key techniques required for the implementation of a modern GIR using the proposed framework are a proper representation of document and query information, a better geographical information extractor, an innovative information retrieval model and relevance ranking algorithm, and a combined indexing mechanism for both geographical and thematic information.The new information retrieval model is established based on a Spatial Bayesian Network consisting of place names appeared in a single document and the spatial relationships between them. The new model assesses the geographical relevance between GIR document and query by the geographical importance and adjacency of the document geo-footprint versus the geographical scope of the user’s query.Regarding the indexing mechanism for modern GIR systems, a Keyword-Spatial Hybrid Index (KSHI) is proposed for the single and overall geo-footprint model, in which there is only one single geo-footprint for each document to retrieve from. A Keyword-Spatial Dual Index (KSDI) is proved to be more appropriate for a GIR system which allows for multiple geo-footprints within a single document.In addition to theoretical analysis, necessary experiments have also been carried out to evaluate the efficiency of proposed new information retrieval model and indices. Both the theoretical analysis and results of experiments show the potentials of proposed solution and techniques.