The effects of indexing strategy-query term combination on retrieval effectiveness in a Swedish full text database

University dissertation from Göteborg University

Abstract: This thesis deals with Swedish full text retrieval and the problem of morphological variation of query terms in thedocument database. The study is an information retrieval experiment with a test collection. While no Swedish testcollection was available, such a collection was constructed. It consists of a document database containing 161,336news articles, and 52 topics with four-graded (0, 1, 2, 3) relevance assessments.The effects of indexing strategy-query term combination on retrieval effectiveness were studied. Three of five testedmethods involved indexing strategies that used conflation, in the form of normalization. Further, two of these threecombinations used indexing strategies that employed compound splitting. Normalization and compound splittingwere performed by SWETWOL, a morphological analyzer for the Swedish language. A fourth combinationattempted to group related terms by right hand truncation of query terms. A search expert performed the truncation.The four combinations were compared to each other and to a baseline combination, where no attempt was made tocounteract the problem of morphological variation of query terms in the document database.Two situations were examined in the evaluation: the binary relevance situation and the multiple degree relevancesituation. With regard to the binary relevance situation, where the three (positive) relevance degrees (1, 2, 3) weremerged into one, and where precision was used as evaluation measure, the four alternative combinationsoutperformed the baseline. The best performing combination was the combination that used truncation. Thiscombination performed better than or equal to a median precision value for 41 of the 52 topics. One reason for therelatively good performance of the truncation combination was the capacity of its queries to retrieve different partsof speech.In the multiple degree relevance situation, where the three (positive) relevance degrees were retained, retrievaleffectiveness was taken to be the accumulated gain the user receives by examining the retrieval result up to givenpositions. The evaluation measure used was nDCG (normalized cumulated gain with discount). This measurecredits retrieval methods that (1) rank highly relevant documents higher than less relevant ones, and (2) rankrelevant (of any degree) documents high. With respect to (2), nDCG involves a discount component: a discount withregard to the relevance score of a relevant (of any degree) document is performed, and this discount is greater andgreater, the higher position the document has in the ranked list of retrieved documents.In the multiple degree relevance situation, the five combinations were evaluated under four different user scenarios,where each scenario simulated a certain user type. Again, the four alternative combinations outperformed thebaseline, for each user scenario. The truncation combination had the best performance under each user scenario.This outcome agreed with the performance result in the binary relevance situation. However, there were alsodifferences between the two relevance situations. For 25 percent of the topics and with regard to one of the four userscenarios, the set of best performing combinations in the binary relevance situation was disjunct from the set of bestperforming combinations in the multiple degree relevance situation. The user scenario in question was such thatalmost all importance was placed on highly relevant documents, and the discount was sharp.The main conclusion of the thesis is that normalization and right hand truncation (performed by a search expert)enhanced retrieval effectiveness in comparison to the baseline, irrespective of which of the two relevance situationswe consider. Further, the three indexing strategy-query term combinations based on normalization were almost asgood as the combination that involves truncation. This holds for both relevance situations.