Mining Web Logs to Improve User Experience in Web Search
Abstract: The World Wide Web continues to grow in size and diversity and this makes it increasingly hard for users to find valuable information because of heterogeneous form and content of the documents, little knowledge about the reliability and prestige of the documents and a great deal of redundancy. Usually search engines look for documents that contain specific keywords or phrases stated by the users as queries. There might be millions of pages containing those keywords and they may be related to a variety of different topics. Traditional retrieval strategies yield increasingly poor results due to a dramatic increase in ballast in the results. Search engine users thus increasingly experience information overload. With these difficulties in mind, there is a large ongoing effort in research with the goal to deliver appropriate information to the users, this is what is meant by improving users Web search experience. The aim of this thesis is to design, test and analyze different approaches to address the problem of characterizing search behavior of users and improve the search process, in the context ofWeb search. There are three main aspects to focus on when tackling this problem throughWeb Log Mining: user recommendations to improve the search process, automatic detection of user information needs and modeling of user information needs. To improve Web search by user recommendations, a suit of algorithms tailored to the mixture models is presented, the algorithms are simple and efficient. Tests are carried out on a broad range of generated data according to a spectrum of subclasses of mixture models, and on real data collected from a Hungarian news portal log and from the Chilean TodoCl1 web search log, the resulting performance is shown to be of high quality. Other application areas were mixture models are used also benefit from these results, this is the case of dating services, e-commerce, virtual collaborative communities, Internet Service Providers and in bioinformatics to analyze gene expression data. The main contribution in user behavior characterization is to provide a complete study of all major learning approaches applied to automatically detect user intent in Web search. The three analyzed machine learning techniques for mining user intent are: completely supervised, semi–supervised and unsupervised. In this context the semi–supervised learning approach shows significant improvements over the supervised approach for mining user intent and interests, which previously was considered the best one. This study is also of interest more generally in exploring the true potential of all learning techniques in large scale settings such as the Web, both in terms of their scalability and in their accuracy. In the user intent modeling context the contribution is the proposition of a new categorization of user’s intentions from the point of view of facets with the aim to improve on previous classification schemes. Initially a set of queries were manually labeled with the new faceted classification scheme to find relationships between the facets to aid in the manual labelling process and to understand users intentions. The distribution of the queries within the facets shows that the facets are relevant since each produces a division of the query space that will allow for better understanding of the user needs.
This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.