Combining Machine Learning and Natural Language Processing for Automatic Keyword Extraction

University dissertation from Kista : Institutionen för data- och systemvetenskap (tills m KTH)

Abstract: Automatic keyword extraction is the task of automatically selecting a small set of terms describing the content of a single document. That a keyword is extracted means that it is present verbatim in the document to which it is assigned. This dissertation discusses the development of an algorithm for automatic keyword extraction, and presents a number of experiments, in which the performance of the algorithm is incrementally improved.The approach taken is that of supervised machine learning, that is, prediction models are constructed from documents with known keywords. Before any learning can take place, the data must be pre-processed and represented. In the presented research, two problems concerning the representation for keyword extraction are tackled. Since a keyword may consist of more than one token, the first problem concerns where a keyword begins and ends in a running text, that is, how a candidate term is defined. In this dissertation, three term selection approaches are defined and evaluated. The first approach extracts all uni-, bi-, and trigrams, the second approach extracts all noun phrase chunks, while the third approach extracts all terms matching any of a number of empirically defined part-of-speech patterns.Since the majority of the extracted candidate terms are not keywords, the second problem concerns how these terms can be limited, to only keep those that are appropriate as keywords. In the presented research, four features for filtering the candidate terms are investigated. These are term frequency, inverse document frequency, relative position of the first occurrence, and the part-of-speech tag or tags assigned to the candidate term.The research presented in this dissertation is linguistically oriented in the sense that the output from natural language processing tools is a considerable factor both for the pre-processing of the data, as well as for the performance of the prediction models. Of the three term selection approaches, the best individual performance ? as measured by keywords previously assigned by professional indexers ? is achieved by the noun phrase chunk approach. The part-of-speech tag feature dramatically improves the performance of the models, independently of which term selection approach is applied. The highest performance is, however, achieved when the predictions of all three models are combined.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.