Intelligent semi-structured information extraction : a user-driven approach to information extraction

University dissertation from Linköping : Linköpings universitet

Abstract: The number ofdomains and tasks where information extraction tools can be usedneeds to be increased. One way to reach this goal is to designuser-driven information extraction systems where non-expert usersare able to adapt them to new domains and tasks. It is difficult todesign general extraction systems that do not require expert skillsor a large amount of work from the user. Therefore, it is difficultto increase the number of domains and tasks. A possible alternativeis to design user-driven systems, which solve that problem byletting a large number of non-expert users adapt the systemsthemselves. To accomplish this goal, the systems need to becomemore intelligent and able to learn to extract with as little giveninformation as possible. The type of information extraction system that is in focus for thisthesis is semi-structured information extraction. The termsemi-structured refers to documents that not only contain naturallanguage text but also additional structural information. Thetypical application is information extraction from World Wide Webhypertext documents. By making effective use of not only the linkstructure but also the structural information within each suchdocument, user-driven extraction systems with high performance canbe built. There are two different approaches presented in this thesis tosolve the user-driven extraction problem. The first takes a machinelearning approach and tries to solve the problem using a modified$Q(lambda)$ reinforcement learning algorithm. A problem with thefirst approach was that it was difficult to handle extraction fromthe hidden Web. Since the hidden Web is about 500 times larger thanthe visible Web, it would be very useful to be able to extractinformation from that part of the Web as well. The second approachis called the hidden observation approach and tries to also solvethe problem of extracting from the hidden Web. The goal is to havea user-driven information extraction system that is also able tohandle the hidden Web. The second approach uses a large part of thesystem developed for the first approach, but the additionalinformation that is silently obtained from the user presents otherproblems and possibilities. An agent-oriented system was designed to evaluate the approachespresented in this thesis. A set of experiments was conducted andthe results indicate that a user-driven information extractionsystem is possible and no longer just a concept. However,additional work and research is necessary before a fully-fledgeduser-driven system can be designed.

  CLICK HERE TO DOWNLOAD THE WHOLE DISSERTATION. (in PDF format)