Multilingual Acquisition of Large Scale Knowledge Resources
Natural Language Processing (NLP) is a subfield of Artificial Intelligence(AI) that attempts to automatically process human language. Nowadays, NLP systems seem to have reached an upper-bound using existing resources and techniques. There is a broad consensus in the research community that systems need to integrate larger amounts of semantic and world knowledge in order to improve the quality of the current results. Nevertheless, building adequate semantic resources is a very difficult and an open research problem. Many efforts have been devoted to build knowledge repositories in the past decades, producing a wide range of knowledge bases, which offer different levels of granularity or approach different aspects of knowledge representation. Among them, Princeton WordNet2 (WN) is by far the most widely-used semantic resource in the NLP area. The main goal of the research presented in this thesis is to devise new methods and tools to automatically create new semantic relations between WordNet senses. That is, to accurately increase by automatic means the knowledge represented in WordNet. The proposed process uses the current content of WordNet as the minimal knowledge base required to start a cycling acquisition approach. First, the process acquires from corpora relevant terms associated to each WordNet sense. Second, the identification stage uses the knowledge present in Word-Net to establish the appropriate sense of each of these terms, obtaining as a result large amounts of new semantic relations among WordNet. In particular, our research focuses on devising new methods and tools for: 1. Acquiring relevant words from general or domain corpora for an specific WordNet word-sense. 2. Identifying the implicit word-senses of the acquired relevant words with respect to an existing knowledge base (in particular, WordNet). 3. Empirically evaluating the quality of the resulting new semantic relations in a controlled multilingual evaluation framework. Thus, our research goals cover the automatic acquisition, identification, integration, and evaluation of large amounts of semantic relations among WordNet senses captured from general or domain-specific corpora. In this way, the resulting knowledge net or KnowNet (KN), should be an extensible, large, accurate and useful knowledge base, derived automatically from text collections. Furthermore, being represented at a semantic level, we also expect that the new semantic knowledge acquired from text in one language can be of utility in other languages.