Thumbnail Image


Diab, M.T..pdf (1.27 MB)
No. of downloads: 272

Publication or External Link





Word Sense Disambiguation (WSD) is the process of resolving the meaning of a word unambiguously in a given natural language context. Within the scope of this thesis, it is the process of marking text with explicit sense labels. What constitutes a sense is a subject of great debate. An appealing perspective, aims to define senses in terms of their multilingual correspondences, an idea explored by several researchers, Dyvik (1998), Ide (1999), Resnik & Yarowsky (1999), and Chugur, Gonzalo & Verdejo (2002) but to date it has not been given any practical demonstration. This thesis is an empirical validation of these ideas of characterizing word meaning using cross-linguistic correspondences. The idea is that word meaning or word sense is quantifiable as much as it is uniquely translated in some language or set of languages. Consequently, we address the problem of WSD from a multilingual perspective; we expand the notion of context to encompass multilingual evidence. We devise a new approach to resolve word sense ambiguity in natural language, using a source of information that was never exploited on a large scale for WSD before. The core of the work presented builds on exploiting word correspondences across languages for sense distinction. In essence, it is a practical and functional implementation of a basic idea common to research interest in defining word meanings in cross-linguistic terms. We devise an algorithm, SALAAM for Sense Assignment Leveraging Alignment And Multilinguality, that empirically investigates the feasibility and the validity of utilizing translations for WSD. SALAAM is an unsupervised approach for word sense tagging of large amounts of text given a parallel corpus — texts in translation — and a sense inventory for one of the languages in the corpus. Using SALAAM, we obtain large amounts of sense annotated data in both languages of the parallel corpus, simultaneously. The quality of the tagging is rigorously evaluated for both languages of the corpora. The automatic unsupervised tagged data produced by SALAAM is further utilized to bootstrap a supervised learning WSD system, in essence, combining supervised and unsupervised approaches in an intelligent way to alleviate the resources acquisition bottleneck for supervised methods. Essentially, SALAAM is extended as an unsupervised approach for WSD within a learning framework; in many of the cases of the words disambiguated, SALAAM coupled with the machine learning system rivals the performance of a canonical supervised WSD system that relies on human tagged data for training. Realizing the fundamental role of similarity for SALAAM, we investigate different dimensions of semantic similarity as it applies to verbs since they are relatively more complex than nouns, which are the focus of the previous evaluations. We design a human judgment experiment to obtain human ratings on verbs’ semantic similarity. The obtained human ratings are cast as a reference point for comparing different automated similarity measures that crucially rely on various sources of information. Finally, a cognitively salient model integrating human judgments in SALAAM is proposed as a means of improving its performance on sense disambiguation for verbs in particular and other word types in general.