Inducing Semantic Frames from Lexical Resources

Thumbnail Image


dissertation.pdf (10.17 MB)
No. of downloads: 800

Publication or External Link






The multiple ways in which propositional content can be expressed is often referred to as the paraphrase problem. This phenomenon creates challenges for such applications as information retrieval, information extraction, text summarization, and machine translation: Natural language understanding needs to recognize what remains constant across paraphrases, while natural language generation needs the ability to express content in various ways.

Frame semantics is a theory of language understanding that addresses the paraphrase problem by providing slot-and-filler templates to represent frequently occurring, structured experiences. This dissertation introduces SemFrame, a system that induces semantic frames automatically from lexical resources (WordNet and the Longman Dictionary of Contemporary English [LDOCE]). Prior to SemFrame, semantic frames had been developed only by hand.

In SemFrame, frames are first identified by enumerating groups of verb senses that evoke a common frame. This is done by combining evidence about pairs of semantically related verbs, based on LDOCE's subject field codes, words used in LDOCE definitions and WordNet glosses, WordNet's array of semantic relationships, etc. Pairs are gathered into larger groupings, deemed to correspond to semantic frames. Nouns associated with the verbs evoking a frame are then analyzed against WordNet's semantic network to identify nodes corresponding to frame slots.

SemFrame is evaluated in two ways: (1) Compared against the handcrafted FrameNet, SemFrame achieves its best recall-precision balance with 83.2% recall (based on SemFrame's coverage of FrameNet frames) and 73.8% precision (based on SemFrame verbs' semantic relatedness to other frame-evoking verbs). A WordNet-hierarchy-based lower bound achieves 52.8% recall and 46.6% precision. (2) A frame-semantic-enhanced version of Hearst's TextTiling algorithm, applied to detecting boundaries between consecutive documents, improves upon the non-enhanced TextTiling algorithm at statistically significant levels. (Previous enhancement of the text segmentation algorithm with thesaural relationships had degraded performance.)