Multifaceted Geotagging for Streaming News
Lieberman, Michael David
MetadataShow full item record
News sources on the Web generate constant streams of information, describing the events that shape our world. In particular, geography plays a key role in the news, and understanding the geographic information present in news allows for its useful spatial browsing and retrieval. This process of understanding is called <italic>geotagging</italic>, and involves first finding in the document all textual references to geographic locations, known as <italic>toponyms</italic>, and second, assigning the correct lat/long values to each toponym, steps which are termed <italic>toponym recognition</italic> and <italic>toponym resolution</italic>, respectively. These steps are difficult due to ambiguities in natural language: some toponyms share names with non-location entities, and further, a given toponym can have many location interpretations. Removing these ambiguities is crucial for successful geotagging. To this end, geotagging methods are described which were developed for streaming news. First, a spatio-textual search engine named STEWARD, and an interactive map-based news browsing system named NewsStand are described, which feature geotaggers as central components, and served as motivating systems and experimental testbeds for developing geotagging methods. Next, a geotagging methodology is presented that follows a multifaceted approach involving a variety of techniques. First, a multifaceted toponym recognition process is described that uses both rule-based and machine learning–based methods to ensure high toponym recall. Next, various forms of toponym resolution evidence are explored. One such type of evidence is lists of toponyms, termed <italic>comma groups</italic>, whose toponyms share a common thread in their geographic properties that enables correct resolution. In addition to explicit evidence, authors take advantage of the implicit geographic knowledge of their audiences. Understanding the local places known by an audience, termed its <italic>local lexicon</italic>, affords great performance gains when geotagging articles from local newspapers, which account for the vast majority of news on the Web. Finally, considering windows of text of varying size around each toponym, termed <italic>adaptive context</italic>, allows for a tradeoff between geotagging execution speed and toponym resolution accuracy. Extensive experimental evaluations of all the above methods, using existing and two newly-created, large corpora of streaming news, show great performance gains over several competing prominent geotagging methods.