Automated Structural and Spatial Comprehension of Data Tables

Thumbnail Image


Publication or External Link





Data tables on the Web hold large quantities of information, but are difficult to search, browse, and merge using existing systems. This dissertation presents a collection of techniques for extracting, processing, and querying tables that contain geographic data, by harnessing the coherence of table structures for retrieval tasks. Data tables, including spreadsheets, HTML tables, and those found in rich document formats, are the standard way of communicating structured data for typical computer users. Notably, geographic tables (i.e., those containing names of locations) constitute a large fraction of publicly-available data tables and are ripe for exposure to Internet users who are increasingly comfortable interacting with geographic data using web-based maps. Of particular interest is the creation of a large repository of geographic data tables that would enable novel queries such as "find vacation itineraries geographically similar to mine" for use in trip planning or "find demographic datasets that cover regions X, Y, and Z" for sociological research.

In support of these goals, this dissertation identifies several methods for using the structure and context of data tables to improve the interpretation of the contents, even in the presence of ambiguity. First, a method for identifying functional components of data tables is presented, capitalizing on techniques for sequence labeling that are used in natural language processing. Next, a novel automated method for converting place references to physical latitude/longitude values, a process known as geotagging, is applied to tables with high accuracy. A classification procedure for identifying a specific class of geographic table, the travel itinerary, is also described, which borrows inspiration from optimization techniques for the traveling salesman problem (TSP). Finally, methods for querying spatially similar tables are introduced and several mechanisms for visualizing and interacting with the extracted geographic data are explored.