Automated Structural and Spatial Comprehension of Data Tables

dc.contributor.advisorSamet, Hananen_US
dc.contributor.authorAdelfio, Marco Daviden_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2015-06-25T05:32:53Z
dc.date.available2015-06-25T05:32:53Z
dc.date.issued2014en_US
dc.description.abstractData tables on the Web hold large quantities of information, but are difficult to search, browse, and merge using existing systems. This dissertation presents a collection of techniques for extracting, processing, and querying tables that contain geographic data, by harnessing the coherence of table structures for retrieval tasks. Data tables, including spreadsheets, HTML tables, and those found in rich document formats, are the standard way of communicating structured data for typical computer users. Notably, geographic tables (i.e., those containing names of locations) constitute a large fraction of publicly-available data tables and are ripe for exposure to Internet users who are increasingly comfortable interacting with geographic data using web-based maps. Of particular interest is the creation of a large repository of geographic data tables that would enable novel queries such as "find vacation itineraries geographically similar to mine" for use in trip planning or "find demographic datasets that cover regions X, Y, and Z" for sociological research. In support of these goals, this dissertation identifies several methods for using the structure and context of data tables to improve the interpretation of the contents, even in the presence of ambiguity. First, a method for identifying functional components of data tables is presented, capitalizing on techniques for sequence labeling that are used in natural language processing. Next, a novel automated method for converting place references to physical latitude/longitude values, a process known as geotagging, is applied to tables with high accuracy. A classification procedure for identifying a specific class of geographic table, the travel itinerary, is also described, which borrows inspiration from optimization techniques for the traveling salesman problem (TSP). Finally, methods for querying spatially similar tables are introduced and several mechanisms for visualizing and interacting with the extracted geographic data are explored.en_US
dc.identifierhttps://doi.org/10.13016/M21W4F
dc.identifier.urihttp://hdl.handle.net/1903/16410
dc.language.isoenen_US
dc.subject.pqcontrolledComputer scienceen_US
dc.subject.pquncontrolledGeographic data tablesen_US
dc.subject.pquncontrolledGeotaggingen_US
dc.subject.pquncontrolledInformation extractionen_US
dc.subject.pquncontrolledItinerary recognitionen_US
dc.subject.pquncontrolledSimilarity searchen_US
dc.subject.pquncontrolledSpreadsheetsen_US
dc.titleAutomated Structural and Spatial Comprehension of Data Tablesen_US
dc.typeDissertationen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Adelfio_umd_0117E_15882.pdf
Size:
11.29 MB
Format:
Adobe Portable Document Format