Identifying Graphs from Noisy Observational Data

dc.contributor.advisorGetoor, Liseen_US
dc.contributor.authorNamata Jr., Galile Mark Supapoen_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2012-10-11T05:39:16Z
dc.date.available2012-10-11T05:39:16Z
dc.date.issued2012en_US
dc.description.abstractThere is a growing amount of data describing networks -- examples include social networks, communication networks, and biological networks. As the amount of available data increases, so does our interest in analyzing the properties and characteristics of these networks. However, in most cases the data is noisy, incomplete, and the result of passively acquired observational data; naively analyzing these networks without taking these errors into account can result in inaccurate and misleading conclusions. In my dissertation, I study the tasks of entity resolution, link prediction, and collective classification to address these deficiencies. I describe these tasks in detail and discuss my own work on each of these tasks. For entity resolution, I develop a method for resolving the identities of name mentions in email communications. For link prediction, I develop a method for inferring subordinate-manager relationships between individuals in an email communication network. For collective classification, I propose an adaptive active surveying method to address node labeling in a query-driven setting on network data. In many real-world settings, however, these deficiencies are not found in isolation and all need to be addressed to infer the desired complete and accurate network. Furthermore, because of the dependencies typically found in these tasks, the tasks are inherently inter-related and must be performed jointly. I define the general problem of graph identification which simultaneously performs these tasks; removing the noise and missing values in the observed input network and inferring the complete and accurate output network. I present a novel approach to graph identification using a collection of Coupled Collective Classifiers, C3, which, in addition to capturing the variety of features typically used for each task, can capture the intra- and inter-dependencies required to correctly infer nodes, edges, and labels in the output network. I discuss variants of C3 using different learning and inference paradigms and show the superior performance of C3, in terms of both prediction quality and runtime performance, over various previous approaches. I then conclude by presenting the Graph Alignment, Identification, and Analysis (GAIA) open-source software library which not only provides an implementation of C3 but also algorithms for various tasks in network data such as entity resolution, link prediction, collective classification, clustering, active learning, data generation, and analysis.en_US
dc.identifier.urihttp://hdl.handle.net/1903/13137
dc.subject.pqcontrolledComputer scienceen_US
dc.subject.pquncontrolledActive Learningen_US
dc.subject.pquncontrolledCollective Classificationen_US
dc.subject.pquncontrolledEntity Resolutionen_US
dc.subject.pquncontrolledGraph Identificationen_US
dc.subject.pquncontrolledGraphs and Networksen_US
dc.subject.pquncontrolledLink Predictionen_US
dc.titleIdentifying Graphs from Noisy Observational Dataen_US
dc.typeDissertationen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
NamataJr_umd_0117E_13331.pdf
Size:
1.51 MB
Format:
Adobe Portable Document Format