University of Maryland DRUM  
University of Maryland Digital Repository at the University of Maryland

Digital Repository at the University of Maryland (DRUM) >
Theses and Dissertations from UMD >
UMD Theses and Dissertations >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1903/4241

Title: Collective Entity Resolution In Relational Data
Authors: Bhattacharya, Indrajit
Advisors: Getoor, Lise
Department/Program: Computer Science
Type: Dissertation
Sponsors: Digital Repository at the University of Maryland
University of Maryland (College Park, Md.)
Subjects: Computer Science
Keywords: entity resolution
data integration
clustering
relational
collective
Issue Date: 11-Dec-2006
Abstract: Many databases contain imprecise references to real-world entities. For example, a social-network database records names of people. But different people can go by the same name and there may be different observed names referring to the same person. The goal of entity resolution is to determine the mapping from database references to discovered real-world entities. Traditional entity resolution approaches consider approximate matches between attributes of individual references, but this does not always work well. In many domains, such as social networks and academic circles, the underlying entities exhibit strong ties to each other, and as a result, their references often co-occur in the data. In this dissertation, I focus on the use of such co-occurrence relationships for jointly resolving entities. I refer to this problem as `collective entity resolution'. First, I propose a relational clustering algorithm for iteratively discovering entities by clustering references taking into account the clusters of co-occurring references. Next, I propose a probabilistic generative model for collective resolution that finds hidden group structures among the entities and uses the latent groups as evidence for entity resolution. One of my contributions is an efficient unsupervised inference algorithm for this model using Gibbs Sampling techniques that discovers the most likely number of entities. Both of these approaches improve performance over attribute-only baselines in multiple real world and synthetic datasets. I also perform a theoretical analysis of how the structural properties of the data affect collective entity resolution and verify the predicted trends experimentally. In addition, I motivate the problem of query-time entity resolution. I propose an adaptive algorithm that uses collective resolution for answering queries by recursively exploring and resolving related references. This enables resolution at query-time, while preserving the performance benefits of collective resolution. Finally, as an application of entity resolution in the domain of natural language processing, I study the sense disambiguation problem and propose models for collective sense disambiguation using multiple languages that outperform other unsupervised approaches.
URI: http://hdl.handle.net/1903/4241
Appears in Collections:UMD Theses and Dissertations
Computer Science Theses and Dissertations

Files in This Item:

File Description SizeFormatNo. of Downloads
umi-umd-4070.pdf761.21 kBAdobe PDF2379View/Open

All items in DRUM are protected by copyright, with all rights reserved.

 

DRUM is brought to you by the University of Maryland Libraries
University of Maryland, College Park, MD 20742-7011 (301)314-1328.
Please send us your comments