Browsing by Author "Bhattacharya, Indrajit"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
Item Collective Entity Resolution In Relational Data(2006-12-11) Bhattacharya, Indrajit; Getoor, Lise; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Many databases contain imprecise references to real-world entities. For example, a social-network database records names of people. But different people can go by the same name and there may be different observed names referring to the same person. The goal of entity resolution is to determine the mapping from database references to discovered real-world entities. Traditional entity resolution approaches consider approximate matches between attributes of individual references, but this does not always work well. In many domains, such as social networks and academic circles, the underlying entities exhibit strong ties to each other, and as a result, their references often co-occur in the data. In this dissertation, I focus on the use of such co-occurrence relationships for jointly resolving entities. I refer to this problem as `collective entity resolution'. First, I propose a relational clustering algorithm for iteratively discovering entities by clustering references taking into account the clusters of co-occurring references. Next, I propose a probabilistic generative model for collective resolution that finds hidden group structures among the entities and uses the latent groups as evidence for entity resolution. One of my contributions is an efficient unsupervised inference algorithm for this model using Gibbs Sampling techniques that discovers the most likely number of entities. Both of these approaches improve performance over attribute-only baselines in multiple real world and synthetic datasets. I also perform a theoretical analysis of how the structural properties of the data affect collective entity resolution and verify the predicted trends experimentally. In addition, I motivate the problem of query-time entity resolution. I propose an adaptive algorithm that uses collective resolution for answering queries by recursively exploring and resolving related references. This enables resolution at query-time, while preserving the performance benefits of collective resolution. Finally, as an application of entity resolution in the domain of natural language processing, I study the sense disambiguation problem and propose models for collective sense disambiguation using multiple languages that outperform other unsupervised approaches.Item Entity Resolution In Graphs(2005-10-27) Bhattacharya, Indrajit; Getoor, LiseThe goal of entity resolution is to reconcile data references corresponding to the same real world entity. Here we introduce the problem of entity resolution in graphs, where the nodes are the references in the data and the hyper-edges represent the relations that are observed to hold between the references. The goal then is to reconstruct a `cleaned' entity graph that captures the relations among the true underlying entities from the reference graph. This is an important first step in any graph mining process; mining an unresolved graph will be inefficient and result in inaccurate conclusions. We also motivate collective entity resolution in graphs where references sharing hyper-edges are resolved jointly, as opposed to independent pair-wise resolution of the references. We illustrate the problem of graph-based entity resolution in bibliographic datasets. We discuss several interesting issues such as multiple entity types, local and global resolution and different kinds of graph-based evidence. We formulate the graph-based entity resolution problem as an unsupervised clustering task, where each cluster represents references that map to the same entity, and the similarity measure between two clusters incorporates the similarity of the references attributes and, more interestingly, the similarity between their relations. We explore two different measures of relational similarity. One approach, which we call `edge detail similarity', explicitly compares the individual edges that each cluster participates in, but is expensive to compute. A less computationally intensive alternative is measuring `neighborhood similarity', which only compares the multi-set of neighboring clusters for each cluster. We perform an extensive empirical evaluation of the two relational similarity measures for author resolution using co-author relations in two real bibliographic datasets. We show that both similarity measures improve performance over unsupervised algorithms that consider only reference attributes. We also describe an efficient implementation and show that these algorithms scale gracefully with increasing size of the data.Item A Latent Dirichlet Model for Unsupervised Entity Resolution(2005-08-19) Bhattacharya, Indrajit; Getoor, LiseIn this paper, we address the problem of entity resolution, where given many references to underlying objects, the task is to predict which references correspond to the same object. We propose a probabilistic model for collective entity resolution. Our approach differs from other recently proposed entity resolution approaches in that it is a) unsupervised, b) generative and c) introduces a hidden `group' variable to capture collections of entities which are commonly observed together. The entity resolution decisions are not considered on an independent pairwise basis, but instead decisions are made collectively. We focus on how the use of relational links among the references can be exploited. We show how we can use Gibbs Sampling to infer the collaboration groups and the entities jointly from the observed co-author relationships among entity references and how this improves entity resolution performance. We demonstrate the utility of our approach on two real-world bibliographic datasets. In addition, we present preliminary results on characterizing conditions under which collaborative information is useful.Item Similarity Searching in Peer-to-Peer Databases(2004-02-25) Bhattacharya, Indrajit; Kashyap, Srinivas R.; Parthasarathy, SrinivasanWe consider the problem of handling "similarity queries" in peer-to-peer databases. Given a query for a data object, we propose an indexing and searching mechanism which returns the set of objects in the database that are semantically related to the query. Our schemes can be implemented on a variety of structured overlays such as CAN, CHORD, Pastry, and Tapestry. We provide analytical and experimental evaluation of our schemes in terms of the search accuracy, search cost, and load balancing. Our analytical guarantees perfectly predict the experimentally observed trends for the search accuracy.