Entity Resolution In Graphs

Bhattacharya, Indrajit; Getoor, Lise

Entity Resolution In Graphs

Files

4758.pdf (482.5 KB)

No. of downloads: 6479

Date

2005-10-27

Authors

Bhattacharya, Indrajit

Getoor, Lise

Abstract

The goal of entity resolution is to reconcile data references corresponding to the same real world entity. Here we introduce the problem of entity resolution in graphs, where the nodes are the references in the data and the hyper-edges represent the relations that are observed to hold between the references. The goal then is to reconstruct a cleaned' entity graph that captures the relations among the true underlying entities from the reference graph. This is an important first step in any graph mining process; mining an unresolved graph will be inefficient and result in inaccurate conclusions. We also motivate collective entity resolution in graphs where references sharing hyper-edges are resolved jointly, as opposed to independent pair-wise resolution of the references. We illustrate the problem of graph-based entity resolution in bibliographic datasets. We discuss several interesting issues such as multiple entity types, local and global resolution and different kinds of graph-based evidence. We formulate the graph-based entity resolution problem as an unsupervised clustering task, where each cluster represents references that map to the same entity, and the similarity measure between two clusters incorporates the similarity of the references attributes and, more interestingly, the similarity between their relations. We explore two different measures of relational similarity. One approach, which we call edge detail similarity', explicitly compares the individual edges that each cluster participates in, but is expensive to compute. A less computationally intensive alternative is measuring `neighborhood similarity', which only compares the multi-set of neighboring clusters for each cluster. We perform an extensive empirical evaluation of the two relational similarity measures for author resolution using co-author relations in two real bibliographic datasets. We show that both similarity measures improve performance over unsupervised algorithms that consider only reference attributes. We also describe an efficient implementation and show that these algorithms scale gracefully with increasing size of the data.

URI (handle)

http://hdl.handle.net/1903/4021

Collections

Technical Reports of the Computer Science Department

Full item page