Entity Resolution In Graphs

Loading...
Thumbnail Image

Files

4758.pdf (482.5 KB)
No. of downloads: 6394

Publication or External Link

Date

2005-10-27

Advisor

Citation

DRUM DOI

Abstract

The goal of entity resolution is to reconcile data references corresponding to the same real world entity. Here we introduce the problem of entity resolution in graphs, where the nodes are the references in the data and the hyper-edges represent the relations that are observed to hold between the references. The goal then is to reconstruct a cleaned' entity graph that captures the relations among the true underlying entities from the reference graph. This is an important first step in any graph mining process; mining an unresolved graph will be inefficient and result in inaccurate conclusions. We also motivate collective entity resolution in graphs where references sharing hyper-edges are resolved jointly, as opposed to independent pair-wise resolution of the references. We illustrate the problem of graph-based entity resolution in bibliographic datasets. We discuss several interesting issues such as multiple entity types, local and global resolution and different kinds of graph-based evidence. We formulate the graph-based entity resolution problem as an unsupervised clustering task, where each cluster represents references that map to the same entity, and the similarity measure between two clusters incorporates the similarity of the references attributes and, more interestingly, the similarity between their relations. We explore two different measures of relational similarity. One approach, which we call edge detail similarity', explicitly compares the individual edges that each cluster participates in, but is expensive to compute. A less computationally intensive alternative is measuring `neighborhood similarity', which only compares the multi-set of neighboring clusters for each cluster. We perform an extensive empirical evaluation of the two relational similarity measures for author resolution using co-author relations in two real bibliographic datasets. We show that both similarity measures improve performance over unsupervised algorithms that consider only reference attributes. We also describe an efficient implementation and show that these algorithms scale gracefully with increasing size of the data.

Notes

Rights