Entity Resolution In Graphs
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
The goal of entity resolution is to reconcile data references
corresponding to the same real world entity. Here we introduce the
problem of entity resolution in graphs, where the nodes are the
references in the data and the hyper-edges represent the relations
that are observed to hold between the references. The goal then is to
reconstruct a cleaned' entity graph that captures the relations among the true underlying entities from the reference graph. This is an important first step in any graph mining process; mining an unresolved graph will be inefficient and result in inaccurate conclusions. We also motivate collective entity resolution in graphs where references sharing hyper-edges are resolved jointly, as opposed to independent pair-wise resolution of the references. We illustrate the problem of graph-based entity resolution in bibliographic datasets. We discuss several interesting issues such as multiple entity types, local and global resolution and different kinds of graph-based evidence. We formulate the graph-based entity resolution problem as an unsupervised clustering task, where each cluster represents references that map to the same entity, and the similarity measure between two clusters incorporates the similarity of the references attributes and, more interestingly, the similarity between their relations. We explore two different measures of relational similarity. One approach, which we call
edge detail similarity', explicitly compares the individual
edges that each cluster participates in, but is expensive to
compute. A less computationally intensive alternative is measuring
`neighborhood similarity', which only compares the multi-set of
neighboring clusters for each cluster. We perform an extensive
empirical evaluation of the two relational similarity measures for
author resolution using co-author relations in two real bibliographic
datasets. We show that both similarity measures improve performance
over unsupervised algorithms that consider only reference
attributes. We also describe an efficient implementation and show that
these algorithms scale gracefully with increasing size of the data.