Browsing by Author "Getoor, Lise"
Now showing 1 - 10 of 10
Results Per Page
Sort Options
Item Collective Classification in Network Data(2008-02-13) Sen, Prithviraj; Namata, Galileo; Bilgic, Mustafa; Getoor, Lise; Gallagher, Brian; Eliassi-Rad, TinaNumerous real-world applications produce networked data such as web data (hypertext documents connected via hyperlinks) and communication networks (people connected via communication links). A recent focus in machine learning research has been to extend traditional machine learning classification techniques to classify nodes in such data. In this report, we attempt to provide a brief introduction to this area of research and how it has progressed during the past decade. We introduce four of the most widely used inference algorithms for classifying networked data and empirically compare them on both synthetic and real-world data.Item Entity Resolution In Graphs(2005-10-27) Bhattacharya, Indrajit; Getoor, LiseThe goal of entity resolution is to reconcile data references corresponding to the same real world entity. Here we introduce the problem of entity resolution in graphs, where the nodes are the references in the data and the hyper-edges represent the relations that are observed to hold between the references. The goal then is to reconstruct a `cleaned' entity graph that captures the relations among the true underlying entities from the reference graph. This is an important first step in any graph mining process; mining an unresolved graph will be inefficient and result in inaccurate conclusions. We also motivate collective entity resolution in graphs where references sharing hyper-edges are resolved jointly, as opposed to independent pair-wise resolution of the references. We illustrate the problem of graph-based entity resolution in bibliographic datasets. We discuss several interesting issues such as multiple entity types, local and global resolution and different kinds of graph-based evidence. We formulate the graph-based entity resolution problem as an unsupervised clustering task, where each cluster represents references that map to the same entity, and the similarity measure between two clusters incorporates the similarity of the references attributes and, more interestingly, the similarity between their relations. We explore two different measures of relational similarity. One approach, which we call `edge detail similarity', explicitly compares the individual edges that each cluster participates in, but is expensive to compute. A less computationally intensive alternative is measuring `neighborhood similarity', which only compares the multi-set of neighboring clusters for each cluster. We perform an extensive empirical evaluation of the two relational similarity measures for author resolution using co-author relations in two real bibliographic datasets. We show that both similarity measures improve performance over unsupervised algorithms that consider only reference attributes. We also describe an efficient implementation and show that these algorithms scale gracefully with increasing size of the data.Item Features generated for computational splice-site prediction correspond to functional elements(Springer Nature, 2007-10-24) Dogan, Rezarta Islamaj; Getoor, Lise; Wilbur, W John; Mount, Stephen MAccurate selection of splice sites during the splicing of precursors to messenger RNA requires both relatively well-characterized signals at the splice sites and auxiliary signals in the adjacent exons and introns. We previously described a feature generation algorithm (FGA) that is capable of achieving high classification accuracy on human 3' splice sites. In this paper, we extend the splice-site prediction to 5' splice sites and explore the generated features for biologically meaningful splicing signals. We present examples from the observed features that correspond to known signals, both core signals (including the branch site and pyrimidine tract) and auxiliary signals (including GGG triplets and exon splicing enhancers). We present evidence that features identified by FGA include splicing signals not found by other methods. Our generated features capture known biological signals in the expected sequence interval flanking splice sites. The method can be easily applied to other species and to similar classification problems, such as tissue-specific regulatory elements, polyadenylation sites, promoters, etc.Item How friendship links and group memberships affect the privacy of individuals in social networks(2008-07-01) Zheleva, Elena; Getoor, LiseIn order to address privacy concerns, many social media websites allow users to hide their personal profiles from the public. In this work, we show how an adversary can exploit a social network with a mixture of public and private user profiles to predict the private attributes of users. We map this problem to a relational classification problem and we propose a simple yet powerful model that uses group features and group memberships of users to perform multi-value classification. We compare its efficacy against several other classification approaches. Our results show that even in the case when there is an option for making profile attributes private, if links and group affiliations are known, users' privacy in social networks may be compromised. On a dataset from a well-known social-media website, we could easily recover the sensitive attributes for half of the private-profile users with a high accuracy when as much as half of the profiles are private. To the best of our knowledge, this is the first work that uses link-based and group-based classification to study privacy implications in social networks. We conclude with a discussion of our findings and the broader applicability of our proposed model.Item Indirect two-sided relative ranking: a robust similarity measure for gene expression data(2010-03-17) Licamele, Louis; Getoor, LiseBackground: There is a large amount of gene expression data that exists in the public domain. This data has been generated under a variety of experimental conditions. Unfortunately, these experimental variations have generally prevented researchers from accurately comparing and combining this wealth of data, which still hides many novel insights. Results: In this paper we present a new method, which we refer to as indirect two-sided relative ranking, for comparing gene expression profiles that is robust to variations in experimental conditions. This method extends the current best approach, which is based on comparing the correlations of the up and down regulated genes, by introducing a comparison based on the correlations in rankings across the entire database. Because our method is robust to experimental variations, it allows a greater variety of gene expression data to be combined, which, as we show, leads to richer scientific discoveries. Conclusions: We demonstrate the benefit of our proposed indirect method on several datasets. We first evaluate the ability of the indirect method to retrieve compounds with similar therapeutic effects across known experimental barriers, namely vehicle and batch effects, on two independent datasets (one private and one public). We show that our indirect method is able to significantly improve upon the previous state-of-the-art method with a substantial improvement in recall at rank 10 of 97.03% and 49.44%, on each dataset, respectively. Next, we demonstrate that our indirect method results in improved accuracy for classification in several additional datasets. These datasets demonstrate the use of our indirect method for classifying cancer subtypes, predicting drug sensitivity/resistance, and classifying (related) cell types. Even in the absence of a known (i.e., labeled) experimental barrier, the improvement of the indirect method in each of these datasets is statistically significant.Item A Latent Dirichlet Model for Unsupervised Entity Resolution(2005-08-19) Bhattacharya, Indrajit; Getoor, LiseIn this paper, we address the problem of entity resolution, where given many references to underlying objects, the task is to predict which references correspond to the same object. We propose a probabilistic model for collective entity resolution. Our approach differs from other recently proposed entity resolution approaches in that it is a) unsupervised, b) generative and c) introduces a hidden `group' variable to capture collections of entities which are commonly observed together. The entity resolution decisions are not considered on an independent pairwise basis, but instead decisions are made collectively. We focus on how the use of relational links among the references can be exploited. We show how we can use Gibbs Sampling to infer the collaboration groups and the entities jointly from the observed co-author relationships among entity references and how this improves entity resolution performance. We demonstrate the utility of our approach on two real-world bibliographic datasets. In addition, we present preliminary results on characterizing conditions under which collaborative information is useful.Item Link-based Classification(2007-02-19) Sen, Prithviraj; Getoor, LiseOver the past few years, a number of approximate inference algorithms for networked data have been put forth. We empirically compare the performance of three of the popular algorithms: loopy belief propagation, mean field relaxation labeling and iterative classification. We rate each algorithm in terms of its robustness to noise, both in attribute values and correlations across links. We also compare them across varying types of correlations across links.Item Predicting Protein-Protein Interactions Using Relational Features(2007-01-07) Licamele, Louis; Getoor, LiseProteins play a fundamental role in ever y process within the cell. Understanding how proteins interact, and the functional units they are par t of, is important to furthering our knowledge of the entire biological process. There has been a growing amount of work, both experimental and computational, on determining the protein-protein interaction network. Recently researchers have had success looking at this as a relational learning problem. In this work, we further this investigation, proposing several novel relational features for predicting protein-protein interaction. These features can be used in any classifier. Our approach allows large and complex networks to be analyzed and is an alternative to using more expensive relational methods. We show that we are able to get an accuracy of 81.7% when predicting new links from noisy high throughput data.Item Social Capital in Friendship-Event Networks(2006-09-27) Licamele, Louis; Getoor, LiseIn this paper, we examine a particular form of social network which we call a friendship-event network. A friendship-event network captures both the friendship relationship among a set of actors, and also the organizer and participation relationships of actors in a series of events. Within these networks, we formulate the notion of social capital based on the actor-organizer friendship relationship and the notion of benefit, based on event participation. We investigate appropriate definitions for the social capital of both a single actor and a collection of actors. We ground these definitions in a real-world example of academic collaboration networks, where the actors are researchers, the friendships are collaborations, the events are conferences, the organizers are program committee members and the participants are conference authors. We show that our definitions of capital and benefit capture interesting qualitative properties of event series. In addition, we show that social capital is a better publication predictor than publication history.Item To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles(2008-10-30) Zheleva, Elena; Getoor, LiseIn order to address privacy concerns, many social media websites allow users to hide their personal profiles from the public. In this work, we show how an adversary can exploit an online social network with a mixture of public and private user profiles to predict the private attributes of users. We map this problem to a relational classification problem and we propose practical models that use friendship and group membership information (which is often not hidden) to infer sensitive attributes. The key novel idea is that in addition to friendship links, groups can be carriers of significant information. We show that on several well-known social media sites, we can easily and accurately recover the information of private-profile users. To the best of our knowledge, this is the first work that uses link-based and group-based classification to study privacy implications in social networks with mixed public and private user profiles.