Computer Science Theses and Dissertations

Permanent URI for this collectionhttp://hdl.handle.net/1903/2756

Browse

Search Results

Now showing 1 - 10 of 272
  • Thumbnail Image
    Item
    COMPUTING APPROXIMATE CUSTOMIZED RANKING
    (2009) Wu, Yao; Raschid, Louiqa; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    As the amount of information grows and as users become more sophisticated, ranking techniques become important building blocks to meet user needs when answering queries. PageRank is one of the most successful link-based ranking methods, which iteratively computes the importance scores for web pages based on the importance scores of incoming pages. Due to its success, PageRank has been applied in a number of applications that require customization. We address the scalability challenges for two types of customized ranking. The first challenge is to compute the ranking of a subgraph. Various Web applications focus on identifying a subgraph, such as focused crawlers and localized search engines. The second challenge is to compute online personalized ranking. Personalized search improves the quality of search results for each user. The user needs are represented by a personalized set of pages or personalized link importance in an entity relationship graph. This requires an efficient online computation. To solve the subgraph ranking problem efficiently, we estimate the ranking scores for a subgraph. We propose a framework of an exact solution (IdealRank) and an approximate solution (ApproxRank) for computing ranking on a subgraph. Both IdealRank and ApproxRank represent the set of external pages with an external node $\Lambda$ and modify the PageRank-style transition matrix with respect to $\Lambda$. The IdealRank algorithm assumes that the scores of external pages are known. We prove that the IdealRank scores for pages in the subgraph converge to the true PageRank scores. Since the PageRank-style scores of external pages may not typically be available, we propose the ApproxRank algorithm to estimate scores for the subgraph. We analyze the $L_1$ distance between IdealRank scores and ApproxRank scores of the subgraph and show that it is within a constant factor of the $L_1$ distance of the external pages. We demonstrate with real and synthetic data that ApproxRank provides a good approximation to PageRank for a variety of subgraphs. We consider online personalization using ObjectRank; it is an authority flow based ranking for entity relationship graphs. We formalize the concept of an aggregate surfer on a data graph; the surfer's behavior is controlled by multiple personalized rankings. We prove a linearity theorem over these rankings which can be used as a tool to scale this type of personalization. DataApprox uses a repository of precomputed rankings for a given set of link weights assignments. We define DataApprox as an optimization problem; it selects a subset of the precomputed rankings from the repository and produce a weighted combination of these rankings. We analyze the $L_1$ distance between the DataApprox scores and the real authority flow ranking scores and show that DataApprox has a theoretical bound. Our experiments on the DBLP data graph show that DataApprox performs well in practice and allows fast and accurate personalized authority flow ranking.
  • Thumbnail Image
    Item
    Lexical Features for Statistical Machine Translation
    (2009) Devlin, Jacob; Dorr, Bonnie; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    In modern phrasal and hierarchical statistical machine translation systems, two major features model translation: rule translation probabilities and lexical smoothing scores. The rule translation probabilities are computed as maximum likelihood estimates (MLEs) of an entire source (or target) phrase translating to a target (or source) phrase. The lexical smoothing scores are also a likelihood estimate of a source (target) phrase translating to a target (source) phrase, but they are computed using independent word-to-word translation probabilities. Intuitively, it would seem that the lexical smoothing score is a less powerful estimate of translation likelihood due to this independence assumption, but I present the somewhat surprising result that lexical smoothing is far more important to the quality of a state-of-the-art hierarchical SMT system than rule translation probabilities. I posit that this is due to a fundamental data sparsity problem: The average word-to-word translation is seen many more times than the average phrase-to-phrase translation, so the word-to-word translation probabilities (or lexical probabilities) are far better estimated. Motivated by this result, I present a number of novel methods for modifying the lexical probabilities to improve the quality of our MT output. First, I examine two methods of lexical probability biasing, where for each test document, a set of secondary lexical probabilities are extracted and interpolated with the primary lexical probability distribution. Biasing each document with the probabilities extracted from its own first-pass decoding output provides a small but consistent gain of about 0.4 BLEU. Second, I contextualize the lexical probabilities by factoring in additional information such as the previous or next word. The key to the success of this context-dependent lexical smoothing is a backoff model, where our "trust" of a context-dependent probability estimation is directly proportional to how many times it was seen in the training. In this way, I avoid the estimation problem seen in translation rules, where the amount of context is high but the probability estimation is inaccurate. When using the surrounding words as context, this feature provides a gain of about 0.6 BLEU on Arabic and Chinese. Finally, I describe several types of discriminatively trained lexical features, along with a new optimization procedure called Expected-BLEU optimization. This new optimization procedure is able to robustly estimate weights for thousands of decoding features, which can in effect discriminatively optimize a set of lexical probabilities to maximize BLEU. I also describe two other discriminative feature types, one of which is the part-of-speech analogue to lexical probabilities, and the other of which estimates training corpus weights based on lexical translations. The discriminative features produce a gain of 0.8 BLEU on Arabic and 0.4 BLEU on Chinese.
  • Thumbnail Image
    Item
    Sequential Search With Ordinal Ranks and Cardinal Values: An Infinite Discounted Secretary Problem
    (2009) Palley, Asa Benjamin; Cramton, Peter; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    We consider an extension of the classical secretary problem where a decision maker observes only the relative ranks of a sequence of up to N applicants, whose true values are i.i.d. U[0,1] random variables. Applicants arrive according to a homogeneous Poisson Process, and the decision maker seeks to maximize the expected time-discounted value of the applicant who she ultimately selects. This provides a straightforward and natural objective while retaining the structure of limited information based on relative ranks. We derive the optimal policy in the sequential search, and show that the solution converges as N goes to infinity. We compare these results with a closely related full information problem in order to quantify these informational limitations.
  • Thumbnail Image
    Item
    Combinatorial Problems in Online Advertising
    (2009) Malekian, Azarakhsh; Khuller, Samir; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Electronic commerce or eCommerce refers to the process of buying and selling of goods and services over the Internet. In fact, the Internet has completely transformed traditional media based advertising so much so that billions of dollars of advertising revenue is now flowing to search companies such as Microsoft, Yahoo! and Google. In addition, the new advertising landscape has opened up the advertising industry to all players, big and small. However, this transformation has led to a host of new problems faced by the search companies as they make decisions about how much to charge for advertisements, whose ads to display to users, and how to maximize their revenue. In this thesis we focus on an entire suite of problems motivated by the central question of "Which advertisement to display to which user?". Targeted advertisement happens when a user enters a relevant search query. The ads are usually displayed on the sides of the search result page. Internet advertising also takes place by displaying ads on the side of webpages with relevant content. While large advertisers (e.g. Coca Cola) pursue brand recognition by advertisement, small advertisers are happy with instant revenue as a result of a user following their ad and performing a desired action (e.g. making a purchase). Therefore, small advertisers are often happy to get any ad slot related to their ad while large advertisers prefer contracts that will guarantee that their ads will be delivered to enough number of desired users. We first focus on two problems that come up in the context of small advertisers. The first problem we consider deals with the allocation of ads to slots considering the fact that users enter search queries over a period of time, and as a result the slots become available gradually. We use a greedy method for allocation and show that the online ad allocation problem with a fixed distribution of queries over time can be modeled as maximizing a continuous non-decreasing submodular sequence function for which we can guarantee a solution with a factor of at least (1- 1/e) of the optimal. The second problem we consider is query rewriting problem in the context of keyword advertisement. This problem can be posed as a family of graph covering problems to maximize profit. We obtain constant-factor approximation algorithms for these covering problems under two sets of constraints and a realistic notion of ad benefit. We perform experiments on real data and show that our algorithms are capable of outperforming a competitive baseline algorithm in terms of the benefit due to rewrites. We next consider two problems related to premium customers, who need guaranteed delivery of a large number of ads for the purpose of brand recognition and would require signing a contract. In this context, we consider the allocation problem with the objective of maximizing either revenue or fairness. The problems considered in this thesis address just a few of the current challenges in e-Commerce and Internet Advertising. There are many interesting new problems arising in this field as the technology evolves and online-connectivity through interactive media and the internet become ubiquitous. We believe that this is one of the areas that will continue to receive greater attention by researchers in the near future.
  • Thumbnail Image
    Item
    Algorithmic issues in visual object recognition
    (2009) Hussein, Mohamed Elsayed Ahmed; Davis, Larry; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This thesis is divided into two parts covering two aspects of research in the area of visual object recognition. Part I is about human detection in still images. Human detection is a challenging computer vision task due to the wide variability in human visual appearances and body poses. In this part, we present several enhancements to human detection algorithms. First, we present an extension to the integral images framework to allow for constant time computation of non-uniformly weighted summations over rectangular regions using a bundle of integral images. Such computational element is commonly used in constructing gradient-based feature descriptors, which are the most successful in shape-based human detection. Second, we introduce deformable features as an alternative to the conventional static features used in classifiers based on boosted ensembles. Deformable features can enhance the accuracy of human detection by adapting to pose changes that can be described as translations of body features. Third, we present a comprehensive evaluation framework for cascade-based human detectors. The presented framework facilitates comparison between cascade-based detection algorithms, provides a confidence measure for result, and deploys a practical evaluation scenario. Part II explores the possibilities of enhancing the speed of core algorithms used in visual object recognition using the computing capabilities of Graphics Processing Units (GPUs). First, we present an implementation of Graph Cut on GPUs, which achieves up to 4x speedup against compared to a CPU implementation. The Graph Cut algorithm has many applications related to visual object recognition such as segmentation and 3D point matching. Second, we present an efficient sparse approximation of kernel matrices for GPUs that can significantly speed up kernel based learning algorithms, which are widely used in object detection and recognition. We present an implementation of the Affinity Propagation clustering algorithm based on this representation, which is about 6 times faster than another GPU implementation based on a conventional sparse matrix representation.
  • Thumbnail Image
    Item
    Combining Static and Dynamic Typing in Ruby
    (2009) Furr, Michael; Foster, Jeffrey S; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Many popular scripting languages such as Ruby, Python, and Perl are dynamically typed. Dynamic typing provides many advantages such as terse, flexible code and the ability to use highly dynamic language constructs, such as an eval method that evaluates a string as program text. However these dynamic features have traditionally obstructed static analyses leaving the programmer without the benefits of static typing, including early error detection and the documentation provided by type annotations. In this dissertation, we present Diamondback Ruby (DRuby), a tool that blends static and dynamic typing for Ruby. DRuby provides a type language that is rich enough to precisely type Ruby code, without unneeded complexity. DRuby uses static type inference to automatically discover type errors in Ruby programs and provides a type annotation language that serves as verified documentation of a method's behavior. When necessary, these annotations can be checked dynamically using runtime contracts. This allows statically and dynamically checked code to safely coexist, and any runtime errors are properly blamed on dynamic code. To handle dynamic features such as eval, DRuby includes a novel dynamic analysis and transformation that gathers per-application profiles of dynamic feature usage via a program's test suite. Based on these profiles, DRuby transforms the program before applying its type inference algorithm, enforcing type safety for dynamic constructs. By leveraging a program's test suite, our technique gives the programmer an easy to understand trade-off: the more dynamic features covered by their tests, the more static checking is achieved. We evaluated DRuby on a benchmark suite of sample Ruby programs. We found that our profile-guided analysis and type inference algorithms worked well, discovering several previously unknown type errors. Furthermore, our results give us insight into what kind of Ruby code programmers ``want'' to write but is not easily amenable to traditional static typing. This dissertation shows that it is possible to effectively integrate static typing into Ruby without losing the feel of a dynamic language.
  • Thumbnail Image
    Item
    The Multivariate Variance Gamma Process and Its Applications in Multi-asset Option Pricing
    (2009) Wang, Jun; Madan, Dilip B; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Dependence modeling plays a critical role in pricing and hedging multi-asset derivatives and managing risks with a portfolio of assets. With the emerge of structured products, it has attracted considerable interest in using multivariate Levy processes to model the joint dynamics of multiple financial assets. The traditional multidimensional extension assumes a common time change for each marginal process, which implies limited dependence structure and similar kurtosis on each marginal. In this thesis, we introduce a new multivariate variance gamma process which allows arbitrary marginal variance gamma (VG) processes with flexible dependence structure. Compared with other multivariate Levy processes recently proposed in the literature, this model has several advantages when applied to financial modeling. First, the multivariate process built with any marginal VG process is easy to simulate and estimate. Second, it has a closed form joint characteristic function which largely simplifies the computation problem of pricing multi-asset options. Last, it can be applied to other time changed Levy processes such as normal inverse Gaussian (NIG) process. To test whether the multivariate variance gamma model fits the joint distribution of financial returns, we compare the model performance of explaining the portfolio returns with other popular models and we also develop Fast Fourier Transform (FFT)-based methods in pricing multi-asset options such as exchange options, basket options and cross-currency foreign exchange options.
  • Thumbnail Image
    Item
    Cooperative Particle Swarm Optimization for Combinatorial Problems
    (2009) Lapizco Encinas, Grecia del Carmen; Reggia, James A; Kingsford, Carl L; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    A particularly successful line of research for numerical optimization is the well-known computational paradigm particle swarm optimization (PSO). In the PSO framework, candidate solutions are represented as particles that have a position and a velocity in a multidimensional search space. The direct representation of a candidate solution as a point that flies through hyperspace (i.e., Rn) seems to strongly predispose the PSO toward continuous optimization. However, while some attempts have been made towards developing PSO algorithms for combinatorial problems, these techniques usually encode candidate solutions as permutations instead of points in search space and rely on additional local search algorithms. In this dissertation, I present extensions to PSO that by, incorporating a cooperative strategy, allow the PSO to solve combinatorial problems. The central hypothesis is that by allowing a set of particles, rather than one single particle, to represent a candidate solution, combinatorial problems can be solved by collectively constructing solutions. The cooperative strategy partitions the problem into components where each component is optimized by an individual particle. Particles move in continuous space and communicate through a feedback mechanism. This feedback mechanism guides them in the assessment of their individual contribution to the overall solution. Three new PSO-based algorithms are proposed. Shared-space CCPSO and multispace CCPSO provide two new cooperative strategies to split the combinatorial problem, and both models are tested on proven NP-hard problems. Multimodal CCPSO extends these combinatorial PSO algorithms to efficiently sample the search space in problems with multiple global optima. Shared-space CCPSO was evaluated on an abductive problem-solving task: the construction of parsimonious set of independent hypothesis in diagnostic problems with direct causal links between disorders and manifestations. Multi-space CCPSO was used to solve a protein structure prediction subproblem, sidechain packing. Both models are evaluated against the provable optimal solutions and results show that both proposed PSO algorithms are able to find optimal or near-optimal solutions. The exploratory ability of multimodal CCPSO is assessed by evaluating both the quality and diversity of the solutions obtained in a protein sequence design problem, a highly multimodal problem. These results provide evidence that extended PSO algorithms are capable of dealing with combinatorial problems without having to hybridize the PSO with other local search techniques or sacrifice the concept of particles moving throughout a continuous search space.
  • Thumbnail Image
    Item
    The Lattice Project: A Multi-model Grid Computing System
    (2009) Bazinet, Adam Lee; Cummings, Michael P; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This thesis presents The Lattice Project, a system that combines multiple models of Grid computing. Grid computing is a paradigm for leveraging multiple distributed computational resources to solve fundamental scientific problems that require large amounts of computation. The system combines the traditional Service model of Grid computing with the Desktop model of Grid computing, and is thus capable of utilizing diverse resources such as institutional desktop computers, dedicated computing clusters, and machines volunteered by the general public to advance science. The production Grid system includes a fully-featured user interface, support for a large number of popular scientific applications, a robust Grid-level scheduler, and novel enhancements such as a Grid-wide file caching scheme. A substantial amount of scientific research has already been completed using The Lattice Project.
  • Thumbnail Image
    Item
    Using Internet Geometry to Improve End-to-End Communication Performance
    (2009) Lumezanu, Cristian; Spring, Neil; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The Internet has been designed as a best-effort communication medium between its users, providing connectivity but optimizing little else. It does not guarantee good paths between two users: packets may take longer or more congested routes than necessary, they may be delayed by slow reaction to failures, there may even be no path between users. To obtain better paths, users can form routing overlay networks, which improve the performance of packet delivery by forwarding packets along links in self-constructed graphs. Routing overlays delegate the task of selecting paths to users, who can choose among a diversity of routes which are more reliable, less loaded, shorter or have higher bandwidth than those chosen by the underlying infrastructure. Although they offer improved communication performance, existing routing overlay networks are neither scalable nor fair: the cost of measuring and computing path performance metrics between participants is high (which limits the number of participants) and they lack robustness to misbehavior and selfishness (which could discourage the participation of nodes that are more likely to offer than to receive service). In this dissertation, I focus on finding low-latency paths using routing overlay networks. I support the following thesis: it is possible to make end-to-end communication between Internet users simultaneously faster, scalable, and fair, by relying solely on inherent properties of the Internet latency space. To prove this thesis, I take two complementary approaches. First, I perform an extensive measurement study in which I analyze, using real latency data sets, properties of the Internet latency space: the existence of triangle inequality violations (TIVs) (which expose detour paths: ''indirect'' one-hop paths that have lower round-trip latency than the ''direct'' default paths), the interaction between TIVs and network coordinate systems (which leads to scalable detour discovery), and the presence of mutual advantage (which makes fairness possible). Then, using the results of the measurement study, I design and build PeerWise, the first routing overlay network that reduces end-to-end latency between its participants and is both scalable and fair. I evaluate PeerWise using simulation and through a wide-area deployment on the PlanetLab testbed.