Technical Reports from UMIACS

Permanent URI for this collectionhttp://hdl.handle.net/1903/7

Browse

Search Results

Now showing 1 - 10 of 31
  • Thumbnail Image
    Item
    Archiving Temporal Web Information: Organization of Web Contents for Fast Access and Compact Storage
    (2008-04-07) Song, Sangchul; JaJa, Joseph
    We address the problem of archiving dynamic web contents over significant time spans. Current schemes crawl the web contents at regular time intervals and archive the contents after each crawl regardless of whether or not the contents have changed between consecutive crawls. Our goal is to store newly crawled web contents only when they are different than the previous crawl, while ensuring accurate and quick retrieval of archived contents based on arbitrary temporal queries over the archived time period. In this paper, we develop a scheme that stores unique temporal web contents in containers following the widely used ARC/WARC format, and that provides quick access to the archived contents for arbitrary temporal queries. A novel component of our scheme is the use of a new indexing structure based on the concept of persistent or multi-version data structures. Our scheme can be shown to be asymptotically optimal both in storage utilization and insert/retrieval time. We illustrate the performance of our method on two very different data sets from the Stanford WebBase project, the first reflecting very dynamic web contents and the second relatively static web contents. The experimental results clearly illustrate the substantial storage savings achieved by eliminating duplicate contents detected between consecutive crawls, as well as the speed at which our method can find the archived contents specified through arbitrary temporal queries.
  • Thumbnail Image
    Item
    Web Archiving: Organizing Web Objects into Web Containers to Optimize Access
    (2007-10-09) Song, Sangchul; JaJa, Joseph
    The web is becoming the preferred medium for communicating and storing information pertaining to almost any human activity. However it is an ephemeral medium whose contents are constantly changing, resulting in a permanent loss of part of our cultural and scientific heritage on a regular basis. Archiving important web contents is a very challenging technical problem due to its tremendous scale and complex structure, extremely dynamic nature, and its rich heterogeneous and deep contents. In this paper, we consider the problem of archiving a linked set of web objects into web containers in such a way as to minimize the number of containers accessed during a typical browsing session. We develop a method that makes use of the notion of PageRank and optimized graph partitioning to enable faster browsing of archived web contents. We include simulation results that illustrate the performance of our scheme and compare it to the common scheme currently used to organize web objects into web containers.
  • Thumbnail Image
    Item
    Techniques to Audit and Certify the Long Term Integrity of Digital Archives
    (2007-08-14) Song, Sangchul; JaJa, Joseph
    A large portion of the government, business, cultural, and scientific digital data being created today needs to be archived and preserved for future use of periods ranging from a few years to decades and sometimes centuries. A fundamental requirement for a long term archive is to set up mechanisms that will ensure the authenticity of the holdings of the archive. In this paper, we develop a new methodology to address the integrity of long term archives using rigorous cryptographic techniques. Our approach involves the generation of a small-size integrity token for each digital object to be archived, and some cryptographic summary information based on all the objects handled within a dynamic time period. We present a framework that enables the continuous auditing of the holdings of the archive, as well as auditing upon access, depending on the policy set by the archive. Moreover, an independent auditor will be able to verify the integrity of every version of an archived digital object as well as link the current version to the original form of the object when it was ingested into the archive. Using this approach, a prototype system called ACE (Auditing Control Environment) has been built and tested. ACE is scalable and cost effective, and is completely independent of the archive's underlying architecture.
  • Thumbnail Image
    Item
    ACE: A Novel Software Platform to Ensure the Integrity of Long Term Archives
    (2007-01-31) Song, Sangchul; JaJa, Joseph
    We develop a new methodology to address the integrity of long term archives using rigorous cryptographic techniques. A prototype system called ACE (Auditing Control Environment) was designed and developed based on this methodology. ACE creates a small-size integrity token for each digital object and some cryptographic summary information based on all the objects handled within a dynamic time period. ACE continuously audits the contents of the various objects according to the policy set by the archive, and provides mechanisms for an independent third-party auditor to certify the integrity of any object. In fact, our approach will allow an independent auditor to verify the integrity of every version of an archived digital object as well as link the current version to the original form of the object when it was ingested into the archive. We show that ACE is very cost effective and scalable while making no assumptions about the archive architecture. We include in this paper some preliminary results on the validation and performance of ACE on a large image collection.
  • Thumbnail Image
    Item
    A Novel Information-Aware Octree for the Visualization of Large Scale Time-Varying Data
    (2006-04-20T16:32:32Z) Kim, Jusub; JaJa, Joseph
    Large scale scientific simulations are increasingly generating very large data sets that present substantial challenges to current visualization systems. In this paper, we develop a new scalable and efficient scheme for the visual exploration of 4-D isosurfaces of time varying data by rendering the 3-D isosurfaces obtained through an arbitrary axis-parallel hyperplane cut. The new scheme is based on: (i) a new 4-D hierarchical indexing structure, called Information-Aware Octree; (ii) a controllable delayed fetching technique; and (iii) an optimized data layout. Together, these techniques enable efficient and scalable out-of-core visualization of large scale time varying data sets. We introduce an entropy-based dimension integration technique by which the relative resolutions of the spatial and temporal dimensions are established, and use this information to design a compact size 4-D hierarchical indexing structure. We also present scalable and efficient techniques for out-of-core rendering. Compared with previous algorithms for constructing 4-D isosurfaces, our scheme is substantially faster and requires much less memory. Compared to the Temporal Branch-On-Need octree (T-BON), which can only handle a subset of our queries, our indexing structure is an order of magnitude smaller and is at least as effective in dealing with the queries that the T-BON can handle. We have tested our scheme on two large time-varying data sets and obtained very good performance for a wide range of isosurface extraction queries using an order of magnitude smaller indexing structures than previous techniques. In particular, we can generate isosurfaces at intermediate time steps very quickly.
  • Thumbnail Image
    Item
    Efficient Isosurface Extraction for Large Scale Time-Varying Data Using the Persistent Hyperoctree (PHOT)
    (2006-01-13T20:31:09Z) Shi, Qingmin; JaJa, Joseph
    We introduce the Persistent HyperOcTree (PHOT) to handle the 4D isocontouring problem for large scale time-varying data sets. This novel data structure is provably space efficient and optimal in retrieving active cells. More importantly, the set of active cells for any possible isovalue are already organized in a Compact Hyperoctree, which enables very efficient slicing of the isocontour along spatial and temporal dimensions. Experimental results based on the very large Richtmyer-Meshkov instability data set demonstrate the effectiveness of our approach. This technique can also be used for other isosurfacing schemes such as view-dependent isosurfacing and ray-tracing, which will benefit from the inherent hierarchical structure associated with the active cells.
  • Thumbnail Image
    Item
    Optimal and Near-Optimal Algorithms for Generalized Intersection Reporting on Pointer Machines
    (2003-11-25) Shi, Qingmin; JaJa, Joseph
    We develop efficient algorithms for a number of generalized intersection reporting problems, including orthogonal and general segment intersection, 2-D range searching, rectangular point enclosure, and rectangle intersection search. Our results for orthogonal and general segment intersection, 3-sided 2-D range searching, and rectangular pointer enclosure problems match the lower bounds for their corresponding standard versions under the pointer machine model. Our results for the remaining problems improve upon the best known previous algorithms. (UMIACS-TR-2003-110)
  • Thumbnail Image
    Item
    Recovery of a Digital Image Collection Through the SDSC/UMD/NARA Prototype Persistent Archive
    (2003-11-25) Smorul, Mike; JaJa, Joseph; McCall, Fritz; Brown, Susan Fitch; Moore, Reagan; Marciano, Richard; Chen, Sheau-Yen; Lopez, Rick; Chadduck, Robert
    The San Diego Supercomputer Center (SDSC), the University of Maryland, and the National Archives and Records Administration (NARA) are collaborating on building a pilot persistent archive using and extending data grid and digital library technologies. The current prototype consists of node servers at SDSC, University of Maryland, and NARA, connected through the Storage Request Broker (SRB) data grid middleware, and currently holds several terabytes of NARA selected collections. In particular, a historically important image collection that was on the verge of becoming inaccessible was fully restored and ingested into our pilot system. In this report, we describe the methodology behind our approach to fully restore this image collection and the process used to ingest it into the prototype persistent archive. (UMIACS-TR-2003-105)
  • Thumbnail Image
    Item
    Space-Efficient and Fast Algorithms for Multidimensional Dominance Reporting and Range Counting
    (2003-11-25) Shi, Qingmin; JaJa, Joseph; Mortensen, Christian
    We present linear-space sublogarithmic algorithms for handling the {\em three-dimensional dominance reporting problem} and the {\em two-dimensional range counting problem}. Under the RAM model as described in~[M.~L. Fredman and D.~E. Willard. ``Surpassing the information theoretic bound with fusion trees'', {\em Journal of Computer and System Sciences}, 47:424--436, 1993], our algorithms achieve $O(\log n/\log\log n+f)$ query time for 3-D dominance reporting, where $f$ is the number of points reported, and $O(\log n/\log\log n)$ query time for 2-D range counting case. We extend these results to any constant dimension $d$ achieving $O(n(\log n/\log\log n)^{d-3})$-space and $O((\log n/\log\log )^{d-2}+f)$-query time for the reporting case and $O(n(\log n/\log\log n)^{d-2})$-space and $O((\log n/\log\log n)^{d-1})$ query time for the counting case. (UMIACS-TR-2003-101)
  • Thumbnail Image
    Item
    An O(n)-Space O(log n/log log n + f)-Query Time Algorithm for 3-D Dominance Reporting
    (2003-08-01) Shi, Qingmin; JaJa, Joseph
    We present a linear-space algorithm for handling the {\em three-dimensional dominance reporting problem}: given a set $S$ of $n$ three-dimensional points, design a data structure for $S$ so that the points in $S$ which dominate a given query point can be reported quickly. Under the variation of the RAM model introduced by Fredman and Willard~\cite{Fredman94}, our algorithm achieves $O(\log n/\log\log n+f)$ query time, where $f$ is the number of points reported. Extensions to higher dimensions are also reported. (UMIACS-TR-2003-77)