Archiving Temporal Web Information: Organization of Web Contents for Fast Access and Compact Storage

dc.contributor.authorSong, Sangchul
dc.contributor.authorJaJa, Joseph
dc.date.accessioned2008-04-08T19:02:12Z
dc.date.available2008-04-08T19:02:12Z
dc.date.issued2008-04-07
dc.description.abstractWe address the problem of archiving dynamic web contents over significant time spans. Current schemes crawl the web contents at regular time intervals and archive the contents after each crawl regardless of whether or not the contents have changed between consecutive crawls. Our goal is to store newly crawled web contents only when they are different than the previous crawl, while ensuring accurate and quick retrieval of archived contents based on arbitrary temporal queries over the archived time period. In this paper, we develop a scheme that stores unique temporal web contents in containers following the widely used ARC/WARC format, and that provides quick access to the archived contents for arbitrary temporal queries. A novel component of our scheme is the use of a new indexing structure based on the concept of persistent or multi-version data structures. Our scheme can be shown to be asymptotically optimal both in storage utilization and insert/retrieval time. We illustrate the performance of our method on two very different data sets from the Stanford WebBase project, the first reflecting very dynamic web contents and the second relatively static web contents. The experimental results clearly illustrate the substantial storage savings achieved by eliminating duplicate contents detected between consecutive crawls, as well as the speed at which our method can find the archived contents specified through arbitrary temporal queries.en
dc.format.extent152082 bytes
dc.format.mimetypeapplication/pdf
dc.identifier.urihttp://hdl.handle.net/1903/7569
dc.language.isoen_USen
dc.relation.ispartofseriesUMIACSen
dc.relation.ispartofseriesUMIACS-TR-2008-08en
dc.titleArchiving Temporal Web Information: Organization of Web Contents for Fast Access and Compact Storageen
dc.typeTechnical Reporten

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
temporal-web-archiving-final-umiacs-tr-2008-08.pdf
Size:
148.52 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.81 KB
Format:
Item-specific license agreed upon to submission
Description: