HIGH PERFORMANCE XPATH EVALUATION IN XML STREAMS
Chawathe, Sudarshan S
MetadataShow full item record
This thesis presents methods for efficiently evaluating structural queries over tree-structured data streams. A data stream usually consists of a sequence of items that arrive in an order determined by the source. An application that uses such data cannot revisit an earlier item in the stream unless it buffers the item itself. Naive buffering methods are not practical due to the high throughput and indefinite length of data streams. Compared with the flat, relational-like data model for data streams that has received recent attention, processing a tree-structured XML data stream poses additional challenges, since a data item cannot, in general, be interpreted without taking structural information into account. In this thesis, we focus on the evaluation of XPath queries on streaming XML. As a W3C standard, XPath has become a core XML technology not only as a standalone query language but also as the foundation of XQuery and XSLT. Features such as subqueries and reverse axes make XPath a powerful query language but they also complicate XPath query processing. We present our work on XSQ, a streaming XPath query engine. Our methods are based on a novel segment-based evaluation scheme. XSQ uses very little memory and is able to process unbounded and unsegmented streaming data because it does not build a DOM tree in memory. It also provides high throughput by only processing the relevant portions of the data and low response time by returning results as early as possible. XSQ is the first streaming system to support complex XPath features such as multiple predicates, closure axes, aggregations, reverse axes, and subqueries. We also describe our work on XPaSS, an XPath-based publish-subscribe system that simultaneously evaluates a large number of XPath queries over XML streams. Unlike other similar systems that filter pre-segmented documents as results, XPaSS returns only the precisely delineated data specified by a user query. It uses a segment-sharing scheme instead of prefix- and suffix-sharing that are commonly used. In our experiments, XPaSS supports up to one million XPath subscriptions using a modest PC-class server, with a throughput comparable to that of the simpler filtering systems.