CS227 - Advanced Topics In Database Management
Spring 2004

Stream Data Management

Streaming for Dummies (final version)

Introduction

Data sources are proliferating.  Data is being produced by cell phones, PDA's, sensors and other small devices that are embedded in the environment..  Real-time data is also being produced by information utilities such as stock tickers or news feeds.  All of these data sources produce data and push it into the computing fabric.  Push-based data sources are unaware of how their data will be used.  As the data moves through the computing infrastructure, it will be acted on in intelligent ways.  This is backwards from the way standard computing models treat data (e.g., traditional databases, RPC).  The data sequences that are produced in this way are called streams.

 

Recently a new area has emerged that has been called Stream Data Management (SDM).  The goal of this work is to produce software technology similar to that of Database Management Systems (DBMS) for the stream world.  Users will register queries with the system.  These queries will execute forever producing results on the basis of the data that arrives in the streams.

The area of data management provides intelligent access to data using specialized techniques and application-level knowledge to organize the use of scarce resources (e.g., storage).  Modern database management systems have successfully shown the value of such an approach. This course will focus on issues related to the goal of providing similar capability to the world of stream-based data.  Our hypothesis is that, while many of the fundamental concepts are the same, the approach must necessarily be radically different.

This course will discuss component technologies that will contribute to this vision of stream data management.  We will examine relevant research papers that will give us insight into the basic issues and approaches that might be needed in this brave new world. We will pay special attention to system building issues by looking in some detail at several competing university systems.

Topics

While the field of stream data management is very new and speculative, there has been work on topics that could be relevant to our vision. Furthermore, several new research prototypes are emerging.   In this course, we will investigate these topics one at a time through readings and class discussions.  A schedule for the semester follows.

 

Date

Topic

Presenters

Feb 2

Introduction to Course

Aurora (Brown)

sbz

Feb 9

Stream (Stanford), Telegraph (Berkeley)

alex (alexr), mark (msh)

Feb 16

Query Languages for Streams

jenny (jenine), john (jwicks), derek (dss)

Feb 23

Long Weekend

     - No Class -

 

Mar 1

Query Processing for Streams I

john (jwicks),  phil (pgs), tori (tori)

March 8

Query Processing for Streams II

peter (pgs), alex (alexr), alex (alexz)

March 15

QoS, Scheduling, and Load Shedding

mark (msh), tori (tori)

March 22

Distributed Processing

derek (dss), peter (pgs)

March 29

Spring Recess

    – No Class

 

April 5

The Competition

phil (pgs), jenny (jenine)

April 12

Writing - outline

 

April 19

Writing - early section drafts

 

April 26

Writing - integration of sections

 

May 3

Writing - rewiting

 

May 10

Finishing touches

 

 

Mechanics

This is a seminar course that is based on current research papers.  There is no textbook.  The intent is to explore the brand new area of stream data management (SDM) by looking at current papers on this and related topics.  We will attempt to build a picture of what SDM might look like in the future given these contributing results.

 

Assuming that the class is a manageable size, it would be wonderful if we could produce something of lasting value.  I propose that that could be modest summary of issues and approaches to building stream data management systems.  Such a summary could take the form of a tutorial paper, say 50-70 pages in length.  This paper will be written by the whole class.

 

The first 2/3 of the semester will be devoted to making us all "experts" in the field.  We will do this through readings and discussions of current research papers.  The last 1/3 of the semester will be devoted to the writing process.  During the writing phase, we will use class time to exchange ideas about content, critique and reformulate drafts and do whatever else it takes to make a very high-quality tutorial paper.  One of our major goals will be to make it accessible to anyone with a basic understanding of computer science.

Rather than break up the discussions with two classes a week, we will instead hold one long class.  We will need to keep things lively in order to avoid the potential boredom that 2.5 hours could produce.  Thus, for the paper-reading part of the course, we will adopt the following class methodology.

The Presenters will be responsible for providing a summary of the area based on the readings.  This can largely be derived from the assigned readings, but you are encouraged to go beyond our syllabus to discover other interesting work.  Remember that the last thing in the world that we are looking for is a linear presentation of the sections in the papers.  Part of the message should be a description of how you think that the topic at hand relates to stream data management. 

The rest of you are not off the hook.  You are expected to actively participate in the debate.  Also, in order to ensure that you read the papers and think about the issues before coming to class, everyone who is not a presenter or a discussant will write up two examples.  Each example should illustrate something about what you consider to be a key part of the technology that we are studying.  An example can illustrate any of the approaches of the papers.  Ideally, you will be able to work each example out in more than one of the systems studied.

The discussion of the papers will take the first part of the class.

The second part of the class will be for the previous week's discussion leaders to come to class with a synopsis of the examples that the class members produced.  We are looking for examples that would be good fodder for our tutorial. .

What am I going to be graded on?

The grade for the course will based on the following things.

1. Your presentations
2. Your weekly examples
3.
Your contribution to the overall class writing effort
4. Your class citizenship during the discussions (You should have something to say.)


 

There will be no exams.


There are no textbooks to buy.  Most all of the readings are available on line.

 

Readings

 

 

Aurora (Brown)

 

Daniel Abadi, Don Carney, Ugur Cetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik, Aurora: A New Model and Architecture for Data Stream Management, VLDB Journal, Vol. 12, No. 2, August, 2003. (pdf)

 

Stan Zdonik, Michael Stonebraker, Mitch Cherniack, Ugur Cetintemel, Magdalena Balazinska and Hari Balakrishnan, The Aurora and Medusa Projects, IEEE Data Engineering Bulletin, Vol. 26, No. 1, March, 2003. (Postscript)
 

 

Stream (Stanford)

The STREAM Group. STREAM: The Stanford Stream Data Manager (short overview paper)
IEEE Data Engineering Bulletin, Vol. 26 No. 1, March 2003

 

Telegraph (Berkeley)

 

Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Samuel R. Madden, Vijayshankar Raman, Fred Reiss, and Mehul A. Shah. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. To appear, CIDR 2003. [PDF]

Sailesh Krishnamurthy, Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein, Wei Hong, Samuel R. Madden, Vijayshankar Raman, Fred Reiss, and Mehul A. Shah. TelegraphCQ: An Architectural Status Report. IEEE Data Engineering Bulletin, Vol 26(1), March 2003. [PS]

 

Query Languages for Streams

Seshadri, P., Livny, M. and Ramakrishnan, R., Sequence Query Processing. in ACM Conference on the Management of Data (SIGMOD), (1994), 430-441.

A. Arasu, S. Babu and J. Widom. The CQL Continuous Query Language: Semantic Foundations and Query Execution
Technical Report, Oct. 2003

Wang, H. and Zaniolo, C. ATLaS: A Native Extension of SQL for Data Mining and Stream Computations, UCLA CS Dept., 2002.

 

Query Processing for Streams I


Tucker, P., Maier, D., Sheard, T. and Fegaras, L. Punctuating Continuous Streams, Oregon Graduate Institute, Portland, OR, 2002.

  U. Srivastava, S. Babu, and J. Widom. Monitoring Stream Properties for Continuous Query Processing (short paper)
In Proc. of the 2003 Workshop on Management and Processing of Data Streams (MPDS 2003), June 2003

 

Kang, J., Naughton, J.F. and Viglas, S.D., Evaluating Window Joins Over Unbounded Streams. in.

 

Query Processing for Streams II

A. Arasu, B. Babcock, S. Babu, J. McAlister, and J. Widom. Characterizing Memory Requirements for Queries over Continuous Data Streams
To appear in ACM Transactions on Database Systems, March 2004.

  S. Babu, U. Srivastava, and J. Widom. Exploiting k-Constraints to Reduce Memory Overhead in Continuous Queries over Data Streams
Technical Report, November 2002

Vijayshankar Raman, Amol Deshpande, and Joseph M. Hellerstein. Using State Modules for Adaptive Query Processing. ICDE 2003 (pdf).

 

QoS, Scheduling, and Load Shedding

B. Babcock, S. Babu, M. Datar, R. Motwani, and D. Thomas. Operator Scheduling in Data Stream Systems
Technical Report, Oct. 2003

B. Babcock, M. Datar, and R. Motwani. Load Shedding Techniques for Data Stream Systems (short paper)
In Proc. of the 2003 Workshop on Management and Processing of Data Streams (MPDS 2003), June 2003

Nesime Tatbul, Ugur Cetintemel, Stan Zdonik, Mitch Cherniack and Michael Stonebraker, <b>Load Shedding in a Data Stream Manager</b>, Proceedings of the 29th International Conference on Very Large Data Bases (VLDB), September, 2003. (pdf)


Don Carney, Ugur Cetintemel, Alex Rasin, Stan Zdonik, Mitch Cherniack and Michael Stonebraker, Operator Scheduling in a Data Stream Environment</b>, Proceedings of the 29th International Conference on Very Large Data Bases (VLDB), September, 2003. (pdf)

R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and R. Varma. Query Processing, Resource Management, and Approximation in a Data Stream Management System
In Proc. of the 2003 Conf. on Innovative Data Systems Research (CIDR), January 2003

Vijayshankar Raman and Joseph M. Hellerstein. Partial Results for Online Query Processing. ACM SIGMOD Conference, Madison, WI, June 2002. [PDF]

Distributed Processing

Mitch Cherniack, Hari Balakrishnan, Don Carney, Ugur Cetintemel, Ying Xing and Stan Zdonik, Scalable Distributed Stream Processing, Proceedings of the Conference for Innovative Database Research (CIDR), January, 2003, Asilomar, CA. (pdf)

C. Olston, J. Jiang, and J. Widom. Adaptive Filters for Continuous Queries over Distributed Data Streams
In Proc. of the ACM Intl Conf. on Management of Data (SIGMOD 2003), June, 2003

B. Babcock and C. Olston. Distributed Top-K Monitoring
In Proc. of the ACM Intl Conf. on Management of Data (SIGMOD 2003), June 2003

Shah, M.A., Hellerstein, J.M., Chandrasekaran, S. and Franklin, M.J., Flux: An Adaptive Partitioning Operator for Continuous Query Systems. in International Conference on Data Engineering (ICDE), (Bangalore, India, 2003).

 

Applications and the Competition

S. Babu, L. Subramanian, and J. Widom. A Data Stream Management System for Network Traffic Management
In Proc. of the Workshop on Network-Related Data Management (NRDM 2001), May 2001

Joseph M. Hellerstein. From Database to Dataflow: New Directions in IT. Medical Records Institute Health IT Advisory Report 3(6) (2002). [ PDF]

Oracle 9i Stream