CS227 - Advanced Topics In Database Management
Spring 2003

Stream Data Management

Introduction

In the future, computers that do your work might not be visible, nor will you be aware that they exist. Computing is beginning to move away from the desktop and into the daily fabric of our lives through cell phones, PDA's, and small devices that are embedded in the environment..  Performing computations on such an infrastructure has often been termed pervasive or ubiquitous computing.  While these technologies hold great promise, the promise will never be realized without reliable and timely access to relevant data.

 

Often relevant data is collected by small devices and sent on into the computing infrastructure for further processing.  The collection devices are unaware of how the data will be used; however, in order to ensure that higher-level entities are aware of its existence, they will push the data forward without any specific requests.  This is backwards from the way many computing models treat data (e.g., traditional databases, RPC).  The data sequences that are produced in this way are called streams.

 

Recently a new area has emerged that has been called Stream Data Management (SDM).  The goal of this work is to produce software technology similar to that of Database Management Systems (DBMS) for the stream world.  Users will register queries with the system.  These queries will execute forever producing results on the basis of the data that arrives in the streams.

The area of data management provides intelligent access to data using specialized techniques and application-level knowledge to organize the use of scarce resources (e.g., storage).  Modern database management systems have successfully shown the value of such an approach. This course will focus on issues related to the goal of providing similar capability to the world of stream-based and pervasive computing.  Our hypothesis is that, while many of the fundamental concepts are the same, the approach must necessarily be radically different.

This course will discuss component technologies that will contribute to this vision of stream data management.  We will examine relevant research papers that will give us insight into the basic issues and approaches that might be needed in this brave new world.

Topics

While the field of pervasive data management is very new and speculative, there has been work on topics that could be relevant to our vision.  In this course, we will investigate these topics one at a time through readings and class discussions.  A schedule for the semester follows.

 

Date

Topic

Presenters

Discussants

Papers

Jan 27

Introduction to Course

 



 

Feb 3

Pervasive Data Management

cw
nwoodbri

fwood 

[30] [11] [9]

Feb 10

East Coast Systems

arsinger
zkaplowi 

 cw
nwoodbri

[5] [18] [27]

Feb 17

Presidents Day

     - No Class

 

 

 

Feb 24

West Coast Systems

olga
dyao 

arsinger
zkaplowi 

[16] [17] [7]

March 3

Triggers and Active Databases

 fwood
skim
lucia

cw
nwoodbri 

[6] [12] [21]

March 10

Continuous Queries and Publish/Subscribe

cw
nwoodbri 

olga
dyao 

[26] [31] [1]

March 17

Query Languages for Streams

arsinger
zkaplowi 

olga
dyao 

[2] [22] [29]

March 24

Spring Recess

    – No Class

 

 

 

March 31

Query Processing for Streams

olga
dyao

arsinger
zkaplowi 

[28] [14] [3]

April 7

Sensor Databases and Peer-to-Peer Systems

 fwood
skim
lucia

 nwoodbri

[15] [25] [4]

April 14

QoS and Load Shedding

olga
dyao 

lucia 

[10] [24]

April 21

Scheduling and Realtime Databases

 fwood
skim
lucia

 cw

[19] [20]

April 28

High-Availability and Load Sharing

 arsinger
zkaplowi

 skim

[8] [23]

May 5

The Competition

 tba

 tba

[13]

 

Mechanics

This is a seminar course that is based on current research papers.  There is no textbook.  The intent is to explore the brand new area of stream data management (SDM) by looking at current papers on this and related topics.  We will attempt to build a picture of what SDM might look like in the future given these contributing results.

Rather than break up the discussions with two classes a week, we will instead hold one long class.  We will need to keep things lively in order to avoid the potential boredom that 2.5 hours could produce.  Thus, we will adopt the following class methodology.

One team will be responsible for presenting a summary of the area based on the readings.  This can largely be derived from the assigned readings, but you are encouraged to go beyond our syllabus to discover other interesting work.  Remember that the last thing in the world that we are looking for is a linear presentation of the sections in the papers.  Part of the message should be a description of how you think that the topic at hand relates to stream data management.  This team will try to present the area in the best possible light.  You guys are the cheerleaders for the approach.

Another team will be assigned the job of being the discussants.  Discussants will present a short rebuttal to the presenters’ talk.  They will also come to class prepared with questions, counterexamples, and a generally crabby attitude toward the work.  With any luck, this will set up a debate-like atmosphere in which we can argue about the pros and cons of the basic technologies.

The rest of you are not off the hook.  You are expected to actively participate in the debate.  Also, in order to ensure that you read the papers and think about the issues before coming to class, everyone who is not a presenter or a discussant will write a brief position paper which captures your own thoughts about the readings.  My guess is that these will need to be about 2 pages in length, but you may use whatever you feel is adequate.

The discussion of the papers will take the first half of the class.

The second half of the class will be a group exercise to produce a class tutorial on the topic.  We will not be able to create prose in class, but we can produce a detailed outline.  The presenters and the discussants will lead this exercise.  They will deputize someone to take control of the keyboard as we argue about the content of our tutorial.  The presenters and the discussants will then turn this outline into prose off-line.  There is no real page requirement, but my guess is that to do this right, you will need at least 10 pages.

What am I going to be graded on?

The grade for the course will based on the following things.

1. Your presentations
2. Your performance as a discussant
3. Your tutorials
4. Your position papers
5. Your class citizenship during the discussions (You should have something to say.)
6. A final project

There will be no exams.


There are no textbooks to buy.  Most all of the readings are available on line.

 

Readings

 

1.         Altinel, M. and Franklin, M.J. Efficient Filtering of XML Documents for Selective Dissemination of Information. The VLDB Journal. 53-64.

2.         Arasu, A., Babu, S. and Widom, J. An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations, 2002.

3.         Babu, S. and Widom, J. Exploiting k-Constraints to Reduce Memory Overhead in Continuous Queries over Data Streams, Stanford University, Palo Alto, CA, 2002.

4.         Bonnet, P., Gehrke, J.E. and Seshadri, P., Towards Sensor Database Systems. in International Conference on Mobile Data Management, (Hong Kong, 2001).

5.         Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Seidman, G., Stonebraker, M., Tatbul, N. and Zdonik, S., Monitoring Streams - A New Class of Data Management Applications. in International Conference on Very Large Databases (VLDB), (Hong Kong, 2002).

6.         Ceri, S., Cochrane, R.J. and Widom, J., Practical Applications of Triggers and Constraints: Successes and Lingering Issues. in International Conference on Very Large Databases (VLDB), (Cairo, Egypt, 2000), 285-296.

7.         Chandrasekaran, S. and Franklin, M., Streaming Queries over Streaming Data. in International Conference on Very Large Databases (VLDB), (Hong Kong, 2002).

8.         Cherniack, M., Balakrishnan, H., Carney, D., Cetintemel, U., Xing, Y. and Zdonik, S., Scalable Distributed Stream Processing. in Conference on Innovative Data Systems Research (CIDR), (Monterey, CA, 2003).

9.         Cherniack, M., Galvez, E.F., Franklin, M.J. and Zdonik, S., Profile-Driven Cache Management. in International Conference on Data Engineering (ICDE), (Bangalore, India, 2003).

10.        Floyd, S. and Jacobson, V. Random Early Detection Gateways for Congestion Avoidance. IEEE/ACM Transactions on Networking, 1 (4).

11.        Franklin, M.J. (ed.), Challenges in Ubiquitous Data Management. Springer-Verlag, 2001.

12.        Hanson, E.N., Carnes, C., Huang, L., Konyala, M., Noronha, L., Parthasarathy, S., Park, J.B. and Vernon, A., Scalable Trigger Processing. in International Conference on Data Engineering (ICDE), (1999), 266-275.

13.        Jacobs, D., Distributed Computing with BEA WebLogic Server. in Conference on Innovative Data Systems Research (CIDR), (Monterey, CA, 2003), 292-302.

14.        Kang, J., Naughton, J.F. and Viglas, S.D., Evaluating Window Joins Over Unbounded Streams. in.

15.        Madden, S., Franklin, M.J., Hellerstein, J.M. and Hong, W., TAG: A Tiny Aggregation Service for ad hoc Sensor Networks. in OSDI, (Boston, MA, 2002).

16.        Madden, S., Shah, M., Hellerstein, J. and Raman, V., Continuously Adaptive Continuous Queries over Streams. in ACM Conference on the Management of Data (SIGMOD), (Madison, WI, 2002).

17.        Motwani, R., Widom, J., Arasu, A., Babcock, B., Babu, S., Datar, M., Manku, G., Olston, C., Rosenstein, J. and Varma, R., Query Processing, Resource Management, and Approximation and in a Data Stream Management System. in Conference on Innovative Data Systems Research (CIDR), (Pacific Grove, CA, 2003), 245-256.

18.        Naughton, J., DeWitt, D., Maier, D. and , ,. The Niagara Internet Query System, University of Wisconsin, 2002.

19.        Pang, H., Carey, M. and Livny, M. Multiclass Query Scheduling in Real-Time Database Systems. IEEE Transactions on Knowledge and Data Engineering, 7 (4). 533-551.

20.        Ramamritham, K. Real-Time Databases. Distributed and Parallel Databases, 1 (2). 199-226.

21.        Schreier, U., Pirahesh, H., Agrawal, R. and Mohan, C., Alert: An Architecture for Transforming a Passive DBMS into an Active DBMS. in International Conference on Very Large Databases (VLDB), (1991), 469-478.

22.        Seshadri, P., Livny, M. and Ramakrishnan, R., Sequence Query Processing. in ACM Conference on the Management of Data (SIGMOD), (1994), 430-441.

23.        Shah, M.A., Hellerstein, J.M., Chandrasekaran, S. and Franklin, M.J., Flux: An Adaptive Partitioning Operator for Continuous Query Systems. in International Conference on Data Engineering (ICDE), (Bangalore, India, 2003).

24.        Shenker, S. Fundamental Design Issues for the Future Internet. IEEE Journal on Selected Areas in Communications, 13 (7). 1176-1188.

25.        Stoica, I., Morris, R., Karger, D., Kaashoek, F. and Balakrishnan, H., Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications. in ACM Conference on Data Communication (SIGCOMM), (San Diego, CA, 2001).

26.        Terry, D.B., Goldberg, D., Nichols, D. and Oki, B.M., Continuous Queries over Append-Only Databases. in ACM Conference on the Management of Data (SIGMOD), (1992), 321-330.

27.        Tucker, P., Maier, D., Sheard, T. and Fegaras, L. Punctuating Continuous Streams, Oregon Graduate Institute, Portland, OR, 2002.

28.        Viglas, S. and Naughton, J.F., Rate-based query optimization for streaming information sources. in ACM Conference on the Management of Data (SIGMOD), (Madison, WI, 2002), 37-48.

29.        Wang, H. and Zaniolo, C. ATLaS: A Native Extension of SQL for Data Mining and Stream Computations, UCLA CS Dept., 2002.

30.        Weiser, M. The Computer for the 21st Century. Scientific American. 94-110.

31.        Yalamanchi, A., Srinivasan, J. and Gawlick, D., Managing Expressions as Data in Relational Database Systems. in Conference on Innovative Data Systems Research (CIDR), (Moonterey, CA, 2003), 303-313.