CS227 - Advanced Topics In Database Management
Spring 2008

Large-Scale Data Access

Introduction

Data is proliferating at a rapid rate.  While storage devices are getting bigger, the appetite for data of many applications is growing even faster.  Walmart keeps all of its point-of-sale data on-line for three years.  This amounts to about 20 Tbytes of information.  Scientific applications (e.g., biology, astrophysics) collect data from instruments that can exceed several petabytes a year.  Storing and querying this much data in a meaningful and efficient way is an enormous technical challenge.

The database field has responded by developing a set of tools called data warehouses (DW).  In the commercial arena, data warehouses typically contain a copy of the data that has been captured in a transactional environment.  In fact, the data warehouse often is the place in which multiple heterogeneous data sources are integrated into a consistent single point-of-access.  The DW is also the data source for large-scale data mining.  This is an interesting area, but we will not pursue it in this course.

The DW is the base on which decision support systems are built.  In other words, the database presents a model of the world that should be available online for answering a decision maker’s questions.  Despite the size of the DW, decision makers demand rapid access to queries.  A data warehouse is designed to meet the needs of what has become known as On-Line Analytic Processing (OLAP) workloads. OLAP expects to see a workload that is largely queries (reads) and insertions (appending new records). This is in distinction to what has been called On-Line Transaction Processing (OLTP) workloads. OLTP is characterized by a workload containing lots of simple updates and queries. Think ATM's. Not too surprizingly, the techniques for managing OLTP and OLAP data are different.

This course will discuss technologies that contribute to the field  data warehousing.  Even though data warehouses are becoming more mature, we will look at this subject from a research perspective; however, in order to be relevant, we will pay attention to the industrial state-of-the-art.  We will examine relevant research papers that will give us insight into the basic issues and approaches that could be baked into any data warehouse system or product. We will pay special attention to system building issues. We will also attempt to discover the current product offerrings that relate to what we are discussing whenever possible.

This term we will also emphasize the topic of automated database design. For any given set of logical relations, there are many ways to actually store that data. Physical database design is concerned with making that choice. Typically this choice is made by DBA's. As the complexity level of the database increases (e.g., as we try to run our warehouse on a computing cluster), this process quickly becomes unmanageable. We need a tool that either do the physical design for us or that can act as an advisor to the DBA to help him or her make the best choice. We will spend several weeks discussing the state of the art of automated database design.

Topics

A rich literature in large-scale data access has been produced over the last decade.  We will read representative papers from this literature that cover some of the more central topics.  A schedule for the semester follows.

 

Date

Topic

Presenters

Jan 28

Introduction to Course

 

sbz

Feb 4

Field trip – NE Database Day

MIT

Feb 11

Data Warehousing (general)

sbz

Feb 18

Long Weekend

     - No Class -

 

Feb 25

Column Stores - Vertica

Shilpa Lawande

March 3

Indices / Bitmaps

David / Alex

March 10

Compression

Jennie / Matt / Zhenyuan / Bo

March 17

Query Processing

Jonathan /Andy / Josh / Sidra

March 24

Spring Recess

    – No Class

 

March 31

Automatic Database Design 1

Hideaki /  Rob / Bo 

April 17

Automatic Database Design 2

 Hideaki / Jennie / Alex

April 14

Automatic Database Design 3

Nathan / Zhenyuan / Matt

April 21

Automatic Database Design 4

Rob / Andy / Nathan / Jonathan

April 28

Other Approaches

 David / Josh

May 5

Final Project Presentations

all 

 

Mechanics

This is a seminar course that is based on current research papers.  There is no textbook per se, but we will use the book The Data Warehouse Toolkit, Second Edition by Ralph Kimble and Margy Ross, (published by John Wiley & Sons) in the beginning of the course to get us all to think about how to design a warehouse and as a reference throughout the semester to supply us with real world example applications.  Beyond that, the intent is to explore the area of large-scale data access by looking at current research papers related to data warehousing.  We will attempt to build a picture of the kinds of techniques that  might be necessary to address the problems of the coming data avalanche.

 

Rather than break up the discussions with two classes a week, we will instead hold one long class.  We will need to keep things lively in order to avoid the potential boredom that 2.5 hours could produce.  Thus, we will adopt the following class methodology.

 

One team will be responsible for presenting a summary of the area based on the readings.  This can largely be derived from the assigned readings, but you are encouraged to go beyond our syllabus to discover other interesting work.  Remember that the last thing in the world that we are looking for is a linear presentation of the sections in the papers.  Part of the message should be a description of how you think that the topic at hand relates to data warehouses.  This team will try to present the assigned paper in the best possible light.  You guys are the cheerleaders for the approach.

 

The syllabus usually provides multiple papers for the topic at hand.  The topic is the important thing.  You should definitely not present all the details from all the papers.  A suggestion for the lead paper is given in the syllabus.  You might want to concentrate most of your efforts on that paper.  The other papers are there to provide background.  It would be useful for you to let the class know (by e-mail) a few days before your class which papers you will be discussing so that the class can make sure that they have read the same ones in detail.

 

We will also develop a set of slides that will serve as a record of what we did this semester.  This will be compiled from “the best of” each class presentation.  Each team of presenters should construct a revised slide deck based on the experience from the class.  The more examples in these slides the better.  The idea is to use feedback from the class presentation to produce a shorter, better slide deck.

 

The rest of you are not off the hook.  You are expected to actively participate in the debate.  Also, in order to ensure that you read the papers and think about the issues before coming to class, everyone who is not a presenter will write a brief position paper which captures your own thoughts about the readings.  My guess is that these will need to be about 2 pages in length, but you may use whatever you feel is adequate.

 

What am I going to be graded on?

 

The grade for the course will be based on the following things.

 

            1. Your presentations
            2. Your summary slides
            3. Your weekly position papers
            4. Your class citizenship during the discussions (You should have something to say.)
            5. A final project

 

There will be no exams. Most all of the readings are available on line.

 

Presentation Guidelines

 

While every topic and therefore every presentation will be different, there are some general things that we can expect out of each presentation. Use this a s guide. It might be that this does not fit your topic, but it is likely that most of it will. The subtopics are examples of the kinds of things that might be relevant to ask. Invent your own questions. Don't just blindly feel that you must use these.

 

            1. Summary of paper

            Explanation in easy to understand terms - cut through the formality

            Plenty of examples

            2. Critique

            Is this a real problem

            When will the solution work and when will it not.

            3. Industry

            What do products do?

            DBMSs

            Other products

            4. Comparisons

            with other papers

            with DB products

            with other approaches

            5. Future

            How might one broaden or extend the work

            Ideas for new research

 

Teams

 

Here's the current project handout

 

Team

Jonathan and Nathan

Zhenyuan and Bo

Hideaki and David

Andy

Sidra and Jennie

Rob and Matt

Josh and Ahmad

 

Readings

 

 The list of readings is available here.