CS227 - Advanced Topics In Database Management
Spring 2005

Large-scale Data Access

Introduction

Data is proliferating at a rapid rate.  While storage devices are getting bigger, the appetite for data of many applications is growing even faster.  Walmart keeps all of its point-of-sale data on-line for three years.  This amounts to about 20 Tbytes of information.  Scientific applications (e.g., biology, astrophysics) collect data from instruments that can exceed several petabytes a year.  Using this much data in a meaningful and efficient way is an enormous technical challenge.

The database field has responded by developing a set of tools called data warehouses (DW).  In the commercial arena, data warehouses typically contain a copy of the data that is captured in a transactional environment.  In fact, the data warehouse often is the place in which multiple heterogeneous data sources are integrated into a consistent single point-of-access.  The DW is also the data source for large-scale data mining.  This is an interesting area, but we will not pursue it in this course.

The DW is the base on which decision support systems are built.  In other words, the database presents a model of the world that should be available online for answering a decision maker’s questions.  Despite the size of the DW, decision makers demand rapid access to queries.  Since the DW can contain huge amounts of data, the techniques for managing it are different than those in a smaller operational store.

This course will discuss component technologies that contribute to the field  data warehousing.  Even though data warehouses are becoming more mature, we will look at this subject from a research perspective.  The research literature has decomposed the broader problem into smaller more bite-size pieces that can be studied in isolation.  We will examine relevant research papers that will give us insight into the basic issues and approaches that could be baked into any data warehouse system or product. We will pay special attention to system building issues.

Topics

A rich literature in large-scale data access has been produced over the last decade.  We will read representative papers from this literature that cover some of the more central topics.  A schedule for the semester follows.

 

Date

Topic

Presenters

Discussants

Jan 31

Introduction to Course

 

sbz

 

Feb 7

Data Warehousing (general)

Chris, David

Will

Feb 14

Column Stores

Alex, Tingjian

John

Feb 21

Long Weekend

     - No Class -

 

 

Feb 28

Storage Systems

Andrew, Bill, Chris

Doug

March 7

Materialized Views

Anjali, Doug, John

Tingjian

March 14

Managing Updates

Tingjian, Alex

Anjali

March 21

Indexing

Alex, Josh, Andrew

Chris

March 28

Spring Recess

    – No Class

 

 

April 4

Query Evaluation

Will, Anjali

Josh

April 11

Compression

John, Salil

Bill

April 18

Automatic Database Design

Doug, Will, Bill

Andrew

April 25

Lineage

Salil, David, Josh

Alex

May 2

Prepare Final Projects

(no class)

 

 

May 9

Final Project Presentations

 

 

 

Mechanics

This is a seminar course that is based on current research papers.  There is no textbook.  The intent is to explore the area of large-scale data access by looking at current papers related to data warehousing.  We will attempt to build a picture of the kinds of techniques that  might be necessary to address the problems of the coming data avalanche.

 

Rather than break up the discussions with two classes a week, we will instead hold one long class.  We will need to keep things lively in order to avoid the potential boredom that 2.5 hours could produce.  Thus, we will adopt the following class methodology.

 

One team will be responsible for presenting a summary of the area based on the readings.  This can largely be derived from the assigned readings, but you are encouraged to go beyond our syllabus to discover other interesting work.  Remember that the last thing in the world that we are looking for is a linear presentation of the sections in the papers.  Part of the message should be a description of how you think that the topic at hand relates to data warehouses.  This team will try to present the area in the best possible light.  You guys are the cheerleaders for the approach.

 

The syllabus usually provides multiple papers for the topic at hand.  The topic is the important thing.  You should definitely not present all the details from all the papers.  A suggestion for the lead paper is given in the syllabus.  You might want to concentrate most of your efforts on that paper.  The other papers are there to provide background.  It would be useful for you to let the class know (by e-mail) a few days before your class which papers you will be discussing so that the class can make sure that they have read the same ones in detail.

 

Another person will be assigned the job of being the discussant.  Discussants will present a short rebuttal to the presenters’ talk.  They will also come to class prepared with questions, counterexamples, and a generally crabby attitude toward the work.  With any luck, this will set up a debate-like atmosphere in which we can argue about the pros and cons of the basic technologies.  The discussants job is to find where the work falls short.  This can be in its technical approach or in its presentation.  If the discussants can bring up a few questions that were not adequately addressed in the paper(s), then we can use that as grist for our discussion.

 

We will also try to develop a set of slides that would be suitable (say) for teaching a one day course in data warehousing.  This will be compiled from “the best of” each class presentation.  Each team of presenters should construct a 10-15 slide (if you need more that’s OK too as long as the additional slides add something) summary of their area.  The more examples in these slides the better.  The idea is to use feedback from the class presentation to produce a shorter, better slide deck.

 

The rest of you are not off the hook.  You are expected to actively participate in the debate.  Also, in order to ensure that you read the papers and think about the issues before coming to class, everyone who is not a presenter or a discussant will write a brief position paper which captures your own thoughts about the readings.  My guess is that these will need to be about 2 pages in length, but you may use whatever you feel is adequate.

 

What am I going to be graded on?

 

The grade for the course will based on the following things.

 

1. Your presentations
2. Your performance as a discussant
3.
Your summary slides
4. Your weekly position papers
5. Your class citizenship during the discussions (You should have something to say.)
6. A final project

 

There will be no exams. There are no textbooks to buy.  Most all of the readings are available on line.

 

Readings

 

 The list of readings is available here.