CS227 - Advanced Topics In
Database Management
Spring 2005
Large-scale Data Access
Data is proliferating at a rapid rate. While storage devices are getting bigger, the appetite for data of many applications is growing even faster. Walmart keeps all of its point-of-sale data on-line for three years. This amounts to about 20 Tbytes of information. Scientific applications (e.g., biology, astrophysics) collect data from instruments that can exceed several petabytes a year. Using this much data in a meaningful and efficient way is an enormous technical challenge.
The database field has responded by developing a set of tools called data warehouses (DW). In the commercial arena, data warehouses typically contain a copy of the data that is captured in a transactional environment. In fact, the data warehouse often is the place in which multiple heterogeneous data sources are integrated into a consistent single point-of-access. The DW is also the data source for large-scale data mining. This is an interesting area, but we will not pursue it in this course.
The DW is the base on which decision support systems are built. In other words, the database presents a model of the world that should be available online for answering a decision maker’s questions. Despite the size of the DW, decision makers demand rapid access to queries. Since the DW can contain huge amounts of data, the techniques for managing it are different than those in a smaller operational store.
This course will discuss component technologies that contribute to the field data warehousing. Even though data warehouses are becoming more mature, we will look at this subject from a research perspective. The research literature has decomposed the broader problem into smaller more bite-size pieces that can be studied in isolation. We will examine relevant research papers that will give us insight into the basic issues and approaches that could be baked into any data warehouse system or product. We will pay special attention to system building issues.
A rich literature in large-scale data access has been produced over the last decade. We will read representative papers from this literature that cover some of the more central topics. A schedule for the semester follows.
|
Date |
Topic |
Presenters |
Discussants |
|
Jan 31 |
Introduction to Course |
sbz |
|
|
Feb 7 |
Data Warehousing
(general) |
Chris, David |
Will |
|
Feb 14 |
Column Stores |
Alex, Tingjian |
John |
|
Feb 21 |
Long Weekend
- No Class - |
|
|
|
Feb 28 |
Storage Systems |
Andrew, Bill, Chris |
Doug |
|
March 7 |
Materialized Views |
Anjali,
Doug, John |
Tingjian |
|
March 14 |
Managing Updates |
Tingjian,
Alex |
Anjali |
|
March 21 |
Indexing |
Alex, Josh, Andrew |
Chris |
|
March 28 |
Spring Recess
– No Class |
|
|
|
April 4 |
Query Evaluation |
Will, Anjali |
Josh |
|
April 11 |
Compression |
John, Salil |
Bill |
|
April 18 |
Automatic Database
Design |
Doug, Will, Bill |
Andrew |
|
April 25 |
Lineage |
Salil, David, Josh |
Alex |
|
May 2 |
Prepare Final Projects (no class) |
|
|
|
May 9 |
Final Project
Presentations |
|
|
Mechanics
This is a seminar course that
is based on current research papers. There is no textbook.
The intent is to explore the area of large-scale data access by looking at
current papers related to data warehousing. We will attempt to build a
picture of the kinds of techniques that might be
necessary to address the problems of the coming data avalanche.
Rather than break up the
discussions with two classes a week, we will instead hold one long class.
We will need to keep things lively in order to avoid the potential boredom that
2.5 hours could produce. Thus, we will adopt the following class
methodology.
One team will be responsible
for presenting a summary of the area based on the readings. This can
largely be derived from the assigned readings, but you are encouraged to go
beyond our syllabus to discover other interesting work. Remember that the
last thing in the world that we are looking for is a linear presentation of the
sections in the papers. Part of the message should be a description of
how you think that the topic at hand relates to data warehouses. This
team will try to present the area in the best possible light. You guys
are the cheerleaders for the approach.
The syllabus usually provides
multiple papers for the topic at hand. The topic is the important
thing. You should definitely not present all the details from all the
papers. A suggestion for the lead paper is given in the syllabus.
You might want to concentrate most of your efforts on that paper. The
other papers are there to provide background. It would be useful for you
to let the class know (by e-mail) a few days before your class which papers you
will be discussing so that the class can make sure that they have read the same
ones in detail.
Another person will be
assigned the job of being the discussant. Discussants will present a short
rebuttal to the presenters’ talk. They will also come to class prepared
with questions, counterexamples, and a generally crabby attitude toward the
work. With any luck, this will set up a debate-like atmosphere in which
we can argue about the pros and cons of the basic technologies. The
discussants job is to find where the work falls short. This can be in its
technical approach or in its presentation. If the discussants can bring
up a few questions that were not adequately addressed in the paper(s), then we
can use that as grist for our discussion.
We will also try to develop a
set of slides that would be suitable (say) for teaching a one day course in
data warehousing. This will be compiled
from “the best of” each class presentation. Each team of presenters should construct a
10-15 slide (if you need more that’s OK too as long as the additional slides
add something) summary of their area. The
more examples in these slides the better. The idea is to use feedback from the class
presentation to produce a shorter, better slide deck.
The rest of you are not off
the hook. You are expected to actively participate in the debate.
Also, in order to ensure that you read the papers and think about the issues
before coming to class, everyone who is not a presenter or a discussant will
write a brief position paper which captures your own thoughts about the
readings. My guess is that these will need to be about 2 pages in length,
but you may use whatever you feel is adequate.
What am I going to be
graded on?
The grade for the course will
based on the following things.
1. Your presentations
2. Your performance as a discussant
3. Your summary slides
4. Your weekly position papers
5. Your class citizenship during the discussions (You should have something to
say.)
6. A final project
There will be no exams. There
are no textbooks to buy. Most all of the readings are available on line.
The list of readings is available here.