CS227 - Advanced Topics In
Database Management
Spring 2008
Large-Scale Data Access
Data is proliferating at a rapid rate. While storage devices are getting bigger, the appetite for data of many applications is growing even faster. Walmart keeps all of its point-of-sale data on-line for three years. This amounts to about 20 Tbytes of information. Scientific applications (e.g., biology, astrophysics) collect data from instruments that can exceed several petabytes a year. Storing and querying this much data in a meaningful and efficient way is an enormous technical challenge.
The database field has responded by developing a set of tools called data warehouses (DW). In the commercial arena, data warehouses typically contain a copy of the data that has been captured in a transactional environment. In fact, the data warehouse often is the place in which multiple heterogeneous data sources are integrated into a consistent single point-of-access. The DW is also the data source for large-scale data mining. This is an interesting area, but we will not pursue it in this course.
The DW is the base on which decision support systems are built. In other words, the database presents a model of the world that should be available online for answering a decision maker’s questions. Despite the size of the DW, decision makers demand rapid access to queries. A data warehouse is designed to meet the needs of what has become known as On-Line Analytic Processing (OLAP) workloads. OLAP expects to see a workload that is largely queries (reads) and insertions (appending new records). This is in distinction to what has been called On-Line Transaction Processing (OLTP) workloads. OLTP is characterized by a workload containing lots of simple updates and queries. Think ATM's. Not too surprizingly, the techniques for managing OLTP and OLAP data are different.
This course will discuss technologies that contribute to the field data warehousing. Even though data warehouses are becoming more mature, we will look at this subject from a research perspective; however, in order to be relevant, we will pay attention to the industrial state-of-the-art. We will examine relevant research papers that will give us insight into the basic issues and approaches that could be baked into any data warehouse system or product. We will pay special attention to system building issues. We will also attempt to discover the current product offerrings that relate to what we are discussing whenever possible.
This term we will also emphasize the topic of automated database design. For any given set of logical relations, there are many ways to actually store that data. Physical database design is concerned with making that choice. Typically this choice is made by DBA's. As the complexity level of the database increases (e.g., as we try to run our warehouse on a computing cluster), this process quickly becomes unmanageable. We need a tool that either do the physical design for us or that can act as an advisor to the DBA to help him or her make the best choice. We will spend several weeks discussing the state of the art of automated database design.
A rich literature in large-scale data access has been produced over the last decade. We will read representative papers from this literature that cover some of the more central topics. A schedule for the semester follows.
|
Date |
Topic |
Presenters |
|
Jan 28 |
Introduction to Course |
sbz |
|
Feb 4 |
Field trip – NE
Database Day |
MIT |
|
Feb 11 |
Data Warehousing
(general) |
sbz |
|
Feb 18 |
Long Weekend
- No Class - |
|
|
Feb 25 |
Column Stores - Vertica |
Shilpa
Lawande |
|
March 3 |
Indices / Bitmaps |
David / Alex |
|
March 10 |
Compression |
Jennie / Matt / Zhenyuan / Bo |
|
March 17 |
Query Processing |
Jonathan /Andy /
Josh / Sidra |
|
March 24 |
Spring Recess
– No Class |
|
|
March 31 |
Automatic Database Design 1 |
Hideaki / Rob / Bo |
|
April 17 |
Automatic Database Design 2 |
Hideaki /
Jennie / Alex |
|
April 14 |
Automatic Database Design 3 |
Nathan / Zhenyuan / Matt |
|
April 21 |
Automatic Database
Design 4 |
Rob / Andy / Nathan
/ Jonathan |
|
April 28 |
Other Approaches |
David / Josh |
|
May 5 |
Final Project
Presentations |
all |
Mechanics
This is a seminar course that
is based on current research papers. There is no textbook per se, but we will
use the book The Data Warehouse Toolkit, Second Edition by Ralph
Kimble and Margy Ross, (published by John Wiley & Sons) in the beginning of
the course to get us all to think about how to design a warehouse and as a
reference throughout the semester to supply us with real world example
applications. Beyond that, the intent is to explore the area of
large-scale data access by looking at current research papers related to data
warehousing. We will attempt to build a picture of the kinds of techniques
that might be necessary to address the problems
of the coming data avalanche.
Rather than break up the
discussions with two classes a week, we will instead hold one long class.
We will need to keep things lively in order to avoid the potential boredom that
2.5 hours could produce. Thus, we will adopt the following class
methodology.
One team will be responsible
for presenting a summary of the area based on the readings. This can
largely be derived from the assigned readings, but you are encouraged to go
beyond our syllabus to discover other interesting work. Remember that the
last thing in the world that we are looking for is a linear presentation of the
sections in the papers. Part of the message should be a description of
how you think that the topic at hand relates to data warehouses. This
team will try to present the assigned paper in the best possible light.
You guys are the cheerleaders for the approach.
The syllabus usually provides
multiple papers for the topic at hand. The topic is the important
thing. You should definitely not present all the details from all the
papers. A suggestion for the lead paper is given in the syllabus.
You might want to concentrate most of your efforts on that paper. The
other papers are there to provide background. It would be useful for you
to let the class know (by e-mail) a few days before your class which papers you
will be discussing so that the class can make sure that they have read the same
ones in detail.
We will also develop a set of
slides that will serve as a record of what we did this semester. This
will be compiled from “the best of” each class presentation. Each team of
presenters should construct a revised slide deck based on the experience from
the class. The more examples in these slides the
better. The idea is to use feedback from the class presentation to
produce a shorter, better slide deck.
The rest of you are not off
the hook. You are expected to actively participate in the debate.
Also, in order to ensure that you read the papers and think about the issues
before coming to class, everyone who is not a presenter will write a brief
position paper which captures your own thoughts about the readings. My
guess is that these will need to be about 2 pages in length, but you may use
whatever you feel is adequate.
What
am I going to be graded on?
The grade for the course will
be based on the following things.
1. Your presentations
2. Your
summary slides
3. Your
weekly position papers
4. Your
class citizenship during the discussions (You should have something to say.)
5. A final
project
There will be no exams. Most
all of the readings are available on line.
Presentation
Guidelines
While every topic and therefore every presentation will be different, there are some general things that we can expect out of each presentation. Use this a s guide. It might be that this does not fit your topic, but it is likely that most of it will. The subtopics are examples of the kinds of things that might be relevant to ask. Invent your own questions. Don't just blindly feel that you must use these.
1. Summary of paper
Explanation in easy to understand terms - cut through the formality
Plenty of examples
2. Critique
Is this a real problem
When will the solution work and when will it not.
3. Industry
What do products do?
DBMSs
Other products
4. Comparisons
with other papers
with DB products
with other approaches
5. Future
How might one broaden or extend the work
Ideas for new research
|
Team |
|
|
|
Jonathan and Nathan |
|
Zhenyuan and Bo |
|
Hideaki and David |
|
Andy |
|
Sidra and Jennie |
|
Rob and Matt |
|
Josh and Ahmad |
The list of readings is available here.