ACM Computing Surveys
28A(4), December 1996,
http://www.acm.org/surveys/1996/Formatting/. Copyright ©
1996 by the Association for Computing Machinery, Inc. See the permissions statement below.
A Role for Database Research in the Database Industry
David Lomet
Microsoft Research
,
Database Group
One Microsoft Way, Redmond, WA 98052
lomet@microsoft.com
,
http://www.research.microsoft.com/research/db/lomet
(206)703-1853
Abstract:
We argue that research in the database industry is not primarily a
matter
of discovering new problems. Rather, the main role of database research
is to formulate the basic abstractions, provide the elegant and general
algorithms, and characterize performance.
Categories and Subject Descriptors:
General Terms:
Additional Key Words and Phrases: database research
Publication Information
- Citation
- Lomet, D., 1996.
A Role for Database Research in the Database Industry
ACM
- Submission date
- November 15, 1996
- Acceptance date
- November 15, 1996
Research in the Database Industry
A database industry would be alive and well in the US and elsewhere,
even if researchers had never entered the database arena. This area,
first and foremost, copes with the business data processing problem.
Businesses were and are willing to spend money on this. Hence, the
existence of the industry is no accident, and certainly did not require
researchers to identify the problem.
In addition, a good bit of the database technology would have evolved
without research input. Something close to the notion of a transaction
existed in IMS around 1970. Data models, both hierarchical and network
(looking very much like extended relational and OO navigation models)
already existed by the early 70's without research input. Tree indexing
and hashing were in use in a similar time frame. We need to understand
how research contributed to the evolution of database technology so that
we can understand the role it might play in the future.
The research contribution, in my view, consisted of providing two
fundamental abstractions, transactions and the relational model.
Working
with these abstractions, one could enormously expand the scope of the
algorithm solution space, hence improving functionality, performance,
indeed many desirable attributes of database systems. These
abstractions
gave database users models with which they could cope. And researchers
leapt into the database technology enterprise by exploiting these
abstractions with technical solutions that were both general and
elegant.
With transactions, it was concurrency control, recovery, and
availability
techniques that resulted. With the relational model, it was
normalization, query processing, optimization, data independence,
indexing and storage organizations.
The point here is that industry identified the problems and provided the
early impetus. Researchers came along later and provided the clean
abstractions and the elegant solutions. These aree what enables
database
technology to be readily transmitted to new practitioners and to become
solid engineering, not just arcane craft. This has served our field
well,
giving researchers important problems to ponder, and has returned to
industry elegant abstractions, algorithms, and understanding. This is
likely to continue to be the model, and I believe it should be.
Areas of Current Industry Interest
So what are the areas on which industry has just begun to focus? I
think it is these areas that cry out, or should cry out to researchers
as the golden opportunities. I am unconvinced that the research
community is likely to anticipate the next area with a huge impact. So
the areas that I cite below (only some of the possible areas) have
largely had some preliminary industrial exploration already. But
fundamental understandings are in short supply, as are elegant and
generalizable algorithms.
- "Transactions" Everywhere
We need to generalize our notion of transaction and apply it in a much
wider context. This has been an on-going research activity. And I
think
it should continue. Transactions have a role to play in workflow
systems,
in business data processing over the internet, in efforts to improve
reliability and availability more generally. None of this is new, but
the game is not over. Most systems do not deal with applications, most
workflow is not reliable, high availability techniques tend to be
isolated methods. Deeper understanding would translate into systems
with
better performance, greater generality, and higher availability.
- Query Processing and Optimization
We have just begun to deal with complex queries over enormous volumes of
data. This will require parallelism to succeed. It will require
effective indexing. It will require a much better handle on how to
estimate query costs, and how to transform query expressions, especially
over diverse data types, including extended types. We have only begun
to
support materialized views and to understand how to optimize in the
context of multiple queries. Decision support systems, data
warehousing,
on-line analytic processing, multimedia data, all these are areas
impacted by this technology.
- Information Discovery and Integration
Somewhere between converting monetary units and fully solving the
artificial intelligence problem, there surely exist technqiues for
making
more sense of data. Each success permits us to transform data into
something closer to information (the data mining problem) and to bring
together the information from diverse data sources. Whether these
technques are simply a collection of ad hoc engineering efforts or ones
that have real generality has yet to be established. But this is an
area
with enormous potential leverage for the next deep insight. This is an
area where "information modelling" should play an important role.
- Distribution and the Web
No list of research areas would be complete without mentioning the Web.
It will transform our lives and the nature of our research. But I do
not
regard it as a new area, but rather as the coming of age of wide scale
distribution. This will stress our abilities in all the preceding areas.
The new fundamentals here are long response time, autonomy, and
security.
For long response time, we need to understand when and how to cache
information and how to (in)validate it. To cope with autonomy requires
cooperative, non-intrusive protocols that web sites will want to sign up
for. Security for distributed systems has long been a "black hole" but
the web makes its solution more pressing.
Industry is moving fast in all these areas, but research should be able
to play its customary role of understanding and generalizing, and hence
providing the foundation upon which superior technolgy can and will be
built.
Permission to make digital
or hard copies of part or all of this work for personal or classroom
use is granted without fee provided that copies are not made or
distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for
components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, to
republish, to post on servers, or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from
Publications Dept, ACM Inc., fax +1 (212) 869-0481, or
permissions@acm.org.
>