New Focal Points for Research in Database Systems

Henry F. Korth

Bell Laboratories, Room 2T-214
Lucent Technologies Inc.
700 Mountain Avenue
Murray Hill, NJ 07974 USA
hfk@bell-labs.com
http://cm.bell-labs.com/who/hfk

1 Introduction

There is a broad consensus that emerging applications of database technology and concepts have generated new research domains that are both of scientific interest and substantial economic value. To a large extent, the recent ``Lagunita-II'' report ( Database Research: Achievements and Opportunities into the 21st Century, Silberschatz, Stonebraker, and Ullman, eds.) addresses these issues. Beyond the technical issues, there is the necessary matter of explaining to non-technical decision-makers in government, industry, and elsewhere the possibilities opened up by database systems research -- both prospective future research as well as research already completed.

This latter concern leads me to structure my remarks not along technical subdisciplines, but rather into two broad areas: commercial information systems and consumer information systems. The former address the needs of ``back-office'' information systems: including the emerging areas of data mining, data warehousing, decision support, and real-time billing, as well as the large-scale multimedia processing required by video servers or web servers. The latter address the personal needs of users -- consumers and office workers -- and includes such areas as workflow, pen-based data, and mobile computing.

2 Commercial Information Systems

Many of the historical achievements of database research (transaction processing, the relational model, etc.) have supported commercial applications. Traditional applications in the commercial domain continue to need research to provide needed incremental advancement. The key need, however, is in new and qualitatively different applications that address deeper business needs.

2.1 Dealing with Uncertainty and Insecurity

Managing Imperfect Data:

Historically, databases have been large (by contemporary standards) collections of well-structured, well-defined data. Much effort (both in research and in operational support) has gone into ensuring data consistency.

Increasingly, databases must deal with inherently imperfect data. Because the data is not always centrally owned but instead belongs to autonomous organizations or individuals (the web, e.g.), consistency, uniformity, etc. are not achievable. Indeed some organizations may deliberately work against such goals -- and not necessarily for malicious purposes (e.g., information brokers differentiating themselves in the market). Many emerging applications have as their goal enabling people to perform tasks better, though not necessarily perfectly. Data mining applications, web search engines, multi-resolution information systems, data warehouse applications, etc. are examples of such domains.

For these reasons, issues in the management of imperfect, or approximately consistent data, are of great importance. This shows up again for consumer information systems, discussed below. There is an initial base of work in approximate notions of consistency arising from work in CAD databases, sagas, and multidatabases, but results in those areas only begin to address current needs.

Security and Acceptable Risks:

Aside from early work on inferences from statistical queries, database security work focuses on guarantees of access control. The broader information system domain of electronic commerce requires a tradeoff between allegedly perfect security and ``good enough'' security. A review of current best practices in traditional commerce illustrates the point. It is remarkably easy to obtain credit-card numbers, checking-account numbers, etc. -- just consider all the people to whom we give our credit card or a check as part of ordinary business -- yet there is a great hesitancy on the part of the general public to put this data on the net. Although this reluctance is being overcome by guarantees by vendors or network providers, there is a need for further study of economic models of database security in which both costs and fraud exposure can be quantified and compared.

In some sense, this is analogous to the remarks above about consistency -- a need for approximation rather than absolutes.

2.2 Throughput versus Response Time

Whereas performance has traditionally meant transactions per second, most new applications have a stringent responsiveness requirement. Any application that interfaces with a user will benefit from response times under a second -- a time period that is, in human terms, a good approximation of instantaneous. In other cases, the issue may be one of predictable response times, even if they do not appear instantaneous to the user.

Furthermore, the data must be timely. Many interactive applications present data to users who are at least partially aware of the real-world being modeled by the data in the database. User-perceived inconsistencies due to a lack of timeliness are at best a source of irritation, and at worst, a source of serious dissatisfaction. Consider, for example, an airline customer trying to reserve a free flight with frequent flyer miles. The customer knows that there are enough miles in the account, courtesy of a recent flight. If that flight is not yet entered into the frequent-flyer database, the airline cannot grant the request -- and the customer ends the interaction unhappy. As the above example illustrates, responsive systems provide a basis through which companies can compete in quality customer service.

Time constraints are involved also in systems with deadlines. It may be necessary to complete a task by a specific time for the task to be meaningful (examples include air traffic control, telephone call processing, and certain emergency and health-care applications). A conflicting target of low-cost systems makes it necessary to build software than can meet timing deadlines without requiring excessively high levels of reserve hardware capacity.

These observations lead to the conclusion that the management of real-world time constraints in a variety of forms will be an increasing important research issue in information systems.

3 Personal Information Systems

The emergence of a huge market in personal information systems for home and/or office has been predicted for a long time and has never yet met expectations. First there was the ``office of the future;'' most recently there was the ``personal digital assistant'' popularly identified with the Apple Newton.

Despite the unfulfilled predictions of the past, it is clear that personal information management is gaining widespread application. Business people routinely travel with computers outside the office. Home financial systems, the Web, personal email accounts, etc. are helping to create personal information bases for a significant fraction of the general public.

Historically, the database research community has focused on commercial rather than personal information management. There are several problems currently open in this domain, that are well suited for consideration by the database research community. A few of these are discussed below.

3.1 Data Consistency and Information Reconciliation

Personal data is by definition owned by an individual. It may consist of read-only data (as in a phone directory or a cached web page) or personally owned data that should be updatable without restriction. Both types of data may become inconsistent. Read-only personal data may be updated at a remote site, while updates generated by the individual user may result in data stored elsewhere becoming out of date.

The above appears to be a straightforward case of replicated data. However, two factors make the problem more challenging: Access to data must be allowed even if the system is not sure that the data is current (examples abound -- disconnected mobile computers is one class of example). Updates must be propagated to copies and divergent versions integrated with as little human involvement as possible.

These two factors lead to challenging research problems. The mobile computing community has begun to address the issues of data management in distributed systems that include often-disconnected mobile computers. The problem of reconciling conflicting versions of data has been considered for special cases. A general solution would require a relatively sophisticated model of the types of data in the database and the set of operations thereon (thus the object-oriented model may be the right starting point).

Interestingly, both commercial and personal applications require dealing with inconsistency and uncertainty. Are these really just one problem? While I believe there are distinctions, the issue as a whole clearly merits further discussion.

3.2 Ink as a Datatype and Pen-Based Queries

Personal computing environments are likely to make significant use of the pen as an input device. The advantages of the pen from a packaging and human-factors standpoint are such that there will be continued demand for better solutions to the problem of managing handwritten information. It is my view that improved recognition software (or new alphabets more suited to recognition such as Grafitti or Unistrokes) is not a complete answer. The richness of handwriting is lost if only that handwriting that corresponds to Ascii strings is acceptable as input.

There has been considerable research in maintaining handwritten data within the database and using similarity-matching techniques to retrieve data. While some good results have been achieved already, there is clearly room for improvement. Even with improvements in the accuracy of similarity matching, there will always be some degree of error arising from the inherent variability in human handwriting. Thus, data retrievals over handwritten data must necessarily be approximate. Joins over handwritten attributes present a particularly interesting challenging since ``unlikely'' tuples in the join need to be dropped from the result, but the definition of ``unlikely'' is only a probablistic notion.

Here again, the theme of managing uncertainly of data arises, though in a different setting than those discussed in earlier sections.

Similar issues exist for voice interfaces and spatial information.

4 Conclusion

The management of uncertain or not-fully-consistent data is a critical need in a variety of domains, only a few of which were mentioned here. This is a difficult and under-studied problem that deserves further research. We in the database research community need to address these and other issues, not only in the contextof terabyte-sized commercial systems but also in the context of the megabyte-sized problems of personal information systems. The latter, when considered from the standpoint of millions and perhaps billions of users, is indeed a large-scale database problem representing a truly challenging focal point for future research.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org.

hfk@bell-labs.com