ACM Computing Surveys 28A(4), December 1996, http://www.acm.org/surveys/1996/LohmanComplex/. Copyright © 1996 by the Association for Computing Machinery, Inc. See the permissions statement below.
Products have focussed primarily on the traditional applications: transactions and complex queries of small, uniform records, composed primarily of character and numeric business data. Performance of these applications has been accelerated by improvements in both hardware and the DBMS software itself. The progress can be seen in benchmarks standardized by the Transaction Processing Council: the TPC-C benchmark for transactions now boasts tens of thousands of transactions per second for under $200 per transaction, and the TPC-D benchmark for complex queries, new last year, is likely to see the same progress for complex, decision-support queries. Most of the TPC records are held by arrays of inexpensive processors and disks, configured either as shared-memory Symmetric Multi-Processors or as shared-nothing Massively-Parallel Processors. Parallelism is not only here, it's here to stay! Standardization has also facilitated the interconnection of DBMSs from different vendors through so-called middleware offered by all the major vendors. Connectivity to applications such as spreadsheets and text applications has also improved the usability of DBMSs, as have front-end GUIs that prompt users to formulate queries in a more natural way than SQL. Object-oriented DBMSs broadened the data types available to applications, but these products have remained a niche market compared to the relational juggernaut. Only in the last few years have object-relational features appeared in relational products, and even so customers seem hesitant to exploit these features until the SQL3 proposal becomes a standard.
Despite this progress, DBMSs seem very much in their infancy, having taken only the very simplest of initial steps toward storing the world's data. The structure of nested types in object-relational systems make optimization a nightmare. Continuous types like voice, image, and video are even less likely to have methods to do anything more than store and retrieve them, much less be optimized. Predicates on such data are usually applied to the human-derived attributes that characterize that data (e.g., content), rather than the data itself. But what can we expect? Even humans don't have a consistent and precise vocabulary for characterizing the contents of an image or a sound. Optimizers will be increasingly challenged by the horribly complex queries generated by GUIs and query generators, which seem to favor LIKE predicates (because they support wildcards) and Boolean combinations of nested subqueries. User-defined functions, now available in object-relational systems such as IBM's DB2 for Common Servers, can now be added to aggregation, joins, and sorting as expensive operations whose correct sequencing significantly affects query execution time and hence must be modeled by today's query optimizers. And as servers add all these features and run on massive arrays of parallel processors, managing them well and tuning their performance only gets harder to do manually, much less automate.
The future poses significant new challenges to DBMSs on many fronts. All of them are "up the food chain", but have consequences for database servers as well.
Now that transactions are commonplace and fast, corporations have amassed huge data-bases of them that represent a vast, untapped source of information about business and consumer trends that companies are understandably anxious to understand and exploit. This has fueled interest in data mining, on-line analytic processing (OLAP), and data warehousing, all techniques seeking to organize data logically and to distill information from the data. These techniques, in turn, have revived interest in efficient ways to process enormous databases with batch techniques, specialized algorithms for determining associations, and access methods such as multi-dimensional indexes for spatial and fuzzy searching. Visualizing the mined information with graphics tools, and even navigating through it with virtual reality, will be the next step toward making this information easily grasped, but will further increase the demands upon the DBMS to store and retrieve the supporting graphical information efficiently.
Scientific and geographic applications still aren't supported well by traditional DBMSs. Besides not supporting the obvious data types (e.g., arrays) and methods (e.g., distance) needed by these applications, traditional DBMSs are based upon the fundamental assumption that the schema is fixed and the data varies, through transactions. But scientific data remains fixed -- a scientist adds to but never changes data! -- and the schema may change as the scientist tests various hypotheses to explain the data. If you think about it, data mining and OLAP are doing something quite similar! Current systems are very weak at modifying the underlying data model, or permitting rows of a table to have varying schemas, while keeping the data fixed. This turns all our concepts about DBMSs inside out! A notable exception is Lotus Notes, which effectively allows each row to have a different schema.
Integration of applications such as computer-aided design and manufacturing gives rise to workflow applications that require long and nested transactions, better integration of graphical data types, real-time control, and input from automated sensors. There has been research on these topics for many years, but little of that research has found its way into the mainstream DBMS products.
Lastly, integrating DBMSs with the World-Wide Web will significantly challenge our ability to organize and search data, much if not most of it stored in a super-heterogeneous tangle of loosely-organized files rather than databases. Our search tools must become much more sophisticated and adaptable to the varying formats they find. Two projects that are addressing this problem are the Tsimmis project at Stanford University [Papakonstantinou 1995] and the Garlic project at IBM Almaden Research Center [Carey 1995]. Fluid schemas, and likely no schemas at all, will challenge our ability even to understand the semantics of a query, much less to search the database efficiently. Mobile and sometimes disconnected operation pose all sorts of interesting problems for reliability, replication, and consistency with disconnected databases. As the user's system becomes increasingly "thin" to keep it cheap and highly portable, more layers of intermediate servers are likely to create a hierarchy of servers, not just the simple client-server model of today. This in turn compounds the problem of what to cache where, and how to keep all the cached copies consistent while minimizing the time to access the most frequently referenced data.
In contrast to the relative uniformity, simplicity, and standardization of the past, the future promises a wealth of heterogeneity and complexity, as more diverse applications come together and need to store and find data that they require. Will current systems be able to adapt to all this additional complexity? Surely our existing meta-data must expand tremenously to deal with all this heterogeneity. We have just begun...
[Carey 1995] Carey, M., Haas, L., Schwarz, P. , et al., "Towards Heterogeneous Multimedia Information Systems", Proc. of the Intl. Workshop on Research Issues in Data Engineering, March 1995.
[Papakonstantinou 1995] Papakonstantinou, Y., Garcia-Molina, H. and Widom, J., "Object Exchange Across Heterogeneous Information Sources", Proc. of the IEEE Intl. Conference on Data Engineering, 1995, pp. 251-260.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org.
lohman@almaden.ibm.com