WHAT IS A DATA WAREHOUSE

by W H Inmon

Copyright © 1995 Prism Solutions, Inc.

 

Data warehouse is the center of the architecture for information systems in the 1990s.

Data warehouse supports informational processing by providing a solid platform of integrated, historical data from which to do analysis.

Data warehouse provides the facility for integration in a world of unintegrated application systems.

Data warehouse is achieved in an evolutionary, step-at-a-time fashion.

Data warehouse organizes and stores the data needed for informational, analytical processing over a long historical time perspective.

There is indeed a world of promise in building and maintaining a data warehouse.

 

What then is a data warehouse?

 

A data warehouse is a:

•subject-oriented, •integrated, •time-variant, •nonvolatile

collection of data in support of management's decision making process, as shown in Figure 1.

 

 

 

Figure 1: What is a database warehouse?

The data entering the data warehouse comes from the operational environment in almost every case.

The data warehouse is always a physically separate store of data transformed from the application data found in the operational environment.

 

This bookish definition of a data warehouse deserves a full explanation because there are some important issues and subtleties underlying the characteristics of a warehouse.

 

 

 

Orientation

 

The first feature of the data warehouse is that it is oriented around the major subjects of the enterprise. The data-driven, subject orientation is in contrast to the more classical process/functional orientation of applications, which most older operational systems are organized around. Figure 2 shows the contrast

between the two types of orientations.

 

 

 

 

Figure 2: The data warehouse has a strong subject orientation

 

 

The operational world is designed around applications and functions such as loans, savings, bank card and trust for a financial institution.

 

The data warehouse world is organized around major subjects such as customer, vendor, product and activity.

The alignment around subject areas affects the design and implementation of the data found in the data warehouse.

Most prominently, the major subject areas influence the most important part of the key structure.

 

The application world is concerned both with database design and process design.

The data warehouse world focuses on data modeling and database design exclusively.

Process design (in its classical form) is not part of the data warehouse environment.

 

The differences between process/function application orientation and subject orientation show up as a difference in the content of data at the detailed level as well. Data warehouse data excludes data that will not be used for DSS processing, while operational application-oriented data contains data to satisfy immediate functional/processing requirements that may or may not be of use to the DSS analyst.

 

Another important way in which the application-oriented operational data differs from data warehouse data is in the relationships of data. Operational data maintains an ongoing relationship between two or more tables based on a business rule that is in effect. Data warehouse data spans a spectrum of time and the relationships found in the data warehouse are many. Many business rules (and correspondingly, many data relationships) are represented in the data warehouse between two or more tables.

 

From no other perspective than that of the fundamental difference between a functional/process application orientation and a subject orientation, there is a major difference between operational systems and data and the data warehouse.

 

 

 

Integration

Easily the most important aspect of the data warehouse environment is that data found within the data warehouse is integrated. ALWAYS. WITH NO EXCEPTIONS. The very essence of the data warehouse environment is that data contained within the boundaries of the warehouse is integrated.

 

The integration shows up in many different ways - in consistent naming conventions, in consistent measurement of variables, in consistent encoding structures, in consistent physical attributes of data, and so forth.

 

Contrast the integration found within the data warehouse with the lack of integration found in the applications environment, and the differences are stark, as shown by Figure 3.

 

 

 

 

 

Figure 3

Figure 3: When data is moved to the data warehouse from the application-oriented operational environment, the data is integrated before entering the warehouse

 

 

Over the years the different applications designers have made numerous individual decisions as to how an application should be built. The style and the individualized design decisions of the application designer show up in a hundred ways.

In differences in encoding.

In differences in key structures.

In differences in physical characteristics.

In differences in naming conventions, and so forth.

The collective ability of many application designers to create inconsistent applications is legendary. Figure 3 shows some of the most important differences in the ways applications are designed.

 

Encoding - application designers have chosen to encode the field GENDER in different ways. One designer represents GENDER as an "M" and an "F." Another application designer represents GENDER as a "1" and a "0." Another application designer represents GENDER as an "x" and a "y." And yet another application designer represents GENDER as "male" and "female." It doesn't matter much how GENDER arrives in the data warehouse. "M" and "F" are probably as good as any representation. What matters is that whatever source GENDER comes from, it must arrive in the data warehouse in a consistent integrated state. Therefore when GENDER is loaded into the data warehouse from an application where it has been represented in other than an "M" and "F" format, the data must be converted to the data warehouse format.

 

Measurement of attributes - application designers have chosen to measure pipeline in a variety of ways over the years. One designer stores pipeline data in centimeters. Another application designer stores pipeline data in terms of inches. Another application designer stores pipeline data in million cubic feet per second. And another designer stores pipeline information in terms of yards. Whatever the source, when the pipeline information arrives in the data warehouse it needs to be measured the same way.

 

As shown in Figure 3, the issues of integration affect almost every aspect of design - the physical characteristics of data, the dilemma of having more than one source of data, the issue of inconsistent naming standards, inconsistent date formats, and so forth.

 

Whatever the design issue, the result is the same - the data needs to be stored in the data warehouse in a singular, globally-acceptable fashion even when the underlying operational systems store the data differently.

 

When the DSS analyst looks at the data warehouse, the focus of the analyst should be on using the data that is in the warehouse, rather than on wondering about the credibility or consistency of the data.

 

 

 

 

 

 

 

 

 

Time Variancy

 

All data in the data warehouse is accurate as of some moment in time. This basic characteristic of data in the warehouse is very different from data found in the operational environment. In the operational environment data is accurate as of the moment of access. In other words, in the operational environment when you access a unit of data, you expect that it will reflect accurate values as of the moment of access.

 

Because data in the data warehouse is accurate as of some moment in time (i.e., not "right now"), data found in the warehouse is said to be "time variant." Figure 4 shows the time variancy of data warehouse data.

 

The time variancy of data warehouse data shows up in several ways. The simplest way is that data warehouse data represents data over a long time horizon - from five to ten years. The time horizon represented for the operational environment is much shorter - from the current values of today up to sixty to ninety days. Applications that must perform well and must be available for transaction processing must carry the minimum amount of data if they are to have any degree of flexibility at all. Therefore operational applications have a short time horizon, as a matter of sound application design.

 

 

 

 

 

 

Figure 4

 

 

The second way that time variancy shows up in the data warehouse is in the key structure. Every key structure in the data warehouse contains - implicitly or explicitly - an element of time, such as day, week, month, etc. The element of time is almost always at the bottom of the concatenated key found in the data warehouse. On occasions, the element of time will exist implicitly, such as the case where an entire file is duplicated at the end of the month, or the quarter.

 

The third way that time variancy appears is that data warehouse data, once correctly recorded, cannot be updated. Data warehouse data is, for all practical purposes, a long series of snapshots. Of course if the snapshot of data has been taken incorrectly, then snapshots can be changed. But assuming that snapshots are made properly, they are not altered once made. In some cases it may be unethical or even illegal for the snapshots in the data warehouse to be altered. Operational data, being accurate as of the moment of access, can be updated as the need arises.

 

 

Nonvolatile

 

The fourth defining characteristic of the data warehouse is that it is nonvolatile. Figure 5 illustrates this aspect of the data warehouse.

 

 

 

 

 

Figure 5

 

Figure 5 shows that updates - inserts, deletes, and changes - are done regularly to the operational environment on a record-by-record basis. But the basic manipulation of data that occurs in the data warehouse is much simpler. There are only two kinds of operations that occur in the data warehouse - the initial loading of data, and the access of data. There is no update of data (in the general sense of update) in the data warehouse as a normal part of processing.

 

There are some very powerful consequences of this basic difference between operational processing and data warehouse processing.

At the design level, the need to be cautious of the update anomaly is no factor in the data warehouse, since update of data is not done.

This means that at the physical level of design, liberties can be taken to optimize the access of data, particularly in dealing with the issues of normalization and physical denormalization. Another consequence of the simplicity of data warehouse operation is in the underlying technology used to run the data warehouse environment.

Having to support record-by-record update in an on-line mode (as is often the case with operational processing) requires the technology to have a very complex foundation underneath a facade of simplicity.

The technology supporting backup and recovery, transaction and data integrity, and the detection and remedy of deadlock is quite complex.

And unnecessary for data warehouse processing.

 

The characteristics of a data warehouse - subject orientation of design, integration of data within the data warehouse, time variancy, and simplicity of data management - all lead to an environment that is VERY, VERY different from the classical operational environment.

 

The source of nearly all data warehouse data is the operational environment. It is a temptation to think that there is massive redundancy of data between the two environments. Indeed, the first impression many people draw is that of great redundancy of data between the operational environment and the data warehouse environment. Such an understanding is superficial and demonstrates a lack of understanding as to what is occurring in the data warehouse. In fact there is a MINIMUM of redundancy of data between the operational environment and the data warehouse environment.

 

Consider the following:

•data is filtered as it passes from the operational environment to the data warehouse environment. Much data never passes out of the operational environment. Only that data that is needed for DSS processing finds its way into the data warehouse environment. •the time horizon of data is very different from one environment to the next. Data in the operational environment is very fresh. Data in the warehouse is much older. From the perspective of time horizons alone, there is very little overlap between the operational and the data warehouse environments. •the data warehouse contains summary data that is never found in the operational environment. •data undergoes a fundamental transformation as it passes into the data warehouse. Figure 3 illustrates that most data is significantly altered upon being selected for and moving into the data warehouse. Said another way, most data is physically and radically altered as it moves into the warehouse. It is not the same data that resides in the operational environment from the standpoint of integration.

 

 

 

In light of these factors, data redundancy between the two environments is a rare occurrence, resulting in less than 1-% redundancy between the two environments.


Email: jmo@asaptt.com