Data Warehouses: Your Company's Future in the "Stars"

December 1, 2003

Applied Clinical Trials

Applied Clinical Trials, Applied Clinical Trials-12-01-2003,

Somewhere in a closed conference room, the IT wizards are plotting the data strategy of your company or institution. They are wrestling with the vast amount of raw data that is being collected and the challenges of making that data available as information to help in making decisions.

Somewhere in a closed conference room, the IT wizards are plotting the data strategy of your company or institution. They are wrestling with the vast amount of raw data that is being collected and the challenges of making that data available as information to help in making decisions. After all, what is the point of collecting all of that data if you aren't going to use it?

Clinical development organizations are expert at taking the raw data from a clinical trial and turning it into highly refined data listings and data tables, which are ideally used to support a drug's approval by the regulatory authorities. That is, after all, one of the main functions of a pharmaceutical company. However, the greater value of clinical data can only be unleashed by combining it together with other data for analysis and exploration. The hundreds of clinical trials, involving tens of thousands of patients, contain a treasure trove of information about disease and wellness, and the effects of pharmacological therapy. Now, if only this data could be easily aggregated and analyzed.

Paul Bleicher

The data strategy being plotted by your IT staff hopefully goes far beyond mere clinical data. There is a vast amount of data collected in many disparate systems that, when linked together, can be invaluable in understanding and planning the entire process of drug development, not to mention making the process more efficient, increasing quality, and decreasing cost.

This data comes from clinical data systems, financial systems, clinical trial management systems, randomization systems, project planning systems, laboratory information systems, and even human resource systems. When taken as a whole, the data from these systems can be used to answer questions about the quality, efficiency, and cost of the clinical development process.

The information can be used to support strategic and tactical changes to business operations, to monitor the effects of current or changing processes, and to identify new ways of performing clinical development and clinical trials.

Our data strategists will eventually leave their closed room and present their strategy to us. A barrel-shaped drawing labeled "Data Warehouse" might be in the middle of a complicated diagram. This, they will tell us, will be the way that we can combine all of that data from all of the different systems. Through the data warehouse we will be able to extract the aforementioned gold mine of information-the analytic and summary data that helps us make and assess the tactics and strategies of our company and its processes.

It is a grand vision, but will it work? There is no good answer to this question beyond "it depends." Certainly, though, its success is likelier if the end users of the data warehouse understand the purpose and limitations of such a project, provide knowledgeable input over the two to three years of its design and construction, and continue the input during the ongoing process of revision. To participate in this process, it is helpful to understand a little bit about databases in general and database design in particular.

From flat...

Most of us are unaware of the many databases we interact with on a daily basis. Everything from buying a product in a retail store to reading a Web page or our email involves hundreds of database transactions and database reporting. The marvel of modern computing is that the databases are so isolated from the end users' experiences that we're typically not even aware of their existence.

When those of us who are not in a data management role speak of a database, we often envision a big spreadsheet or table. The address list for a party, for example, can be created where each row is a person and each column contains one of their attributes, such as name, street address, town, zip code, and phone number. This type of database, known as a "flat file" database, is very valuable and useful for specific purposes. Databases of this sort can contain thousands of rows and many columns. In skilled hands, very useful analytic pivot tables and reports can be produced from flat file databases, which can often provide important business information.

There are, unfortunately, many limitations of flat file databases. First, they very quickly get too large to manipulate practically. Searching, sorting, or reporting a large flat file database can take a very long time. When the data being collected are more complex, the flat file database becomes untenable. For example, if you were the manager of a large shopping center, you might want to keep a database of stores, including the category of store, the name, phone number, and address. It is easy to capture this in a simple row and column database-each row would represent a store, and columns would contain name, address, phone, and type of store. Unfortunately, you would be re-using a particular store category for several different stores (e.g. bookstore), increasing the potential for errors (like a "bookstove").

Now, imagine that you also wanted to track employees and information about sales events. Do you add a new column for each employee, labeled Employee1, Employee2, and so on? Obviously, this would be very unwieldy. Some stores have many, many employees; others have only a few. There would be many blanks in the table. In addition, such a database would offer no way to track employee information such as addresses and birthdates.

Another approach might be to make a repeat row for each employee. This would create many rows for some companies. If you now wanted to collect information on sales events, you would have to have a separate row for each of these, for each employee. The number of rows would rapidly expand to an unreasonable number. In addition, this approach would create a lot of repeated information. Again, the same issue of repeated typing of locations or companies comes up, and the possibility of mistyping, errors, and other data corruption.

...To relational

The solution for this dilemma is known as the relational database. In essence, a relational database is a series of database tables designed to maximize the efficiency of the database for a particular task.

Dr. E.F. Codd first developed the relational data model in 1969 in an IBM research report. He continued working on the development of relational databases until the mid-'80s, when he published a list of 12 rules that concisely define an ideal relational database. These rules have provided a guideline for the design of all relational database systems ever since. There are certainly many other types of transactional database models in use today, but the vast majority of them are relational.

To understand the relational database, consider our new store database, which might have separate flat file databases called tables for Store, Employee, Sale, and Category. In this relational database, each table has a unique name or "primary key" that defines each entry (for example, Store1, Store2, Store3). A table can be "related" to another table by linking to the other table's primary key as a column-these links are called "foreign keys." For example, the Employee table would have primary keys of Employee1, Employee2, Employee3, with names and addresses for each listing. The Employee table would also have a column for StoreID as the foreign key, which would correspond to the store that employs them. Since one store has many employees, but each employee only works at one store, the two tables are related by the "one to many" relationship. This description is highly simplified, but describes the essence of a relational database.

Data entry screens may match the particular tables in the database, making data entry very efficient. A modern database management system (DBMS) can also use roll-back functions that ensure that a multifield data entry transaction is entirely completed. Without this, partial data entry from a particular transaction can lead to corrupted databases and lost data.

Data retrieval from a transaction-oriented relational database can be complex, involving the careful "joining" of two or more tables. Fortunately, there is a common language (alas, with many dialects) that is used for adding, retrieving and manipulating data in databases. The language is known as SQL, for Structured Query Language. Learn SQL and you can get data from pretty much any professional database.

The power of the relational database is many-fold. First, a properly designed database should only require the entry of particular data one time. This prevents errors from redundant data entry. In addition, the relational database is set up to maintain referential integrity. If a store is deleted from the database, all the linked tables that reference that store can also be removed. Finally, the relational database is designed to perform efficiently as part of the day-to-day operations of a business, enabling data entry and data reporting.

As mentioned, Dr. Codd described a series of rules for "normalization" that are used to optimize databases. The rules formalize the process of eliminating redundancy in data, and ensure the presence of primary keys and dependency of all data on these keys. A database that is normalized to the "third normal form (3NF)" is considered to have a good design. 3NF databases typically have little or no repeated data.

For example, in our database many stores share the same address. Therefore, all the stores at the same address would have the same "AddressID" entry. A separate Address table would contain a single instance of each address. In this way, the address wouldn't be repeated over and over again in the Store table. Practical considerations sometimes force deviations from this, but most good transactional databases have a 3NF form.

Once we have all the data entered into a relational database, how do we get the data out? SQL queries with proper table joins will generate views of the entered data, or even nicely formatted and summarized reviews of the dataset. Most operational databases have a variety of prewritten and often flexible reports that allow users to view data.

However, consider a theoretical database of this sort with data on every shop in every mall in the country, or even the world. Imagine this database also contains tables with the sales and expenses for each store, for each day, for years.

The size of such a database would be staggering. A query that summarizes the quarterly sales and expenses for each of the different store categories, and allows viewing of the data for each of the different regions of the country and world, could require an immense amount of computing power to search through the tables, find and summarize the data, and report it in summary form. The relational database is superb for complex transactional data processes, but is quite limited in complex data reporting and analysis.

Data Warehouses

Sophisticated reporting and analysis is where the data warehouse comes in. The data warehouses (and the smaller, more targeted data marts) are database designs, created to answer questions about business processes and results using massive amounts of data.

Many people think of data warehouses as a "repository" for all of your business data, available for browsing or query development, but this is misleading and typically incorrect. Rather, it is best to think of a data warehouse as a database built to answer specific and frequently asked questions about business processes and results. The data warehouse typically involves the collection of data from many disparate transactional databases involved in the conduct of business.

From the beginning, the data warehouse is designed with a very significant understanding of the questions that it is going to be asked. The table structure is typically constructed in a distinct "star" form, or data cube. A complex query in a relational database might require many table joins and hence very poor performance.

The star schema allows this query to be accomplished efficiently and quickly. This is because the tables in the data warehouse are purposefully "denormalized," and usually contain much repeated data, violating Dr. Codd's rules.

An example might be useful. In our theoretical, international database of shopping malls we will construct a data mart, which is a small, purpose-built data warehouse, for answering our question about sales and expenses at different store categories summarized by region of the world. The center of our "star" is a "fact table" which represents the business process that is being measured. The fact table contains a primary key (fact1, fact2, fact3), the core data facts, and secondary keys for a series of surrounding "dimension" tables, thus the star design. The core data facts of daily sales or units sold are typically numeric values that can be manipulated arithmetically (summed, multiplied, and divided).

In our example, the fact table might contain a FactID, a quarterly sales amount, a quarterly expense amount, and secondary IDs that serve as links to a store table, and a category table, and others. While the center of the star, the fact table, is normalized, each of the dimension tables are not.

A dimension table for Store, for example, might include store name, postal code, city, state, region, country, and continent. This table would have many repeat values in the columns; the state may be repeated in hundreds or thousands of rows. The denormalized form of the dimension tables is purposely created to allow rapid and efficient SQL queries. A nontechnical user can quickly see each attribute of a dimension table can be used to "sum up" the data. In our store table, we can look at quarterly sales and expenses by store, postal code, country, region, state or city. No complex joining of tables needs to happen-the design of the database is created to support this.

Similarly, a "category" dimension table may contain category (dress store, bookstore), product (hardcover, paperback, blouse), and other factors. Again, sales can be summed by product, category, and other factors.

Several important things are worth noting about data warehouses. First, the data are drawn from the transactional databases of a company. They typically don't have to be "real time" so they are often uploaded, in bulk, on a regular daily or weekly basis. The work of cleaning, aggregating, and manipulating the data for uploading to a data warehouse can be significant.

To use a simple example, the combining of two databases that use different terminologies for male and female (M and F versus 1 and 2) can require modification of all the records in one of the databases. When this must be done for many different fields, it can become a very large task. In some cases the adjustments can be automated, but they always requires a lot of thought and planning.

Since the data are moved from an existing database, they don't typically have to be updated individually, can have repeating identical fields, and don't have to have "referential integrity" built in. A data warehouse is NOT a data repository-it is a special purpose analysis database designed for efficient reporting.

Nevertheless, the types of queries and reports that can be derived from a data warehouse depend a great deal on the planning and design of the database before it is built or populated. The structure of the tables determines the types of reports that can be generated, and so must be designed to contain the appropriate data. For efficiency, this data is very often aggregated-monthly updates in a data warehouse may be combined from individual data transactions before it gets sent. Alternately, a data warehouse may have fact tables for weekly, monthly, and quarterly facts, each calculated from a single daily fact table. In either case, the aggregated dimension table makes it possible to get rapid access to data without excessive processing or database queries.

If implemented and managed well, a data warehouse can provide very important information for smarter decisions from concise understanding of business processes. In clinical data, a data mart for a particular question about a particular therapeutic area may use the wealth of data available to better understand a drug or drug class. However, those who plan to make use of this data must understand that the data may not be as clean and ordered as is typical in a clinical database.

If you are asked to participate in a data warehouse project, make sure you understand what you are getting into. The process is going to take several years to create and implement. The project will certainly expand beyond your initial expectations, and will involve a substantial amount of data cleaning and quality control of data sources.

Finally, your end users need to be properly trained and motivated to use the data warehouse. It is quite common for data "consumers" to be uncomfortable with digging into a data warehouse.

Without knowledgeable and tenacious data consumers, the data warehouse won't be nearly as valuable. However, with carefully defined goals and a committed end-user community, it can provide invaluable information and knowledge from a previously untapped resource-your data.

Related Content:

News