V

Exam essay: Data warehousing

January 5, 2009

In this article, I discuss data warehouses: what they are, how they compare to operational databases, and how they are designed, implemented and maintained.

Exam essay: Data warehousing

By: Chris Malek

Jan 05 2009

tags:

Category: Articles, Exam Essays

1 Comment »

Data warehousing: what is it, why do we want to do it, how is it done?  How do data warehouses compare with operational databases? What do we want to consider when doing so, and what are our options in terms of implementation?

In this article, I discuss data warehouses: what they are, how they compare to operational databases, and how they are designed, implemented and maintained.

An operational database is one which is used by the enterprise to run its day to day operations.  They are created to support fast transaction processing, with frequent updates.  Speed is key to operational databases.  They typically are used by clerical staff, and are on the order of megabytes of data to gigabytes.  Database consistency is very important to operational databases, and consistency checks and constraints are rigidly enforced.  They contain the most current set of data applicable to running enterprise operations.  These are our sales and inventory databases.

A data warehouse differs from this in many ways.  They are used by management for making decisions, watching trends, and running reports.  They are typically used offline, have few users and are very large: gigabytes to terabytes.  They contain historical data, are read only, and are added to but rarely or never updated (the rows in the database are not changed, I mean).  The data in the data warehouse is time sensitive – each row is the warehouse is timestamped so that trending of data versus time can be done.   The kinds of queries that are run against data warehouses are complex, containing many WHERE, JOIN and UNION clauses.   These are decision support databases that are used to make strategic decisions about the enterprise.

Organizations make data warehouses in order to gain insight into trends, anomalies and exceptions in enterprise data that affect the business strategically.  This kind of analysis and reporting is called OLAP: on line analytical processing.   Management uses OLAP tools on the data warehouse to pull reports and make decisions.   This would not be possible to do with an operational data store, since the operational data store contains data that is only true at the current time.  For instance, an operational database which describes inventory would only describe what is in stock currently, not what was in stock last week.  Data warehouses would contain snapshots of the operational data for many instances of time, and so would show inventory both as of now and as of last week.

At this point, I want to distinguish between two terms: data warehouse and data mart.  A data warehouse is used to store all the data that an enterprise would need to make decisions about any part of itself.  A data mart contains only a part of the data that the organization uses.   Data marts can be either implemented as a subset or building block for a data warehouse, or as an independent entity.

There are two design aspects to consider when we think of data warehousing: how do we get data from the operational databases into the data warehouse, and how do we design the data warehouse itself.

Getting data from the operational database to the data warehouse occurs via a process called ETL: extraction, translation and loading.  Data are extracted periodically from the operational database into a temporary holding area, they are transformed – aggregated, massaged, timestamped, converted from the data definition of the operational database to that of the data warehouse – and then are loaded into the data warehouse.  This loading typically happens daily, and can involve many operational databases.   It is a time consuming operation, and optimizing ETL is an ongoing research topic.

There are two issues to consider when designing data warehouses: enterprise issues and technical issues.   From an enterprise perspective, implementing a data warehouse is an immense project that can take months or years to complete.  There are two models for doing this: the Bruce Kimball bottom-up method and the Bill Inmann top-down method.  In the bottom-up method, the organization starts by identifying a business need for trending or reporting and builds a data mart to satisfy that one need.  This can be accomplished quickly. Over time, more data marts are created and more complicated questions can begin to be asked which span data marts.  Eventually all the relevant enterprise data is implemented as data marts within the data warehouse.  The thing that makes this work is that there must be an organizational standards body that defines and enforces a common data model as data marts are created and integrated.  What I mean by this is that different operational databases may use different data definitions of the same kind of object: a person for example.   The role of the standards body is to define a common person data definition, and cause (during ETL) data from the various operational databases to be transformed from their native definition to the common one. This is important because it makes reporting far simpler and prevents data duplication throughout the warehouse.

In the top-down method, we identify all the core data we need from the organization first (without identifying relevant business purposes for that data yet) and implement that as a bottom layer of normalized tables.  We then build data marts on top of that layer.   In the bottom-up approach, we define the common data definition with an iterative approach, while in the top-down approach we define the common data definition all at once before we start.  The first method may cause us to have to rework the warehouse as we go along, while the latter incurs substantial startup cost.

When writing of technical considerations when designing data warehouses, I mean two things: logical design, and physical design.   Data warehouses can huge huge amounts of data, and queries on that data can be complex, with complicated WHERE clauses, many JOINs, UNIONs, and aggregations and computations involved.  The challenge in data warehousing is in designing the database such that such queries will finish in a reasonable amount of time.  Ideally, this is less than 10-15 seconds when doing interactive OLAP queries (the time it takes for someone to lose their concentration and flow), longer for canned reports (minutes, though, not hours).   At the logical level, the industry is split into two camps: star schemas and snowflake schemas.  In both designs, we organize a set of tables around a central table, called the fact table.  The fact table contains the things we want to run queries on, and the surrounding tables (called dimensions), bound to the fact table through foreign key constraints, contain data which describes the rows in the fact table.   In the star schema design, the dimension tables are denormalized to reduce the number of JOINs necessary in queries on the fact table, while in the snowflake schema the dimension tables are normalized to reduce data duplication and allow reuse of those tables with other fact tables.

At a physical level, data warehouses tend to be heavily indexed and partitioned to put the most used data on faster storage.   There are other options available as well.  Multi-dimensional databases attempt to speed queries which must aggregate or derive data in the fact table by storing pre-calculated aggregate shadow tables in the database which the query optimizer will use when appropriate.  For instance, a common set of aggregates are per-time aggregates, pre-calculating per day, per week, per month, per quarter and per year sums of a fact table.   One can implement this within a relational database, in which case it is called a ROLAP database.  There are also special products which implement more effectively these multi-dimensional functionalities by storing data in non-relational form: these are called MOLAP databases.

When we talk about the value of IT as being that of providing value now and opening new options for value in the future, data warehousing is a perfect example of that, and it is clear from the press and research around data warehousing that both the business and research communities understand this.   Data warehouses are typically designed with specific questions in mind, but as data grows, the warehouse gains value because there are new questions that can be asked if only the enterprise is perceptive enough to see them.  Those questions and their answers can lead to new opportunities for creating competitive advantage.  Amazon is a perfect example of this: they have gathered over the last decade a tremendous amounts of data on buying habits of their customers, the trustworthiness of different used book vendors and the popularity of various books and items, and have continually leveraged that data to emphasize their brand (raising barriers to entry in their market), increase switching costs for their users (because having books recommended to you is helpful, but you have to work to let Amazon know what you like by) thus reducing the threat of substitution.

There is a large amount of research into data warehousing in the database community on several fronts: data mining (how do we use our data warehouse effectively), recommender systems (really big now), data warehouse design and construction (such as what Kimball and Inmann do), query optimization, multi-dimensional databases and their implementation.

One Response to “Exam essay: Data warehousing”

  1. My self-critique:

    This is too long, and it took me too long to write, probably well over an hour (I had to write it in two sittings, though). But I don’t know what I could cut out. A lot of the problem was that the knowledge was slow to come into my memory, so I had to write a while to get it to come. In that case, it’s good I wrote it.

    I should emphasize that OLAP is about ad-hoc retrieval. I think I confuse the issue by saying that data marts are designed around business goals, specifically around specific business questions or types of questions. Perhaps that’s how you start, but eventually you will need to support ad-hoc data retrieval and analysis. I need to include that.

    The loss of attention time is actually 5 seconds, not 10 – 15 seconds. From HCI research. I didn’t include the evolution of data warehousing within an enterprise, but I hit everything else pretty much according to my notes.

    If I want to take the Amazon example of leveraging data in data warehouses to create future business options further, I can also talk about changing the nature of competition. Anyone who competes with Amazon has to compete at least in price and in recommender quality, and with used book vendor/buyer support and population (rating system). You can talk about changing the nature of supplier relationships in the context again of suppliers because being able to funnel people looking for particular things to similar things that a supplier might also carry is a big plus for the supplier and encourages them to be more dependent on Amazon.

    I did throw in some general themes of research this time; I’m assuming that this is enough. Simply due to space considerations, I don’t know how much more detail I could go into.

Leave a Reply