Gartner says that data warehouse is the next big thing. Oracle is big into this space. It is something that many customers are interested in. It begs the questions, what the heck is data warehouse and why should you care?
The key words that are usually associated with data warehouse are consolidation, business intelligence, single server, data mining, and analytics. What I can gather is that data warehousing is a mechanism for gathering key metrics about the business on one server so that you can analyze trends, changes, or requirements of your business. Isn’t this what your database does for you anyway? Well, not really. You use the database to store HR data for payroll and benefits. You use a different instance to store inventory and accounts reveivable. Yet another for manufacturing or shipping. Some of these are transactional in nature with entries happening every minute. Others are batch operations to generate paychecks every week, two weeks, or once a month.
Unfortunately, most companies have different departments that manage this data. Accounting and finance deal with things money related. Manufacturing deals with inventory, catalogs, and part lists. HR deals with people in the company. Sales deals with customer information and retail web sites. Data warehouse tries to integrate these departments and aggregate much of the data so that an executive can look at things like marketing defects and customer satisfication issues as well as inventory turns and the cost of shipping and storing parts. If our company, for example, generates plastic parts for our product and that product is starting to have a higher rate of returns and customer complaints, it might be of benefit to look at our suppliers for plastic as well as our die machine that molds the plastic parts. If the die machine is old or does not work properly with a new additive from our supplier, we need to find this out. The inventory and parts list repository will not show this.
My understanding is that you can currently create a star schema that crosses many tables in one database instance. This allows you to create materialized views that cross department boundaries. Most modern databases can cache and optimize these views. When data is updated in your inventory system, theoretically it is copies to your data warehouse and the materialized view that is in your data warehouse is located is updated. This does not negatively impact your inventory system other than copying the data to a remote system. I dont’ truly understand the need for data cubes and how they apply to this technology. It is something that I need to do more research on.
Data mining is another concept that comes up as part of a warehouse. It is typically said in conjunction with modeling and predictive analytics. Regression analysis and linear models appear to be important as well as deviation from predictions. I remember some of this from control systems but I am sure that the technology on keeping a pendilum swinging upside down is different than that of keeping inventory minimized and product quality maximized.
ETL is another feature that seems to be important. This piece typically consists of a point to point connection to a data stream and data translation in some way. If something changes in your mainframe, it is copied into your warehouse and converted from the language that it is stored into a format that works in a database. It also has the ability to convert data from one representation to another. For example, sales in Europe might be stored in Euros. If we want to aggregate data into a warehouse we might want to standardize on US Dollars as it is copied.
The biggest problems that I have seen for building a data warehouse are not technical but political. The key issues are around funding, data ownership, and data access. No one wants to fund something for the whole company out of their budget. Since IT is typically a cost center and not a revenue center, it is difficult for IT to being this type of initiative. What usually happens is that you get a mini-warehouse between two departments or data shared in a spreadsheet that is not available to anyone else. Some times you do see an analytics tool that runs on top of a manufacturing system but only runs at night since the system is tuned for high rate of transactions during the day and not batch operations. Mixing the two has traditionally caused problems with one or both functions.
What I need to figure out moving forward with this is:
1) how do cubes relate to data warehousing and what value does it bring.
2) what impact is there on copying data to a data warhouse from my existing system
3) is there a good example that shows a tangible benefit to building a warehouse over building integration between two or three systems
4) who is the end user and a typical use case for a warehouse? Is it just used by the higher level managers or by mid tier managers to run their operations?