what the heck is a data warehouse

Gartner says that data warehouse is the next big thing. Oracle is big into this space. It is something that many customers are interested in. It begs the questions, what the heck is data warehouse and why should you care?

The key words that are usually associated with data warehouse are consolidation, business intelligence, single server, data mining, and analytics. What I can gather is that data warehousing is a mechanism for gathering key metrics about the business on one server so that you can analyze trends, changes, or requirements of your business. Isn’t this what your database does for you anyway? Well, not really. You use the database to store HR data for payroll and benefits. You use a different instance to store inventory and accounts reveivable. Yet another for manufacturing or shipping. Some of these are transactional in nature with entries happening every minute. Others are batch operations to generate paychecks every week, two weeks, or once a month.

Unfortunately, most companies have different departments that manage this data. Accounting and finance deal with things money related. Manufacturing deals with inventory, catalogs, and part lists. HR deals with people in the company. Sales deals with customer information and retail web sites. Data warehouse tries to integrate these departments and aggregate much of the data so that an executive can look at things like marketing defects and customer satisfication issues as well as inventory turns and the cost of shipping and storing parts. If our company, for example, generates plastic parts for our product and that product is starting to have a higher rate of returns and customer complaints, it might be of benefit to look at our suppliers for plastic as well as our die machine that molds the plastic parts. If the die machine is old or does not work properly with a new additive from our supplier, we need to find this out. The inventory and parts list repository will not show this.

My understanding is that you can currently create a star schema that crosses many tables in one database instance. This allows you to create materialized views that cross department boundaries. Most modern databases can cache and optimize these views. When data is updated in your inventory system, theoretically it is copies to your data warehouse and the materialized view that is in your data warehouse is located is updated. This does not negatively impact your inventory system other than copying the data to a remote system. I dont’ truly understand the need for data cubes and how they apply to this technology. It is something that I need to do more research on.

Data mining is another concept that comes up as part of a warehouse. It is typically said in conjunction with modeling and predictive analytics. Regression analysis and linear models appear to be important as well as deviation from predictions. I remember some of this from control systems but I am sure that the technology on keeping a pendilum swinging upside down is different than that of keeping inventory minimized and product quality maximized.

ETL is another feature that seems to be important. This piece typically consists of a point to point connection to a data stream and data translation in some way. If something changes in your mainframe, it is copied into your warehouse and converted from the language that it is stored into a format that works in a database. It also has the ability to convert data from one representation to another. For example, sales in Europe might be stored in Euros. If we want to aggregate data into a warehouse we might want to standardize on US Dollars as it is copied.

The biggest problems that I have seen for building a data warehouse are not technical but political. The key issues are around funding, data ownership, and data access. No one wants to fund something for the whole company out of their budget. Since IT is typically a cost center and not a revenue center, it is difficult for IT to being this type of initiative. What usually happens is that you get a mini-warehouse between two departments or data shared in a spreadsheet that is not available to anyone else. Some times you do see an analytics tool that runs on top of a manufacturing system but only runs at night since the system is tuned for high rate of transactions during the day and not batch operations. Mixing the two has traditionally caused problems with one or both functions.

What I need to figure out moving forward with this is:
1) how do cubes relate to data warehousing and what value does it bring.
2) what impact is there on copying data to a data warhouse from my existing system
3) is there a good example that shows a tangible benefit to building a warehouse over building integration between two or three systems
4) who is the end user and a typical use case for a warehouse? Is it just used by the higher level managers or by mid tier managers to run their operations?

deciding what to do

my wife and I are fortunate enough to come upon some money. Unfortunately, the money came by the death of one of our family members. We both realize that we should do something in the spirit of the person that left us the money and don’t want to blow it on a new car, a lavish vacation, or a big screen tv. What we want to do is invest the money in something that won’t loose value. Given the amount of our net worth in the stock market, we don’t want to put more into the market. The next option is property. This brings up the question of what do you do with your resources to improve your position. It got me thinking that this isn’t much different from a CIO of a company. Assume that you are part of a company that is publically traded. The stock rises by 2x over a year from the given market conditions, condition of the company, and great job that you and the rest of the executive team are doing. The board has opted to issue some new stock thus giving the company a influx of capital. The CEO has agreed that this money should be split across all divisions since an acquisition is not something that anyone wants to do. In doing so, the CEO asks for a plan on what to spend the money on. I realize that this isn’t common place. Most IT departments have to fight for any new dollars and if they are given any extra money it is to launch a specific project.

Given that my family has a blank sheet of paper, what property do we want to invest in? Pay for our existing house? Buy some rental property and become landlords? Buy some beach or lakefront property? Buy a timeshare in another country? It begs the question where do you want to be in a year and what incremental cost will you get with this new property. What cost is there of a new property.

Paying for our existing house has some positives. Given Texas law, once we own the house, it can’t be taken from us unless we start running guns or drugs across the border. The biggest disadvantage is that it effects the way that we file federal taxes. We might not be able to itemize but have to take standard deductions. This isn’t a big deal but it will increase out taxes by a few hundred dollars a year. On the positive side we won’t have to pay the thousands in interest that we currently pay. We also have a very low interest rate so if we are getting 6% yield in the market, we would be loosing money because our house value does not go up 6%.

Buying a rental property and managing it is another alternative. This will allow us to supplement our income assuming that we can rent the property for less than the mortgage and insurance. We can probably do this but there is risk. We will establish a relationship with someone we don’t know and won’t necessarily respect our investment like we do. We were renters once. I remember this one duplex in college that I totally trashed. Ok, I didn’t trash it, my new puppy did. I strongly believe in karma so becoming a slum lord is not something that I want to deal with because I don’t want to be on the other end of that nightmare.

Buying a lake front, beach front, or river front property is another alternative. The advantages of this are that it would give us a place to get away once a month and a place for the family to gather even after we all graduate. I see my oldest son about five or six times a year. If we had a common gathering place I can see this going up to ten times a year. As my other kids get older the same will be true. They are already talking about a place to build a tree house, building a canoe or rowing boat together and going down the river together. My wife is talking about leaving town on Thursday and coming back late Monday during the summers. We both have flexible schedules and can work remotely during the summer. The drawback is that someone will have to mow the lawn, water the plants, and make sure the property is secure. This also means another cable or satelite subscription for internet access remotely. It means another water, electric, and garbage bill. It means an extra tank of gas per month for driving to and from the new location. It means a boat or jet ski to play with on the lake. It means a new set of appliances for the new house. It means a net set of beds, couches, kitchen tables, you get the picture.

What about a condo or time share? We could easily do something like this. It gives us all the benefits of a lake or beach house without having to do yard work. Unfortunately it is something that will probably be rented to other people and we won’t have exclusive access to the property. We will need to plan out trips to our property and pay a management service to clean the property. True this will give us some additional income but it will also give us restrictions on how we use our new investment. We will also have to share the income with a management company. This isn’t a bad idea, just something that we need to consider. We also need to make sure we get a property in an area that we want. The monthly maintenance fee and management property fee will be additional expenses. We will also have to buy new furniture but not necessarily new appliances. This option will not have room for a tree house or an area where we can jointly build a boat or deck.

I guess I will need to ponder this a little more. I’m glad I’m not a CIO. I dont’ think I want to look at big issues like this on a daily basis. I tend to overanalyze things.

why use sharepoint

I was watching a Phil Lee podcast available through PodTechNet. The discussion was about Microsoft SharePoint Collaboration Server. The discussion made me start thinking.

The legacy behind the SharePoint product is that it started as a Site Server and migrated to SharePoint 1.0. Microsoft then merged Front Page into Windows SharePoint Services which morfed into SharePoint Portal Server 2003. The idea is to take the typical tools that everyone uses (Word, Excel, and PowerPoint) and enable users to share these applicaitons as a service collaboratively. This isn’t quite Documentum, FileNet, or the other ECM processes. It is more of a collaboration tool and not a document control mechanism. It does not have the business process integration like a traditional ECM product.

The SharePoint Server 2007 does incorporate the Content Management 2002 server and Office 2007. This is considered to be the money maker for Microsoft. They key to the code is splitting the services into different components. It consists of Collaboration. a Portal, a Search Engine, Content Management Services, Business Analytics, and Business Process Automation. The service layer under these services create things like information lifecycle management of documents, security and access controls, and other things to provide government compliance and data retention regulations.

The concept here is that if a user is a historical Office Suite user, there is an opportunity to upgrade everything to collaborative services and corporate standards with the new Suite and SharePoint combined. These services are also starting to include some basic templates for back office applicaitons like inventory, basic accounting, and HR. These templates are not robust and will probably only work for mom and pop shops.

The demo that was given showed a contracts folder that allows a group of people to create documents with Excel, share these documents through files, or make them available offline through Outlook. The tool also has a discussion group mechanism that allows for a newsgroup style discussion. This service is available as is News offline but also works similar to an instant messenger chat session if you are connected.

The demo showed an example of filing an expense report through a template. This integrates a Forms server on the SharePoint server similar to the Oracle Forms server integrates with an Applicaiton Server. The expense tool showed basic forms.

The way that file sharing is done has changed slightly. Rather than looking at a collection of Powerpoints being dropped into a directory, users can subscribe to a shared channel to view presentations generated from a group that generates collections of presentations. This is a little different from a web page service because it allows for people to be notified when something drops into a channel.

SharePoint is becoming pervasive because many companies own Office 2007 and other related technologies. They typically have at least one SQL Server license, and there isn’t really a good collaboration tool other than email and IM. SharePoint gives you a tool to easily expand use of other tools that users are familiar with.

Some of the drawbacks are that it is easy to cause islands of information and problems when integration of different sharepoint instances merge. The storage structure, topology, and data organization becomes an issue. There really isn’t a sync or offline way of attaching to this. It is also a Windows only solution. There is also a big step in going to Office 2007. Once you go, everyone will have to be able to read the new file storage format. This is an all or nothing step.

This product does have good momentum. It does integrate a variety of tools that people are familiar with.

– end of training video –

In my opinion, this looks like a good tool. I am not a fan of Microsoft products because they typically only work on Windows but that does not seem to be a problem for most corporations. I don’t think it integrates very well into documents that need to be scanned or documents that are not Windows format. The search engine and security mechanisms are not as granular or robust as they need to be. Fortunately for Oracle, we do have some good extension products that we got with the Stellant application that allow us to integrate the SharePoint data and other products like Documentum together. It seems like every customer that we talk to has an ECM solution and it needs to integrate into SharePoint. Many companies are treating it like a web version of Outlook that does collaboration and joint scheduling. Not a bad concept but many of these companies are moving forward with initiatives in different business units to isolate sharing between departments. This is cauing integration and standards problems within their company when they try to aggregate into one SharePoint site.


distributed and high availability

I have had a discussion at a number of customers of how to expand their existing footprint. They are regional companies that have merged or acquired another company. The new organization is typically an international company. The question comes up how to let the users from Singapore, Sydney, or Scotland see the financials, inventory, and hr data without taking forever to perform standard operations.

The first method that I always recommend is to take inventory of what they have and how it is configured and accessed. If you have central servers in a major metropolitan area in the US, make sure that you have high speed access to your intranet. This means something larger than a T1 connection for most companies. Once you have the high speed connection, make sure that you have two data centers. It does not matter where the remote data center is, just make sure that it isn’t in the same building or server room. If you have a site failure because the power company goes down or a storm floods the city, put your alternate data center in another city. If you are looking at supporting operations in Europe, Asia, or China, why not put an alternate data center there? Typical reasons to stop people from putting data centers there are legal, logistical, and political. Some countries do have restrictions on what data can be exported from the country. France, for example, does not let HR data leave the country. Employee data for French citizens must stay in France. If you have a large data center in Asia, you might want to put your data center in cities like Singapore because they have advanced and reliable network infrastructure. Putting it in China using the publically available network might lead to problems or random outages. The same is true for many companies in Africa.

Once you have alternate sites available, make sure that you have two network connections into and out of each site. If one network goes down, the second network connection provides backup. Many companies in the US only have one internet connection because maintaining two high speed connections can be expensive. Many of the services allow you to pay for an all you can eat service where you only pay for the bandwidth that you use. This is an excellent idea but it does not allow for predictable cost. If you generate an event that causes a significant amount of traffic, like a new product release or a new marketing event, your network cost can go through the roof. Many companies pay for a high speed fast pipe like an OC3 or OC12 and supplement this with a variable network connection. The variable network connection is the standby failover line and is only used in event of failure or substantial congestion on the high speed fast pipe.

Once you have the network plumbing working, you need to look at latency between the sites. The latency needs to be guaranteed less than 300ms, typically less than 150ms for the Oracle suite of applications to properly work. Typical network speeds around the world on an uncongested network is on the order of 85ms. Network latency inside a data center is typically on the order of 10ms. If you can fall into the 30 to 50ms range, things will work properly and you typically will not notice that you are at a different data center. When latency jumps to 100ms plus, users will start to complain of slow applicaitons. Realize for each web page interaction there are four to five round trips on the network to transfer the web page. This means that a half a second per web page before it starts to paint. This delay is noticable from the user.

The discussions that I have had recently are how do I split my financials and HR data between two sites. If I have operations in Europe and operations in the US, can I have the European data in London and my US data in Houston? The answer is yes and no. Yes, I can have my Oracle Apps running in both locations. Yes, I can have a database that stores the Europe data on a database server in London. Yes, I can have a database that stores US data in Houston. The big problem is how do I run two financial systems and merge these repositories? How do I convert between Euros and Dollars? How do I report expenses in multiple currencies and convert this data on a time basis based on when I exchange money with my bank? How do I adhere to the Tax laws in the US and the Tax laws in the UK? There is more to consider than just can I store my data in multiple locations.

You can have Oracle Apps running in two different sites and have two different databases in the two sites as well. The number two is not special, this could be twenty or twelve just as easy. The difficulty comes about in merging this data into one corporate view. If, for example, I want to look for a part in our inventory, I want to look first locally then internationally. The search can be done through Oracle Apps (PeopleSoft, Seibel, JD Edwards, Hyperion, etc.) with a regional restriction or with a global search. The regional restriction will look locally. The global search will look in multiple tables or data partitions. You can replicate the European data to the US using DataGuard and the US data to Europe the same way. This will give you two database tables at each site with one table being the primary and the second being a logical or physical standby as well. The standby data can be accessed in read-only mode and changes can be pushed across using a streams interface to the primary. The drawback to this is that 9i and 10g restrict database availability for databases that are opened with read only. The redo apply stalls until the database is closed. The 11g database allows you to have the database open and have the redo apply processed to this standby repository.

The distribution of applicaiton internationally is a complex problem. There are many alternatives or options to make it work. There are many problems and issues that you need to think about beyond simple data synchronization. This is a topic that I will probably visit a few more times.