backing up a database to the cloud

Up to this point we have talked about the building blocks of the cloud. Today we are going to look into the real economics of using some of the cloud services that we have been examining. We have looked at moving compute and storage to the cloud. Let’s look at some of the reasons why someone would look at storage in the cloud.

Storage is one of those funny things that everyone asks for. Think of uses for storage. You save emails that come in every day. If you host your email system in your corporation, you have to consider how many emails someone can keep. You have to consider how long you keep files associated with email. At Oracle we have just over 100,000 employees and limit everyone to 2GB for email. This means that we need 200 TB to store email. If we increase this to 20 GB this grows to 2 PB. At $3K/TB we are looking at $600K capex to handle email messages. If we grow this to 2 PB we are looking at $6M for storage. This is getting into real money. Associated with this storage is a 10% support cost ($60K opex annually) as well as part of a full time employee to replace defective disks, tune and feed the storage system, allocate disks and partitions not only to our storage but other projects at a cost of $80K payroll annually. If we use a 4 year depreciation, our email boxes will cost us ($150K capex + $60K opex + $80K opex) $290K per year or $29/user just to store the email. If we expand the email limits to 20 GB we grow almost everything by a factor of 10 as well so the email boxes cost us $220/user annually (we don’t need 10x the storage admins). Pile on top of this home directories that people want to save attachments into and this number explodes. We typically do want to give everyone 20 GB for a home directory since this stores documents associated with operation of the company. We typically want people storing these documents on a network share and not on a disk on their laptop. Storing data on their laptop opens up security and data protection discussions as well as access to data if the laptop fails. Putting it on a shared home directory allows the company to backup the files as well as define protection mechanisms for sensitive data. We have basically justified $250/user for email and home directories by allocating 22 GB to each employee.

The biggest problem is not user data, it is corporate data. Databases typically consume upwards of 400 GB to 40 TB. There is a database for human resources, payroll, customer service, purchase orders, general ledger, inventory, transportation management, manufacturing…. the list goes on. Backing up this data as it changes becomes an issue. Fortunately, programs like E-Business Suite, PeopleSoft, and JD Edwards aggregate all of these business functions into a small number of database instances so we don’t have tens or hundreds of databases to backup. Some companies do roll out multiple databases for each project to collect and store data but these are typically done with low cost, low function databases like MySQL, Postgress, MongoDB, and SQL Server. Corporate data that large numbers of people use are typically an Oracle database, DB2, or SQL Server. Backing up this data is critical to a corporation. Database backups are typically done nightly to make sure that you can recover disk or server failures. Fortunately, you don’t need to backup all 400 GB every night but can do incremental backups and copy only the data blocks that have changed from the previous night. Companies typically reserve late at night on the weekends to do a full backup because users are typically not working and few if any people are hitting the database at 2am on Sunday morning. The database can be taken offline for a couple of hours to backup 400 GB or a live backup can be taken with little risk since few if anyone is on the system at this time. If you have a typical computer with SCSI or SAS disks you can reasonably get 2G/second throughput so reading 400 GB will take 200 seconds. Unfortunately, writing is typically about half that speed so the backup should reasonably take 400 seconds which is 7 minutes. If your database is 4 TB then you increase this by a factor of 10 so it takes just over an hour to backup everything. Typically you want to copy this data to another data center or to tape and your write speeds get doubled again. The 7 minutes becomes 15 minutes. The hour becomes two hours.

When we talk about backing up database data, there are two schools of thought. The database data is contained in a table extent file. You can backup your data by replicating your file system or using database tools to backup your data. Years ago files were kept on raw disk partitions. Few people do this anymore and table extent files are kept on file systems. Replicating data from a raw partition is difficult so most people used tools like RMAN to backup database files on raw partitions. Database vendors have figured out how to optimize reads and writes to disk despite the file system structures that operating system vendors created. File system vendors have figured out how to optimize backup and recovery to avoid disk failures. Terms like mirroring, triple mirroring, RAID, and logical volume management come up when you talk about protecting data in a file system. Other terms like snap mirror and off-site cloning sneak into the conversation as well. Earlier when we talked about $3K/TB we are really talking about $1K/TB but we triple mirror the disks thus triple the cost of usable storage. This makes sense when we go down to Best Buy or Fry’s and look at a 1 TB USB disk for $100. We could purchase this but the 2G/second transfer rate suddenly drops to 200K/second. We need to pay more for a higher speed communication bridge to the disk drive. We could drop the cost of storage to $100/TB but the 7 minute backup and recovery time suddenly grows to 70 minutes. This becomes a cost vs recovery time discussion which is important to have. At home, recovering your family photos from a dead desktop computer can take hours. For a medical practice, waiting hours to recover patient records impacts how the doctors engage with patients. Waiting hours on a ticket sales or stock trading web site becomes millions of dollars lost as people go to your competitors to transact business.

Vendors like EMC and NetApp talk about cloning or snap mirrors of disks to another data center. This technology works for things like email and home directories but does not work well for databases. Database files are written to multiple files at times. If you partition your data, the database might be moving data from one file to another as data ages. We might have high speed SSD disks for current data and low speed, low latency disks for data greater than 30 days old. If we start a clone of our SSD disks during a data move, the recent data will get copied to our mirror at another site. The database might finish re-partitioning the data and the disk management software starts backing up the lower speed disks. We suddenly get into a data consistency problem. The disk management software and the database software don’t talk to each other and tell each other that they are moving data between file systems. Data that was in the high speed SSD disks is now out of sequence with the low speed disks at our backup site. If we have a disk failure on our primary site, restoring data from or secondary site will cause database failure. The only way to solve this problem is to schedule disk clones while the database is shut down. Unfortunately, many IT departments select a disk cloning solution since it is the best solution for mirroring home directories, email servers, and virtualization servers. Database servers have a slightly different backup requirement and require a different way of doing things.

The recommended way of backing up a database is to use archive tools like RMAN or commercially available products like ComVault or Legato. The commercial products provide a common backup process that knows how virtualization servers and databases like to be backed up. It allows you to backup SQL Server and an Oracle database with the same user interface and process. Behind the scenes these tools talk to RMAN and the SQL Server backup utilities but presents a uniform user interface to schedule and manage backups and restores.

Enough rambling about disks and backups. Let’s start talking about how to use cloud storage for our disk replication. Today we are going to talk about database backup to the cloud. The idea behind our use of the cloud is pure economics. We would like to reduce our cost of storage from $3K/TB to $400/TB/year and get rid of the capex cost. The true problem with purchasing storage for our data center is that we don’t want to purchase 10 TB a month because that is what we are consuming for backups. What we are forced to do is look 36 months ahead and purchase 400 TB of disk to handle the monthly data consumption or start deleting data after a period. For things like Census data and medical records the retention period is decades and not months. For some applications, we can delete data after 12 months. If we are copying incremental database backups, we can delete the incrementals once we do a full backup. In our 10 TB a month example we will have to purchase $1.2M in storage today knowing that we will only consume 10 TB this month and 10 TB next month. Using the cloud storage we can pay $300 this month, $600 next month, and grow this amount at $300/month until we get to the 400 TB that we will consume in 36 months. If we guestimated low we will need to purchase more storage again in two years. If we guestimated high we will have overpurchased storage and spend more than $3K/TB for what we are using. Using cloud storage allows us to consume storage at $400/TB/year. If we guess wrong and have metered storage there is no penalty. If we are using non-metered storage, we might purchase a little too much but only have to look forward 12 months rather than 36. It typically is easier to guess a year ahead rather than three years.

Just to clarify, we are not talking about moving all of our backup data to the cloud all at once. What we are talking about is doing daily incremental backups to your high speed disk attached to your database. After a few days we do a full backup to lower cost network storage. After a few weeks we copy these backups to the cloud. In the diagram below we do high speed backups for five days, backup to low speed disks for 21 days, backup to the cloud for periods beyond that. We show moving data to tape in the cloud beyond 180 days. The cost benefit is to take data that we probably won’t read to a lower cost storage. Using $400/TB/year gives us a $2600/TB cost savings in capex. Using $12/TB/year tape gives us an additional $388/TB cost savings in opex.

The way that we get this storage tiering is to modify the RMAN backup libraries and move the tape interface from an on-site tape unit to disk or tape in the cloud. The library module can be download from the oracle tech network. More information on this service can be found at Oracle documentation or Backup whitepaper. You can also watch videos that describe this service

The economics behind this can be seen in a TCO analysis that we did. In this example we look at moving 30 TB of backup from on-premise disk to cloud backup. The resulting 4 year savings is $120K. This does not take into account tangential savings but only looks at physical cost savings of not purchasing 30 TB of disk.

Let’s walk through what is needed to make this work. First we have to download the library module that takes RMAN read and write commands and translates them into REST api commands. This library exists for Oracle Storage Cloud Services as well as Amazon S3. The key benefits to the Oracle Storage Cloud is that you get encryption and parallelism for free as part of the service. With Amazon S3 you need to pay for additional parallel channels at $1500/channel as well as encryption in the database at $10K/processor license. The Oracle Storage Cloud provides this as part of the $33/TB/month database backup bundle.

Once we download the module, we need to install it with a java command. Note that this is where we tie the oracle home and SID to the cloud credentials. The data is stored in a database wallet as well as the encryption keys used to encrypt the backups.

Now that we have replaced the tape interface with cloud storage, we need to define a tape interface for RMAN and link the library into the process. When we read and write to tape we are actually reading and writing to cloud storage.

Once we have everything configured, we use RMAN or ComVault or Legato as we have for years. Accessing the tape unit is really accessing the cloud storage.

In summary, this is our first use case of the cloud. We are offsetting the cost of on-premise storage and reduce the cost of our database backups. A good rule of thumb is that we can drop the cost of backups from $3K/TB plus $300/TB/year to $400/TB/year. Once we have everything downloaded an installed, nothing looks or feels different from what we have been doing for years. When you start looking at purchasing more disks because you are running out of space, look at moving your backups from local disk to the cloud.