THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Argenis Fernandez

DBA Best Practices - A Blog Series: Episode 1 - Backups

 

This blog post is part of the DBA Best Practices series, on which various topics of concern for daily database operations are discussed. Your feedback and comments are very much welcome, so please drop by the comments section and be sure to leave your thoughts on the subject.

Morning Coffee

When I was a DBA, the first thing I did when I sat down at my desk at work was checking that all backups had completed successfully. It really was more of a ritual, since I had a dual system in place to check for backup completion: 1) the scheduled agent jobs to back up the databases were set to alert the NOC in failure, and 2) I had a script run from a central server every so often to check for any backup failures.

Why the redundancy, you might ask. Well, for one I was once bitten by the fact that database mail doesn't work 100% of the time. Potential causes for failure include issues on the SMTP box that relays your server email, firewall problems, DNS issues, etc. And so to be sure that my backups completed fine, I needed to rely on a mechanism other than having the servers do the taking - I needed to interrogate the servers and ask each one if an issue had occurred. This is why I had a script run every so often.

Some of you might have monitoring tools in place like Microsoft System Center Operations Manager (SCOM) or similar 3rd party products that would track all these things for you. But at that moment, we had no resort but to write our own Powershell scripts to do it.

Now it goes without saying that if you don't have backups in place, you might as well find another career. Your most sacred job as a DBA is to protect the data from a disaster, and only properly safeguarded backups can offer you peace of mind here.

"But, we have a cluster...we don't need backups"

Sadly I've heard this line more than I would have liked to. You need to understand that a cluster is comprised of shared storage, and that is precisely your single point of failure. A cluster will protect you from an issue at the Operating System level, and also under an outage of any SQL-related service or dependent devices. But it will most definitely NOT protect you against corruption, nor will it protect you against somebody deleting data from a table - accidentally or otherwise.

Backup, fine. How often do I take a backup?

The answer to this is something you will hear frequently when working with databases: it depends.

What does it depend on? For one, you need to understand how much data your business is willing to lose. This is what's called Recovery Point Objective, or RPO. If you don't know how much data your business is willing to lose, you need to have an honest and realistic conversation about data loss expectations with your customers, internal or external. From my experience, their first answer to the question "how much data loss can you withstand?" will be "zero". In that case, you will need to explain how zero data loss is very difficult and very costly to achieve, even in today's computing environments.

Do you want to go ahead and take full backups of all your databases every hour, or even every day? Probably not, because of the impact that taking a full backup can have on a system. That's what differential and transaction log backups are for.

Have I answered the question of how often to take a backup? No, and I did that on purpose. You need to think about how much time you have to recover from any event that requires you to restore your databases. This is what's called Recovery Time Objective. Again, if you go ask your customer how long of an outage they can withstand, at first you will get a completely unrealistic number - and that will be your starting point for discussing a solution that is cost effective.

The point that I'm trying to get across is that you need to have a plan. This plan needs to be practiced, and tested. Like a football playbook, you need to rehearse the moves you'll perform when the time comes. How often is up to you, and the objective is that you feel better about yourself and the steps you need to follow when emergency strikes.

A backup is nothing more than an untested restore

Backups are files. Files are prone to corruption. Put those two together and realize how you feel about those backups sitting on that network drive. When was the last time you restored any of those?

Restoring your backups on another box - that, by the way, doesn't have to match the specs of your production server - will give you two things: 1) peace of mind, because now you know that your backups are good and 2) a place to offload your consistency checks with DBCC CHECKDB or any of the other DBCC commands like CHECKTABLE or CHECKCATALOG. This is a great strategy for VLDBs that cannot withstand the additional load created by the consistency checks.

If you choose to offload your consistency checks to another server though, be sure to run DBCC CHECKDB WITH PHYSICALONLY on the production server, and if you're using SQL Server 2008 R2 SP1 CU4 and above, be sure to enable traceflags 2562 and/or 2549, which will speed up the PHYSICALONLY checks further - you can read more about this enhancement here.

Back to the "How Often" question for a second. If you have the disk, and the network latency, and the system resources to do so, why not backup the transaction log often? As in, every 5 minutes, or even less than that? There's not much downside to doing it, as you will have to clear the log with a backup sooner than later, lest you risk running out space on your tlog, or even your drive. The one drawback to this approach is that you will have more files to deal with at restore time, and processing each file will add a bit of extra time to the entire process. But it might be worth that time knowing that you minimized the amount of data lost. Again, test your plan to make sure that it matches your particular needs.

Where to back up to? Network share? Locally? SAN volume?

This is another topic where everybody has a favorite choice. So, I'll stick to mentioning what I like to do and what I consider to be the best practice in this regard. I like to backup to a SAN volume, i.e., a drive that actually lives in the SAN, and can be easily attached to another server in a pinch, saving you valuable time - you wouldn't need to restore files on the network (slow) or pull out drives out a dead server (been there, done that, it’s also slow!).

The key is to have a copy of those backup files made quickly, and, if at all possible, to a remote target on a different datacenter - or even the cloud. There are plenty of solutions out there that can help you put such a solution together. That right there is the first step towards a practical Disaster Recovery plan. But there's much more to DR, and that's material for a different blog post in this series.

Published Thursday, November 15, 2012 7:42 PM by Argenis

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Luke Jian said:

One other thing I like to ask users is what are the SLA (Service Level Agreement) for the different apps that live on the server. A lot of time the RPO/RTO that the respective business owners will give you is far from the SLA and when things go south the pressure will be much higher because they just found out that the agreed RTO is nowhere near the SLA and clients will not be happy.

If you know about it ahead of time you might take preventive steps ahead of time or yield higher priority to those DBs in case of a disaster.

November 15, 2012 10:02 PM
 

Cindy said:

I attended the pre-con and took away a lot of great info. When you were talking about checking backups I was thinking yup, check, got that covered. However I recently had an issue where my scripts to check for database backups did not work. The system databases on this server had been moved but the subsystem dlls had not been updated. The SQL agent looked like it was running, but all the jobs were going into a suspended/failed state and the agent wasn't working properly and neither was my check/notification job. I realized it wasn't working when Redgate SQL monitor started to complain about overdue backups. After updating the path of the subsystem dlls things started working again. Fortunately this was a test server, and I can make sure the same situation doesn't happen in prod when those system databases get moved. With that said, I guess I have some re-work to do on my check failed job script.

November 15, 2012 11:26 PM
 

Anup Warrier said:

Nice write-up,looking forward for this series !

November 16, 2012 1:47 PM
 

martinz said:

Nice start to your series Argenis - I attended the pre-con that you and Robert did at PASS 2012 as well and there were a lot of good take aways - hopefully you'll be capturing some of those in this series too - Cheers.

November 19, 2012 4:31 PM

Leave a Comment

(required) 
(required) 
Submit
Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement