Thursday, March 2, 2017

Creating a Disaster Recovery Plan

        So your boss asked for a copy of your DR plan.  Once you've wiped that deer-in-the-headlights look off your face, you realize "We've got database backup.", isn't exactly a plan.  You'll need to define what a disaster could be, document the business impact identify your limitations.  So where to you start? Well, that's the easy part. 

Define what's important... before a disaster.

    There are lots of questions to ask before you have a disaster. They're even more important before you build your disaster recovery plan.  It's essentially the WHO, WHAT & WHEN of your plan. How long can you be down & how much data you can lose. Define what data is important.  Who is responsible for declaring a disaster? Who is responsible for doing the actual recover work? Even knowing where you store your plan is something to think about now.

Who are my stakeholders?

    In a single statement, they're the people that can answer all of the questions we're asking here.  This starts with C-level execs, since they'll be the ones that have to answer to the board after a disaster.  They're also the ones that have to pay for it.  Next, identify the people that will be affected by any data loss or a system outage.  Who can't do their job? Go through each application and ask yourself "Who cares?" and then follow up with them.  Finally, talk to the people that will have to implement the DR plan: System admins, networking & security, Operations, Storage & Infrastructure, DBA team, etc.

RPO & RTO

These are terms you see thrown around in sessions, in meetings and on twitter.  The official definitions are little wordy & not as helpful as you might think.



    Your Recovery Time Objective is a way of defining how long can you be down.  Or put another way, how long until you have to come back up.  So find out how your business defines RTO and then build toward that.  Of course, this can vary based on the nature of your disaster.   Did you lose a database, an instance or did someone kick the storage out from under everything? Did you have to fail over to your DR site?  Were you the victim of a DDOS attack, either directly or indirectly?
    Your Recovery Point Objective is easily summed up by asking "How much data can you lose?"  Some systems, like payments and healthcare systems, that answer will be zero.  The approach and expense for those systems is greatly different than others.  The infrastructure, build and design will vary by a large degree when you're allowed to lose milliseconds of data versus up to 15 minutes. For slow changing systems, could you restore on Tuesday from a Sunday full backup and Monday's differential? In some instances, rebuilding is faster and easier than restoring, so be sure to explore that as an option.  

    While Execs can help you define RTO & RPO, they aren't the ones who have to make it happen.  If you can't meet their requirements, be honest. You also need to be aware of contractual obligations to external clients.  While these are out of your direct control, you should encourage sales & legal to work with you while you prep a contract.  If they're promising 5 9's in uptime but you can only deliver 3 9's, they need to know.  Use phrases like "claw back", "refund of fees" or "violation of terms".  That should get their attention. 

Building a Backup Strategy

    Start with the basics.  Now that you know how much data you can lose and how long you've got to get things back up and running, set about making that happen.  Start documenting your backup processes then put them into place across the enterprise. Make sure all of your servers are following the rules you've established.  If you can lose 15 minutes worth of data and take 2 hours to come back online, then set the schedule.  Make sure you know how to restore the tail end of a log. Here, Tim Radney [T|B] shows us how in an older blog post that's timeless.

    We're not just talking about database backups.  If you use it, you'll need it.  Defining items to backup other than databases means an end to end examination of our business.  Plan on having to recovery things like Active Directory, Application configs, development source code, external files, encryption keys and passwords.  Script out some things ahead of time like SQL Agent jobs, a create logins script, database restores, service accounts, etc.  

Building a Recovery Strategy

    Any DBA is only as good as their last restore. That means you should be doing that regularly.  You'll want to establish recovery baselines.  During a disaster, the longer something takes, the more likely you are to panic. Make sure you know ahead of time everything you'll need to do for a restore and how long you'll expect it to take.  Make sure you have copies of database backups locally and a copy at your DR site. You should know how long it's going to take to restore.  Of course, the biggest pay off to practicing restores is knowing that your backups work.  

Test your plan. 
  
    Some sage advice from Allan Hirt, "If you don't test your D/R plan, you don't have a plan.  You have a document." Document your DR plan.  Practice.  Automate. Adjust. Document. Practice again.  It's the only way to really be sure.  That being said, if your business can't support a full fail-over test, consider a tabletop test instead.  Once your plan is documented, have a meeting with all of the people responsible for recovery.  Go through each and every step of your DR plan.  Make sure every one agrees this should work.  Make sure every one understands their role.  Make sure they have all the pieces they need to put this plan into action.

    Don't just practice for the big disasters either.  Practice for those smaller disasters & disruptions.  Things like DDOS, ISP going down, a data center power outage, natural and unnatural disasters.  While these aren't directly a DBA problem, if they cause you to fail to your secondary data center, it just became your problem.  Then there are the smaller but potentially devastating events: forgetting a where clause in production, drive failure, storage corruption or a malicious insider.  A pissed off employee can cause real issues deleting or modifying data that is vital to your company.  Be prepared to do an object level restore or a side-by-side restore to recover data that may have been compromised. 

Building it out. 

    Let's be honest.  Most companies can't afford to build & maintain a hot standby environment equal to your current production.  If yours can, good for you.  Feel free to skip this section.  But if they're like many companies, DR is currently sharing real estate with staging, QA or UAT.  It's not a hot site or it barely has the processing power you'd need to run your business.  

    Identify your wants versus your needs.  Lay out what you WANT your DR site to look like and how you need for it to function. Identify how you're going to keep it up to date.  Then lay out what you NEED to have for your disaster recovery site.   I suspect the final product will fall somewhere in the middle of your wants and needs.  

    Don't forget hardware, licensing and maintenance.  Plan for enough storage space for the live databases and backups. You'll need enough web servers to run your applications.  Don't forget to factor in enough time each & the peop0le required to build this out and maintain it on a monthly basis.  You'll have to patch all those servers & keep versions aligned. 


How much will it cost you to build?  Probably less than it will cost in lost revenue, client trust and public relations.