A disaster recovery plan, or DRP, is a documented process that lays out specific procedures to follow when an organization experiences a disaster (often involving data loss). It’s designed to minimize data loss and business disruption and, most importantly, to get an organization back on its feet as quickly as possible.
An IT Disaster Recovery Plan is an important component of a larger business continuity plan (BCP). In this article, we’ll define what a disaster recovery plan is, why it’s important, and what elements it should encompass. Even if you already have some policies and procedures in place, it’s essential to:
With ever-changing technology, evolving cyber risks, and employee turnover, developing and maintaining a DRP must never be a “set it and forget it” exercise.
For a quick and simple guide to developing a disaster recovery plan template, review our Disaster Recovery Checklist. For a deeper understanding, dive in below.
IT Disaster Recovery Plan
Imagine yourself in these scenarios:
All of these examples are true stories of data disaster, and all could have been mitigated by a thorough disaster recovery plan.
A successful disaster recovery plan will help you:
Let’s review some of the most common types of disasters you’ll want to cover in your disaster recovery plan.
Natural disasters can include highly localized events like a lightning strike causing a fire in your data center, larger disasters like city-wide blackouts caused by storms, or widespread catastrophes like hurricanes or wildfires.
Make sure when you develop your DRP, you’re thinking about the full range of natural disasters from the smallest to the largest, what systems they could affect, and what resources may or may not be available to you during a time of crisis.
Also keep in mind that when we think of the word “disaster,” what often comes to mind is a natural disaster. While you should diligently prepare for natural disasters, your disaster recovery plan should also encompass man-made disasters like political unrest and energy shortages–as well as potential public health disasters like epidemics and sudden environmental hazards.
Cybercrime is on the rise. Until 2022, human error was the largest cause of data loss, but now for the first time, cyberattacks have become the greatest source of data loss. Here are some common attack vectors that can give access to hackers and lead to data loss:
When malicious parties gain access to your data using these and other tactics, they can do any combination of the following:
Hardware failure is one of the top causes of data loss, and it can create a huge disruption when you least expect it. Endpoints, on-site servers, and external drives are all potential points of hardware failure. Hard drives are among the most fragile parts of computers, and there are numerous ways they can be damaged or simply fail. And even cloud storage solutions with multiple layers of protection aren’t completely immune from hardware failure.
Any organization is vulnerable to data loss due to hardware failure, but small businesses are especially likely to suffer from this as they typically house servers on-premises rather than in a managed data center, and they’re less likely to back up their files regularly (and test those backups).
Let’s face it, nobody’s perfect, and anyone who’s ever forgotten to click the save icon on a regular basis knows that unique feeling of terror right after your application crashes. As frustrating as it is to lose an afternoon’s worth of work on a big presentation, the consequences of human error are not limited to data on a single device. According to a study by Stanford University, around 88% of all data breaches are caused by employee error.
Having clear policies, keeping current on employee training, and automating as many processes as possible are all ways to help cut down on the probability of human error.
Some examples of human error include:
There are many different ways to slice and dice the stages of a disaster recovery plan. Here, we’ll break it down into five stages: Preparation, Assessment, Restoration, Recovery, and Lessons Learned.
Conduct a risk analysis. Preparing for a natural disaster will look different based on your geographical location. Maybe you’re located somewhere that tends to get hit with rolling blackouts, like California during fire season. Or you could have facilities in the path of hurricanes on the Atlantic coast, or along a fault line.
When it comes to human-caused disasters, the likelihood of various incidents are potentially dependent on your industry and your user profile. For example, if you work in the manufacturing or healthcare industries, you should be aware that they’re the top two industries to be targeted by ransomware. And if your users are less tech-savvy, they’re more prone to become a victim of a phishing attack.
Determine potential points of failure. Assess your current state. Are your authentication protocols up to date? Are your physical failovers – like backup power generators or alternate networking equipment – in good working order? Are your files actively being backed up and have you recently tested restoring them? Are your partners staying up to date on their security certifications?
Identify a response team. Different types of disasters will require different disaster response team members. Make sure each person you’ve identified knows their role and be sure to designate a backup in case there’s employee turnover or someone’s on vacation when disaster strikes.
Document everything. And be sure everyone on the team knows where to find the documentation. In addition to documenting your disaster recovery processes themselves, also document things like technical specs, insurance policies, emergency contact information, and relevant government or community resources.
Practice, practice, practice. Disasters are a matter of when, not if. Think how horrified you’d be if a whitewater rafting guide brought you down a new river without doing a test run. It’s the same with disaster planning. With practice, you’ll find hidden obstacles ahead of time, and be able to respond quickly and competently when the time comes.
Declare the event. The first step in assessing a disaster is to declare the event and notify leadership and your response team. Determine your chain of command based on the type of incident and the team you’ve previously identified. Share necessary information with employees, customers, and any relevant authorities.
Keep in mind that how you communicate is just as important as what you communicate. As a team, decide upon necessary audiences (customers, prospects, employees, authorities) and draft communications to be sent as rapidly as possible. Calm, clear, correct communication can be the difference between successful containment and a PR calamity.
Assess current state. Is the disaster ongoing? What can be done now to mitigate further loss, and what is currently out of your control? When dealing with a natural disaster, physical safety should be your true North.
Take inventory. What’s good, what’s lost, what’s potentially recoverable, and what’s destroyed? Take stock of your physical assets like facilities, servers, and products, as well as your digital ones like customer-facing websites, financial databases, and files on users’ computers.
Get back up and running. Here’s where all your preparation pays off. At this point, you know what you need to do and can immediately begin executing your plan. At this stage of your plan, time is of the essence. ITIC’s Global Server Hardware Security Survey in 2022 found that the average hourly cost of downtime is more than $300,000 – and 44% of medium and large businesses report that an hour of downtime could cost their businesses over $1 million.
Activate your failovers. Depending on your needs and your restore point objectives and restore time objectives, you may have full redundancy in some of your systems, or you may have to spin up alternate hardware or set up alternate physical sites.
Keep lines of communication open. Make sure to keep updating your customers, clients, employees, and/or authorities as you work to restore services. In your initial communication with stakeholders, define an update frequency and stick to that cadence even if just to say “We’re still working on it.”
Confirm everything is working. Now that the crisis has passed, you’ll want to methodically check all your systems to make sure everything is working properly. This is where you can rely on the documentation you had at the outset.
Recover lost data, if possible. Once your operations are restored, attempt to recover any lost data not already addressed. Depending on your data retention policies and RPO decisions you may lose varying amounts of data. If you’ve utilized a 3-2-1 backup strategy you should have at least one other copy of data from which to restore, even if a large-scale disaster (or terrible coincidence) were to take out more than one copy of your important data at the same time.
Conduct a debrief. Get together with your disaster recovery team and discuss what went well, what went wrong, and/or what unexpected issues you encountered. Identify gaps in the initial preparation AND execution of your plan. It is important at this point to conduct this exercise in the model of a blameless post-mortem. Things broke. Mistakes were made. Assigning blame to team members is unhelpful to future success.
Integrate learnings into your disaster recovery plan. There will inevitably be something you wished you’d thought of earlier. This is your chance to document everything you’ve learned and update your DRP so you can improve your disaster response next time around.
Like the Scouts’ motto goes: “Be Prepared.” In so many areas of life, preparation is key to both peace of mind and avoiding or minimizing bad outcomes. Disaster preparedness that safeguards your essential business data is no different. We briefly outlined some of the major benefits already, but let’s dive into a few in more depth.
Recovery time objective (RTO) refers to how quickly data must be made available after an outage without significantly impacting the organization. A short RTO is essential for operations that are business-critical or timely – like customer-facing websites, or files that were being used by employees at the time of the outage. You can increase your recovery time objective for things that are less critical, which allows you to turn your immediate focus and resources towards the most urgent operations.
Recovery point objective (RPO), on the other hand, refers to the maximum allowable amount of data that an organization believes it can lose without crippling the business. Defining an RPO necessitates that the organization accept two facts:
The first step in defining an RPO is to classify your data and understand where it’s stored and whether it’s being backed up. From there, you can negotiate as a business over costs, risks, and impact.
For example, if you’re running tape backups of an important transactional database once a day, you would lose up to a day’s worth of data when the primary system experiences an outage. Is that acceptable? Is there an opportunity to add additional online redundancy to that system and is it worth the cost (in time, money or both) to mitigate that risk? All of those considerations must be taken into account for business data at every level of your classification schema.
As you construct your plan, you’ll likely need to make tradeoffs on RTO, as you may not have the resources to have layers of redundancy and continuous backups on everything. Therefore, thinking strategically ahead of time will ensure that the business is aware of its exposure in the event of an incident and that makes it much easier to recover in a timely manner.
Having a clear understanding and alignment on your organization’s risk tolerance is a critical foundation to disaster recovery planning. Once you have your RTO and RPOs defined, you’ll use your disaster recovery plan to identify concrete tactics to meet your recovery point and recovery time objectives. A good disaster recovery plan can even uncover ways to exceed those objectives and further minimize risk.
There are countless examples of customers jumping ship and stock prices plummeting after a data breach. It can take years to repair a brand’s tarnished reputation. According to a 2019 survey by PingIdentity, 81% of people would stop engaging with a brand online following a breach, and only 14% of respondents would readily sign up for and use an application or service following a breach.
The good news is that your disaster recovery plan can mitigate these dismal outcomes. By demonstrating and communicating to your customers and the public that you’re on top of the situation, your organization retains trust with your market. When faced with a data disaster, this can mean the difference between a public relations nightmare and simply a bad day.
During the Preparation stage of your disaster recovery plan, you can define ways to build a foundation of trust with your customers and the public. Some of these may include:
You can also include protocols that help to preserve trust during the Restoration stage of your DRP:
Implementing initiatives to gain and keep customers’ trust is an important and sometimes overlooked part of a DRP, and will benefit your organization by helping to preserve your organization’s reputation. This leads to better customer retention and fewer financial losses when there’s a crisis. At this point, in the eyes of external stakeholders, it is often less about whether an organization deals with a data-loss incident and more about how it responds when it does. Having a plan in place beforehand will help ensure your organization rises to the challenge.
In a well-known case of a mishandled data breach, the CSO of a popular ride-sharing app covered up a data breach and instead paid a $100,000 ransom to restore the stolen data. Not only did this executive’s action result in their termination, but they were also later convicted of obstruction of justice for the attempt to cover up the incident. This is not a good outcome for anyone, and it could have been prevented with a disaster recovery plan for ransomware.
Legal liability isn’t just limited to individuals. If a company is found negligent in its handling of customer data, it will find itself vulnerable to lawsuits and/or regulatory penalties. Using a disaster recovery plan, you can do your due diligence and show that when data loss does occur, it’s not due to negligence and there is a plan in place to minimize the impact and address shortcomings. This will save your organization time and headaches.
Because this section talks about legal liability we want to make it clear that none of this amounts to official legal advice. Laws and regulations vary by industry and situation. There are people who have devoted their entire professional careers to this pursuit. Consult with a lawyer if you want more specifics on how to protect yourself and your business from potential liability.
One last thing we should say about disaster recovery planning: it doesn’t have to be overly complicated to still be worth doing. In fact, if after reading this you feel intimidated, we have unfortunately done you a disservice.
If you do nothing else after reading this article, take some time to review what policies you currently have in place. Do they make sense? Do you know where all your data lives? Is it backed up? Do the relevant stakeholders understand their roles? Shore up what you currently have and then make a plan to expand. If disaster befalls you, you’ll be glad you were better prepared.
Looking for more guidance? Review our guide on how to create your own disaster recovery plan.
Learn more about how CrashPlan is built to protect your data and help you bounce back from disasters.