...and then it might be too late.
An update from Delta CEO Ed Bastian: pic.twitter.com/udNN0kzbKs— Delta (@Delta) August 8, 2016
Recently, Delta Airlines suffered a weeklong outage that, if you take it on it's face, ticks just about every box on a security person's disaster recovery planning scenario.
Delta has given multiple interviews on what happened. Although details are still being pieced together, essentially the company had a power issue, and when it tried to go to backup systems, they had failures. I managed a data center for several years, and this what keeps data center mangers up at night: the what-if-it-doesn't-come-back-up scenario.
The outage cost Delta millions of dollars in recovery effort, including vouchers that were given to customers who were inconvenienced. They severely impacted travelers by cancelling hundreds of flights, and likely suffered some reputational damage.
Hindsight being 20/20, this scenario - with good risk management - can be avoided, or at the very least can be reduced, but it takes awareness, resources and buy in from top management. Ed Bastian, the CEO of Delta, has taken personal responsibility for the failure. We can expect significant internal review of the disaster preparedness scenarios their IT teams are involved in.
Business continuity and disaster recovery is also a security professional's concern, impacting the CIA triad's most demanding principle, Availability. Availability means that information is available to users when they need it. Availability can take many different forms, depending on the business context. Delta, as a company which operates 24x7x365, relies on availability and any impact to that also impacts the timing of fleet operations. Even minor disruptions cause cascades which severely and adversely affect the business. Sensitivity to availability is one reason it took so long for Delta to fully recover; the longer they stayed down the worse the problem got amplified.
"The system moved to backup power but not all the servers were connected to it," Bastian told the WSJ.
Documentation is a key to all disaster planning. You have to understand in your disaster recovery (DR) plan what will and will not be part of your backup system. It is very expensive to maintain a full replica of your systems, so your DR plan might account for only a partial recovery. The business risk of a partial recovery must be documented and communicated so everyone understands what will happen in a disaster scenario. Bastian commented "We did not believe, by any means, that we had this kind of vulnerability."
Use this issue, and that of a few weeks earlier of Southwest Airlines that lasted four days, to review your business continuity/disaster recovery plans, and especially create them if you don't have any in place. Test your recovery plans at regular intervals, using tabletop walkthroughs and actual recovery techniques, and make upper management aware of the outcomes. Use these results to drive improvements in planning for availability, and you can avoid or reduce the impact of a disaster scenario. And always remember to revisit and update your plan at regular intervals, especially at the conclusion of a test, to ensure you have up to date and relevant information.
At a minimum, your disaster plan should include:
- Paper copies of everything relevant to the disaster plan; online resources will likely be disrupted, even if you have highly available systems or cloud-based ones
- Contact information of all relevant stakeholders in a disaster; C-levels, technicians, business people, customers; anyone who would need to be part of the recovery scenarios; include physical addresses of sites and phone numbers of required resources
- A list of required vendors your organization needs to operate in the event of a disaster scenario; include contact information for those vendors
- A map of all systems which would function on backup power; include all the networking devices between the systems (switches, routers, storage); map the systems to business functions so you can see visually which functions would be disrupted and which would be operational
- Maps of physical locations that are relevant to disaster recovery
- Forms which are critical to business operations such as supply order forms, injury reports, expense tracking, etc
- Disaster declaration procedures, and communication procedures (who to contact when, who is in charge of media relations, etc)
- Checklists and runbooks on operations processes - specifically this is required so the distractions of a disaster do not impact running operations, and don't require memory or specific skill to accomplish
This is just a short list, and does not go into the specifics of disaster planning, but it's a good start and validation point. Once you have checked off this list, start to look at recovery time objectives (RTO), recovery point objectives (RPO), and true business continuity process (processes which allow business to continue uninterrupted, even during an outage). There's a host of resources online and third party providers which are available to help.
According to the WSJ article, “It's not clear the priorities in our investment have been in the right place,” Mr. Bastian said. “It has caused us to ask a lot of questions which candidly we don't have a lot of answers for.” Upper management can be reluctant to put money into disaster recovery, seeing it more as an insurance policy, which it partially is. Testing isn't just to vet your plans, it's also to ensure that priorities get positioned correctly. If testing shows that not all systems would be recoverable, then investment can justified.
Disaster planning requires time and some resources to accomplish. This is an investment in the future, and any time invested now will be offset by the reduction in recovery time later, and hopefully, the lessened impact on your business operations.