Disaster recovery lessons from an island struck by a hurricane
- 26 January, 2021 19:35
(A hurricane devastated an island that held two data centers controlling mission-critical systems for an American biotech company. They flew a backup expert with four decades of experience to the island on a corporate jet to save the day. This is the story of the challenges he faced and how he overcame them. He spoke on the condition of anonymity, so we call him Ron, the island Atlantis, his employer Initech, and we don’t name the vendors and service providers involved.)
Initech had two data centers on Atlantis with a combined 400TB of data running on approximately 200 virtual and physical machines. The backup system was based on a leading traditional backup software vendor, and it backed up to a target deduplication disk system. Each data center backed up to its own local deduplication system and then replicated its backups to the disk system in the other data center. This meant that each datacenter had an entire copy of all Initech’s backups on Atlantis, so even if one data center were destroyed the company would still have all its data.
Initech also occasionally copied these backups to tape and stored them on Atlantis for air gap purposes. They could have been stored on the mainland but weren’t, and fortunately the tapes were not destroyed in the disaster but could have been. Initech had considered using the cloud for disaster recovery but found it impractical due to bandwidth limitations on Atlantis.
When the hurricane struck, Initech began looking for someone to spearhead the recovery process on the ground. Due to the level of destruction, they knew they needed someone that could handle command-level recovery. There were only a few people with that skill level at Initech, and one of them was Ron. They put him on a private jet and flew him to Atlantis.
There he found an incredible level of general destruction, and specific to Initech, one data center was flooded, taking out the bottom row of servers in every rack, leaving the servers in upper racks untouched. The recovery plan was to move the servers that were still working to the dry data center and recover everything there.
While the overall plan of transferring the servers from one place to another succeeded, Ron said that haste did result in some servers being inappropriately handled. This meant it was harder to reassemble them on the other end of the move. (Note to self: Be nice to servers when moving them.)
The biggest hurdle Ron had to overcome was that the Internet connection between Atlantis and the mainland was temporarily disabled due to the hurricane, which created a major problemInitech had made the unfortunate decision of relying on the mainland for things like Active Directory, instead of having a separate Active Directory setup on Atlantis. This meant that any AD queries had to go directly to the mainland, which was now unreachable. This meant they couldn’t login to the systems they needed to use in order to start the recovery.
They tried a number of options, starting with satellite-based Internet. While this gave them some connectivity, they found they were maxing out their daily bandwidth allotment, after which the satellite ISP would throttle down their connection. They also tried a microwave connection to another ISP. This was a multi-step microwave relay, so the loss of power in any of the buildings in the relay could cause another temporary outage. It turns out it's really hard to have a stable network connection when the infrastructure upon which that network connection relies--buildings and power--aren't stable.
The actual restore turned out to be the easy part. It certainly wasn't quick by any standards, but it did work. The entire process of restoring one data center to another one took a little over two weeks. Considering the state of Atlantis, that's actually pretty impressive.
The backup software they were using was backing up VMware at the hypervisor level, so restoring the 200-plus VMs was relatively simple. Restoring the few physical servers that required a bare-metal recovery turned out to be a little bit more challenging. If you've never performed a bare-metal recovery on dissimilar hardware, suffice it to say it can be challenging. Windows is pretty forgiving, but sometimes things just don't work, and you are required to manually perform many extra steps. Such recoveries were the hardest part of the restoration.
Lessons from a disaster
The first lesson from this disaster is one of the most profound: as important as backup and recovery systems are, they might not pose the most difficult challenges in a disaster recovery. Getting a place to recover and a network to use can prove much more difficult. Mind you, this is not a reason to slack off on your backup design. If anything, it’s a reason to make sure that at least the backups work when nothing else does.
Local accounts that don’t rely on Active Directory would be a good start. Services such as Active Directory that are necessary to start a recovery should have at least a locally cached copy of the service that works without an Internet connection. A completely separate instance of such a service would be much more resilient.
Rehearse large scale recoveries as best as you can, and also make sure you are aware of how to do them without a GUI. Being able to login to the servers via SSH and run restores on the command line is more power efficient and flexible. As foreign as that seems to many people, a command-line recovery is often the only way to move forward. On Atlantis, electric service was at a premium, so using it to power monitors wasn’t really an option.
Extra hardware can be extra helpful. One problem in disaster recovery is that as soon as you recover your systems, they need to be backed up. But in a recovery like this, you don't necessarily have a lot of extra hardware sitting around to be used for backups. The hardware you do have is working very hard to restore other systems, so you do not want to task it with the job of backing up the systems that you just restored. The cloud could be helpful here, but that wasn’t an option in this case.
You need to plan for how you’re going to back up your servers during and after the disaster recovery, while your primary backup system is busy doing the restore. Initech solved this with its tape library. Prior to the disaster, Initech used tape to get a copy of their backups to a safe location off-site. The primary disk system was being used to its full capacity to perform the restore, so they needed something to perform the day-to-day backup of the newly restored servers. They disabled the off-site tape-copy process and temporarily directed their production backups to the tape library that had previously only been used to create an off-site copy. One great thing about tape is that it has virtually unlimited capacity as long as you have enough extra tapes sitting around. It’s also a lot less expensive to have a lot of extra tape sitting around than it is to have a lot of extra disk sitting around. Given the capacity of Initech’s data center, having enough tape to handle backups for a few weeks would cost less than $1000. The lesson, though, is that you need to plan for how you’re going to do backups while you’re doing a major restore.
Automatic backup inclusion is the way to go. All modern backup software packages have the ability to backup all VMs and all drives on those VMs, but not everybody uses this feature. Initech – like a lot of companies – tried to save some money by only including certain filesystems in its backup. This meant they missed a number of important filesystems because they had not been manually selected. Lesson: Use your backup software's ability to automatically backup everything. If you know something is complete garbage you can manually exclude it. But manual exclusions are way safer than the manual inclusion design that Initech chose for some of their systems.
You need to figure out where your recovery people are going to sleep! In a major disaster there are no hotel rooms, so plan in advance and make sure you have on-site capabilities to house, bathe, and feed your IT people who will be living in that building for quite a while. Ron was told to bring his sleeping bag, but there should be brand new sleep bags, inflatable mattresses, and toiletries available on-premises. In addition, look into emergency food rations. Initech was able to feed Ron and his colleagues, but it certainly wasn’t easy. Buying and maintaining these supplies is a small price to pay for keeping your recovery crew rested and fed.
DR tests that only test a piece of the disaster are completely inadequate to simulate what a real disaster will be like. It’s hard to test a full disaster recovery, but had Initech actually done such a test, it could have identified some inaccurate assumptions about an actual recovery. The more you test the more you know.
Finally, testing performance is not a predictor of actual performance. Even if you perform a full DR test, the real thing is going to be different. This is especially true if you're dealing with a natural disaster that floods your data center, sets it on fire, or even blows it to smithereens. You can do your best to try to account for all of these scenarios, but in the end what you also need are people that can react to the unexpected on the ground. In this case, Initech sent a seasoned veteran who turned out to be exactly the right person for the situation. He and the other IT people rolled with the punches and found a way to recover. Even with all he modern IT systems that are available, people are still your best asset.
Food for thought
A few questions to consider as you plan disaster recovery:
- Are there faulty assumptions in your backup design?
- Have you looked into alternate communications systems in case your main connection is taken out?
- Do you know where you would house a bunch of IT people that need to be very close to your datacenter?
- How confident are you of your ability to succeed in such a disaster?
If you don’t have good answers to these questions, maybe a few Zoom sessions are in order.