Monday, January 6, 2014: Not since the blizzard of ‘78 had a day started in this way. After receiving close to 18” of snow on Sunday, temperatures plunged to windchills of 40 and 50 below. My phone rang early that morning. It was our CEO. We needed to have an executive conference call. The mayor had declared a travel ban. We needed to shut down all of our locations (80 some locations). We needed to let employees know. Very quickly we initiated our communication plan.
Within minutes of settling in on the couch to watch more of the morning news coverage of the Blizzard of ‘14, my phone rang again. This time it was our Senior Director of IT Operations. His words sent a chill down my spine, even though I was warm and cozy inside. “There is a major power outage in downtown Indianapolis. Our headquarters is without power. All of our servers have shut down, we are dead in the water.”
“Well, at least we are shut down today,” I thought.
“IPL says it will be 48 hours or longer until we have power. The team is on a Google Hangout right now discussing options.”
I jumped on the Hangout and got a quick recap of the situation. Everything at corporate is down (including our server room). UPS batteries are depleted and the servers have shut down. IPL says 48 - 72 hours. Three options 1) declare a disaster and begin recovery to our warm site on the west side 2) rent a generator, have it delivered and installed, and power the server room 3) wait. Knowing our history of disaster recovery testing, I advised the team to explore the feasibility of option 2, while I instituted phase I of our Business Continuity Plan by sending notice to the Executive Team and firing up a conference bridge (the Executives weren’t quite Google Hangout savvy).
With the Hangout still live in my home office, I explained the status to our CEO, COO and CFO. A disaster recovery would realistically take 24 hours AND it was one-directional. Once live at the warm site, it would take weeks or months of planning to “come back” to corporate. The generator option was a possibility, but we didn’t know yet. Or, we wait. The line was silent. None of those options were appealing. I quickly pointed out the good news: email was still working! (My boss, the CEO loves it when I point out the obvious, especially when it underscores what a great decision it was to move to Google and the cloud. It’s kind of like telling your wife “I told you so”! Goes over VERY well!)
Putting the conference bridge on hold, I jumped back to the Hangout just in time to hear someone ask, “Did anyone check the power at the Warm Site?” GREAT question. Since our warm site was eight miles due west of headquarters, and the storm had rolled in from the west. I could hear the click, click, clack, click of Daniel checking the status. “Scratch, option one. Warm Site is dead too”.
With that, I went back to the conference bridge. Since doing nothing seemed like a CLM (career limiting move), I informed them the Warm Site was down, and that we were executing option 2.
Fast forward to July. It was now about 130 degrees warmer. We were hosting visitors from another Goodwill organization in our board room. Down the hall, we were conducting a Disaster Recovery Test with our Mission Partners (that’s what we call our users...I hate that word...so we call them Mission Partners).
Let me repeat that in case you missed it. The CIO was sitting in a conference room talking with three or four executives from another Goodwill organization, while his team was conducting a disaster recovery test, complete with Mission Partner testing.
OK, if you are still not seeing the nuance, let me give you some background. In 2009, we went live with our Business Continuity and Disaster Recovery Plan, including our warm site. Our investment was about nine months work and $500,000. That fall, we conducted a Recovery Test, our users (because we called them that then) all gathered at the warm site to test their systems. Everything passed.
In 2010, we had a new CIO (me!), a new systems engineer, and a couple of other new staff members. It was closing in on time to do our annual Business Continuity Test (including a mock scenario). Our systems engineer reviewed the documentation, spent some time at the warm site, and then came into my office. “I don’t know how they did it. They had to fake it! It had to be smoke and mirrors. There is no WAY they recovered the systems! I need two months to prepare.”
GREAT, new CIO, successful test last year and we need two months to prepare this year. THAT is going to go over well. Guess I will have to use another punch on the New Guy Card! Two months later, we conducted another successful test.
October 2011, time for another test. I call the engineer into my office. “Are we ready?”, I asked.
“Well, we’ve made some changes to the environment that have not been replicated to the DR site, you see, we’ve been busy. I need a couple months to get ready.”
With steam coming out of my ears, I let him know we needed to be ready, we needed to document, and we needed to keep the environments in sync (shame on me, I thought we were doing all of that!).
A couple months later, we conducted the test. While it was declared successful, there were some bumps. At our lessons learned meeting, the team was, well, they were whining about not having enough time. After listening, I asked, “If we had a disaster today, would we be ready”. Again, after several minutes of this and that, I asked, “If we had a disaster today, would we be ready?” After about a minute of this and that, I interrupted, “I am declaring a disaster. This is a test and only a test. However, we are implementing our recovery NOW!”
After looking at me for several minutes and then realizing I was serious, the team headed out to the warm site to recover our systems...again.
It was now fall of 2012. I was sick of the words “Disaster Recovery Test”, yet it was that time again. We had a new systems engineer, the prior one leaving earlier in the year. I stopped by his desk to ask about our preparedness for our disaster recovery test. “I’ve been looking at it. I don’t know how it has ever worked. They must have faked it. It had to be smoke and mirrors. I need two months.” Given he was now The New Guy, I let him punch his New Guy Card and gave him the two months. The test was successful.
Now do you see it? The CIO was sitting in a conference room talking with three or four executives from another Goodwill organization, while his team was conducting a disaster recovery test, complete with Mission Partner testing. After that history? How could DR testing be a non-event?
It started early in 2013. I was in my office with John Qualls of Bluelock and Steve Bullington of TWTelecom (Level 3). They were describing to me a new product and service from Bluelock and a partnership with TWTelecom. Bluelock was touting RaaS, or DRaaS if you will. Disaster Recovery as a Service; paired with TWTelecom’s new Dynamic Capacity bandwidth. What?!!? You mean I can get rid of the warm site? Replace it with the elastic capacity of DR in the cloud? Combined with a team of professionals to manage it all? Leveraging bandwidth that can be dynamic based on our needs? All, for less than I was spending today to depreciate the warm site investment? No more smoke and mirrors? No more two months to prepare? Seemed like a no-brainer! Where do I sign up?
As luck would have it, our initial investment would be fully depreciated in the 3rd quarter of 2013. We were faced with a forklift upgrade to replace our servers and SAN at the warm site. The ROI was overwhelming. Due to competing priorities, we slated this project to start in mid-December so it would be complete by early in 2014. (If only I had a crystal ball!)
The project itself was pretty straightforward: establish the connectivity between the sites, install the Zerto agents on our servers, replicate the data, and test! Easy-Peasy! We did experience some challenges (shocked to hear that, aren’t you?). The biggest challenges were the visibility into our own environment; the initial seeding of the replication; and design hangover.
The visibility issue really could be summed up by “we didn’t know, what we didn’t know” about our own environment. Over time there had been a lot of cooks in the kitchen. We had a lot of servers that we weren’t quite sure what they did, combined with terabytes of data that we weren’t sure of either led us to a lot of research to straighten the spaghetti (see how I did that? cooks in the kitchen...straighten the spaghetti...oh, nevermind, back to the story).
The next challenge was the initial seed. Even though we knew the amount of data that had to replicate, and we sized the pipe accordingly, it was still taking an inordinate amount of time to create the first replication. Leveraging the Dynamic Capacity feature, we tripled the size of the pipe. It still took longer than anticipated; our own infrastructure became the limiting factor.
The final challenge, one I like to call “design hangover”, was all about how to provide an environment to our Mission Partners in which they could adequately test their applications. After whiteboarding option after option, none of which really provided a great answer, I asked a couple of questions. “So, it sounds like we are jumping through huge hoops to give a window to our Mission Partners, what happens in a real disaster? Do we have to go through all this?”
“No, because prod won’t exist. You don’t have to worry about duplicate addressing, you don’t have to worry about changing IPs, you just see the recovered data. Look, I can show you right now, I can log into our portal at Bluelock and show you our data from my laptop.”
“So? We are going through all this, so our Mission Partners can go to the Business Continuity Site and test their applications? If it were a real disaster, they can go to the site and see their applications, no problem? What if we just let them come to this conference room, access their applications through your laptop and test? Would that be a valid test?”
“Well, yeah...we thought they had to test from the BC site.” (Translated that means “because we always did it that way”). I offered to raise it with the rest of the executive team, but I thought they would much rather have their teams walk down the hall to a conference room to test, than to drive across town and test.
Sure enough, they were all for it!
If only we had been done before the blizzard of 2014!! Our results were phenomenal. First, we had true network segregation between our DR environment and production. Second, our Recovery Time Objective (RTO) was under two hours!! (Disclaimer: our SLA was actually four hours on some servers and eight hours on others, but the whole thing finished in under two hours. 100 VMs; 15 tbs of data) Third, our Recovery Point Objective was THIRTY SECONDS! Yes, an RTO of two hours and an RPO of 30 seconds loss of data. Fourth, our system architect and our system admin did absolutely nothing! Our CFO called Bluelock...gave them his code...and hung up the phone. Two hours later, our System Architect’s phone rang. “Your recovery instance is ready to test”. BOOM! That’s it! I’ve been around long enough to know there is no such thing as a silver bullet in IT, but this was pretty damn close.
Oh, and one more benefit? The response time of the applications during the test, using our recovered instance sitting in Las Vegas, was FASTER than the response time of production sitting 30 feet away in our server room!
Another C-F Project off the books! We now spend longer dreaming up the scenario for the re-enactment than we do preparing for, and executing the DR test. So, what has our system architect and system admin done with their extra time? How about spending time in Retail to understand the business needs and designing solutions for a queuing system to speed up the checkout lines, or, designing the in-store digital displays for mission messaging throughout the store, or, redesigning the power delivery to the POS systems providing extra run-time for less money, or, designing SIP trunking for our VOIP system to provide call tree capabilities...or...or…
And what of that Blizzard of ‘14? We were lucky! Power was restored shortly after noon on the first day (thank YOU IPL!), before the generator was even connected. We dodged that bullet and now we are armed with a silver bullet!
Next month, we will explore a project that did more to take us to a Value-add revenue generating partner than just about any other project. Amplify Your Value: Just Another Spoke on the Wheel.
The series, “Amplify Your Value” explores our five year plan to move from an ad hoc reactionary IT department to a Value-add revenue generating partner. #AmplifyYourValue
Author’s note: In the interest of full transparency. To paraphrase the old Remington Shaver commercial from the 70’s, “I like it so much, I joined the company”. In October of this year, I will leave Goodwill to join Bluelock as the EVP of Product and Service Development. My vision is to help other companies experience the impact Goodwill has felt through this partnership.
We could not have made this journey without the support of several partners, including, but not limited to: Bluelock, Level 3 (TWTelecom), Lifeline Data Centers, Netfor, Zerto and CDW. (mentions of partner companies should be considered my personal endorsement based on our experience and on our projects and should NOT be considered an endorsement by my company or its affiliates).
Jeffrey Ton is the SVP of Chief Information Officer for Goodwill Industries of Central Indiana, providing vision and leadership in the continued development and implementation of the enterprise-wide information technology and marketing portfolios, including applications, information & data management, infrastructure, security and telecommunications.
Find him on LinkedIn.
Follow him on Twitter (@jtonindy)
Add him to your circles on Google+
Check out more of his posts on Intel's IT Peer Network
Read more from Jeff on Rivers of Thought