When the lights go out

The power went out here in Austin this afternoon. Not in the office, mind you... Only in the data center. The root cause isn't all that important or interesting - some maintenance didn't go as planned so the DC was dark while street power was unaffected.

The impact, however, is a great illustration of the differences between the server-huggers and the grid-enabled. The former group -- who think it is important to know which server belongs to them and where it's located at all times -- were unable to work for several hours. Their jobs crashed with the servers, and their data was unavailable until the local fileservers came back online. They were standing around in the hallways or leaving for the day. The grid users, on the other hand, had already enabled themselves to take advantage of shared computing resources in at least one other site, sometimes as many as two or three sites. While they lost some local state and running jobs, they could go home and log in through VPN, or wait a few minutes until the network infrastructure was back online (Networks almost always first in the power-up sequence). Jobs could be re-submitted, schedules could be met.

While we often talk about the cost savings and performance improvements of grid computing, we shouldn't overlook the resulting business continuity benefits. If you have deployed a grid, does your BC plan allow a single-site outage to be absorbed by the remaining capacity? If you're considering grid deployment, is business continuity a factor in your decision?