Recently, a colleague and I spoke to a group of IT administrators in Washington, DC. We left our car in a self-park parking lot in which the attendants had everyone leave their keys in their car, in lieu of keeping them on a valet "key board". They seemed to be depending on reasonably honest customers (we were in a secure area past a government checkpoint) and their own memories to ensure no cars were "lost". We returned to find that the parking lot attendants had completely rearranged the vehicles. Since it was a rental car, it was hard to describe the car and therefore hard to find. (By this point you're probably thinking that I've posted to the wrong board or that Intel pays me by the word, but bear with me)
It took a rather lengthy iterative search, but we eventually found the car. As we walked, my colleague and I joked about this as "parking lot virtualization". Our vehicle was moved from one slot to another to better fulfill the changing needs of the parking environment over time. This struck a chord with us, having just been discussing some of the challenges with virtualization.
In the data center, most virtualization suites allow an administrator to manually move a workload from one host to another. This is a very powerful concept - instead of having to negotiate for a 3:00am Sunday morning maintenance window to do preventative hardware maintenance, we can move all of the workloads to another physical machine, perform maintenance during normal working hours, and eventually move the workload back to its original location. We can also migrate workloads from a less powerful machine to a newer machine for performance or in order to retire hardware.
Combining this capability with the ability to host multiple workloads on a single piece of hardware, the data center can quickly become very complex. Without a robust database to map workload to physical machine (and vice-versa) or an automated update mechanism to adjust these mappings after a move, we can easily lose track of our services. These mappings are needed in order to answer questions like "host/rack/row/room x went down - what services need to be restarted?"
My colleague noted that ITIL has mature, well-defined mechanisms to deal with many of these types of events. Change orders, maintenance escalations, and configuration databases were all designed with these business processes in mind, albeit at a much slower (and more manual) pace. It would defeat much of the benefit of virtualization if one had to get a signed piece of paper, email approval, or file a trouble ticket in order to offload a workload in response to a failed CPU fan. Instead, we should use policy to anticipate and enact these types of responses. The discipline and rigor of change management is critical within the virtualized data center, but it must be directly encapsulated by our tools in order to be effective. In essence, the CMDB needs to be dynamically updated in order to maintain fidelity to the Data Center's logical state at any given instant.
For those of you who have deployed virtual machines in large-scale production, what techniques have been most successful for managing the chaos of moving services and images? Are you using a glue layer for your legacy CMDB and other management tools, or are you finding it easier to throw them out and depend on the tools provided by your virtualization stack?