Technical Limiters for Pervasive Virtualization

In my introduction I talked about the main aspects of the Intel IT Enterprise Private Cloud, and I would like to take a deeper dive on one of these key aspects: how we have approached driving pervasive virtualization.  We chose this path of pervasive virtualization primarily because we have lots of existing legacy applications, and we needed a method to bring them into our overall automation scope and help resolve some core business challenges we were experiencing (lack of agility and low utilization).

When we started this journey about a year ago, our server OS’s that were running on virtual hardware was around 12% of our total environment.  Clearly with our intent to have pervasive virtualization, we needed to change that number quickly.

We took a number of paths to make sure we could make immediate gains in our virtual to physical ratios… first of all was getting in front of new capacity demands, and we spread our net far.  We took control of the physical server purchases and drove server purchasers towards our virtual platform as the default, we scrubbed everyone’s capital plans and looked for opportunities to get in front of their purchases, we analyzed IT projects and looked for who was going to need capacity and got in front of them.  All of these methods helped us guide new purchases to virtual machines instead of new physical – it also helped us figure out what were the real barriers keeping us from 100% virtual.

Based on the data we collected and some additional analysis we created our list of Technical Limiters to 100% of our OS’s running on virtual machines.  We use this list to track our engineering work, have discussions with our suppliers, help us measure what we need to still solve, and determine which physical servers are optimum candidates for virtual.

Some of the top limiters we are dealing with are:

1.)    Virtualizing our big Tier 1 systems:

a.       We use load balancing pretty extensively for our web heads with a mixture of software and hardware load balancers depending on the app – there are plenty of people doing this today, however we needed to get the solution reproducible for our operations team so it could become the norm.

b.      Lots of our applications use clustering for application level failover…  we are putting the final touches on our implementation, however getting this working and designed for operations was not trivial.

c.       Many of our big Tier 1 servers have significant number of LUNs (Logical Unit Number) on each host, and with the 256 limit per host we had a scaling problem on the clusters.  This required us design smaller clusters for these apps and look at more scaling out options for the applications.

2.)    Virtualization of externally facing apps:

a.       We use a combination of security methods to secure our DMZ (demilitarized zone) environment for externally facing applications, we recently completed the engineering and rollout on this and are now virtualizing our externally facing environments on multi-tenant infrastructure.

3.)    Data Classification Controls:

a.       Due to concerns with checking the integrity of the hypervisor and the potential for a unsecured guest to be an optimum attack surface, we are in the process of engineering a solution that allows us to have mixed multi-tenant clusters for our higher priority servers and minimizing the risk to other guests or the host itself.

4.)    Mega Virtual Machines:

a.       Most of our VMs (virtual machines) fall into one of our 3 units (Small, Medium, and Large) and we rarely roll out VMs with more than 8GB memory or 4 vCPU (virtual CPU).  However in order to cover the rest of the environment we have to deploy much larger VMs, and it seems some software is just asking for more and more memory in a scale-up fashion vs. scale out.  We are analyzing now how to best handle 48GB+ VMs and still have a good functioning cluster (from an operational perspective).

We then take each of these limiters (this is just part of the list) and we figure out how to see if a physical server is impacted by it, and therefore determine how much of our environment is resolved when we fix a limiter.  This method has kept us very data driven and systematic.  Previously there were lots of open ended opinions and FUD (fear, uncertainty, and doubt) and by using this method and sharing the details internally extensively we have made some pretty big leaps in making virtual the first choice for our application owners.  All the servers that are considered not limited are then fed into our operational Virtual Factory which runs the process of analysis, scheduling, migration, testing, and end of life (EOL) of the hardware.  This is another interesting topic that I will write about in the future.

What are your main limiters and how are you dealing with them?