Capacity Planning for IaaS

This is the last post from a series of articles about Capacity Planning for Cloud. I started with capacity planning for SaaS and Capacity Planning for SaaS Part 2, and in the next one I discussed PaaS. These topics are not co-dependent, i.e. you don’t need to have an IaaS in place to have PaaS, or a PaaS to have SaaS. However, the concept of layers makes it easier to understand the possible issues that can be avoided in the lower stack of a cloud infrastructure.

Usually, the biggest concern for an IaaS solution is how to architect the infrastructure in order to provide the capacity and performance flexibility required for a multi-tenant environment. IaaS is basically storage, network, processor and memory wrapped in a service offering. Thus, what I want to discuss are some underlying trends on each of these components.


The base of the IaaS stack is virtualization.  Most challenges happen in a physical approach, such as underutilization of server resources, difficulty in protecting server availability and dealing with disaster recovery. These can all be alleviated with virtualization. However, the biggest challenge is storage management due to the complexities associated with hypervisor management resources, and the shared storage model.

In an IaaS solution, usually there are two approaches to the design of the storage solution: scale-up and scale-out. The decision about which to adopt will affect the overall cost, performance, availability and scalability of the entire solution.

Besides the fact that topology decision is a combination of functionalities, price, TCO and skill, the biggest difference between scale-out and scale–up topology is shown in the following table:



Hardware scaling

Add commodity devices

Add faster, larger devices

Hardware limits

Scale beyond   device limits

Scale up to   device limit

Availability, resiliency

Usually more

Usually less

Storage management complexity

More   resources do manage, software required

Less   resources do manage

Usually, scaling up an existing system often results in simpler storage management than with scale-out approach, as the complexity of the underlying environment is reduced or at least known. However, as you scale up a system, the performance may suffer due to increasing density of shared resources in this topology. On the contrary, with scale-out topology, the performance may increase due to increased number of nodes, where more CPU, memory, spindle and network interface are added with each node.

If you plan for a local private cloud in a Greenfield, the scale-up approach can attend very well because of the  simplicity to manage. If you design a very large public cloud and take the benefit to grow theoretically without limits and small grains, you should consider the scale-out approach.

Storage is a key component in cloud computing. Now, there are various options based on workload, as shown below:


There isn’t a “one solution fits all” in a cloud environment. The architecture concept should be built to allow decoupling of virtual machines from physical layer and a virtual storage topology that allows any virtual machine to connect to any storage in the network. This is a requirement for a well-designed IaaS.


Actually, the consolidation factor in a virtual environment where a single physical host reaches a point where it has 15-25+ virtual machines, and the amount of network traffic is equivalent a top of rack switch. Usually, at least 8x GbE is required to handle VMs network traffic plus hypervisor management traffic. Besides Ethernet interfaces, it is not uncommon use 2x HBAs interfaces for storage connectivity.  Considering Data Center optimization best practices for increasing rack density and managing 10 units server/rack, it’s only 120 cables  for servers (i.e. 2x power cables + 8x GbE cables + 2 FC cables per server).

Managing a high density server environment adds a lot of complexity for connectivity. Unified fabric is a key technology for IaaS. Unified Networking concept with 10GbE can reduce the amount of cables from 10 to 2 per server with 25% more throughput. At same time, it allows flexibility to dynamically allocate bandwidth to VMs and balance between storage and Ethernet with SLA policy.

In order to deal with availability and improve flexibility, the best practice is to configure both interfaces for use by each VM and connect each interface to a different 10GbE switch. The following picture illustrates this configuration:


Personally, I don’t see much reason to use the 1GbE LOM only for manageability. 10GbE has enough bandwidth and reliability so that you do not need to use a 10GbE switch port or place a second top of rack switch for 1GbE.

Unified Networking definitely makes the capacity planning much easier!


Physical servers should be the result of a collection of factors: hypervisor licensing model, expected VMs templates, capabilities, network and storage architecture, Data Center facilities and budget constraints.

To illustrate it, I used some assumptions about a fiction environment of 1000 server, where I expect that 80% of VMs are 1vCPU with 3GB of memory and 15% with 2vCPUs with 8GB and only 5% with 4vCPUs with 16GB, so an average of 4.4GB/VM.

Assuming that in this scenario I used a rack server configuration from the Dell website and adopted the VMWare vSphere 5.0 Enterprise Plus SKU with new license model, we get the following spreadsheet, based on amount of memory installed. For this exercise, I assumed that total amount of pMemory is equal to vMemory and using 100% memory allocation – don’t over commit.

For 2 sockets servers:


And for 4 sockets server:


Now, plotting these two tables together we can see a 2 socket server is the best choice for this particular environment:


There isn’t a “one size fits all” for IaaS. This is not because virtualization gives us the flexibility to allocate computational resources so that we always make good decisions. In fact, it’s the opposite. With virtualization, you can remediate a bad decision -- but now our decisions have a profound impact on solution TCO.