I have been so busy with my normal work that I haven’t had time to share what we are doing in a while, so on a recent flight home from Washington D.C. where we just held our first IT@Intel Cloud Summit, I thought I could spend a little time to share where we are at, and where we are going.
First of all 2010 was a busy year for all of us working on introducing the cloud to the Office and Enterprise environment at Intel. We took some tough challenges, and pulled most of them off. Here is a recap…
1.) Pervasive Virtualization – our Cloud foundation is moving forward fast, we went from 18% of our environment virtual at the end of 2009, and beat our goal of 37% by end of 2010, and we are now at around 45%, we are starting to hit some of the tougher workloads but we continue to move at a rapid pace here.
2.) Elastic Capacity and Measured Services – we made some pretty great strides in ensuring all of our cloud components have instrumentation, and getting that data into our data layer so we can consume it. Our Ops team is now starting to use the massive amount of data (from guests, to hosts, to storage) to look at aggregate at what is happening in our Cloud, as well as use it to dig into the specifics where we are exceeding thresholds. We also run our massive DB running an ETL of around 40M records a day on a VM, just to make sure we walk the talk.
3.) End-to-End Service Monitoring – we made a decision to tightly couple our Cloud work with our move to a true ITIL Service Management environment – this isn’t a simple task and we have lots of more work to do here. But I think most of my peers I talk to the industry agree that ITIL with Cloud is a great way to combine the discipline of an Enterprise IT shop with the dynamic natures of on demand capacity. We have completed end-to-end service monitoring for a few entire services, and are going to be making this the norm as we continue through 2011, eventually creating the service models automatically when self-service happens.
4.) On-Demand Self-Service – we took an extremely manual environment, and made it automated, and we didn’t do it in a pristine greenfield environment, we did this across our entire Office and Enterprise environment. This means that basically across all of our data centers, and all of our virtual infrastructure we can serve out infrastructure services on-demand to entitled users. We took a goal of under 3 hours, and we are doing a pretty good job of hitting this consistently. This year we are going after the last piece of the environment which is our DMZ and secure enclaves, and our teams are busy working through the business process automation as well as new connectors to automate some very laborious manual tasks.
Now nothing any of us do in IT is simple, and everything has challenges… a few retrospective points I would like to share:
1.) Know your workloads – with the data we are pulling from all of our OS instances, we can see what the workloads are doing to the most important components (CPU, memory, network, storage, and I/O). In fact we have so much data that sometimes it is tough to find the right data. However with this data you can pick the top 2-3 counters per component and make sure you are optimizing the OS instance as it moves to the multi-tenant environment. I like to think of what we are doing as moving families out of the suburbs and into high-rise extremely efficient leased apartments. Being that we control the city, we can make these decisions, but as we do this we need to be careful to make sure we have enough square footage to let the family thrive, if we give them to little space, or we don’t allow them to cool their apartment – we could end up with angry tenants. Also no one wants a rock band living next door, so we have to make sure those noisy neighbors keep the noise down, or give them a room away from the rest of the tenants.
2.) Know your environment thresholds – most IT shops work in silos, and many of the silos make decisions on their specific component that may not comprehend the entire IT ecosystem, this can be as simple as how large a subnet range is, to how many spindles are provided out to handle a handful of DB VMs. In my Design background we would go in and break our infrastructure as a practice (of course not while we are using it) and we would then understand specifically how/why we were able to break it, and set a threshold. This threshold also serves as a challenge – meaning how do you take a 2x or even 10x goal to lift up the threshold as you take on more business, and as the business grows. If you don’t know how to break your environment, when/if it does break you will be struggling to figure out how to get it back to normal.
3.) Don’t underestimate the cultural shift required to move from a manual environment to an automated environment – our factories and our design environment work extremely well due to our large investments we make on automation. This isn’t the case for most traditional IT shops I talk too, and neither was it for ours. We made huge strides of bringing in automation to this environment, but we have a long way to go still. This isn’t just a technical challenge either, you need to help your organization and workers understand that just because we are automating their work, it doesn’t mean they are going away. When I started at Intel one of the most valuable pieces of advice I got was to always seek to engineer myself out of a job. This didn’t mean I was getting laid off, it meant that I could then apply my skills to a higher level task, we are constantly under headcount in IT, especially for those of us that are a cost center and not a profit center – however there is no shortage of valuable work we can do in IT to improve the business services and make evolutionary changes to help the bottom line and the top line. Also, make automation a part of everyone’s job… a script with good documentation in it, is always better than documentation with a pointer to a script.
4.) Many years of manual environments means that automation will hit walls – when someone takes a document and uses it to setup something in one data center, it is almost a given that someone in another datacenter is going to follow that doc slightly differently. Configuration drift leads to some tough challenges, and automation will quickly find these problems and point them out to you – usually with a big red X. Fortunately we phased in the automation so we were able to see a lot of the problems before we turned on self-service globally. Now that we have self-service we see configuration and performance issues almost immediately.
I am about to land home in Portland, and the captain just said it is cloudy with a chance of rain… we have a long path still ahead of us as we continue to enable new businesses at Intel, and existing business rapid growth – the last year of work took us a big leap forward, and I am excited about the coming year.
Where are you with your cloud efforts? How have you handled the challenges and were yours similar or different?
Until next time,
Intel IT Cloud Engineering Lead