China Mobile and Intel: Lessons Learned from a Large Scale Deployment

By Margaret LaBrecque, Data Center Community Development at Intel

 

I had the pleasure this summer of hearing Jonathan Bryce, Executive Director of the OpenStack Foundation, speak at several OpenStack community events. One of the things he identified as critical for accelerating OpenStack adoption was the need for companies with large scale OpenStack deployment experience to share their learnings. The need for documented real world learnings and best practices to address limitations in OpenStack scale and stability and help users more quickly advance up the learning curve has been identified by the OpenStack User Survey as a key barrier to broader adoption. Hence, think Jonathan will be pleased when he sees the groundbreaking study published by China Mobile and Intel, a first-of-its-kind deep dive into performance tuning of the nova scheduler.

A Collaboration to Drive Performance

The China Mobile telecommunications network has more than 800 million subscribers, three million base stations, 100 data centers and 200,000 racks of servers. It’s a little mind boggling to say the least – which is why it has been such an amazing experience for engineers from Intel in China to work so closely with China Mobile on its ongoing deployment of large OpenStack clusters. China Mobile and Intel have collaborated from the start of these deployments, initiating the China 1K Node Tiger Team to manage the large scale deployment process and to develop best practices. Thanks to this collaboration, Intel engineers were able to join their China Mobile counterparts on-site to help collect and analyze runtime data with the goal of improving launch latency and error rate for the nova scheduler under a high volume of requests. (The nova scheduler has been a source of significant performance challenges in OpenStack, thus motivating the teams’ focus.)

But it wasn’t enough to just perform a careful, component level analysis of a 1,000 node cluster, the China Mobile and Intel teams also desired to share insights from their analysis with the OpenStack community. Having performed these tests on a production cluster running the latest OpenStack release candidate (Newton RC2), the results suggest best practices for operating at scale. The teams used lightweight instrumentation along with techniques for obtaining more precise time-synchronization to enable tracking of every VM launch request in the nova scheduler. By analyzing and classifying each failure, the teams were able to identify three major issues which were then addressed through configuration changes.

Finding Confidence in Data Driven Results

These learnings should increase confidence in the Newton release for any organization considering a large-scale OpenStack deployment. I believe the China Mobile and Intel teams have, by example, demonstrated the type of analysis and tuning that should be performed for every new release. This analysis not only results in a more stable and higher performing China Mobile OpenStack cloud, but also provides highly actionable insights to the broader community into scale bottlenecks and areas of future work.

In recognition of the significance of this deployment and the benefits of sharing these learnings with the OpenStack community, China Mobile was selected as the Telecom finalist and overall winner of the OpenStack Foundation SuperUser award at this week’s Barcelona Summit.

At Intel, we are committed to launching tens of thousands of clouds as part of our Cloud for All initiative. Collaborations such as this that accelerate cloud deployments are at the heart of this initiative. We look forward to continued collaboration with China Mobile and to moving OpenStack forward for enterprise.