Methods for Seamlessly Migrating to a different Hadoop Version

Fig1.pngOne of the techniques that Intel IT had to learn in order to gain $US351 Million in value from analytics was how to migrate between Hadoop Versions.  Migrating to a different Hadoop Version, whether from an older version or from another distribution, raises a number of questions for any organization that is thinking about attempting it.   Is the migration worth the effort?  Will production software break?  How can we do the transition with minimal impact to users?  Intel’s IT organization faced all of these questions and challenges when it migrated from Intel’s own custom version of Hadoop, known as IDH, to Cloudera’s Hadoop (CDH).  In this white paper, Intel IT's Hadoop Engineering team describes how their methodology and how they have seamlessly migrated a live production cluster through three different version changes.

The team looked at a feature by feature comparison of Intel’s IDH and Cloudera.  They determined that moving to Cloudera’s Hadoop distribution had significant advantages.  Once the decision was made to migrate, the team outlined three major concerns:

  • Coping with Hadoop variations
  • Understanding the scope of the changes
  • Completing migration in a timely manner

The first concern is about the need to understand how to properly configure the new version.  The second is about the effects of changes – making sure that application developers and internal customers and their code using the cluster would be minimally affected by the change.  The last concern expresses the need to make any migration quick and at best, transparent to live users.

fig2.png

Intel developed what it felt are 6 best practices for migration:

  1. Find Differences with a Comparative Evaluation in a Sandbox Environment
  2. Define Our Strategy for the New Implementation
  3. Upgrade the Hadoop Version
  4. Split the Hardware Environment
  5. Create a Preproduction-to-Production Pipeline
  6. Rebalance the Data

The first practice deals with the first and second concern listed above.  Doing the evaluation identified differences between the IDH and CDH environments without disrupting production environments.     Other practices like creating a production pipeline, were designed to deal with the last concern, migrating quickly and with minimum impact.  Intel divided each version’s instances between servers in the same rack – this leveraged the high speed network within and between racks to move data to the new version with only one transfer.

Using these methods, Intel IT"s Hadoop team completed the full migration from IDH to CDH in 5 weeks.  Only one piece of production code needed to be changed, and that was because it called a library that was deprecated between Hadoop versions. Since the initial migration, this methodology was also used to do 2 version upgrades with no customer impact. Some of the teams' initial concerns about security have been mitigated with Cloudera's implementation of Apache Sentry. Look for another white paper on that subject later this year.

Published on Categories Archive
Jeff Sedayao

About Jeff Sedayao

Jeff Sedayao is the domain lead for security in Intel's IT@Intel group. He has been an engineer, enterprise architect, and researcher focusing on distributed systems—cloud computing, big data, and security in particular. Jeff has worked in operational, engineering, and architectural roles in Intel's Information Technology group, done Research and Development in Intel Labs, as well as performed technical analysis and Intellectual Property development for a variety of business groups at Intel.