There are always plenty of new terms looking to join the data center vernacular. You’re familiar with many that have taken root over the past several years: virtualization, cloud, consolidation, mission critical... the list goes on and on. A hot term you’ve probably heard recently is Big Data. There’s a lot of buzz around this one, as evidenced by the many companies introducing products designed to meet the challenges of Big Data. In this post, I’m going to take a look at Big Data and see what its implications could be for data center networks.
First things first: what is Big Data? The specifics may vary depending whom you ask, but the basic idea is consistent: Big Data means large sets of data that are difficult to manage (analyze, search, etc.) using traditional methods. Of course, there’s more to it than that but that’s a decent enough answer to make you sound semi-intelligent at a cocktail party. Unless it’s a Big Data cocktail party.
A logical follow-up question is “why is Big Data so...big?” The main cause for these massive data sets is the explosive growth in unstructured data over the past several years. Digital images, audio and video files, e-mail messages, Word or PowerPoint files – they’re all examples of unstructured data, and they’re all increasing at a dizzying rate. Need a first-hand example? Think about your home PC. How many more digital photos, MP3s, and video files are on your hard drive compared to a few years ago? Tons, right? Now imagine that growth on an Enterprise scale, where thousands of employees are each saving gigabytes worth of presentations, spreadsheets, e-mails, images, and other files. That’s a lot of data, and it’s easy to see how searching, visualizing, and otherwise analyzing it can be difficult.
[Structured data, for those of you wondering, is data organized in an identifiable structure. Examples include databases or data within a spreadsheet, where information is grouped into columns and rows.]
So what to do? There’s no shortage of solutions billed as the answer to the Big Data problem. Let’s take a look at one that’s getting a lot of attention these days: Hadoop.
Hadoop is an open source software platform used for distributed processing of vast amounts of data. A Hadoop deployment divides files and applications into smaller pieces and distributes them across compute nodes in the Hadoop cluster. This distribution of files and applications makes it easier and faster to process the data, because multiple processors are working in parallel on common tasks.
Let’s take a quick look at how it works.
Two major software components comprise Hadoop, the Hadoop Distributed File System (HDFS) and the MapReduce engine.
- HDFS runs across the cluster and facilitates the storage of portions of larger files on various nodes in the cluster. It also provides redundancy and enables faster transactions by placing a duplicate of each piece of a file elsewhere in the cluster.
- The MapReduce engine divides applications into small fragments, which are then run on nodes in the cluster. The MapReduce engine attempts to place each application fragment on the node that contains the data it needs, or at least as close to that node as possible, reducing network traffic.
So who’s using Hadoop today and why? You’ve heard of the big ones – Yahoo!, Facebook, Amazon, Netflix, eBay. The common thread? Massive amounts of data that need to be searched, grouped, presented, or otherwise analyzed. Hadoop allows organizations to handle these tasks at lower costs and on easily scalable clusters. Many of these companies have built custom applications that run on top of HDFS to meet their specific needs, and there’s a growing ecosystem of vendors selling applications, utilities, and modified file systems for Hadoop. If you’re a Hadoop fan, the future looks bright.
What are the network implications of Hadoop and other distributed systems looking to tackle Big Data? Ethernet is used in many server clusters today, and we think it will continue to grow in these types of deployments, as Ethernet's ubiquity makes it easy to connect these environments without using specialized cluster fabric devices. The same network adapters, switches, and cabling that are being used for data center servers can be used for distributed system clusters, simplifying equipment needs and management. And while Hadoop is designed to run on commodity servers, hardware components, including Ethernet adapters, can make a difference in performance. Dr. Dhabaleswar Panda and his colleagues at Ohio State University have published a research paper in which they demonstrate that 10GbE makes a big difference when combined with an SSD in an unmodified Hadoop environment. Results such as these have infrastructure equipment vendors taking notice. In its recent Data Center Fabric announcement, for example, Cisco introduced a new switch fabric extender aimed squarely at Big Data environments. You can expect to see more of this as distributed system deployments continue to grow.
Big Data isn’t going away. It’s going to keep getting bigger. We’ll see more products, both hardware and software, that will be designed to make your big data experience easier, more manageable, and more efficient. Many will use a distributed model like Hadoop, so network infrastructure will be a critical consideration.
So, here are some questions for you, dear reader: Have you deployed a Hadoop cluster or are you planning to do so soon? What network considerations did you take into account as you planned your cluster?
We’d love to hear your thoughts.
Follow us on Twitter: @IntelEthernet