Intel is creating a marketing analytics program that will help optimize the outreach of media campaigns. We’ve been exploring various optimizations on Apache Hadoop* that can speed the analysis of the unstructured data (about 20 TB) associated with these campaigns—the business wants quick insights (don’t we all?)
Often, developers make some simple design choices at the onset of a project, but fail to revise their design when data management, processing, and consumption needs change. The result is a simple design of parsimony but one that incurs substantial overhead at runtime.
As our marketing analytics program has evolved, we’ve done some experimentation and have identified several ways to improve Hadoop processing times. In my role as platform support solutions design engineer, I’d say the primary recommendation is to minimize how much data is read from the disk. I hope that sharing these ideas will help bootstrap optimizations for other developers more quickly.
Use columnar files
Raw data, such as web logs, are in text format. Although textual, the lines contain multiple column data such as client IP, HTTP code, cookie information, and so on. Presumably, most analytics do not rely on all the columns simultaneously. Therefore, textual searches are not very efficient because they read all columns and discard the unused ones, while columnar searches that selectively read columns are more efficient. We recommend using columnar storage formats (like Parquet*) for derived and intermediate tables.
Use data partitioning
It is really important to consider how your tables are designed. If a year’s worth of logs are stored in a giant table but only the warm data (from the past month) is used for analysis, partitioning the table by date can probably speed up analysis. Partitioning helps avoid reading unnecessary data from the beginning of the analysis, compared to discarding data at the end. Of course partitioning also helps to manage, update, and retain data in a more finite quantum than one giant table.
Use efficient map-joins
Pay attention to the Apache Hive* auto-join conversion. Where applicable, Hive will try to statically resolve the dimensional tables at the front end and leave it to the tasktrackers to perform a hash-join with the bulkier fact tables. But the default threshold limits for Hive to qualify an automatic map-join are extremely conservative (25 MB). Study your dimensional table sizes and the JVM heap settings to allow for a map-join to occur. A map-join avoids the reduce stage, which means you avoid unnecessary shuffle and intermediate disk write operations. Also study the placement (left or right) of the dimensional table in a join operation so Hive knows it must stream the fact (large) table from disk. Dimensional (small) tables typically must be placed on the left.
Avoid NULL reduce “stragglers”
Missing values are inherent to data. Substantial numbers of them can lead to MapReduce “stragglers” during join and cogrouping operations because Hadoop dutifully places them into one <key-list<values>> bucket. If acceptable to your application logic, reinitialize the missing value to a random unique value to distribute the N NULLs into N buckets instead of just one giant NULL bucket. In the new randomized design, all reducers share the workload, compared to overloading the one straggler.
Big data is not academic; it is hands-down applied knowledge. The above tips may accelerate analysis for other developers and provide some technical rationale for paying close attention to the underlying data structures and the compute patterns of big data.
If you’d like some sample code that illustrates these techniques, or you want to find out more about our marketing analytics program, feel free to contact me. I’d also be interested in hearing about the techniques you’re using to optimize in-memory processing.
- Narasimha Edala