Data Cleansing, Data Lakes, and Business Intelligence at IT@Intel


My previous posts on how big data and business intelligence can add value have talked about Intel IT groups’ other than my own.  With this blog, I discuss big data, business intelligence, and data lakes in the context of the IT@Intel group.  When we recently transitioned a key internal operational tool to a SaaS based solution, we encountered many of the data issues that I have posted about.  I’ll talk first about the data management needs of a group like IT@Intel, go over the issues we encountered, and then explain about why using our marketing data lake has tremendous appeal.

What kind of information does IT@Intel need to manage?  For content managers like myself, we need to keep track of the white papers and other collateral that we produce.  “Keeping track” includes noting the project's status, the location (URLs) of white papers, slideshares, and other IT@Intel collateral, and the Subject Matter Experts (SMEs) whose content we produce. Moving to a new tool means required that we take an extract of our tool’s data for import into the new tool.  For our purposes, it seemed quick and simple – dumping to data to a CSV file and uploading that file.

That last part ended up being neither straightforward nor easy.  Like the efforts to implement a data wall and to integrate data for demand forecasting, data cleansing took significant effort.  While we record the URLs of our content, these can change over time as the website is periodically re-architected and less viewed content gets archived.  Data fields changed over time, with some being added and some becoming obsolete.   In other cases, data was in the wrong format. To facility the data transfer, I spent significant time writing scripts that validated records, eliminated obsolete fields, and fixed incorrectly formatted data.


This work was critical for our tools transition, but it did not give us any new insights.  We want to integrate our internal data with usage data that gleaned from other sources.    This would enable us to see what subject areas are the most in demand and what SMEs generate the most viewed content.  Ideally, we would like to be able to see how much of our content directly lead to a sale, and we would like to analyze and visualize this data ourselves without involving other IT groups.  The Intel Analytics Hub (IAH), The Data Lake built for Intel's marketing efforts (pictured to the right), would seem to be the perfect place to integrate and analyze this data.   Having all related data in one place for access, integration, and analysis would make insight generation much easier, as opposed to having to do the integration and ETLs ourselves.  While storing important IT@Intel data in a SaaS solution might seem problematic for integration, but the data lake has been designed to easily import that kind of data.  Much of the data in the IAH is from external sources.

Doing data cleansing work gave me a much better appreciation of the work that our Big Data and Business Intelligence staff do every day.  Frameworks like Hadoop and Spark get much publicity about their power, but in order to use these tools, the unglamorous work of data validation and cleansing must be done.   An old saying in IT is “Garbage In Garbage Out.” That statement is true no matter how powerful a tool you use.

Published on Categories Big DataTags , , ,
Jeff Sedayao

About Jeff Sedayao

Jeff Sedayao is the domain lead for security in Intel's IT@Intel group. He has been an engineer, enterprise architect, and researcher focusing on distributed systems—cloud computing, big data, and security in particular. Jeff has worked in operational, engineering, and architectural roles in Intel's Information Technology group, done Research and Development in Intel Labs, as well as performed technical analysis and Intellectual Property development for a variety of business groups at Intel.