Many business and IT leaders are focused on developing comprehensive data strategies that enable data-driven decision making. A 2016 IDG Enterprise survey found that 53% of companies were implementing or planning to implement data-driven projects within the next 12 months—specifically projects undertaken with the goal of generating greater value from existing data.1 With the growing importance of AI and advanced analytics today, it seems a safe assumption that this number has only increased over time.
The concept of building a data strategy is such a hot topic that top-tier universities are creating executive-level courses on the subject,2 while industry observers are predicting that by 2020, 90% of Fortune 500 companies will have a chief data officer (CDO) or equivalent position.3
Yet despite all of this momentum, the concept of a data strategy remains new to many organizations. They haven’t thought about it in the past, so it is uncharted territory, or maybe even an unknown-unknown. With that thought in mind, in this post, I will walk through some key considerations for building a robust data strategy.
Why is a robust data strategy important? A data strategy is a business-driven initiative, and how technology is involved is an important factor. No matter what, you always start with a set of business objectives, and having the right data when you need it results in business advantages.
The Big Picture
A well-thought-out data strategy will have components specific to one’s own organization and application area. There are, however, important commonalities to any approach. Some of the more important ones include methods for data acquisition, data persistence, feature identification and extraction, analytics, and visualization, three of which I will discuss here.
When I give talks about the data science solutions my team develops, I often reference a diagram describing how many data scientists organize the information flow through their experiments. A good data strategy needs to be informed by these concepts—your choices will either facilitate or hinder how your analysts are able to extract insights from your data!
Data Acquisition and Persistence
Before outlining a data strategy, one needs to enumerate all the sources of data that will be important to the organization. In some businesses, these could be real-time transactions, while in others these could be free-text user feedback or log files from climate control systems. While there are countless potential sources of data, the important point is to identify all of the data that will play into the organization’s strategy at the outset. The goal is to avoid time-consuming additional steps further along in the process.
In one project I worked on when I was but a wee data scientist, we needed to obtain free-text data from scientific publications and merge the documents with metadata extracted from a second source. The data extraction process was reasonably time-consuming, so we had to do this as a batch operation and store the data to disk. After we completed the process of merging together our data sources, I realized I forgot to include a data source we were going to need for annotating some of the scientific concepts in our document corpus. Because we had to do a separate merge step, our experimental workflow took a great deal more time, necessitating many avoidable late hours at the office. The big lesson here: Proactively thinking through all the data that will be important to your organization is a guaranteed way to save some headaches down the road.
Once you have thought through data acquisition, it’s easier to make decisions about how (or if) these data will persist and be shared over time. To this end, there have never been more options for how one might want to keep data around. Your choices here should be informed by a few factors, including the data types in question, the speed at which new data points arrive (e.g., is it a static data set or real-time transactional data?), whether your storage needs to be optimized for reading or writing data, and which internal groups are likely to need access. In all likelihood, your organization’s solution will involve a combination of several of these data persistence options.
Your choices are also likely to change in big versus small data situations. How do you know if you have big data? If it won’t fit in a standard-size grocery bag, you may have big data. In all seriousness though, my rule of thumb is, once infrastructure (i.e., the grocery bag) is a central part of your data persistence solution, one is effectively dealing with big data. There are many resources that will outline the advantages and disadvantages of your choices here. These days, many downstream feature extraction and analytical methods have libraries for transacting with the more popular choices here, so it’s best to base one’s decision on expected data types, optimizations, and data volume.
Feature Identification and Extraction
In data science, a “feature” is the information a machine learning algorithm will use during the training stage for a predictive model, as well as what it will use to make a prediction regarding a previously-unseen data point. In the case of text classification, features could be the individual words in a document; in financial analytics, a feature might be the price of a stock on a particular day.
Most data strategies would do well to steer away from micromanaging how the analysts will approach this step of their work. However, there are organization-level decisions that can be made that will facilitate efficiency and creativity here. The most important approach, in my mind, is fostering an environment that encourages developers to draw from, and contribute to, the open source community. This is essential.
Many of the most effective and common methods for feature extraction and data processing are well-understood, and excellent approaches have been implemented in the open source community (e.g., in Python*, R*, or Spark*). In many situations, analysts will get the most mileage out of trying one of these methods. In a research setting, they may be able to try out custom methods that are effective in a particular application domain. It will benefit both employee morale and your organization’s reputation if they are encouraged to contribute these discoveries back to the open source community.
Again, I think it’s key for an organization-level data strategy to avoid micromanagement of the algorithm choices analysts make in performing predictive analytics, but I would still argue that there are analytical considerations that should be included in a robust data strategy. Overseeing data governance—the management of the availability, usability, integrity, and security of your organization’s data is a central part of the CDO’s role—and analytics is where a lot of this can breakdown or reveal holes in your strategy. Even if your strategy leverages NoSQL databases, if the relationships between data points are poorly understood or not documented, it’s possible that the analysts could be missing important connections, or even prevented from accessing certain data altogether.
To take a step back, a data strategy should include identification of software tools that your organization will rely upon. Intel can help here. Intel has led or contributed actively to the development of a wide range of platforms, libraries, and programming languages that provide ready-to-use resources for data analytics initiatives.
To help with analytical steps and some aspects of feature identification and extraction, you can leverage the Intel® Math Kernel Library (Intel® MKL), Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) and the Intel® Data Analytics Acceleration Library (Intel® DAAL), as well as BigDL and the Intel® Distribution for Python*.
- Intel® MKL arms you with highly optimized, threaded, and vectorized functions to increase performance on Intel processors.
- Intel® MKL-DNN provides performance enhancements for accelerating deep learning frameworks on Intel architecture.
- Intel® DAAL delivers highly tuned functions for deep learning, classical machine learning, and data analytics performance.
- BigDL simplifies the development of deep learning applications for use as standard Spark programs.
- The Intel® Distribution for Python adds acceleration of Python application performance on Intel platforms.
Ready for a deeper dive? Our “Tame the Data Deluge” whitepaper is a great place to get started. For some real-life examples of the way organizations are using data science to make better decisions in less time, visit the Intel Advanced Analytics site.