By Aaron Taylor, Senior Analytics Software Engineer, Innovation Pathfinding Architecture Group (IPAG), Data Analytics & Machine Learning, Intel Corporation
Analyzing Big Data requires big computers, and high-performance computing (HPC) is increasingly being pressed into service for the job. However, HPC systems are complex beasts, often having thousands to tens of thousands of computing nodes, each with associated processor, memory, storage, and fabric resources. Keeping all the moving pieces firing on all cylinders and balancing resource tradeoffs between performance and energy is a mammoth job.
Imagine the data traffic management job involved in collecting telemetry data on hundreds of thousands of processors, memory, and networking components every 30 milliseconds. Having compute node component failures every minute is not uncommon in such complex systems.
To stay ahead of failures, data center managers need automated monitoring and management tools capable of collecting, transmitting, analyzing, and acting on torrents of system health data in real time. There are simply no tools available today that can do this across the fabric, memory, and processor resources for an entire cluster.
A new approach to telemetry analytics
We in Intel Data Analytics and Machine Learning Pathfinding have come up with a new approach for managing Big Data analytics systems, called Data Center Telemetry Analytics (DCTA). It uses hierarchical telemetry analytics and distributed compression to move primary analytics close to the source of the raw telemetry data, doing the initial analysis there and then sending only summarized results to a central DCTA system for analysis.
Over time, with enough health monitoring data in hand, you can use machine learning to build predictive fault models that characterize the response of the entire HPC system, not just individual nodes. And you donâ€™t have to store reams of raw telemetry data, because the algorithms learn what they need from incoming data, get smarter from it, then discard the data.
Our tests have demonstrated that DCTA lets data center operators engage in accurate predictive capacity planning; automate root-cause determination and resolution of IT issues; monitor compute-intensive jobs over time to assess performance trends; balance performance with energy constraints; proactively recommend processor, memory, and fabric upgrades or downgrades; predict system or component failures; and detect and respond to cyber intrusions within the data center.
The key: hierarchical data analytics
Key to the success of DCTA, and using HPC to analyze Big Data in general, is hierarchical data analytics. With this technique, raw telemetry data is collected at each node, and using digital signal processing (DSP), statistical and stochastic processes, and machine learning techniques, the data is compressed while still preserving the context of the data. The context of the data improves over time as more data is analyzed and new features are derived, yielding more information about whatâ€™s happening on each node.
With enough information gathered over time, machine learning clustering and classification algorithms can characterize the system response at each node, and enable predictive fault detection and automated resource management to improve cluster resiliency and energy efficiency.
The ability to compact large amounts of raw data into a summary form greatly reduces the overhead of processing and transmitting enormous volumes of telemetry data across a data center fabric, which helps balance performance and energy consumption. The ability to tame telemetry data at its source essentially cuts system management down to size.
Consider the ripple effect: using DSP, initial raw telemetry data is compressed, which eliminates the need to store pure raw values. Over time, more context about system behavior is derived through the analysis of higher-level system features (e.g., statistical features). Data about these higher-level features can also be compressed using DSP techniques.
As the context is further built out over time, machine learning algorithms characterize the system responses at each level, yielding a small amount of data to store. There is no need to store the information-level features, as the localized system response has already been characterized.
With data thus shrunk, data center managers realize massive storage savings and can also transmit much less data across the fabric and thus more effectively characterize the entire cluster response while greatly improving fabric latency, which is a major bottleneck in HPC and cloud computing.
In summary, compute-intensive type data compression algorithms (e.g., DSP and machine learning) can be applied in a hierarchical manner at each data source to greatly reduce storage requirements and latency in transmitting data across the fabric. At the same time, system context is preserved and deepened over time to greatly improve resiliency and predictive capabilities.
Capabilities like these are key to cost-effectively meeting increasingly intensive compute and analytics requirements. We have developed working prototypes and demonstrated their effectiveness in Intel Data Analytics and Machine Learning Pathfinding and are working hard to bring DCTA to life.