The search for unknown unknowns has the greatest potential for new discoveries in large data collections. This is sometimes called unsupervised discovery. It is considered to be the purest form of data mining. Specifically, this refers to exploring unlabeled datasets with unsupervised machine learning algorithms (e.g., Clustering, Association Mining, Link Analysis, PCA, ICA, Outlier Detection) without the preconceived bias of class labels or the limited investigation of previously known correlations. The authors of an article on high-dimensional data expressed it this way: “unsupervised exploratory analysis plays an important role in the study of large, high-dimensional datasets that arise in a variety of applications.”
In broad terms, unsupervised discovery comes in four flavors: class discovery, novelty discovery, association discovery, and correlation discovery. Class discovery is finding new classes and subclasses of objects, events, and behaviors in multi-dimensional data. Novelty discovery is sometimes called outlier detection or anomaly detection, but I prefer to call it “Surprise Discovery” – finding the novel, unexpected, and interesting data points and patterns (the unknown unknowns) in your data set that lie outside the bounds of your expectations (i.e., Surprise!). Association discovery is finding unusual or improbable co-occurring associations and linkages among the items in your data set. For a vivid discussion of association discovery in big data, see my TedX video presentation on “Big Data, Small World.” We will focus here on the fourth variety of unsupervised discovery (correlation discovery) through a discussion of the tools and techniques of Exploratory Data Analysis.
Big Data Variety Takes Center Stage
With all of the discussion about Big Data these days, there is frequest reference to the 3 V’s that represent the top big data challenges: Volume, Velocity, and Variety. These 3 V’s generally refer to the size of the dataset (Volume), the rate at which data is flowing into (or out of) your systems (Velocity), and the complexity (dimensionality) of the data (Variety). Most practitioners agree that big data volume is indeed huge, but that is not necessarily big data’s biggest challenge, at least not in terms of data storage capacities, which are growing rapidly also and keeping pace with data volume. The velocity of big data is also a very big challenge, though primarily for applications and use cases that specifically demand near-real-time analysis and response to dynamic data streams. However, unlike volume and velocity, most will agree that the variety (complexity) of the data is truly big data’s biggest mega-challenge at all scales and in most applications. Consequently, any dataset (whether large or small) that has hundreds or thousands (or more) dimensions per data item is difficult to explore, mine, and interpret. So, when you find a data tool that helps in the analysis of high-dimensional data, you stop and take a look. I did that recently with the AutoDiscovery tool from Butler Scientifics.
The Perfect Couple: Small Data and Exploratory Data Analysis (EDA)
First, note that this tool is not explicitly for big data, though it is certainly useful for small subsets of big data: that is, small data! The focus is therefore on scientific discovery from small data. This is the style of data science that nearly every scientist needs to carry out on a routine basis, since data from daily experiments are rarely in the rarified realm of big data, but modern scientific instruments often do generate large numbers of measured parameters per data object. AutoDiscovery enables the discovery, exploration, and visualization of correlations in high-dimensional data from such experiments – i.e., Exploratory Data Analysis (EDA).
The Top 10 Features of an EDA Tool
One of the most sensible characteristics of AutoDiscovery is that it does not try to be the “one tool” for all possible statistical analyses. There are other statistical software packages that already do that, and there is no need to compete with giants like R, SAS, or SPSS. Consequently, AutoDiscovery aims to satisfy a very particular scientific discovery requirement: correlation discovery in the high-dimensional parameter spaces of complex (high-variety) data. It is a complement to those other (more comprehensive) statistical packages, not a competitor.
Correlation discovery alone may seem relatively simple and thus a specialized tool for it seems unnecessary. However, several proprietary features within AutoDiscovery can more than justify its use. The top 10 features of AutoDiscovery for exploring complex relationships in data for scientific discovery are: (1) simplified, visual integration of data from multiple sources (including “primary key” discovery across multiple data tables); (2) the streamlined easy-to-use EDA visual environment for data selection, filtering, and exploration; (3) rapid discovery of interesting findings that can confirm (or deny) initial hypotheses, inform further experimentation and experimental design, and generate multiple additional testable hypotheses; (4) automatic search for significant correlations across the full set of pairwise parameter combinations in your dataset; (5) automatic search for significant correlations between virtual parameters (i.e., the ratios of the original input parameters); (6) quantitative assessment and evaluation of the value of each finding; (7) automatic sorting of results, including deprecation of weak and insignificant correlations, placing them lower in the output listings, though still searchable if wanted; (8) optional correlation analyses within multiple sub-segments of each parameter’s range of possible values (thereby enabling discovery of changes in the parameter correlations across these limited ranges of the data values, which is a reality often observed in complex scientific experiments); (9) visual tools that present the linked network of the most significant pair-wise correlations among scientific parameters; and (10) correlation analysis outputs (tables, visualizations, and the ability to export the correlation tables) that enable efficient and effective browsing, exploration, and navigation of causal connections (and the causal direction) in correlated data items.
Exploratory and Confirmatory Data Analyses
For scientists, the use of EDA for initial exploratory studies is crucial in the early stages of an experiment – both exploratory and confirmatory analyses enable discovery, hypothesis testing, and refinement of scientific hypotheses. More detailed analysis would follow from initial discoveries of interesting and significant parameter correlations within complex high-dimensional data. An article was recently published in Nature on “Statistical Errors – p Values, the Gold Standard of Statistical Validity, Are Not as Reliable as Many Scientists Assume” (by Regina Nuzzo, Nature, 506, 150-152, 2014). In this article, Columbia University statistician Andrew Gelman states that instead of doing multiple separate small studies,“researchers would first do small exploratory studies and gather potentially interesting findings without worrying too much about false alarms. Then, on the basis of these results, the authors would decide exactly how they planned to confirm the findings.” In other words, a disciplined scientific methodology that includes both exploratory and confirmatory analyses can be documented within an open science framework (e.g., https://osf.io) to demonstrate repeatability and reproducibility in scientific experiments. This would break down the walls of “black box” software that hide the complex analyses that are being applied to complex data. The ability of the scientist and her/his peers to reproduce an experiment’s rationale as well as its results will yield greater transparency in scientific research. AutoDiscovery is a tool that can further the Open Science cause.
Four Benefits of EDA from Early Discoveries in Complex Data
AutoDiscovery objectively discovers interesting findings in the early stages of research. This provides four additional benefits to the scientist in the EDA stage of research: (a) informs improvements in the experimental design; (b) validates and substantiates a priori hypotheses; (c) generates multiple new testable hypotheses; and (d) reveals promising “hot spots” in the data that require deeper statistical analysis. The latter capability is quite exciting – “Interestingness” Discovery – i.e., finding the unexpected, unusual, “interesting” regions and features within your data’s multi-dimensional parameter space! Especially with complex data, the combined sum of these capabilities empowers the data scientist to tell the “data story” in the full dimensionality of the dataset, not just in a few limited 2-D or 3-D projections. Consquently, AutoDiscovery is an objective quantifiable feature-discovery tool that presents the most interesting correlations to end-users for efficient and effective EDA: efficient in the sense that automatic discovery of the most interesting data correlations for deeper analysis avoids lots of useless searches and manual manipulations of the data collection; and effective in the sense that novel discoveries (beyond known correlations and expected relationships) are made possible.
Three Types of Data Relationships: Monotonic, Non-monotonic, Multi-valued
The discovery of more complex relationships (e.g., multi-valued or non-monotonic data patterns) in multi-dimensional data requires specialized tools and transformations that are currently beyond the scope of AutoDiscovery (or of any other readily accessible tool), though discovery of these types of patterns may be enabled in future releases of EDA tools. An example of a multi-valued data relationship is the S-shaped 2D surface embedded in a 3D space (shown here) – discovery of such hypersurfaces requires special algorithms (such as local linear embedding or manifold learning) that are not available in off-the-shelf EDA packages. An example of a non-monotonic data relationship is revealed in the solution to the “island of games” puzzle (the problem statement is here; and the solution is revealed here). Monotonic relationships typically underlie cause-effect studies in science, and that's why EDA software (such as AutoDiscovery) currently targets the discovery of those types of data relationships.
Case Study and Getting Started
The Butler Scientifics website reports a case study in which neuroscientists in the Laboratory of Adult Neurogenesis at Cajal Institute (CSIC, Madrid) used AutoDiscovery to discover correlations between neuron properties and behavior patterns, and the effects of stress and anxiety on learning and memory capacity. They describe the results this way: “AutoDiscovery took less than 2 hours to find out not only all the correlations that the group had identified during their 8-weeks intensive work but also several key correlations that, with a further confirmatory phase, confirmed their original hypothesis.” That is precisely the type of efficiency amplifier that I can use in my research, and I believe that other scientists will experience similar accelerations of their discovery science.
Read more about AutoDiscovery, download a free trial, request a demo, and begin discovering the most interesting features in your ocean of complex data today at http://www.butlerscientifics.com/. A new release (ver. 2.0) of AutoDiscovery is now available for all scientists and data explorers to begin exploring the complex relationships within their data for scientific discovery. Butler Scientifics offers user support and a variety of licensing terms (for individuals, or small research teams, or entire research institutions).
Dr. Kirk Borne is a Data Scientist and Professor of Astrophysics and Computational Science in the George Mason University School of Physics, Astronomy, and Computational Sciences. He received his B.S. degree in physics from LSU and his Ph.D. in astronomy from the California Institute of Technology. He has been at Mason since 2003, where he teaches graduate and undergraduate courses in Data Science and advises many doctoral dissertation students in Data Science research projects. He focuses on achieving big discoveries from big data and promotes the use of data-centric experiences with big data in the STEM education pipeline at all levels. He promotes the "Borne Ultimatum" -- data literacy for all!
Connect with Kirk on LinkedIn.
Follow Kirk on Twitter at @KirkDBorne.
Find links to his blogs, courses taught, and publication listings at http://kirkborne.net/