Detecting Fraud Using Big Data Analytics with IDH

Palanivelu Balasubramanian (Bala), Business Development Manager, Big-Data, Intel.

Bala has more than 25 years of experience in the information management (IM) and analytics domain. Over the years as a consultant, he has excellent track record in influencing customers and architecting solutions for fortune 100 customers in the IM space (Big data, BI, data warehousing….). He has held leadership roles in various capacities supporting sales and delivery organizations. Prior to joining Intel, he was the Practice Principal (FSI) within the Information Management & Analytics division at HP.  He joined HP with the acquisition of Knightsbridge Solutions.

1. Introduction

Industry research says financial institutions lose billions of dollars in fraudulent activities. The impact of fraud is just not limited to money but it also impacts the customer relationship, reputation and goodwill of the institution.  As the influence of technology increases, fraudsters use creative ways manipulate the system to their advantage. Some of the fraud schemes involve AML, forgery, identity theft, fraudulent claims, insider trading, credit card fraud, mortgage fraud, wire transfer fraud, and cyber-attacks. Preventing and detecting fraud has always been one of the biggest challenges in the financial service industry. As client interactions become more complex, instantaneous, and data-intensive, banks have to adapt by deploying smarter ways to prevent fraud, enforce governance measures, and reduce risks.

Being ahead of fraudsters is the key step to prevent fraud. Using Analytics can aid in detection and prevention process. The first step is to learn from past history to prevent similar future events. Understanding the fraud history, patterns of fraud, situations which trigger fraud, customer behavior patterns, knowing your customer/employee, sentiment analysis are some of the key analysis steps institutions need to follow. The next step is to define the rules, models to detect and build alert mechanisms to automatically monitor on an ongoing basis. Using machine learning methods to predict such future incidents is also a key detection step. Securing the data and the data access is equally important to effectively combat fraud.

Analyzing years of history data and integrating new kinds of data are normal steps in such activities. Having a scalable, high performance, cost-effective, robust data management framework is essential to support the ever growing data volume in this kind of data intensive processes. Some of the core data research activities include mining, profiling, searching, match-merging, building predictive models, and adopting to machine learning methods. It’s important to note that these activities use both structured and semi/ unstructured data.

2. Hadoop for Analytics

Apache Hadoop* is an open-source framework that uses a simple programming model to enable distributed processing of large data sets on clusters of computers. The full stack includes common utilities, a distributed file system, analytics and data storage platforms, and an application layer that manages distributed processing, parallel computation, workflow, and configuration management. In addition to high availability, the Hadoop framework is more cost-effective at handling large, complex, or unstructured data sets than conventional approaches and offers massive scalability and speed.

Hadoop can store any kind of data both structured and unstructured and doesn’t need any data conversion. Since data can be stored as-is from the source (no Schema design, no data loss), it aids faster implementation and enables quick data exploration capabilities. Traditional tools and infrastructure struggle to address larger and more varied data sets coming in at high speed. As the volume, variety, and velocity of data increases, enterprises are turning to a new approach to data analytics based on the use of the open source Apache Hadoop* platform.

Traditional data analysis of structured data is managed through models that define the parameters for a type of query. As data grew from megabyte to gigabyte and then to terabyte, data warehouse appliances that use massively parallel processing (MPP) to distribute processing across the compute nodes emerged. Over time, these traditional systems were optimized to work at terascale with structured data.

With the use of petabyte-scale datasets, RDBMSs and MPP systems are unable to handle the volume of unstructured data. While MPP systems have limited horizontal scalability, the cost to add proprietary appliances is often prohibitive. 

The following core Apache Hadoop ecosystem provides capabilities for effective data management:

  • Core: A set of shared libraries
  • HDFS: The Hadoop filesystem
  • MapReduce: Parallel computation framework
  • ZooKeeper: Configuration management and coordination
  • HBase: Column-oriented database on HDFS
  • Hive: Data warehouse on HDFS with SQL-like access
  • Pig: Higher-level programming language for Hadoop computations
  • Oozie: Orchestration and workflow management
  • Mahout: A library of machine learning and data mining algorithms
  • Flume: Collection and import of log and event data
  • Sqoop: Imports data from relational databases

Using the Hadoop ecosystem for Fraud analytics can be the preferred solution to address business and technology needs that are disrupting traditional data management and processing. Enterprises can gain competitive advantage adopting to big data Analytics.

3. Why IDH - Intel® Distribution for Apache Hadoop software?

IDH is a software platform that provides distributed data processing and data management for enterprise applications that analyze massive amounts of diverse data. The Intel Distribution includes Apache Hadoop and other software components with enhancements from Intel. Intel® Distribution for Apache Hadoop software is enterprise-grade big data storage and analytics system that delivers real-time big data processing optimized for Intel processor-based infrastructure. It is supported by experts at Intel with deep optimization experience in the Apache Hadoop software stack as well knowledge of the underlying processor, storage, and networking components.

IDH architecture.jpg

IDH is designed to enable the widest range of use cases on Hadoop by delivering the performance and security that enterprises need. Intel® Manager provides the management console for the IDH. Designed to meet the needs of some of the most demanding enterprises in the world, Intel Manager simplifies the deployment, configuration, tuning, monitoring, and security of your Hadoop deployment. Along with the Intel® Xeon processor, SSD, and Intel® 10GbE networking, IDH offers a robust platform upon which the ecosystem can innovate in delivering new analytics solutions. Intel delivers platform innovation in open source and is committed to supporting the Apache developer community with code and collaboration. Intel believes that every organization and individual should have the ability to generate value from all the data they can access.

The bottom line...

IDH is designed to reflect ongoing innovation in the hardware platform by delivering value in the Apache Hadoop software stack. Software engineers at Intel continue to enable advanced hardware capabilities in every layer of the software stack – from the hypervisor and Linux operating system to Java, Hadoop, HDFS, HBase, and Hive. This robust platform enables the entire software ecosystem to build innovative solutions for analytics. Built ground up, Intel is committed to continuously strengthen IDH’s enterprise capabilities – above are some of the key focus areas.

In summary, being ahead of fraudsters is the key step to prevent fraud. Organizations have to build an efficient people, process and technology framework to effectively combat fraud. From a technology perspective, IDH provides a robust data management framework is essential to enable analytics way of fraud detection. IDH provides those capabilities and has enhanced the Hadoop framework to support enterprise needs. Intel continuously invests in research and enhances the capabilities in both the hardware and the software layers. Intel’s global team of experienced professionals have Decades of Deployment Expertise specific to Hadoop, BI, and Security-domain Experience Developers of Best-Practice Deployment Methodologies & Tools Providers of Advanced Integration and Operational Assistance