AI has made big strides in the past several years. E-commerce sites recommend future purchases based on past purchases, Alexa* and other digital assistants respond to our inquiries, and social media platforms help us organize and tag our photos for easier search. More and more organizations are bringing the power of AI to bear on their processes, initiatives, and operations.
The International Comparison Program (ICP) team did exactly that. The ICP team in the World Bank Development Data Group utilized Intel’s BigDL framework (a distributed deep-learning library for Apache Spark*) and an AWS Databricks* platform running on Intel® Xeon® Processors to help classify more than 1 million crowdsourced photos before sharing the dataset with the public.
The ICP Pilot Study
The photos were collected as part of the pilot data collection study the ICP commissioned from December 2015 to August 2016. For the project, paid contributors used smartphones to gather photos and price-related data for a variety of household goods and services (in 162 categories from food to footwear) in 15 countries: Argentina, Bangladesh, Brazil, Cambodia, Colombia, Ghana, Indonesia, Kenya, Malawi, Nigeria, Peru, Philippines, South Africa, Venezuela, and Vietnam.
To efficiently compare all those photos within and across countries, the ICP team turned to AI and deep-learning models that could help review, search, and sort the images into 162 categories.
In short, they needed to automate the process of confirming that the crowdsourced photos matched the goods and services for which the observations were submitted—and to remove personally identifiable information (PII) from the photos along the way.
What Was the Point?
Why go to all this trouble? Here’s bit of context. The ICP has been around for 50 years and this particular global data initiative was led by the World Bank under the auspices of the United Nations Statistical Commission. The ICP ”measures world economies,” providing the kind of data that enables its parent organization, the World Bank Group, to pursue its larger mission: To reduce poverty, increase shared prosperity, and promote sustainable development by partnering with governments and the private sector around the world.
With an ultimate mission like that, the quality, integrity, and confidentiality of the data matters. The innovative crowdsourcing approach to collect the data, coupled with the intensive use of AI on the Cloud, helps the World Bank team reduce the labor-intensive tasks of manually reviewing, searching, and sorting the images. The completed dataset is then made public and is used to train various deep learning models.
The Two-Phased Process
In order to arrive at a useable dataset from photos that varied in quality and were recorded in different languages with a mix of typed and handwritten text, the team focused on cleaning the images and understanding their reliability. This was achieved by classifying the images using models that run on Intel’s BigDL framework. Each image was identified as tagged correctly, tagged incorrectly, or invalid.
Next, the photos that were identified as tagged correctly were used to train a model that identified the types of goods or services presented in the photo.
- Define image quality and eliminate poor quality images.
- Classify images to validate existing labels.
- Identify images with text in the existing dataset; circle text.
- Recognize the words in that text.
- Determine whether the text contains PII.
- Blur areas with PII text.
The Solution Architecture
The team built a solution architecture using AWS Cloud running Intel® Xeon® processors, Databricks Spark and the Intel BigDL deep-learning framework. With BigDL, users can write their deep-learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters.
This unified platform enables customers to eliminate many unnecessary dataset transfers between separate systems, eliminate separate hardware clusters (for example, CPU and GPU clusters), and move towards a CPU cluster, reducing system complexity and end-to-end latency.
Here’s the solution architecture for the ICP pilot study:
Model Development and Results
The World Bank team used Inception* v1 model for transfer learning and fine tuning on a partial dataset. The team loaded a pretrained Caffe* Inception v1 model to BigDL and added a fully connected layer with the customized SoftMax* classifier.
By using a dataset with pretrained weights, the team reduced training time and improved model accuracy as compared to training an Inception model from scratch. The team was also able to effectively scale the model on multinode clusters in AWS Databricks.
For Phase 1, the team first ran a test on a partial dataset (1,927 images, nine categories) to compare training from scratch vs. transfer learning vs. fine tuning:
Since fine tuning with Inception v1 showed the best results, it was used to further complete model training on a whole dataset (994,325 images, 69 categories). This model training was performed using a multinode cluster on the AWS R4.8xlarge instance with Intel® Xeon® processors with 20 nodes and brought the following results:
|Nodes||Cores||Batch Size||Epochs||Training Time (sec)||Throughput (images/sec)||Accuracy (%)|
Then, the team conducted scalability tests on that partial dataset with Inception v1 running on eight nodes vs. 16 nodes. The test showed almost linear scaling with BigDL with throughput increasing from 56.7 images/sec to 99.6 images/sec. Leveraging a native Spark DL library/framework like BigDL allowed the model to take full advantage of efficient distributed training.
|Nodes||Batch Size||Epochs||Throughput||Training Time|
As the result of partial dataset training and its scalability tests, the World Bank team was able to create an application that automates the validation process to confirm that photos gathered through the crowdsourced data collection pilot matched the goods for which observations were submitted. That helped to get a clean, validated image dataset and understand the reliability of this collection.
You can try image classification using BigDL with this World Bank code at https://github.com/intel-analytics/WorldBankPoC.
This is just one of many examples of Intel’s BigDL platform enabling the application of AI and deep learning to solve real-world challenges. Try applying BigDL to your business and data challenges. Join in and contribute to the project: https://github.com/intel-analytics/BigDL