Suneel is a Senior Software Engineer at Intel on Big Data Platform Engineering group and a committer and PMC member on Apache Mahout project. Suneel first became involved with machine learning back in 2009 and has been working on machine learning projects since then. In early 2011 he became involved with Apache Mahout project. Since then Suneel has been very actively involved with the project by contributing code and managing releases. Suneel was voted in as an Apache Mahout Committer on April 3, 2013 and also became a PMC member in under 6 months. Suneel was a big part of the recent Mahout 0.8 release in July, possible the most stable release of Mahout till date.
The Apache Mahout Machine Learning Libraryâ€™s goal is to build scalable Machine Learning libraries. Mahoutâ€™s focus is primarily in the areas of Collaborative Filtering (Recommenders), Clustering and Classification (known as the "3Cs"), as well as the necessary infrastructure to support those implementations. That would include, math packages for statistics, linear algebra and others as well as Java primitive collections, local and distributed vector and matrix classes and a variety of integrative code to work with popular packages like Apache Hadoop, Apache Lucene, Apache HBase, Apache Cassandra and more.
The next release of Mahout would be Mahout 0.9, planned for November 2013 and this would be a minor release mostly bug fixes, removing deprecated algorithms and stabilizing the code base in preparation for the following 1.0 milestone which would be a major release (planned for Q1 of 2014).
As the project moves towards a 1.0 release, the community is working towards focusing more on key algorithms that are proven to scale in production and have seen widespread adoption. This post talks about our planned future enhancements based on feedback from user community.
1. Better Clustering interfaces
Presently the only interface to Mahoutâ€™s clustering algorithms is via the command-line. The interface design needs to be improved with a RESTful API and/or a library API, similar to Mahout Recommenders.
Below are the clustering algorithms presently in Mahout 0.8 and planned for future releases. All of the algorithm implementations in Mahout come in both sequential (single threaded) and parallel (MapReduce) versions.
K-Means is a standard clustering algorithm. Mahoutâ€™s implementation of K-Means is described at Mahout K-Means Clustering.
Often used as an initial step in more rigorous clustering techniques like K-Means. It is intended to speed up clustering operations on large data sets, where using another algorithm directly may be impractical due to the size of the data set. Mahoutâ€™s implementation of Canopy clustering is described at Mahout Canopy Clustering.
Unlike K-Means, a data point could belong to several clusters (also known as soft clustering). Mahoutâ€™s implementation of Fuzzy clustering is described at Fuzzy K-Means.
Collapsed Variational Bayes (CVB)
The CVB algorithm that is implemented in Mahout for LDA combines advantages of both regular Variational Bayes and Gibbs Sampling. The algorithm relies on modeling dependence of parameters on latent variables, which are in turn mutually independent. This algorithm was introduced in Mahout 0.8 and had replaced an earlier algorithm, Latent Dirichlet Allocation (LDA).
This is a new addition in Mahout 0.8 and once mature could replace the combination of Canopy and K-Means clustering algorithms in future Mahout releases; it definitely holds lot of promise for real quick clustering of data in a single pass.
Presently not available in Mahout but would be a good addition to the repertoire. There have been several user requests on how to determine similar documents given a large corpus of documents. While Mahoutâ€™s present RowSimilarity Job can be used for determining the distances between documents. Simhash clustering would solve the problem more efficiently by clustering documents that differ by a small number of bits together.
It should be quick and easy to build this by leveraging some of the existing code base from Streaming K-Means clustering. The implementation would be based on Googleâ€™s paper - Detecting Near-Duplicates for Web Crawling.
2. High Performance Classifiers
Mahoutâ€™s classification algorithms include Naive Bayes, Complimentary Naive Bayes, Random Forests and Logistic Regression trained via a single threaded Stochastic Gradient Descent (SGD).
Mahoutâ€™s implementation of Logistic Regression uses Stochastic Gradient Descent (SGD). The implementation is based on the paper - Large Scale Machine Learning using Stochastic Gradient Descent by Leon Bottou. Mahoutâ€™s SGD implementation is an online learning algorithm, which means that new models can be trained while the system is running.
While the present single-threaded SGD implementation in Mahout is stable and performs reasonably well for large datasets, parallel SGD implementations (like Downpour SGD) have been shown to be 10x times more accurate than traditional SGD. Future plans in that direction are to implement a version of Downpour SGD from Googleâ€™s paper Large Scale Distributed Deep Networks.
Mahoutâ€™s MapReduce implementation of the Random Forests classifier is based on the original paper Breiman-Cutler paper. Similar implementations from other packages like R and scikit-learn have been proven more accurate than the Mahout implementation. Thereâ€™s work underway to introduce a more scalable Streaming Random Forests that would be based on Online Learning.
Mahoutâ€™s classification and regression solutions need some rework to include recent advancements. A parallel Logistic Regression implementation and an Ensemble version of Random Forests would be great additions.
3. Dictionaries from input Sequence Files
All of Mahoutâ€™s clustering and classification algorithms expect their input as term vectors. One of the preprocessing steps before invoking a classifier or a clustering algorithm is to convert the user text files into vectors (basically a list of weighted tokens). Mahoutâ€™s seq2sparse utility that converts Sequence files to vectors is one of the slowest steps in the pipeline. Modifying seq2sparse to handle Luceneâ€™s Finite State Transducers (FST) as a dictionary type should notably speed this up. Additionally, a future version of Lucene (5.0) has an in-memory terms dictionary (see Lucene in-memory terms dictionary) that should provide another boost in performance.
There is work underway now on supporting Luceneâ€™s Finite State Transducers (FST) as a Dictionary Type targeted for Mahout 1.0 (2014 Q1).
4. Better Integration with Hive and HBase
This is an often asked by Mahout users and is a feature thatâ€™s presently not supported by Mahout. We at Intel are better positioned to make this happen as part of the IDH distribution.
5. Use JBlas for Matrix Factorization
JBlas has been proven to have better performance than existing Mahout Math package when dealing with Dense Matrices. Initial benchmarks using JBlas have shown marked improvement over traditional Mahout Math methods. This feature is being targeted for Mahout 1.0 (2014 Q1).
6. Better Human Interfaces
It would be great to have products like Dataiku drive Mahoutâ€™s capabilities. Dataiku does a real good job of the cleansing end of Machine Learning, something thatâ€™s completely lacking in Mahout today.
Thereâ€™s presently work underway in having Scala bindings for Mahout, which would be a good starting place to get to the bigger goal of having better human interfaces.
7. Generic Connector/Interface Framework with Visualization tools
Mahout presently lacks Data Visualization capabilities. It would be a useful feature to have a Generic Connector/Integration Framework with 3rd party visualization tools like R, Matlab, GraphViz, Gephi etc.
8. Bigger Community
There are some closely related communities working on implementing Machine Learning on platforms like Spark. More cross-fertilization with other communities would be useful.