Ritu Kama is the Director of Product Management for Big Data at Intel. She has over 15 years of experience in building software solutions for enterprises. She has led Engineering, QA and Solution Delivery organizations within Datacenter Software Division for Security and Identity products. Last year she led the Product and Program management responsibilities for Intel’s Distribution of Hadoop and Big Data solutions. Prior to joining Intel, she led technical and architecture teams at IBM and Ascom. She has a MBA degree from University of Chicago and a Bachelor’s degree in Computer Science.
HBase is a non-relational, column-oriented database that runs on top of the Hadoop Distributed File System (HDFS). Hbase's tables contain rows and columns. Each table has an element defined as a Primary Key which is used for all Get/Put/Scan/Delete operations on those tables. To some extent this can be a shortcoming because one may want to search within, say, a given column.
The IDH Integration with Lucene
The Intel® Distribution for Apache Hadoop* (IDH) solves this problem by incorporating native features that permit straightforward integration with Lucene. Lucene is a search library that acts upon documents containing data fields and their values. The IDH-to-Lucene integration leverages the HBase Observer and Endpoint concepts, and therein lies the flexibility to access the HBase data with Lucene searches more robustly.
The Observers can be likened to triggers in RDBMS's, while the Endpoints share some conceptual similarity to stored procedures. The mapping of Hbase records and Lucene documents is done by a convenience class called IndexMetadata. The Hbase observer monitors data updates to the Hbase table and builds indexes synchronously. The Indexes are stored in multiple shards with each shard tied to a region. The Hbase Endpoint dispatches search requests from the client to those regions.
When entering data into an HBase table you'll need to create an HBase-Lucene mapping using the IndexMetadata class. During the insertion, text in the columns that are mapped get broken into indexes and stored in the Lucene index file. This process of creating the Lucene index is done automatically by the IDH implementation. Once the Lucene index is created, you can search on any keyword. The implementation searches for the word in the Lucene index and retrieves the row ID's of the target word. Then, using those keys you can directly access the relevant rows in the database.
IDH's HBase-Lucene integration extends HBase's capability and provides many advantages:
1. Search not only by row key but also by values.
2. Use multiple query types such as Starts, Ends, Contains, Range, etc.
3. Ranking scores for the search are also available.
Sample Code and Configuration Procedures
The following code illustrates the basic steps for implementing Hbase-Lucene integration, followed by various examples of search types that become available.
1. Modify the hbase-site.xml using Intel Manager
2. Creating Index Meta data
3. Creating a Hbase table and attaching a Index Meta data.
Listed in the following table are a list of queries that can be implemented using this integration and providing the benefits described above.
To learn more about the IDH HBase-to-Lucene integration and to get a copy of source code with usage examples, please check out hadoop.intel.com/resources
Bye for now,